Week 5
In 1846 the Donner and Reed families left Springfield, Illinois, for California by covered wagon. In July, the Donner Party, as it became known, reached Fort Bridger, Wyoming. There its leaders decided to attempt a new and untested rote to the Sacramento Valley. Having reached its full size of 89 people and 20 wagons, the party was delayed by a difficult crossing of the Wasatch Range and again in the crossing of the desert west of the Great Salt Lake. The group became stranded in the eastern Sierra Nevada mountains when the region was hit by heavy snows in late October. By the time the last survivor was rescued on April 21, 1847, 41 of the 89 members had died from famine and exposure to extreme cold.
41 of the 89 members died
# A tibble: 6 × 3
Age Sex Status
<dbl> <fct> <fct>
1 23 Male Died
2 40 Female Survived
3 40 Male Survived
4 30 Male Died
5 28 Male Died
6 40 Male Died
Rows: 43
Columns: 3
$ Age <dbl> 23, 40, 40, 30, 28, 40, 45, 62, 65, 45, 25, 28, 28, 23, 22, 23,…
$ Sex <fct> Male, Female, Male, Male, Male, Male, Female, Male, Male, Femal…
$ Status <fct> Died, Survived, Survived, Died, Died, Died, Died, Died, Died, D…
Status
(binary outcome: Survived
vs Died
) as a function of Age
(continuous variable)donner %>%
mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
ggplot(aes(x=Age, y=prob, color=Sex)) +
geom_jitter(height=0.025, width=0) +
lims(x=c(15,70), y=c(-0.1,1.1)) +
scale_y_continuous(breaks=seq(0,1,0.1)) +
geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE) +
labs(title="Donner party", y="Probability of Survival") +
theme_bw()
ggplot
easily using geom_smooth()
donner %>%
mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
filter(Sex=="Female") %>%
ggplot(aes(x=Age, y=prob)) +
geom_jitter(height=0.05, width=0, color="black") +
scale_y_continuous(breaks=seq(0,1,0.1)) +
geom_smooth(method="lm", se=FALSE, fullrange=TRUE, color="orange") +
lims(x=c(0,80), y=c(-0.5,1.5)) +
labs(title="Donner party (Females)", y="Probability of Survival") +
theme_bw()
donner %>%
mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
filter(Sex=="Female") %>%
ggplot(aes(x=Age, y=prob)) +
geom_jitter(height=0.05, width=0) +
scale_y_continuous(breaks=seq(0,1,0.1)) +
geom_smooth(method="lm", se=FALSE, fullrange=TRUE, color="orange") +
geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE, color="blue") +
lims(x=c(5,70), y=c(-0.5,1.5)) +
labs(title="Donner party (Females)", y="Probability of Survival") +
theme_bw()
Also: we violate assumptions of linear regression
donner %>%
mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
filter(Sex=="Female") %>%
ggplot(aes(x=Age, y=prob)) +
geom_jitter(height=0.05, width=0) +
scale_y_continuous(breaks=seq(0,1,0.1)) +
geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE, color="blue") +
lims(x=c(5,70), y=c(-0.05,1.05)) +
labs(title="Donner party (Females)", y="Probability of Survival") +
theme_bw()
All GLMs have four components:
a GLM with a binomial distribution and a logit link function
a GLM with a binomial distribution and a logit link function
putting it together:
log\left(\frac{p}{1-p}\right) = \beta_{0} + \beta_{1}\mathrm{Age}
p = \frac{e^{(\beta_{0} + \beta_{1}\mathrm{Age})}}{1 + e^{(\beta_{0} + \beta_{1}\mathrm{Age})}}
p = \frac{e^{\beta_{0} + \beta_{1}\mathrm{Age}}}{1 + e^{\beta_{0} + \beta_{1}\mathrm{Age}}} = \frac{e^{\beta_{0} + \beta_{1}(30)}}{1 + e^{\beta_{0} + \beta_{1}(30)}}
glm()
functionglm()
is a general function for fitting generalized linear modelsbinomial
and the link function is logit
Call:
glm(formula = Status ~ Age, family = binomial(link = "logit"),
data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.06749 1.09914 1.881 0.0600 .
Age -0.07339 0.03510 -2.091 0.0366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 53.134 on 41 degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4
glm()
functionglm()
is a general function for fitting generalized linear modelsbinomial
and the link function is logit
logit
is the default so we can leave it out if we want:
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.06749 1.09914 1.881 0.0600 .
Age -0.07339 0.03510 -2.091 0.0366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 53.134 on 41 degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.06749 1.09914 1.881 0.0600 .
Age -0.07339 0.03510 -2.091 0.0366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 53.134 on 41 degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4
donner %>%
mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
ggplot(aes(x=Age, y=prob)) +
geom_jitter(height=0.05, width=0) +
scale_y_continuous(breaks=seq(0,1,0.1)) +
geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE, color="blue") +
lims(x=c(5,70), y=c(-0.05,1.05)) +
labs(title="Donner party (Males & Females)", y="Probability of Survival") +
theme_bw()
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.06749 1.09914 1.881 0.0600 .
Age -0.07339 0.03510 -2.091 0.0366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 53.134 on 41 degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4
1
0.2956287
donner %>%
mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
ggplot(aes(x=Age, y=prob)) +
geom_jitter(height=0.05, width=0) +
scale_y_continuous(breaks=seq(0,1,0.1)) +
geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE, color="blue") +
lims(x=c(5,70), y=c(-0.05,1.05)) +
geom_vline(xintercept=40, color="red") +
geom_hline(yintercept=prob_40, color="red") +
labs(title="Donner party (Males & Females)", y="Probability of Survival") +
theme_bw()
\begin{matrix} log\left(\frac{p}{1-p}\right) &= &2.07 - 0.07(\mathrm{Age})\\ p &= &\frac{e^{2.07 - 0.07(\mathrm{Age})}}{1 + e^{2.07 - 0.07(\mathrm{Age})}} \end{matrix}
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.06749 1.09914 1.881 0.0600 .
Age -0.07339 0.03510 -2.091 0.0366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 53.134 on 41 degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4
\begin{matrix} log\left(\frac{p}{1-p}\right) &= &2.07 - 0.07(\mathrm{Age})\\ p &= &\frac{e^{2.07 - 0.07(\mathrm{Age})}}{1 + e^{2.07 - 0.07(\mathrm{Age})}} \end{matrix}
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.06749 1.09914 1.881 0.0600 .
Age -0.07339 0.03510 -2.091 0.0366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 53.134 on 41 degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4
\begin{matrix} log\left(\frac{p}{1-p}\right) &= &2.07 - 0.07(\mathrm{Age})\\ p &= &\frac{e^{2.07 - 0.07(\mathrm{Age})}}{1 + e^{2.07 - 0.07(\mathrm{Age})}} \end{matrix}
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.06749 1.09914 1.881 0.0600 .
Age -0.07339 0.03510 -2.091 0.0366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 53.134 on 41 degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4
for example:
\begin{matrix} log\left(\frac{p}{1-p}\right) &= &2.07 - 0.07(\mathrm{Age})\\ p &= &\frac{e^{2.07 - 0.07(\mathrm{Age})}}{1 + e^{2.07 - 0.07(\mathrm{Age})}} \end{matrix}
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.06749 1.09914 1.881 0.0600 .
Age -0.07339 0.03510 -2.091 0.0366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 53.134 on 41 degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4
(Intercept) Age SexFemale
2.00466996 -0.08762745 1.50667376
donner %>%
mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
ggplot(aes(x=Age, y=prob, color=Sex)) +
geom_jitter(height=0.05, width=0) +
scale_y_continuous(breaks=seq(0,1,0.1)) +
geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE) +
lims(x=c(5,70), y=c(-0.05,1.05)) +
labs(title="Donner party", y="Probability of Survival") +
theme_bw()
\begin{matrix} log\left(\frac{p}{1-p}\right) &= &2.0 - 0.09(\mathrm{Age}) + 1.5(\mathrm{SexFemale})\\ p &= &\frac{e^{2.0 - 0.09(\mathrm{Age}) + 1.5(\mathrm{SexFemale})}}{1 + e^{2.0 - 0.09(\mathrm{Age}) + 1.5(\mathrm{SexFemale})}} \end{matrix}
\begin{matrix} log\left(\frac{p}{1-p}\right) &= &2.0 - 0.09(\mathrm{Age}) + 1.5(\mathrm{SexFemale})\\ p &= &\frac{e^{2.0 - 0.09(\mathrm{Age}) + 1.5(\mathrm{SexFemale})}}{1 + e^{2.0 - 0.09(\mathrm{Age}) + 1.5(\mathrm{SexFemale})}} \end{matrix}
Call:
glm(formula = Status ~ Age + Sex, family = binomial, data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.00467 1.21857 1.645 0.1000 .
Age -0.08763 0.04084 -2.146 0.0319 *
SexFemale 1.50667 0.78497 1.919 0.0549 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 48.992 on 40 degrees of freedom
AIC: 54.992
Number of Fisher Scoring iterations: 5
Call:
glm(formula = Status ~ Age + Sex, family = binomial, data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.00467 1.21857 1.645 0.1000 .
Age -0.08763 0.04084 -2.146 0.0319 *
SexFemale 1.50667 0.78497 1.919 0.0549 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 48.992 on 40 degrees of freedom
AIC: 54.992
Number of Fisher Scoring iterations: 5
Call:
glm(formula = Status ~ Age + Sex, family = binomial, data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.00467 1.21857 1.645 0.1000 .
Age -0.08763 0.04084 -2.146 0.0319 *
SexFemale 1.50667 0.78497 1.919 0.0549 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 48.992 on 40 degrees of freedom
AIC: 54.992
Number of Fisher Scoring iterations: 5
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.06749 1.09914 1.881 0.0600 .
Age -0.07339 0.03510 -2.091 0.0366 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 53.134 on 41 degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4
Call:
glm(formula = Status ~ Age + Sex, family = binomial, data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.00467 1.21857 1.645 0.1000 .
Age -0.08763 0.04084 -2.146 0.0319 *
SexFemale 1.50667 0.78497 1.919 0.0549 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 48.992 on 40 degrees of freedom
AIC: 54.992
Number of Fisher Scoring iterations: 5
step()
function to do stepwise model selectioncut()
the predictor Age
into bins using quantile()
Survived
donner %>%
mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
ggplot(aes(x=Age, y=prob)) +
geom_jitter(height=0.05, width=0) +
scale_y_continuous(breaks=seq(0,1,0.1)) +
geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE, color="blue") +
geom_vline(xintercept=quantile(donner$Age), color="red", lty=2) +
lims(x=c(5,70), y=c(-0.05,1.05)) +
labs(title="Donner party", y="Probability of Survival") +
theme_bw()
X:Y
means “the interaction between X and Y”Age:log(Age)
to the GLM modelAge:log(Age)
means “the interaction between Age and log(Age)”p=0.7712
so we are >.05
so we fail to reject the null hypothesis that there is no interaction between Age and log odds of survival
Call:
glm(formula = Status ~ Age + Sex + (Age:log(Age)), family = binomial,
data = donner)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.60391 8.97625 -0.067 0.9464
Age 0.27300 1.23769 0.221 0.8254
SexFemale 1.52563 0.79610 1.916 0.0553 .
Age:log(Age) -0.07959 0.27374 -0.291 0.7712
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 59.028 on 42 degrees of freedom
Residual deviance: 48.902 on 39 degrees of freedom
AIC: 56.902
Number of Fisher Scoring iterations: 5
ggpairs()
from the GGally
package