Week 5
In 1846 the Donner and Reed families left Springfield, Illinois, for California by covered wagon. In July, the Donner Party, as it became known, reached Fort Bridger, Wyoming. There its leaders decided to attempt a new and untested rote to the Sacramento Valley. Having reached its full size of 89 people and 20 wagons, the party was delayed by a difficult crossing of the Wasatch Range and again in the crossing of the desert west of the Great Salt Lake. The group became stranded in the eastern Sierra Nevada mountains when the region was hit by heavy snows in late October. By the time the last survivor was rescued on April 21, 1847, 41 of the 89 members had died from famine and exposure to extreme cold.
 
 
41 of the 89 members died
# A tibble: 6 × 3
    Age Sex    Status  
  <dbl> <fct>  <fct>   
1    23 Male   Died    
2    40 Female Survived
3    40 Male   Survived
4    30 Male   Died    
5    28 Male   Died    
6    40 Male   Died    Rows: 43
Columns: 3
$ Age    <dbl> 23, 40, 40, 30, 28, 40, 45, 62, 65, 45, 25, 28, 28, 23, 22, 23,…
$ Sex    <fct> Male, Female, Male, Male, Male, Male, Female, Male, Male, Femal…
$ Status <fct> Died, Survived, Survived, Died, Died, Died, Died, Died, Died, D…Status (binary outcome: Survived vs Died) as a function of Age (continuous variable)donner %>%
    mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
    ggplot(aes(x=Age, y=prob, color=Sex)) + 
       geom_jitter(height=0.025, width=0) + 
       lims(x=c(15,70), y=c(-0.1,1.1)) + 
       scale_y_continuous(breaks=seq(0,1,0.1)) +
       geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE) +
       labs(title="Donner party", y="Probability of Survival") + 
       theme_bw()
ggplot easily using geom_smooth() 
 
 
 
donner %>%
    mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
    filter(Sex=="Female") %>%
    ggplot(aes(x=Age, y=prob)) + 
       geom_jitter(height=0.05, width=0, color="black") + 
       scale_y_continuous(breaks=seq(0,1,0.1)) +
       geom_smooth(method="lm", se=FALSE, fullrange=TRUE, color="orange") +
       lims(x=c(0,80), y=c(-0.5,1.5)) + 
       labs(title="Donner party (Females)", y="Probability of Survival") + 
       theme_bw()
donner %>%
    mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
    filter(Sex=="Female") %>%
    ggplot(aes(x=Age, y=prob)) + 
       geom_jitter(height=0.05, width=0) + 
       scale_y_continuous(breaks=seq(0,1,0.1)) +
       geom_smooth(method="lm", se=FALSE, fullrange=TRUE, color="orange") +
       geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE, color="blue") +
       lims(x=c(5,70), y=c(-0.5,1.5)) + 
       labs(title="Donner party (Females)", y="Probability of Survival") + 
       theme_bw()
Also: we violate assumptions of linear regression
donner %>%
    mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
    filter(Sex=="Female") %>%
    ggplot(aes(x=Age, y=prob)) + 
       geom_jitter(height=0.05, width=0) + 
       scale_y_continuous(breaks=seq(0,1,0.1)) +
       geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE, color="blue") +
       lims(x=c(5,70), y=c(-0.05,1.05)) + 
       labs(title="Donner party (Females)", y="Probability of Survival") + 
       theme_bw()
All GLMs have four components:
a GLM with a binomial distribution and a logit link function
a GLM with a binomial distribution and a logit link function
putting it together:
log\left(\frac{p}{1-p}\right) = \beta_{0} + \beta_{1}\mathrm{Age}
p = \frac{e^{(\beta_{0} + \beta_{1}\mathrm{Age})}}{1 + e^{(\beta_{0} + \beta_{1}\mathrm{Age})}}
p = \frac{e^{\beta_{0} + \beta_{1}\mathrm{Age}}}{1 + e^{\beta_{0} + \beta_{1}\mathrm{Age}}} = \frac{e^{\beta_{0} + \beta_{1}(30)}}{1 + e^{\beta_{0} + \beta_{1}(30)}}
glm() functionglm() is a general function for fitting generalized linear modelsbinomial and the link function is logit
Call:
glm(formula = Status ~ Age, family = binomial(link = "logit"), 
    data = donner)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.06749    1.09914   1.881   0.0600 .
Age         -0.07339    0.03510  -2.091   0.0366 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 53.134  on 41  degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4glm() functionglm() is a general function for fitting generalized linear modelsbinomial and the link function is logitlogit is the default so we can leave it out if we want:
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.06749    1.09914   1.881   0.0600 .
Age         -0.07339    0.03510  -2.091   0.0366 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 53.134  on 41  degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.06749    1.09914   1.881   0.0600 .
Age         -0.07339    0.03510  -2.091   0.0366 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 53.134  on 41  degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4donner %>%
    mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
    ggplot(aes(x=Age, y=prob)) + 
       geom_jitter(height=0.05, width=0) + 
       scale_y_continuous(breaks=seq(0,1,0.1)) +
       geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE, color="blue") +
       lims(x=c(5,70), y=c(-0.05,1.05)) + 
       labs(title="Donner party (Males & Females)", y="Probability of Survival") + 
       theme_bw()
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.06749    1.09914   1.881   0.0600 .
Age         -0.07339    0.03510  -2.091   0.0366 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 53.134  on 41  degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4        1 
0.2956287 donner %>%
    mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
    ggplot(aes(x=Age, y=prob)) + 
       geom_jitter(height=0.05, width=0) + 
       scale_y_continuous(breaks=seq(0,1,0.1)) +
       geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE, color="blue") +
       lims(x=c(5,70), y=c(-0.05,1.05)) + 
       geom_vline(xintercept=40, color="red") +
       geom_hline(yintercept=prob_40, color="red") +
       labs(title="Donner party (Males & Females)", y="Probability of Survival") + 
       theme_bw()
\begin{matrix} log\left(\frac{p}{1-p}\right) &= &2.07 - 0.07(\mathrm{Age})\\ p &= &\frac{e^{2.07 - 0.07(\mathrm{Age})}}{1 + e^{2.07 - 0.07(\mathrm{Age})}} \end{matrix}
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.06749    1.09914   1.881   0.0600 .
Age         -0.07339    0.03510  -2.091   0.0366 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 53.134  on 41  degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4\begin{matrix} log\left(\frac{p}{1-p}\right) &= &2.07 - 0.07(\mathrm{Age})\\ p &= &\frac{e^{2.07 - 0.07(\mathrm{Age})}}{1 + e^{2.07 - 0.07(\mathrm{Age})}} \end{matrix}
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.06749    1.09914   1.881   0.0600 .
Age         -0.07339    0.03510  -2.091   0.0366 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 53.134  on 41  degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4\begin{matrix} log\left(\frac{p}{1-p}\right) &= &2.07 - 0.07(\mathrm{Age})\\ p &= &\frac{e^{2.07 - 0.07(\mathrm{Age})}}{1 + e^{2.07 - 0.07(\mathrm{Age})}} \end{matrix}
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.06749    1.09914   1.881   0.0600 .
Age         -0.07339    0.03510  -2.091   0.0366 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 53.134  on 41  degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4for example:
\begin{matrix} log\left(\frac{p}{1-p}\right) &= &2.07 - 0.07(\mathrm{Age})\\ p &= &\frac{e^{2.07 - 0.07(\mathrm{Age})}}{1 + e^{2.07 - 0.07(\mathrm{Age})}} \end{matrix}
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.06749    1.09914   1.881   0.0600 .
Age         -0.07339    0.03510  -2.091   0.0366 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 53.134  on 41  degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4(Intercept)         Age   SexFemale 
 2.00466996 -0.08762745  1.50667376 donner %>%
    mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
    ggplot(aes(x=Age, y=prob, color=Sex)) + 
       geom_jitter(height=0.05, width=0) + 
       scale_y_continuous(breaks=seq(0,1,0.1)) +
       geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE) +
       lims(x=c(5,70), y=c(-0.05,1.05)) + 
       labs(title="Donner party", y="Probability of Survival") + 
       theme_bw()
\begin{matrix} log\left(\frac{p}{1-p}\right) &= &2.0 - 0.09(\mathrm{Age}) + 1.5(\mathrm{SexFemale})\\ p &= &\frac{e^{2.0 - 0.09(\mathrm{Age}) + 1.5(\mathrm{SexFemale})}}{1 + e^{2.0 - 0.09(\mathrm{Age}) + 1.5(\mathrm{SexFemale})}} \end{matrix}
\begin{matrix} log\left(\frac{p}{1-p}\right) &= &2.0 - 0.09(\mathrm{Age}) + 1.5(\mathrm{SexFemale})\\ p &= &\frac{e^{2.0 - 0.09(\mathrm{Age}) + 1.5(\mathrm{SexFemale})}}{1 + e^{2.0 - 0.09(\mathrm{Age}) + 1.5(\mathrm{SexFemale})}} \end{matrix}
Call:
glm(formula = Status ~ Age + Sex, family = binomial, data = donner)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.00467    1.21857   1.645   0.1000 .
Age         -0.08763    0.04084  -2.146   0.0319 *
SexFemale    1.50667    0.78497   1.919   0.0549 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 48.992  on 40  degrees of freedom
AIC: 54.992
Number of Fisher Scoring iterations: 5
Call:
glm(formula = Status ~ Age + Sex, family = binomial, data = donner)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.00467    1.21857   1.645   0.1000 .
Age         -0.08763    0.04084  -2.146   0.0319 *
SexFemale    1.50667    0.78497   1.919   0.0549 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 48.992  on 40  degrees of freedom
AIC: 54.992
Number of Fisher Scoring iterations: 5
Call:
glm(formula = Status ~ Age + Sex, family = binomial, data = donner)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.00467    1.21857   1.645   0.1000 .
Age         -0.08763    0.04084  -2.146   0.0319 *
SexFemale    1.50667    0.78497   1.919   0.0549 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 48.992  on 40  degrees of freedom
AIC: 54.992
Number of Fisher Scoring iterations: 5
Call:
glm(formula = Status ~ Age, family = binomial, data = donner)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.06749    1.09914   1.881   0.0600 .
Age         -0.07339    0.03510  -2.091   0.0366 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 53.134  on 41  degrees of freedom
AIC: 57.134
Number of Fisher Scoring iterations: 4
Call:
glm(formula = Status ~ Age + Sex, family = binomial, data = donner)
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  2.00467    1.21857   1.645   0.1000 .
Age         -0.08763    0.04084  -2.146   0.0319 *
SexFemale    1.50667    0.78497   1.919   0.0549 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 48.992  on 40  degrees of freedom
AIC: 54.992
Number of Fisher Scoring iterations: 5step() function to do stepwise model selectioncut() the predictor Age into bins using quantile()Surviveddonner %>%
    mutate(prob = ifelse(Status == "Survived", 1, 0)) %>%
    ggplot(aes(x=Age, y=prob)) + 
       geom_jitter(height=0.05, width=0) + 
       scale_y_continuous(breaks=seq(0,1,0.1)) +
       geom_smooth(method="glm", method.args=list(family="binomial"), se=FALSE, fullrange=TRUE, color="blue") +
       geom_vline(xintercept=quantile(donner$Age), color="red", lty=2) +
       lims(x=c(5,70), y=c(-0.05,1.05)) + 
       labs(title="Donner party", y="Probability of Survival") + 
       theme_bw()
X:Y means “the interaction between X and Y”Age:log(Age) to the GLM modelAge:log(Age) means “the interaction between Age and log(Age)”p=0.7712 so we are >.05 so we fail to reject the null hypothesis that there is no interaction between Age and log odds of survival
Call:
glm(formula = Status ~ Age + Sex + (Age:log(Age)), family = binomial, 
    data = donner)
Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.60391    8.97625  -0.067   0.9464  
Age           0.27300    1.23769   0.221   0.8254  
SexFemale     1.52563    0.79610   1.916   0.0553 .
Age:log(Age) -0.07959    0.27374  -0.291   0.7712  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 59.028  on 42  degrees of freedom
Residual deviance: 48.902  on 39  degrees of freedom
AIC: 56.902
Number of Fisher Scoring iterations: 5ggpairs() from the GGally package