Week 4
MPG ~ Weight + Displacement
(using the mtcars
built-in dataset)MPG ~ Weight + Displacement
MPG ~ Weight + Displacement
MPG ~ Weight + Displacement
Y_{i} = \beta_{0} + \displaystyle\sum_{k=1}^{K} \left( \beta_{k} X_{ik} \right) +\varepsilon_{i} - sorry! I can’t draw in > 3 dimensions
The penguins
data from the palmerpenguins package contains size measurements for 344 penguins from three species observed on three islands in the Palmer Archipelago, Antarctica.
Rows: 344
Columns: 8
$ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex <fct> male, female, female, NA, female, male, female, male…
$ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
body_mass_g
of Adelie
penguins using a model that includes:
flipper_length_mm
bill_length_mm
bill_depth_mm
dplyr()
GGally::ggpairs()
lm()
step()
pdata <- penguins %>%
filter(species == "Adelie") %>%
select(body_mass_g, flipper_length_mm, bill_length_mm, bill_depth_mm) %>%
drop_na() # skips rows containing NA values (indicating missing data)
glimpse(pdata)
Rows: 151
Columns: 4
$ body_mass_g <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3475, 4250…
$ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 193, 190, 186, 18…
$ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0…
$ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2…
body_mass_g
based on the other three
flipper_length_mm
, bill_length_mm
, and bill_depth_mm
mod.full <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm, data=pdata)
summary(mod.full)
Call:
lm(formula = body_mass_g ~ flipper_length_mm + bill_length_mm +
bill_depth_mm, data = pdata)
Residuals:
Min 1Q Median 3Q Max
-815.12 -200.33 -7.35 201.02 891.03
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4341.302 795.117 -5.460 1.98e-07 ***
flipper_length_mm 17.421 4.385 3.973 0.000111 ***
bill_length_mm 55.368 11.133 4.973 1.81e-06 ***
bill_depth_mm 140.895 24.216 5.818 3.58e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 324.9 on 147 degrees of freedom
Multiple R-squared: 0.5082, Adjusted R-squared: 0.4982
F-statistic: 50.63 on 3 and 147 DF, p-value: < 2.2e-16
shapiro.test()
on the residualsncvTest()
in the car
package)broom
package to tidy
the model object into a tibble
library(broom) # you will need to first install.packages("broom") once
mod.tidy <- tidy(mod.full, conf.int=TRUE, conf.level=0.95)
mod.tidy
# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -4341. 795. -5.46 0.000000198 -5913. -2770.
2 flipper_length_mm 17.4 4.39 3.97 0.000111 8.76 26.1
3 bill_length_mm 55.4 11.1 4.97 0.00000181 33.4 77.4
4 bill_depth_mm 141. 24.2 5.82 0.0000000358 93.0 189.
(Intercept)
term from the model output# A tibble: 3 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 flipper_length_mm 17.4 4.39 3.97 0.000111 8.76 26.1
2 bill_length_mm 55.4 11.1 4.97 0.00000181 33.4 77.4
3 bill_depth_mm 141. 24.2 5.82 0.0000000358 93.0 189.
# A tibble: 3 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 flipper_length_mm 17.4 4.39 3.97 0.000111 8.76 26.1
2 bill_length_mm 55.4 11.1 4.97 0.00000181 33.4 77.4
3 bill_depth_mm 141. 24.2 5.82 0.0000000358 93.0 189.
coefplot
package to more easily plot coefficients and their confidence intervalslm()
model objectbody_mass_g
?
step()
function)R^{2} = 1 - \frac{\mathit{SS_{res}}}{\mathit{SS_{tot}}} \text{adj.}R^{2} = 1 - \left( \frac{\mathit{SS_{res}}}{\mathit{SS_{tot}}} \times \frac{N-1}{N-K-1} \right)
mod.full <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm + bill_depth_mm, data=pdata)
summary(mod.full)$adj.r.squared
[1] 0.4981588
bill_depth_mm
) and see what effect it has on adjusted r-squaredmod.red <- lm(body_mass_g ~ flipper_length_mm + bill_length_mm, data=pdata)
summary(mod.red)$adj.r.squared
[1] 0.3867675
bill_depth_mm
to the model so we should keep it, it’s “worth it”AIC \approx \frac{\mathit{SS_{res}}}{\hat{\sigma}^{2}} + 2K
step()
function to test competing models and look at AIC measurestep()
can do 3 flavours of model refinement:
step()
to add or remove variables at each stepstep()
command, start with empty modelstep()
starts with mod.0
and adds the best single variable that most reduces AICstep()
command, start with full modelstep()
starts with a full model and checks if dropping any single variable improves (reduces) AICflipper_length_mm = 200 mm
bill_length_mm = 46 mm
bill_depth_mm = 22 mm
predict()
to predict the value of body_mass_g
summary(mod.refined)