Bivariate Correlation & Intro to Linear Regression

Week 4

course website features

copy icon in code blocks
site-wide search bar
? and m key during slides
for the interested: site made using Quarto

Today

Bivariate Correlation

correlation vs causation
correlation strength/direction
Pearson’s r using cor()
significance test using cor.test()
random correlations & sample size

Linear Regression

equation of a line
x-intercept, y-intercept, slope
finding best fit line using lm()
residuals
model coefficients

Correlation does not equal Causation

*https://tylervigen.com/spurious-correlations*

Correlation does not equal Causation

Questions for today

What is Causation? Why do we care about it?
What is Correlation? Why do we care about it?
Why does correlation not equal causation? …
lots of reasons

Causation

Why does a ping pong ball move when you hit it?

Simple Physics

The energy from swinging the paddle gets transferred to the ping pong ball when they collide
The force in the paddle that is transferred to the ball
causes the ball to move

Measuring two things

What if we measured two things in the ping pong ball example

The strength of the swing
The distance travelled by the ball

Would we expect a relationship between the two measurements?
What would happen to the ball if we swung the paddle soft, to medium, to hard?

What have we learned

We know hitting a ball can cause it to move
The action of the causal force can be measured by relating a measure of the cause (swing strength) to a measure of the outcome (distance travelled)
There is a positive relationship between the two measures, increasing swing strength is associated with longer distances

What have we learned

We use the term correlation to describe the relationship between the two measures, in this case we found a positive correlation
We have seen that a causal force (swing strength) can produce a correlation between two measures

Causation

Psychological science is interested in understanding the causes of psychological processes
We can measure change in a causal force, and measure change in an outcome (psychological process)
If the force causally changes the outcome, we expect a relationship or association between the force and the outcome.

Causation

a causal relationship between X and Y predicts a correlation between X and Y
- we can test whether there’s a correlation between X and Y
- if there isn’t, that argues against a causal relationship
- falsification of our hypothesis
BUT the other way around doesn’t work
- the observation of a correlation between X and Y does not represent evidence of a causal relationship between X and Y
- classic logical error called affirming the consequent

Affirming the Consequent

if theory T is true, that predicts observation/data/pattern P should be seen
- if you don’t see P, that is evidence against theory T
if you do observe pattern P, that does not prove theory T
- P could be caused by something other than T

Affirming the Consequent

theory: I have covid-19
- prediction: covid-19 causes a fever
- observation: I do have a fever
- conclusion: I have covid-19
  - LOGICAL ERROR!
I could have a fever for many other reasons

Example:

thesis research question: does the amount of time spent on social media predict depression?
you collect data on a sample of 100 students
- depression rating between 1 and 10
- time spent on social media in hours per day
you find a correlation between the two measures
- is this evidence in support of your hypothesis that social media causes depression?
- NO!—LOGICAL ERROR! (affirming the consequent)

3 kinds of correlation

Positive correlation

Increases in the X variable are associated with
increases in the Y variable
Decreases in the X variable are associated with
decreases in the Y variable

Negative correlation

Increases in the X variable are associated with
decreases in the Y variable
Decreases in the X variable are associated with
increases in the Y variable

Random (no correlation)

Increases in the X variable are NOT associated with
increases or decreases in the Y variable
Decreases in the X variable are NOT associated with
increases or decreases in the Y variable

Increasing positive correlation

Negative correlations

Correlation Strength

Super strong (Perfect): dots all line up, no exceptions
Strong: Clear pattern, not much variation in dots
Medium: There is a pattern but dots have a lot of variaion
Weak: Sort of a hint of a pattern, dots have loads of variation
None: Dots are everywhere, no clear pattern

Pearson’s r

A number that summarizes the strength & direction of a correlation

varies between -1.0 and 1.0
0.0 means no correlation
1.0: perfect positive correlation
-1.0: perfect negative correlation
values in between: more or less strength

Formula for Pearson’s r

R: Pearson’s r

x <- c(1,2,3,4,5)
y <- c(4,3,6,7,8)
N <- 5
covariation<-sum((x-mean(x))*(y-mean(y)))/N
SD_x <- sqrt(sum((x-mean(x))^2)/N)
SD_y <- sqrt(sum((y-mean(y))^2)/N)
r <- covariation/(SD_x*SD_y)
r

[1] 0.9149914

R’s `cor()` function

R has a function to compute correlations called cor()

x <- c(1,2,3,4,5)
y <- c(4,3,6,7,8)
cor(x,y)

[1] 0.9149914

Interpreting Correlations

What do correlations mean?

Could mean that one variable causes change in another variable

BUT, it can mean other things…

Causal direction problem
Non-linear problem
Spurious correlations
Chance problem
Third variable as cause

1 Causal Directionality problem

2 Nonlinearity problem

3 Spurious correlations

http://www.tylervigen.com/spurious-correlations

4 Chance problem

Correlations between two variables can occur by chance, and be completlely meaningless

we will do some simulations!

5 Third variable

Molly’s parents’ income is $50k
Molly takes the SAT on Monday and scores a 480
on Thursday Mom gets a new job, the new income is $150k
If Molly re-takes the SAT on Friday, will her Mom’s increased income cause her to score 53 points higher? (0.53 * $100k)

SAT = 460 + 0.53 ($inc)

Simpson’s Paradox

a trend appears in several groups of data but disappears or reverses when the groups are combined

Pengun Bill length and depth. Artwork by @allison_horst.

Simpson’s Paradox

a trend appears in several groups of data but disappears or reverses when the groups are combined

Illustration of three species of Palmer Archipelago penguins: Chinstrap, Gentoo, and Adelie. Artwork by @allison_horst.

Correlation and variance

r can be between [-1,1]
r^2 is always between [0,1]
r^{2} is the proportion of variance in one variable that is explained by the other variable
r=0.5 means that 25% of the variance in one variable is explained by the other variable
r=0.1 means that only 1% of the variance in one variable is explained by the other variable
- (99% of the variance is unexplained by the other variable)

Correlations and Random Chance

What is randomness?

Two related ideas:

Things have equal chance of happening (e.g., a coin flip)

50% heads, 50% tails

Correlations and Random Chance

What is randomness?

Two related ideas:

Independence: One thing happening is totally unrelated to whether another thing happens

the outcome of one flip doesn’t predict the outcome of another

Two random variables

On average there should be zero correlation between two variables containing randomly drawn numbers

The numbers in variable X are drawn randomly (independently), so they do not predict numbers in Y
The numbers in variable Y are drawn randomly (independently), so they do not predict numbers in X

Two random variables

If X can’t predict Y, then correlation should be 0 right?

on average yes
for individual samples, no!

R: random numbers

In R, runif() allows you to sample random numbers (uniform distribution) between a min and max value. Numbers in the range have an equal chance of occuring

runif(n=5, min=0, max=10)

[1] 1.982554 1.148456 4.181884 9.571623 7.412287

runif(n=5, min=0, max=10)

[1] 2.3187183 0.9780493 6.0273084 2.1305708 3.3249109

runif(n=5, min=0, max=10)

[1] 3.316249 3.658413 6.655778 5.393493 7.680148

“Random” Correlations

x <- runif(n=5, min=0, max=10)
y <- runif(n=5, min=0, max=10)
x

[1] 5.33771078 6.02776119 6.94650596 6.33250039 0.02851074

[1] 6.6436306899 7.2380377073 8.0283523118 0.0002362533 4.2411074392

cor(x,y)

[1] 0.1630108

Small N “random” correlations

What is chance capable of?

We can see that randomly sampling numbers can produce a range of correlations, even when there shouldn’t be a “correlation”
What is the average correlation produced by chance? (zero)
What is the range of correlations that chance can produce?

Simulating what chance can do

The role of sample size (N)

The range of chance correlations decreases as N increases

The inference problem

Let’s say we sampled some data, and we found a correlation, (r = 0.5)
BUT: We know chance can sometimes produce random correlations

The inference problem

Is the correlation we observed in the sample a reflection of a real correlation in the population?
Is one variable really related to the other?
Or, is there really no correlation between these two population variables?
i.e. the correlation in the sample is spurious
i.e. it was produced by chance through random sampling → this is H_{0}, the null hypothesis

The (simulated) Null Hypothesis

Making inferences about chance

In a sample of size N=10, we observe: r=0.5
Null Hypothesis H_{0}:
→ no correlation actually exists in the population
→ actual population correlation r=0.0
Alternative Hypothesis H_{1}: correlation does exist in the population, and r=0.5 is our best estimate
We don’t know what the truth actually is

Making inferences about chance

We can only make an inference based on:

What is the probability p of observing:
→ an r in a sample as big as r=0.5
→ in a sample of size N=10
→ under the null hypothesis H_{0} in which the population r=0.0

Making inferences about chance

If that probability p is low enough:
→ we conclude that it is an unlikely scenario
→ and we reject H_{0}

Making inferences about chance

How low can you go?

If that probability p is low enough:
→ we conclude that it is an unlikely scenario
→ and we reject H_{0}

How low is low enough?

p < .05
p < .01
p < .001

Making inferences about chance

cor.test() in R will give you the probability p of obtaining a sample correlation as large as you did under the null hypothesis where the population correlation is actually zero

Assumptions:
→ x vs y is a linear relationship (plot it!)
→ x & y variables are normally distributed (shapiro.test())

Making inferences about chance

ggplot(tibble(x,y), aes(x,y)) +
  geom_point(size=3) + 
  stat_poly_line(se=FALSE) +
  stat_correlation(use_label("R"), label.x="right") +
  theme_bw()

Making inferences about chance

shapiro.test(x)


    Shapiro-Wilk normality test

data:  x
W = 0.97589, p-value = 0.3943

shapiro.test(y)


    Shapiro-Wilk normality test

data:  y
W = 0.97548, p-value = 0.3808

Making inferences about chance

cor.test(x, y)


    Pearson's product-moment correlation

data:  x and y
t = -14.55, df = 48, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9439854 -0.8341607
sample estimates:
       cor 
-0.9028732

Making inferences about chance

if normality assumption is not met, you can use Spearman’s rank correlation coefficient \rho (“rho”)
See Chapter 5.7.6 of Navarro “Learning Statistics with R”

cor.test(x, y, method="spearman")


    Spearman's rank correlation rho

data:  x and y
S = 39340, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.8890756

IMPORTANT: “significant” \neq “large”

significance test is not whether r is large
test is whether r is “statistically significant”
whether r is reliably different than 0.0

IMPORTANT: “significant” \neq “large”

N=1000
r=0.055
p = 0.041
is r significant? (yes)
is r large? (no)
r^{2}= .055*.055 = .003025
= 0.3\% variance explained
- 99.7\% of the variance is unexplained

IMPORTANT: “significant” \neq “large”

N=7
r=0.75
p = 0.052
is r significant? (no)
is r large? (yes)

Linear Regression

geometric interpretation of correlation
can be used for prediction
a linear model relating one variable to another variable

Examples of correlation

Correlation with Regression lines

What is a regression line?

first: it’s a line (we will need the equation)
the best fit line
how do we determine which line fits best?

Residuals and error

What is a regression line?

first: it’s a line (we will need the equation)
the best fit line
how do we determine which line fits best?

The regression line minimizes the sum of the (squared) residuals

Animated Residuals

regression minimizes the sum of the (squared) residuals

Finding the best fit line

how do we find the best fit line?
First step, remember what lines are …

Equation for a line

y = mx +b

y = \text{slope}*x + \text{yintercept}

y = value on y-axis
m = slope of the line
x = value on x-axis
b = value on y-axis when x = 0

We will also use this form:

y = \beta_{0} + \beta_{1}x

solving for y

predicting y based on x

y = .5x + 2

What is the value of y, when x is 0?

y = .5*0 + 2
y = 0+2
y = 2

Finding the best fit line

find m and b for:

Y = mX + b

so that the regression line minimizes the sum of the squared residuals

Finding the best fit line

Y = mX + b

sample data

sample calculations

sample plot

b = -0.221
m = 1.19
Y = (1.19)X - 0.221

Linear Regression in R

library(tidyverse)
x <- c(1,4,3,6,5,7,8)
y <- c(2,5,1,8,6,8,9)
n = 7
(b <- ((sum(y)*sum(x^2)) - (sum(x)*sum(x*y))) / ((n*sum(x^2)) - (sum(x))^2))

[1] -0.2213115

(m = ((n*sum(x*y)) - (sum(x)*sum(y))) / ((n*sum(x^2)) - (sum(x))^2))

[1] 1.192623

Linear Regression in R using `lm()`

library(tidyverse)
x <- c(1,4,3,6,5,7,8)
y <- c(2,5,1,8,6,8,9)
df <- tibble(x,y)
df

# A tibble: 7 × 2
      x     y
  <dbl> <dbl>
1     1     2
2     4     5
3     3     1
4     6     8
5     5     6
6     7     8
7     8     9

mymod <- lm(y ~ x, data=df)
coef(mymod)

(Intercept)           x 
 -0.2213115   1.1926230

Linear Regression in R using `lm()`

library(tidyverse)
x <- c(1,4,3,6,5,7,8)
y <- c(2,5,1,8,6,8,9)
df <- tibble(x,y)
df

# A tibble: 7 × 2
      x     y
  <dbl> <dbl>
1     1     2
2     4     5
3     3     1
4     6     8
5     5     6
6     7     8
7     8     9

library(ggpmisc) # for stat_poly_eq
ggplot(data=df, aes(x=x,y=y)) +
  geom_point(size=4, color="black") +
  geom_smooth(method="lm", se=FALSE, color="blue") + 
  stat_poly_eq(use_label(c("eq")), size=6, color="blue") + 
  theme_bw(base_size=18)

Bivariate Correlation & Intro to Linear Regression

course website features

Today

Bivariate Correlation

Linear Regression

Correlation does not equal Causation

Correlation does not equal Causation

Questions for today

Causation

Simple Physics

Measuring two things

What have we learned

What have we learned

Causation

Causation

Affirming the Consequent

Affirming the Consequent

Example:

3 kinds of correlation

Positive correlation

Negative correlation

Random (no correlation)

Increasing positive correlation

Negative correlations

Correlation Strength

Pearson’s r

Formula for Pearson’s r

R: Pearson’s r

R’s cor() function

Interpreting Correlations

What do correlations mean?

1 Causal Directionality problem

2 Nonlinearity problem

3 Spurious correlations

4 Chance problem

5 Third variable

5 Third variable

Simpson’s Paradox

Simpson’s Paradox

Correlation and variance

Correlations and Random Chance

What is randomness?

Correlations and Random Chance

What is randomness?

Two random variables

Two random variables

R: random numbers

“Random” Correlations

Small N “random” correlations

What is chance capable of?

Simulating what chance can do

The role of sample size (N)

The inference problem

The inference problem

The (simulated) Null Hypothesis

Making inferences about chance

Making inferences about chance

Making inferences about chance

Making inferences about chance

How low can you go?

Making inferences about chance

Making inferences about chance

Making inferences about chance

Making inferences about chance

Making inferences about chance

IMPORTANT: “significant” \neq “large”

IMPORTANT: “significant” \neq “large”

IMPORTANT: “significant” \neq “large”

NEXT

Linear Regression

Examples of correlation

Correlation with Regression lines

What is a regression line?

Residuals and error

What is a regression line?

Animated Residuals

Finding the best fit line

Equation for a line

solving for y

Finding the best fit line

R’s `cor()` function

Linear Regression in R using `lm()`

Linear Regression in R using `lm()`