Bivariate Correlation & Intro to Linear Regression

Week 4

course website features

  • copy icon in code blocks
  • site-wide search bar
  • ? and m key during slides
  • for the interested: site made using Quarto

Today

Bivariate Correlation

  • correlation vs causation
  • correlation strength/direction
  • Pearson’s r using cor()
  • significance test using cor.test()
  • random correlations & sample size

Linear Regression

  • equation of a line
  • x-intercept, y-intercept, slope
  • finding best fit line using lm()
  • residuals
  • model coefficients

Correlation does not equal Causation

Correlation does not equal Causation

Questions for today

  1. What is Causation? Why do we care about it?
  2. What is Correlation? Why do we care about it?
  3. Why does correlation not equal causation?
    lots of reasons

Causation

Why does a ping pong ball move when you hit it?

Simple Physics

  1. The energy from swinging the paddle gets transferred to the ping pong ball when they collide

  2. The force in the paddle that is transferred to the ball
    causes the ball to move

Measuring two things

What if we measured two things in the ping pong ball example

  1. The strength of the swing

  2. The distance travelled by the ball

  • Would we expect a relationship between the two measurements?
  • What would happen to the ball if we swung the paddle soft, to medium, to hard?

What have we learned

  1. We know hitting a ball can cause it to move

  2. The action of the causal force can be measured by relating a measure of the cause (swing strength) to a measure of the outcome (distance travelled)

  3. There is a positive relationship between the two measures, increasing swing strength is associated with longer distances

What have we learned

  1. We use the term correlation to describe the relationship between the two measures, in this case we found a positive correlation

  2. We have seen that a causal force (swing strength) can produce a correlation between two measures

Causation

  1. Psychological science is interested in understanding the causes of psychological processes

  2. We can measure change in a causal force, and measure change in an outcome (psychological process)

  3. If the force causally changes the outcome, we expect a relationship or association between the force and the outcome.

Causation

  • a causal relationship between X and Y predicts a correlation between X and Y
    • we can test whether there’s a correlation between X and Y
    • if there isn’t, that argues against a causal relationship
    • falsification of our hypothesis
  • BUT the other way around doesn’t work
    • the observation of a correlation between X and Y does not represent evidence of a causal relationship between X and Y
    • classic logical error called affirming the consequent

Affirming the Consequent

  • if theory T is true, that predicts observation/data/pattern P should be seen
    • if you don’t see P, that is evidence against theory T
  • if you do observe pattern P, that does not prove theory T
    • P could be caused by something other than T

Affirming the Consequent

  • theory: I have covid-19
    • prediction: covid-19 causes a fever
    • observation: I do have a fever
    • conclusion: I have covid-19
      • LOGICAL ERROR!
  • I could have a fever for many other reasons

Example:

  • thesis research question: does the amount of time spent on social media predict depression?
  • you collect data on a sample of 100 students
    • depression rating between 1 and 10
    • time spent on social media in hours per day
  • you find a correlation between the two measures
    • is this evidence in support of your hypothesis that social media causes depression?
    • NO!LOGICAL ERROR! (affirming the consequent)

3 kinds of correlation

Positive correlation

  • Increases in the X variable are associated with
    increases in the Y variable

  • Decreases in the X variable are associated with
    decreases in the Y variable

Negative correlation

  • Increases in the X variable are associated with
    decreases in the Y variable

  • Decreases in the X variable are associated with
    increases in the Y variable

Random (no correlation)

  • Increases in the X variable are NOT associated with
    increases or decreases in the Y variable

  • Decreases in the X variable are NOT associated with
    increases or decreases in the Y variable

Increasing positive correlation

Negative correlations

Correlation Strength

  • Super strong (Perfect): dots all line up, no exceptions
  • Strong: Clear pattern, not much variation in dots
  • Medium: There is a pattern but dots have a lot of variaion
  • Weak: Sort of a hint of a pattern, dots have loads of variation
  • None: Dots are everywhere, no clear pattern

Pearson’s r

A number that summarizes the strength & direction of a correlation

  • varies between -1.0 and 1.0
  • 0.0 means no correlation
  • 1.0: perfect positive correlation
  • -1.0: perfect negative correlation
  • values in between: more or less strength

Formula for Pearson’s r


R: Pearson’s r

x <- c(1,2,3,4,5)
y <- c(4,3,6,7,8)
N <- 5
covariation<-sum((x-mean(x))*(y-mean(y)))/N
SD_x <- sqrt(sum((x-mean(x))^2)/N)
SD_y <- sqrt(sum((y-mean(y))^2)/N)
r <- covariation/(SD_x*SD_y)
r
[1] 0.9149914

R’s cor() function

R has a function to compute correlations called cor()

x <- c(1,2,3,4,5)
y <- c(4,3,6,7,8)
cor(x,y)
[1] 0.9149914

Interpreting Correlations

What do correlations mean?

Could mean that one variable causes change in another variable

BUT, it can mean other things…

  1. Causal direction problem
  2. Non-linear problem
  3. Spurious correlations
  4. Chance problem
  5. Third variable as cause

1 Causal Directionality problem

2 Nonlinearity problem

3 Spurious correlations

http://www.tylervigen.com/spurious-correlations

4 Chance problem

Correlations between two variables can occur by chance, and be completlely meaningless

  • we will do some simulations!

5 Third variable

5 Third variable

  • Molly’s parents’ income is $50k
  • Molly takes the SAT on Monday and scores a 480
  • on Thursday Mom gets a new job, the new income is $150k
  • If Molly re-takes the SAT on Friday, will her Mom’s increased income cause her to score 53 points higher? (0.53 * $100k)

SAT = 460 + 0.53 ($inc)

Simpson’s Paradox

  • a trend appears in several groups of data but disappears or reverses when the groups are combined

Pengun Bill length and depth. Artwork by @allison_horst.

Simpson’s Paradox

  • a trend appears in several groups of data but disappears or reverses when the groups are combined

Illustration of three species of Palmer Archipelago penguins: Chinstrap, Gentoo, and Adelie. Artwork by @allison_horst.

Correlation and variance

  • r can be between [-1,1]
  • r^2 is always between [0,1]
  • r^{2} is the proportion of variance in one variable that is explained by the other variable
  • r=0.5 means that 25% of the variance in one variable is explained by the other variable
  • r=0.1 means that only 1% of the variance in one variable is explained by the other variable
    • (99% of the variance is unexplained by the other variable)

Correlations and Random Chance

What is randomness?

Two related ideas:

  1. Things have equal chance of happening (e.g., a coin flip)

50% heads, 50% tails

Correlations and Random Chance

What is randomness?

Two related ideas:

  1. Independence: One thing happening is totally unrelated to whether another thing happens

the outcome of one flip doesn’t predict the outcome of another

Two random variables

On average there should be zero correlation between two variables containing randomly drawn numbers

  1. The numbers in variable X are drawn randomly (independently), so they do not predict numbers in Y
  2. The numbers in variable Y are drawn randomly (independently), so they do not predict numbers in X

Two random variables

If X can’t predict Y, then correlation should be 0 right?

  • on average yes
  • for individual samples, no!

R: random numbers

In R, runif() allows you to sample random numbers (uniform distribution) between a min and max value. Numbers in the range have an equal chance of occuring

runif(n=5, min=0, max=10)
[1] 1.982554 1.148456 4.181884 9.571623 7.412287
runif(n=5, min=0, max=10)
[1] 2.3187183 0.9780493 6.0273084 2.1305708 3.3249109
runif(n=5, min=0, max=10)
[1] 3.316249 3.658413 6.655778 5.393493 7.680148

“Random” Correlations

x <- runif(n=5, min=0, max=10)
y <- runif(n=5, min=0, max=10)
x
[1] 5.33771078 6.02776119 6.94650596 6.33250039 0.02851074
y
[1] 6.6436306899 7.2380377073 8.0283523118 0.0002362533 4.2411074392
cor(x,y)
[1] 0.1630108

Small N “random” correlations

What is chance capable of?

  1. We can see that randomly sampling numbers can produce a range of correlations, even when there shouldn’t be a “correlation”
  2. What is the average correlation produced by chance? (zero)
  3. What is the range of correlations that chance can produce?

Simulating what chance can do

The role of sample size (N)

The range of chance correlations decreases as N increases

The inference problem

  • Let’s say we sampled some data, and we found a correlation, (r = 0.5)
  • BUT: We know chance can sometimes produce random correlations

The inference problem

  • Is the correlation we observed in the sample a reflection of a real correlation in the population?
    Is one variable really related to the other?
  • Or, is there really no correlation between these two population variables?
    i.e. the correlation in the sample is spurious
    i.e. it was produced by chance through random sampling → this is H_{0}, the null hypothesis

The (simulated) Null Hypothesis

Making inferences about chance

  • In a sample of size N=10, we observe: r=0.5
  • Null Hypothesis H_{0}:
    → no correlation actually exists in the population
    → actual population correlation r=0.0
  • Alternative Hypothesis H_{1}: correlation does exist in the population, and r=0.5 is our best estimate
  • We don’t know what the truth actually is

Making inferences about chance

  • We can only make an inference based on:

    What is the probability p of observing:
    → an r in a sample as big as r=0.5
    → in a sample of size N=10
    → under the null hypothesis H_{0} in which the population r=0.0

Making inferences about chance

If that probability p is low enough:
→ we conclude that it is an unlikely scenario
→ and we reject H_{0}

Making inferences about chance

How low can you go?

If that probability p is low enough:
→ we conclude that it is an unlikely scenario
→ and we reject H_{0}

How low is low enough?

  • p < .05
  • p < .01
  • p < .001

Making inferences about chance

  • cor.test() in R will give you the probability p of obtaining a sample correlation as large as you did under the null hypothesis where the population correlation is actually zero

Assumptions:
x vs y is a linear relationship (plot it!)
x & y variables are normally distributed (shapiro.test())

Making inferences about chance

ggplot(tibble(x,y), aes(x,y)) +
  geom_point(size=3) + 
  stat_poly_line(se=FALSE) +
  stat_correlation(use_label("R"), label.x="right") +
  theme_bw()

Making inferences about chance

shapiro.test(x)

    Shapiro-Wilk normality test

data:  x
W = 0.97589, p-value = 0.3943
shapiro.test(y)

    Shapiro-Wilk normality test

data:  y
W = 0.97548, p-value = 0.3808

Making inferences about chance

cor.test(x, y)

    Pearson's product-moment correlation

data:  x and y
t = -14.55, df = 48, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9439854 -0.8341607
sample estimates:
       cor 
-0.9028732 

Making inferences about chance

  • if normality assumption is not met, you can use Spearman’s rank correlation coefficient \rho (“rho”)
  • See Chapter 5.7.6 of Navarro “Learning Statistics with R”
cor.test(x, y, method="spearman")

    Spearman's rank correlation rho

data:  x and y
S = 39340, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.8890756 

IMPORTANT: “significant” \neq “large”

  • significance test is not whether r is large
  • test is whether r is “statistically significant”
  • whether r is reliably different than 0.0

IMPORTANT: “significant” \neq “large”

  • N=1000
  • r=0.055
  • p = 0.041
  • is r significant? (yes)
  • is r large? (no)
  • r^{2}= .055*.055 = .003025
  • = 0.3\% variance explained
    • 99.7\% of the variance is unexplained

IMPORTANT: “significant” \neq “large”

  • N=7
  • r=0.75
  • p = 0.052
  • is r significant? (no)
  • is r large? (yes)

NEXT

Linear Regression

Linear Regression

  • geometric interpretation of correlation
  • can be used for prediction
  • a linear model relating one variable to another variable

Examples of correlation

Correlation with Regression lines

What is a regression line?

  • first: it’s a line (we will need the equation)
  • the best fit line
  • how do we determine which line fits best?

Residuals and error

What is a regression line?

  • first: it’s a line (we will need the equation)
  • the best fit line
  • how do we determine which line fits best?

The regression line minimizes the sum of the (squared) residuals

Animated Residuals

regression minimizes the sum of the (squared) residuals

Finding the best fit line

  • how do we find the best fit line?
  • First step, remember what lines are …

Equation for a line

y = mx +b

y = \text{slope}*x + \text{yintercept}

  • y = value on y-axis
  • m = slope of the line
  • x = value on x-axis
  • b = value on y-axis when x = 0

We will also use this form:

y = \beta_{0} + \beta_{1}x

solving for y

  • predicting y based on x

y = .5x + 2

What is the value of y, when x is 0?

y = .5*0 + 2
y = 0+2
y = 2

Finding the best fit line

find m and b for:

Y = mX + b

so that the regression line minimizes the sum of the squared residuals

Finding the best fit line

Y = mX + b

sample data

sample calculations

sample plot

b = -0.221
m = 1.19
Y = (1.19)X - 0.221

Linear Regression in R

library(tidyverse)
x <- c(1,4,3,6,5,7,8)
y <- c(2,5,1,8,6,8,9)
n = 7
(b <- ((sum(y)*sum(x^2)) - (sum(x)*sum(x*y))) / ((n*sum(x^2)) - (sum(x))^2))
[1] -0.2213115
(m = ((n*sum(x*y)) - (sum(x)*sum(y))) / ((n*sum(x^2)) - (sum(x))^2))
[1] 1.192623

Linear Regression in R using lm()

library(tidyverse)
x <- c(1,4,3,6,5,7,8)
y <- c(2,5,1,8,6,8,9)
df <- tibble(x,y)
df
# A tibble: 7 × 2
      x     y
  <dbl> <dbl>
1     1     2
2     4     5
3     3     1
4     6     8
5     5     6
6     7     8
7     8     9
mymod <- lm(y ~ x, data=df)
coef(mymod)
(Intercept)           x 
 -0.2213115   1.1926230 

Linear Regression in R using lm()

library(tidyverse)
x <- c(1,4,3,6,5,7,8)
y <- c(2,5,1,8,6,8,9)
df <- tibble(x,y)
df
# A tibble: 7 × 2
      x     y
  <dbl> <dbl>
1     1     2
2     4     5
3     3     1
4     6     8
5     5     6
6     7     8
7     8     9
library(ggpmisc) # for stat_poly_eq
ggplot(data=df, aes(x=x,y=y)) +
  geom_point(size=4, color="black") +
  geom_smooth(method="lm", se=FALSE, color="blue") + 
  stat_poly_eq(use_label(c("eq")), size=6, color="blue") + 
  theme_bw(base_size=18)