Bivariate Correlation Review

Today

Bivariate Correlation Review

correlation vs causation
correlation strength/direction
Pearson’s r using cor()
significance test using cor.test()
random correlations & sample size

Causation

Hitting a ball can cause it to move
The action of the causal force can be measured by relating a measure of the cause (swing strength) to a measure of the outcome (distance travelled)
There is a positive relationship between the two measures, increasing swing strength is associated with longer distances

Correlation

We use the term correlation to describe the relationship between the two measures, in this case we found a positive correlation
We have seen that a causal force (swing strength) can produce a correlation between two measures

Causation & Correlation

Psychological science is interested in understanding the causes of psychological processes
We can measure change in a causal force (an indepdendent variable), and measure change in an outcome (psychological process, a dependent variable)
If the force does causally change the outcome, we expect a relationship or association between the force and the outcome (we can measure this as a correlation between the two variables)

Causation

a causal relationship between X and Y predicts a correlation between X and Y
- we can test whether there’s a correlation between X and Y
- if there isn’t a correlation, that argues against a causal relationship
- falsification of our hypothesis
BUT the other way around doesn’t work
- the observation of a correlation between X and Y does not represent evidence of a causal relationship between X and Y
- it doesn’t prove X causes Y
- if you conclude so, you make the classic logical error called affirming the consequent

Affirming the Consequent

if theory T is true, that predicts data/pattern P should be seen
- if you don’t see P, that is evidence against theory T
if you do observe pattern P, that does not prove theory T
- why? P could be caused by something other than T

Affirming the Consequent

theory: I have covid-19
- prediction: covid-19 causes a fever
- observation: I do have a fever
- conclusion: I have covid-19
  - LOGICAL ERROR!
I could have a fever for many other reasons

Example:

your hypothesis: does spending more time on social media negatively affect mental health?
you collect data on a sample of 100 students
- a questionnaire gives a depression rating between 1 and 10
- they also report time spent on social media in a typical week in hours per day
you find a correlation does exist between the two measures
- is this evidence in support of your hypothesis?
- NO!—LOGICAL ERROR! (affirming the consequent)

3 kinds of correlation

Positive correlation

Increases in the X variable are associated with
increases in the Y variable
Decreases in the X variable are associated with
decreases in the Y variable

Negative correlation

Increases in the X variable are associated with
decreases in the Y variable
Decreases in the X variable are associated with
increases in the Y variable

Random (no correlation)

Increases in the X variable are NOT associated with
increases or decreases in the Y variable
Decreases in the X variable are NOT associated with
increases or decreases in the Y variable

Increasing positive correlation

Negative correlations

Correlation Strength

Super strong (Perfect): dots all line up, no exceptions
Strong: Clear pattern, not much variation in dots
Medium: There is a pattern but dots have a lot of variaion
Weak: Sort of a hint of a pattern, dots have loads of variation
None: Dots are everywhere, no clear pattern

Pearson’s r

A number that summarizes the strength & direction of a correlation

varies between -1.0 and 1.0
0.0 means no correlation
1.0: perfect positive correlation
-1.0: perfect negative correlation
values in between: more or less strength

Formula for Pearson’s r

R: Pearson’s r

x <- c(1,2,3,4,5)
y <- c(4,3,6,7,8)
N <- 5
covariation<-sum((x-mean(x))*(y-mean(y)))/N
SD_x <- sqrt(sum((x-mean(x))^2)/N)
SD_y <- sqrt(sum((y-mean(y))^2)/N)
r <- covariation/(SD_x*SD_y)
r

[1] 0.9149914

R’s `cor()` function

R has a function to compute correlations called cor()

x <- c(1,2,3,4,5)
y <- c(4,3,6,7,8)
cor(x,y)

[1] 0.9149914

What do correlations mean?

Could mean that one variable causes change in another variable

BUT, it can mean other things…

Causal direction problem
Non-linear problem
Spurious correlations
Chance correlations
Third variable as cause

1 Causal Directionality problem

2 Nonlinearity problem

3 Spurious correlations

http://www.tylervigen.com/spurious-correlations

4 Chance correlations

Correlations between two datasets (samples from a population) can occur by chance, and be completlely meaningless

we will do some simulations!

5 Third variable

Simpson’s Paradox

a trend appears in several groups of data but disappears or reverses when the groups are combined

Pengun Bill length and depth. Artwork by @allison_horst.

Simpson’s Paradox

a trend appears in several groups of data but disappears or reverses when the groups are combined

Illustration of three species of Palmer Archipelago penguins: Chinstrap, Gentoo, and Adelie. Artwork by @allison_horst.

Correlation and variance

r can be between [-1,1]
r^2 is always between [0,1]
r^{2} is the proportion of variance in one variable that is explained by the other variable
r=0.5 means that 25% of the variance in one variable is explained by the other variable
r=0.1 means that only 1% of the variance in one variable is explained by the other variable
- (99% of the variance is unexplained by the other variable)

Correlations and Random Chance

What is randomness?

Two related ideas:

Things have equal chance of happening (e.g., a coin flip)

50% heads, 50% tails

Correlations and Random Chance

What is randomness?

Two related ideas:

Independence: One thing happening is totally unrelated to whether another thing happens

the outcome of one flip doesn’t predict the outcome of another

Two random variables

On average there should be zero correlation between two variables containing randomly drawn numbers

The numbers in variable X are drawn randomly (independently), so they do not predict numbers in Y
The numbers in variable Y are drawn randomly (independently), so they do not predict numbers in X

Two random variables

If X can’t predict Y, then correlation should be 0 right?

on average yes
for individual samples, no!

R: random numbers

In R, runif() allows you to sample random numbers (uniform distribution) between a min and max value. Numbers in the range have an equal chance of occuring

runif(n=5, min=0, max=10)

[1] 0.03546459 6.21624603 3.07793854 6.57640292 8.66636876

runif(n=5, min=0, max=10)

[1] 5.257686 7.366050 1.553079 3.366388 7.170999

runif(n=5, min=0, max=10)

[1] 3.649243 3.930776 8.077324 5.661306 6.275813

“Random” Correlations

set.seed(28120)
x <- runif(n=5, min=0, max=10)
y <- runif(n=5, min=0, max=10)
x

[1] 2.528557 9.495890 6.832830 8.955035 5.113812

[1] 1.734883 7.162924 3.113391 9.979855 5.181226

cor(x,y)

[1] 0.8174196

What is chance capable of?

Randomly sampling numbers can produce a range of correlations, even when the actual correlation in the population is zero
What is the average correlation produced by chance? (zero)
For a single sample, what is the range of correlations that chance can produce?
- answer: it depends on the sample size

Simulating what chance can do

The role of sample size (N)

The range of chance correlations decreases as N increases

The inference problem

Let’s say we sampled some data, and we found r = 0.5
BUT: We know chance can sometimes produce random correlations in a sample

The inference problem

Is the correlation we observed in our sample a reflection of a real correlation in the population?
Is one variable really related to the other?
Or, is there really no correlation between these variables?
i.e. the correlation in the sample arises from random chance
→ this is H_{0}, the null hypothesis

The (simulated) Null Hypothesis

Making inferences about chance

In a sample of size N=10, we observe: r=0.5
Null Hypothesis H_{0}:
→ no correlation actually exists in the population
→ actual population correlation r=0.0
Alternative Hypothesis H_{1}: correlation does exist in the population, and r=0.5 is our best estimate
We don’t know what the truth actually is

Making inferences about chance

We can only make an inference based on:

What is the probability p of observing:
→ an r in a sample as big as r=0.5
→ in a sample of size N=10
→ under the null hypothesis H_{0} in which the population r=0.0

Making inferences about chance

If that probability p is low enough:
→ we conclude that it is an unlikely scenario
→ and we reject H_{0}

Making inferences about chance

How low can you go?

If that probability p is low enough:
→ we conclude that it is an unlikely scenario
→ and we reject H_{0}

How low is low enough?

p < .05
p < .01
p < .001

Making inferences about chance

cor.test() in R will give you the probability p of obtaining a sample correlation as large as you did under the null hypothesis where the population correlation is actually zero

Assumptions:
→ x vs y is a linear relationship (plot it!)
→ x & y variables are normally distributed (shapiro.test())

Making inferences about chance

ggplot(tibble(x,y), aes(x,y)) +
  geom_point(size=3) + 
  stat_poly_line(se=FALSE) +
  stat_correlation(use_label("R"), label.x="right") +
  theme_bw()

Making inferences about chance

shapiro.test(x)


    Shapiro-Wilk normality test

data:  x
W = 0.93812, p-value = 0.01136

shapiro.test(y)


    Shapiro-Wilk normality test

data:  y
W = 0.96352, p-value = 0.1248

Making inferences about chance

cor.test(x, y)


    Pearson's product-moment correlation

data:  x and y
t = -13.576, df = 48, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9368052 -0.8142478
sample estimates:
       cor 
-0.8907193

Making inferences about chance

if normality assumption is not met, you can use Spearman’s rank correlation coefficient \rho (“rho”)
See Chapter 5.7.6 of Navarro “Learning Statistics with R”

cor.test(x, y, method="spearman")


    Spearman's rank correlation rho

data:  x and y
S = 39276, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
       rho 
-0.8860024

IMPORTANT: “significant” \neq “large”

significance test is not whether r is large
test is whether r is “statistically significant”
whether r is reliably different than 0.0

IMPORTANT: “significant” \neq “large”

N=1000
r=0.055
p = 0.041
is r significant? (yes)
is r large? (no)
r^{2}= .055*.055 = .003025
= 0.3\% variance explained
- 99.7\% of the variance is unexplained

IMPORTANT: “significant” \neq “large”

N=7
r=0.75
p = 0.052
is r significant? (no)
is r large? (yes)

Bivariate Correlation Review

Today