Sampling Distributions & t-tests

In R it is easy to compute measures of central tendency and dispersion. Let’s generate a random sample (taken from a normal distribution with mean zero and standard deviation 1) of size n=10, and compute various measures. Note that when you include round brackets around a statement in R, it prints the output of the statement to the screen.

(x <- rnorm(10, 0, 1))
##  [1]  0.06580364 -0.56815002  0.26812933 -1.37361172 -1.09994771
##  [6] -0.30814480  0.74600471  0.06201456 -0.49934970 -0.05601890
mean(x)
## [1] -0.2763271
median(x)
## [1] -0.1820818
var(x)
## [1] 0.4044395
sd(x)
## [1] 0.6359556

Central limit theorem

Given random sampling, the sampling distribution of the mean approaches a normal distribution as the size of the sample increases, even if the population distribution of raw scores is not normally distributed. The central limit theorem states that the sum of a large number of independent observations from the same distribution has an approximate normal distribution, and this approximation steadily improves as the number of observations increases.

We can demonstrate this relatively easily in R, and along the way learn some useful R functions. First let’s choose a non-normal distribution from which we are going to sample. We’ll choose an exponential distribution, which is bunched up at small values with a long positive tail:

hist(rexp(1000))

So what we’re going to do is repeatedly sample n=10 from an exponential distribution, each time taking the mean of our 10 samples and storing the means. We’ll then vary the sample size and see that as sample size increases, the approximation of the sampling distribution of means to a normal distribution gets better and better. Here is what the sampling distribution of means looks like when 1,000 samples of size 3 are taken:

nsamples <- 1000
samplesize <- 3
x <- rep(0,nsamples)
for (i in 1:1000) {
    x[i] <- mean(rexp(samplesize))
}
xfit <- seq(min(x), max(x), length=50)
yfit <- dnorm(xfit, mean=mean(x), sd=sd(x))
hist(x, prob=T, main=paste( "1,000 samples of size",  samplesize ))
lines(xfit, yfit, col="red")

hist(x, prob=T, main=paste( "1,000 samples of size",  samplesize ))
lines(xfit, yfit, col="red")

Now let’s see what happens when we take 1,000 samples of size 3, 6, 12 and 24:

# instruct the figure to plot sequentially in a 2x2 grid
par(mfrow=c(2,2))
nsamples <- 1000
x <- rep(0,nsamples)
for (s in c(3,6,12,24)){
for (i in 1:1000) {
    x[i] <- mean(rexp(s))
}
xfit <- seq(min(x), max(x), length=50)
yfit <- dnorm(xfit, mean=mean(x), sd=sd(x))
hist(x, prob=T, main=paste( "1,000 samples of size",  s ))
lines(xfit, yfit, col="red")
}