Central limit theorem

Normal distributions and the Central Limit Theorem

If X is a random variable sampled from a normally distributed population with mean \(\mu\) and standard deviation \(\sigma\), then:

\(\frac{\overline{X} - \mu}{\frac{\sigma}{\sqrt{n}}}\)

follows a standard normal distribution (i.e. with mean 0 and standard deviation 1).

sampling <- replicate(10000, {
  n=30
  samp<-rnorm(n, mean=3, sd=2);
  samp_mean <- mean(samp);
  (samp_mean - 3) / (2/sqrt(n))
  }
)
ex1 <- data.frame(data=c(rnorm(10000, mean=3, sd=2), sampling), Label=c(rep("Normal",10000), rep("Sampled", 10000)))
ggplot(ex1, aes(data, fill=Label)) + geom_density(alpha=0.4, trim=T)

This holds true also for small n:

sampling <- replicate(10000, {
  n=2
  samp<-rnorm(n, mean=3, sd=2);
  samp_mean <- mean(samp);
  (samp_mean - 3) / (2/sqrt(n))
  }
)
ex1 <- data.frame(data=c(rnorm(10000, mean=3, sd=2), sampling), Label=c(rep("Normal",10000), rep("Sampled", 10000)))
ggplot(ex1, aes(data, fill=Label)) + geom_density(alpha=0.4, trim=T)

If the sampled population is not normal, for the central limit theorem the sample means approximately follow a normal distribution if n is large:

population <- runif(10000, -20, 0)
pop_mean <- mean(population)
pop_sd <- sd(population)
sampling <- replicate(10000, {
  n=100
  samp<-sample(population, n)
  samp_mean <- mean(samp);
  (samp_mean - pop_mean) / (pop_sd/sqrt(n))
  }
)
ex1 <- data.frame(data=c(population, sampling, rnorm(10000)), Label=c(rep("Uniform pop",10000), rep("Sampled", 10000), rep("Normal", 10000)))
ggplot(ex1, aes(data, colour=Label)) + geom_density(alpha=0.4, trim=T)

However this is not the case for small values of n:

sampling <- replicate(10000, {
  n=3
  samp<-sample(population, n)
  samp_mean <- mean(samp);
  (samp_mean - pop_mean) / (pop_sd/sqrt(n))
  }
)
ex1 <- data.frame(data=c(population, sampling, rnorm(10000)), Label=c(rep("Uniform pop",10000), rep("Sampled", 10000), rep("Normal", 10000)))
ggplot(ex1, aes(data, colour=Label)) + geom_density(alpha=0.4, trim=T)

Note the shorter tails of the distribution of sample means in the figure above.

The Student’s t distribution

In real life scenarions, \(\sigma\) is usually not know, but it can be estimated with the sample standard deviation, s. However, s is a function of the sample and using it to estimate \(\sigma\) increases the uncertainty. For this reason,

\(\frac{\overline{X} - \mu}{\frac{s}{\sqrt{n}}}\)

is a random variable that follows the t distribution with n-1 d.f. (precisely if x is normally distributed, approximately if x is not normal and n is large enough).

n=5
sampling <- replicate(10000, {
  samp<-rnorm(n, mean=3, sd=2);
  samp_mean <- mean(samp);
  samp_sd <- sd(samp)
  (samp_mean - 3) / (samp_sd/sqrt(n))
  }
)
ex1 <- data.frame(data=c(rt(10000,n-1), sampling, rnorm(10000)), Label=c(rep("Student's t",10000), rep("Sampled", 10000), rep("Normal", 10000)))
ggplot(ex1, aes(data, colour=Label)) + geom_density(alpha=0.4, trim=T)

The t distribution has longer tails, reflecting the higher uncertainty due to the fact the we use s to estimate \(\sigma\).

summary(rnorm(10000))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.750000 -0.682200 -0.000073 -0.001534  0.670400  3.390000

summary(sampling)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -9.73500 -0.71890  0.01338  0.01553  0.75430 36.04000

summary(rt(10000,n-1))

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -14.400000  -0.732100   0.003925  -0.014900   0.724300  15.810000

If x is sampled from a non normal distribution, the sample means still follow a standard normal if n is large

population <- runif(10000, -10, 5)
pop_mean <- mean(population)
n=30
sampling <- replicate(10000, {
  samp<-sample(population, n)
  samp_mean <- mean(samp);
  samp_sd <- sd(samp)
  (samp_mean - pop_mean) / (samp_sd/sqrt(n))
  }
)
ex1 <- data.frame(data=c(rt(10000,n-1), sampling, population, rnorm(10000)), Label=c(rep("Student's t",10000), rep("Sampled", 10000), rep("Uniform population", 10000), rep("Normal", 10000)))
ggplot(ex1, aes(data, colour=Label)) + geom_density(alpha=0.4, trim=T)

This is not the case if n is small:

population <- runif(10000, -10, 5)
pop_mean <- mean(population)
n=4
sampling <- replicate(10000, {
  samp<-sample(population, n)
  samp_mean <- mean(samp);
  samp_sd <- sd(samp)
  (samp_mean - pop_mean) / (samp_sd/sqrt(n))
  }
)
ex1 <- data.frame(data=c(rt(10000,n-1), sampling, population, rnorm(10000)), Label=c(rep("Student's t",10000), rep("Sampled", 10000), rep("Uniform population", 10000), rep("Normal", 10000)))
ggplot(ex1, aes(data, colour=Label)) + geom_density(alpha=0.4, trim=T)

summary(rnorm(10000))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.723000 -0.651600  0.009064  0.018460  0.709000  3.735000

summary(sampling)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -55.69000  -0.75250  -0.01850  -0.00561   0.68950  31.75000

summary(rt(10000,n-1))

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -24.08000  -0.74520   0.02118   0.02478   0.77990  34.12000

On the central limit theorem

Tommaso Leonardi

Normal distributions and the Central Limit Theorem

The Student’s t distribution