On the central limit theorem
Tommaso Leonardi
Normal distributions and the Central Limit Theorem
If X is a random variable sampled from a normally distributed population with mean \(\mu\) and standard deviation \(\sigma\), then:
follows a standard normal distribution (i.e. with mean 0 and standard deviation 1).
sampling <- replicate(10000, {
n=30
samp<-rnorm(n, mean=3, sd=2);
samp_mean <- mean(samp);
(samp_mean - 3) / (2/sqrt(n))
}
)
ex1 <- data.frame(data=c(rnorm(10000, mean=3, sd=2), sampling), Label=c(rep("Normal",10000), rep("Sampled", 10000)))
ggplot(ex1, aes(data, fill=Label)) + geom_density(alpha=0.4, trim=T)
This holds true also for small n:
sampling <- replicate(10000, {
n=2
samp<-rnorm(n, mean=3, sd=2);
samp_mean <- mean(samp);
(samp_mean - 3) / (2/sqrt(n))
}
)
ex1 <- data.frame(data=c(rnorm(10000, mean=3, sd=2), sampling), Label=c(rep("Normal",10000), rep("Sampled", 10000)))
ggplot(ex1, aes(data, fill=Label)) + geom_density(alpha=0.4, trim=T)
If the sampled population is not normal, for the central limit theorem the sample means approximately follow a normal distribution if n is large:
population <- runif(10000, -20, 0)
pop_mean <- mean(population)
pop_sd <- sd(population)
sampling <- replicate(10000, {
n=100
samp<-sample(population, n)
samp_mean <- mean(samp);
(samp_mean - pop_mean) / (pop_sd/sqrt(n))
}
)
ex1 <- data.frame(data=c(population, sampling, rnorm(10000)), Label=c(rep("Uniform pop",10000), rep("Sampled", 10000), rep("Normal", 10000)))
ggplot(ex1, aes(data, colour=Label)) + geom_density(alpha=0.4, trim=T)
However this is not the case for small values of n:
sampling <- replicate(10000, {
n=3
samp<-sample(population, n)
samp_mean <- mean(samp);
(samp_mean - pop_mean) / (pop_sd/sqrt(n))
}
)
ex1 <- data.frame(data=c(population, sampling, rnorm(10000)), Label=c(rep("Uniform pop",10000), rep("Sampled", 10000), rep("Normal", 10000)))
ggplot(ex1, aes(data, colour=Label)) + geom_density(alpha=0.4, trim=T)
Note the shorter tails of the distribution of sample means in the figure above.
The Student’s t distribution
In real life scenarions, \(\sigma\) is usually not know, but it can be estimated with the sample standard deviation, s. However, s is a function of the sample and using it to estimate \(\sigma\) increases the uncertainty. For this reason,
is a random variable that follows the t distribution with n-1 d.f. (precisely if x is normally distributed, approximately if x is not normal and n is large enough).
n=5
sampling <- replicate(10000, {
samp<-rnorm(n, mean=3, sd=2);
samp_mean <- mean(samp);
samp_sd <- sd(samp)
(samp_mean - 3) / (samp_sd/sqrt(n))
}
)
ex1 <- data.frame(data=c(rt(10000,n-1), sampling, rnorm(10000)), Label=c(rep("Student's t",10000), rep("Sampled", 10000), rep("Normal", 10000)))
ggplot(ex1, aes(data, colour=Label)) + geom_density(alpha=0.4, trim=T)
The t distribution has longer tails, reflecting the higher uncertainty due to the fact the we use s to estimate \(\sigma\).
summary(rnorm(10000))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.750000 -0.682200 -0.000073 -0.001534 0.670400 3.390000
summary(sampling)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -9.73500 -0.71890 0.01338 0.01553 0.75430 36.04000
summary(rt(10000,n-1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -14.400000 -0.732100 0.003925 -0.014900 0.724300 15.810000
If x is sampled from a non normal distribution, the sample means still follow a standard normal if n is large
population <- runif(10000, -10, 5)
pop_mean <- mean(population)
n=30
sampling <- replicate(10000, {
samp<-sample(population, n)
samp_mean <- mean(samp);
samp_sd <- sd(samp)
(samp_mean - pop_mean) / (samp_sd/sqrt(n))
}
)
ex1 <- data.frame(data=c(rt(10000,n-1), sampling, population, rnorm(10000)), Label=c(rep("Student's t",10000), rep("Sampled", 10000), rep("Uniform population", 10000), rep("Normal", 10000)))
ggplot(ex1, aes(data, colour=Label)) + geom_density(alpha=0.4, trim=T)
This is not the case if n is small:
population <- runif(10000, -10, 5)
pop_mean <- mean(population)
n=4
sampling <- replicate(10000, {
samp<-sample(population, n)
samp_mean <- mean(samp);
samp_sd <- sd(samp)
(samp_mean - pop_mean) / (samp_sd/sqrt(n))
}
)
ex1 <- data.frame(data=c(rt(10000,n-1), sampling, population, rnorm(10000)), Label=c(rep("Student's t",10000), rep("Sampled", 10000), rep("Uniform population", 10000), rep("Normal", 10000)))
ggplot(ex1, aes(data, colour=Label)) + geom_density(alpha=0.4, trim=T)
summary(rnorm(10000))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.723000 -0.651600 0.009064 0.018460 0.709000 3.735000
summary(sampling)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -55.69000 -0.75250 -0.01850 -0.00561 0.68950 31.75000
summary(rt(10000,n-1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -24.08000 -0.74520 0.02118 0.02478 0.77990 34.12000