The central limit theorem is a primary concept in probability theory that has applications in many areas of statistics.
The central limit theorem states that if you have a population with mean \(\mu\) and standard deviation \(\sigma\) and take sufficiently large random samples from the population with replacement, then the distribution of the sample means will be approximately normally distributed. This will hold true regardless of whether the source population is normal or skewed, provided the sample size is sufficiently large (usually n > 30). (LaMorte, 2016)
Put simply, it means that if you take a large enough sample size you can use assumptions from the normal distribution to make inferences, even if the population you are sampling from is not normally distributed.
We can visualize this using simulations. Let’s generate a sample from an exponential distribution.
exponential_population <- rexp(100000)
# Create histogram
hist(exponential_population)
# View the first 100 values
exponential_population[1:100]
## [1] 0.632389415 2.893144987 0.775364080 0.585219613 0.874685526 3.098770815
## [7] 1.602424872 0.363246072 0.896876732 1.296921046 0.227747175 0.382005046
## [13] 0.524895527 0.684398558 0.995037384 3.295051336 0.399972885 1.400912968
## [19] 0.589446661 0.020681689 0.003341043 0.009834128 0.146279449 0.289104827
## [25] 0.279489393 1.957787396 1.064691206 3.203233681 0.459198233 0.510490905
## [31] 0.619248914 0.846925576 2.032354649 0.360950389 0.716687713 0.747305352
## [37] 0.229156344 0.189104565 0.494012315 1.375824899 0.185238257 0.989103414
## [43] 0.245092017 0.449956362 0.896230237 0.042670061 1.344266159 3.862406900
## [49] 0.033138798 3.396231226 1.057077866 0.631171602 1.396739946 0.581414390
## [55] 2.430173012 1.300471503 0.432407979 1.557065507 0.614775172 0.393064424
## [61] 0.623998421 0.265945218 0.640421931 1.750647787 2.928541740 2.330949799
## [67] 0.470032800 2.120552599 1.318408796 2.714644199 0.035155945 0.671875990
## [73] 2.772518655 0.463113400 0.754247281 2.344543219 1.873643453 1.393541778
## [79] 0.769484804 1.018030352 0.464009187 0.274239772 0.307496040 0.817519786
## [85] 0.152125947 2.125922121 0.268627077 0.239423088 2.758761311 1.886636266
## [91] 1.217170684 0.811809601 2.511746203 3.806754848 0.661273088 0.827267238
## [97] 5.235234247 0.286700319 1.173920802 0.343182862
# Summary stats
summary(exponential_population)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000008 0.288375 0.697701 1.001434 1.393376 11.554314
boxplot(exponential_population)
We can see that the distribution has a long tail, mean of 1, and median 0.7.
According to the CLT if we take repeated samples with replacement from this population, the sample means will be normally distributed (provided the sample size is sufficient). We will start with samples of 100 random observations repeated 10 times then observe the sample means.
samples <- array()
for(i in 1:10){
s <- sample(exponential_population, 100)
mu <- mean(s)
samples[i] <- mu
}
samples
## [1] 0.8612153 0.8208838 0.8764301 1.0259946 1.0096955 1.2240990 1.1485466
## [8] 0.9934035 1.1812946 1.2081291
hist(samples)
The distribution looks more normal than exponantial, but not quite there yet. In order for CLT to hold, we need to sample the population a sufficiently large number of times. Let’s try 100 samples:
samples <- array()
for(i in 1:100){
s <- sample(exponential_population, 100)
mu <- mean(s)
samples[i] <- mu
}
samples
## [1] 1.0341758 0.7849288 0.8749542 0.8325325 1.0171413 0.8518120 0.9890670
## [8] 1.0727375 0.9795448 0.9375196 0.9313028 0.8446158 0.9601090 0.8967224
## [15] 0.9612041 0.9764466 0.9447527 1.0029960 0.8545646 0.9794423 0.9979786
## [22] 0.8935843 1.1960514 0.7986960 0.8836585 1.0392336 1.2153476 0.9336208
## [29] 1.0358459 0.9592308 1.1300833 0.9429526 0.9661960 1.0578543 0.8810830
## [36] 0.9649924 0.8816060 1.1421424 0.9525808 0.9555006 0.9056841 0.9415411
## [43] 1.0087524 0.9664238 1.0623481 0.8356298 1.0600430 1.0124160 1.1697420
## [50] 1.0696471 1.1863113 0.8834675 0.9917958 0.9692575 1.0242122 0.8994176
## [57] 1.1512495 1.1625936 0.9752211 1.1882521 1.2005894 0.8930113 0.9897845
## [64] 1.0017321 1.0292010 1.0863022 0.9698716 0.7305418 0.9858192 1.0285456
## [71] 1.1130788 1.0587643 1.1872247 1.0376892 1.0425491 1.0747107 1.0807360
## [78] 1.0781796 1.0426475 0.9773207 1.0172019 0.9261183 0.9391615 0.9249935
## [85] 1.1649770 1.0073275 1.1031006 0.8835616 0.9892376 1.0105783 1.1237809
## [92] 0.9566216 1.0222556 0.9811948 1.0308765 1.0814908 1.1269420 0.9733750
## [99] 0.9923494 1.2196084
hist(samples)
This looks even moreso normal, with tails being much smaller and the center of the distribution having more definition. Let’s try 100,000:
samples <- array()
for(i in 1:5000){
s <- sample(exponential_population, 100)
mu <- mean(s)
samples[i] <- mu
}
hist(samples)
For comparison, let’s plot samples from a normal distribution:
hist(rnorm(5000, mean=1))
As mentioned, this works for any distribution. To demonstrate, let’s simulate data from other distributions and visualize.
# Function for simulating data from a given distribution
generate_dist <- function(name, sample_func, n, ss=0.1*n, sims){
dist <- {}
dist$name <- name
dist$obs <- sample_func(n)
dist$ss <- ss
dist$sample_means <- array()
for(i in seq(1:sims)){
s <- sample(dist$obs, ss, replace=TRUE)
mu <- mean(s)
dist$sample_means[i] <- mu
}
return(dist)
}
# Modified sample functions to set default parameters
rpois_mod <- function(n){return(rpois(n=n, lambda=1))}
beta_05_05 <- function(n){return(rbeta(n=n, shape1=0.5, shape2=0.5))}
beta_5_1 <- function(n){return(rbeta(n=n, shape1=5, shape2=1))}
# Function for generating a list containing data from several distributions
generate_dist_list <- function(n, ss, sims){
list_out <- list(
generate_dist("Normal(0,1)", rnorm, n, ss, sims),
generate_dist("Exponential(1)", rexp, n, ss, sims),
generate_dist("Uniform(0,1)", runif, n, ss, sims),
generate_dist("Poisson(1)", rpois_mod, n, ss, sims),
generate_dist("Beta(0.5,0.5)", beta_05_05, n, ss, sims),
generate_dist("Beta(5,1)", beta_5_1, n, ss, sims)
)
return(list_out)
}
# Create distribution list
dist_list <- generate_dist_list(1000000, ss=50, sims=10000)
# Set up and iterate through data list to plot
par(mfrow=c(2,3))
for(dist in dist_list){
n_size <- format(length(dist$obs), big.mark=",")
hist(dist$obs, main=paste0(dist$name, " \n (N=", n_size, ")"))
}
What happens to sample mean distributions if we use a small sample size (n=5)?
dist_list <- generate_dist_list(1000000, ss=5, sims=10000)
par(mfrow=c(2,3))
for(dist in dist_list){
ss <- format(dist$ss)
sims <- format(length(dist$sample_means), big.mark=",")
hist(dist$sample_means,
main=paste0(dist$name, " Sample Dist. \n (n=", dist$ss, ", ", sims, " simulations)"))
}
A large number of sample means from small samples still appears normal when the population is normally distributed, but you can see that samples from other distributions have non-normal traits. Exponential, Poisson, and Beta(\(\alpha\)=5, \(\beta\)=1) all show skew, while Beta(\(\alpha\)=0.5, \(\beta\)=0.5) and Uniform(0,1) sample mean distributions have heavier tails.
Repeated again, this time with a larger sample size (n=100):
dist_list <- generate_dist_list(1000000, ss=100, sims=10000)
par(mfrow=c(2,3))
for(dist in dist_list){
ss <- format(dist$ss)
sims <- format(length(dist$sample_means), big.mark=",")
hist(dist$sample_means,
main=paste0(dist$name, " Sample Dist. \n (n=", dist$ss, ", ", sims, " simulations)"))
}
Back to normality!