A common source of confusion among students just getting started with statistics is the difference between population
and sample
. Simply speaking, the entire field of statistics exists so that real parameters (real mean, real variance etc.) of the population
can be estimated using a subset of the population (called the sample). This is because it is generally cumbersome, extremely expensive or downright impossible to measure the parameter for the entire population at once.
One such specific area of confusion arises from the fact that while calculating the estimator for the population variance from the sample, we divide the sum of the square differences from the mean by n-1
(unbiased) instead of n
(biased). While a great explanation for this can be found here (video) or here (more math-y), I decided to code up a simple practical example that shows why the unbiased estimator is a better proxy for estimating the real population variance.
Let's first build a population for our experiment and calculate the true variance (the one we'll estimate using the samples) -
# Population size = 100000
N <- 100000
# Generate a population of a 100000 uniform random values
pop <- runif(N)
# Calculate the population mean
pop.mean <- mean(pop)
# Calculate the population variance
pop.var <- sum((pop - pop.mean)^2)/N
We'll take 10,000 samples of size 50 from this population and use them to estimate the variance of the population. For each iteration, we'll store the biased and unbiased estimates of the population variance in a list so that we can later analyze and plot the results.
# Number of samples = 100
K <- 10000
# Sample size = 50
n <- 50
# List to store the sample mean and variance
samp.mean <- NULL
samp.var.biased <- NULL
samp.var.unbiased <- NULL
# Loop over the samples
for (i in 1:K) {
# Choose a random sample
samp <- sample(pop, n)
# Calculate and store the sample mean
s_mean <- mean(samp)
samp.mean <- c(samp.mean, s_mean)
# Calculate and store the *biased* sample variance
samp.var.biased <- c(samp.var.biased, (sum((samp - s_mean)^2)/n))
# Calculate and store the *biased* sample variance
samp.var.unbiased <- c(samp.var.unbiased, (sum((samp - s_mean)^2)/(n-1)))
}
Now we have 10,000 biased and unbiased estimates of the real population variance. Let's plot it -
The green line is the true population variance.
It is hard to see whether the biased or the unbiased estimators are better from the plot, so let's just calculate the percentage of samples where the unbiased estimator was equal to or better than the biased estimator -
# Percentage of times 'unbiased' is equal or better estimator of the pop variance
(sum(abs(samp.var.unbiased - pop.var) <= abs(samp.var.biased - pop.var)) / K) * 100
You'll notice that you'll consistently get a value > 50% thus demonstrating that the unbiased estimator is, on average, a better estimator for the population variance.
Hope this little demonstration was useful in convincing you about the intuition behind the difference.