Population & Sample Variance

Mon 14 April 2014

A common source of confusion among students just getting started with statistics is the difference between population and sample. Simply speaking, the entire field of statistics exists so that real parameters (real mean, real variance etc.) of the population can be estimated using a subset of the population (called the sample). This is because it is generally cumbersome, extremely expensive or downright impossible to measure the parameter for the entire population at once.

One such specific area of confusion arises from the fact that while calculating the estimator for the population variance from the sample, we divide the sum of the square differences from the mean by n-1 (unbiased) instead of n (biased). While a great explanation for this can be found here (video) or here (more math-y), I decided to code up a simple practical example that shows why the unbiased estimator is a better proxy for estimating the real population variance.

Let's first build a population for our experiment and calculate the true variance (the one we'll estimate using the samples) -

# Population size = 100000
N <- 100000

# Generate a population of a 100000 uniform random values
pop <- runif(N)

# Calculate the population mean
pop.mean <- mean(pop)

# Calculate the population variance
pop.var <- sum((pop - pop.mean)^2)/N

We'll take 10,000 samples of size 50 from this population and use them to estimate the variance of the population. For each iteration, we'll store the biased and unbiased estimates of the population variance in a list so that we can later analyze and plot the results.

# Number of samples = 100
K <- 10000

# Sample size = 50
n <- 50

# List to store the sample mean and variance
samp.mean <- NULL
samp.var.biased <- NULL
samp.var.unbiased <- NULL

# Loop over the samples
for (i in 1:K) {
  # Choose a random sample
  samp <- sample(pop, n)

  # Calculate and store the sample mean
  s_mean <- mean(samp)
  samp.mean <- c(samp.mean, s_mean)

  # Calculate and store the *biased* sample variance
  samp.var.biased <- c(samp.var.biased, (sum((samp - s_mean)^2)/n))

  # Calculate and store the *biased* sample variance
  samp.var.unbiased <- c(samp.var.unbiased, (sum((samp - s_mean)^2)/(n-1)))
}

Now we have 10,000 biased and unbiased estimates of the real population variance. Let's plot it -

Biased and unbiased estimators of pop var

The green line is the true population variance.

It is hard to see whether the biased or the unbiased estimators are better from the plot, so let's just calculate the percentage of samples where the unbiased estimator was equal to or better than the biased estimator -

# Percentage of times 'unbiased' is equal or better estimator of the pop variance
(sum(abs(samp.var.unbiased - pop.var) <= abs(samp.var.biased - pop.var)) / K) * 100

You'll notice that you'll consistently get a value > 50% thus demonstrating that the unbiased estimator is, on average, a better estimator for the population variance.

Hope this little demonstration was useful in convincing you about the intuition behind the difference.