Numoracle Recipes: Student's T-Distribution, Degrees of Freedom

In the previous post, we showed how to compute the confidence interval when σ - i.e. the population standard deviation, is known. But in a typical situation, σ is rarely known.

Computing confidence Interval for μ when σ is Unknown: The solution then is to use the sample standard deviation, and use a variant of the standardized statistic for normal distribution z = (X_bar - μ)/σ.

Student's T Distribution:
For a normally distributed population, the Student's T distribution is given by this standardized statistic: t = (X_bar - μ)/(S/√n), with (n - 1) degrees of freedom (df) for the deviations, where S is the sample standard deviation, and n is the sample size. Key points:

The t distribution resembles the bell-shaped z (normal) distribution, but with wider tails than z, with mean zero (the mean of z), and with variance tending to 1 (the variance of z) as df increases. For df > 2, σ² = df/(df - 2).
The z (normal) distribution deals with one unknown μ - estimated by the random variable X_bar, while the t distribution deals with two unknowns μ and σ - estimated by random variables X_bar and S respectively. So it tacitly handles greater uncertainty in the data.
As a consequence, the t distribution has wider confidence intervals than z
There is a t-distribution for each df=1,..n.
A good comparison of the distributions is provided here.
For a sample (n < 30) taken from normally distributed population, a (1 - α) 100% confidence interval for μ when σ is unknown is X_bar ± t_α/2 s/√n. This is the better distribution to use for small samples - with (n - 1) df, and unknown μ and &sigma.
But larger samples (n ≥ 30), and/or with larger df, the t distribution can be approximated by a z distribution, and the (1 - α)100% confidence interval for μ is X_bar ± z_α/2 s/√n (Note: Using the sample sd itself).

We will pause to understand df in this context (based on Aczel's book). We noted earlier that the df helps as compensating factor - here is how. Assume a population of five numbers - 1, 2, 3, 4, 5. The (known) population mean is μ = (1+2+3+4+5)/5 = 7.5. Assume we are asked to sample 5 numbers and find the squared standard deviation (ssd) based on μ:

x | x_bar | deviation | deviation_squared
3 | 3     |  0.0      | 0
2 | 3     | -1.0      | 1
4 | 3     |  1.0      | 1
1 | 3     | -2.0      | 4
2 | 3     | -1.0      | 1
Sum of Squared Deviation = 7

Given the mean, the deviation computation for the random 5 samples effectively retain 5 degrees of freedom. Next, assume we don't know the population mean, and instead are asked to compute the deviation from one random number. Our goal is to choose a number that will minimize the deviation. A readily available number is the s sample mean (3+2+4+1+2)/5 = 2.4 - so we will use it:

x | x_bar | deviation | deviation_squared
3 | 2.4   |  0.6      | 0.36
2 | 2.4   | -0.4      | 0.16
4 | 2.4   |  1.6      | 2.56
1 | 2.4   | -1.4      | 1.96
2 | 2.4   | -0.4      | 0.16
Sum of Squared Deviation = 5.2

The use of sample mean biases the SSD downward from 7 (actual) to 5.2. But given the choice of a mean, the deviation for the same random 5 samples retain df = (5 - 1) = 4 degrees of freedom.
Subsequent choices of 2 means - (3+2)/2, (4+1+2)/3 - or 3 means, would reduce the SSD down further; at the same time, reducing the degrees of freedom for the deviation for the 5 random samples: df=(5-2), df=(5-3) and so on. As an extreme case, if we consider the sample mean of each sampled number as itself, then we have:

x | x_bar | deviation | deviation_squared
3 | 3     |  0        | 0
2 | 3     |  0        | 0
4 | 4     |  0        | 0
1 | 1     |  0        | 0
2 | 2     |  0        | 0
Sum of Squared Deviation = 0

which reduces SSD to 0, and the deviation df to (5-5) = 0. So in general,

deviations (and hence SSD) for a sample of size n taken from a known population mean μ will have df = n
deviations for a sample of size n taken from the sample mean X_bar will have df = (n - 1)
deviations for a sample of size n taken from k ≤ n different numbers (typically mean of sample points) will have df = n - k.

Confidence interval using T-distribution applies for the narrow case of n < 30, normal population; the Z distribution covers larger samples. The practical use this distribution appears more to be in comparing two populations using the Student's T-Test - which requires understanding the concepts of Hypothesis Testing. So I will defer the code for an equivalent confidence_interval() routine based on T-distribution for later.