In the previous post, we showed how to compute the confidence interval when σ - i.e. the population standard deviation, is known. But in a typical situation, σ is rarely known.
Computing confidence Interval for μ when σ is Unknown: The solution then is to use the sample standard deviation, and use a variant of the standardized statistic for normal distribution z = (X_bar - μ)/σ.
Student's T Distribution:
For a normally distributed population, the Student's T distribution is given by this standardized statistic: t = (X_bar - μ)/(S/√n), with (n - 1) degrees of freedom (df) for the deviations, where S is the sample standard deviation, and n is the sample size. Key points:
- The t distribution resembles the bell-shaped z (normal) distribution, but with wider tails than z, with mean zero (the mean of z), and with variance tending to 1 (the variance of z) as df increases. For df > 2, σ² = df/(df - 2).
- The z (normal) distribution deals with one unknown μ - estimated by the random variable X_bar, while the t distribution deals with two unknowns μ and σ - estimated by random variables X_bar and S respectively. So it tacitly handles greater uncertainty in the data.
- As a consequence, the t distribution has wider confidence intervals than z
- There is a t-distribution for each df=1,..n.
- A good comparison of the distributions is provided here.
- For a sample (n < 30) taken from normally distributed population, a (1 - α) 100% confidence interval for μ when σ is unknown is X_bar ± tα/2 s/√n. This is the better distribution to use for small samples - with (n - 1) df, and unknown μ and &sigma.
- But larger samples (n ≥ 30), and/or with larger df, the t distribution can be approximated by a z distribution, and the (1 - α)100% confidence interval for μ is X_bar ± zα/2 s/√n (Note: Using the sample sd itself).
x | x_bar | deviation | deviation_squaredGiven the mean, the deviation computation for the random 5 samples effectively retain 5 degrees of freedom. Next, assume we don't know the population mean, and instead are asked to compute the deviation from one random number. Our goal is to choose a number that will minimize the deviation. A readily available number is the s sample mean (3+2+4+1+2)/5 = 2.4 - so we will use it:
3 | 3 | 0.0 | 0
2 | 3 | -1.0 | 1
4 | 3 | 1.0 | 1
1 | 3 | -2.0 | 4
2 | 3 | -1.0 | 1
Sum of Squared Deviation = 7
x | x_bar | deviation | deviation_squaredThe use of sample mean biases the SSD downward from 7 (actual) to 5.2. But given the choice of a mean, the deviation for the same random 5 samples retain df = (5 - 1) = 4 degrees of freedom.
3 | 2.4 | 0.6 | 0.36
2 | 2.4 | -0.4 | 0.16
4 | 2.4 | 1.6 | 2.56
1 | 2.4 | -1.4 | 1.96
2 | 2.4 | -0.4 | 0.16
Sum of Squared Deviation = 5.2
Subsequent choices of 2 means - (3+2)/2, (4+1+2)/3 - or 3 means, would reduce the SSD down further; at the same time, reducing the degrees of freedom for the deviation for the 5 random samples: df=(5-2), df=(5-3) and so on. As an extreme case, if we consider the sample mean of each sampled number as itself, then we have:
x | x_bar | deviation | deviation_squaredwhich reduces SSD to 0, and the deviation df to (5-5) = 0. So in general,
3 | 3 | 0 | 0
2 | 3 | 0 | 0
4 | 4 | 0 | 0
1 | 1 | 0 | 0
2 | 2 | 0 | 0
Sum of Squared Deviation = 0
- deviations (and hence SSD) for a sample of size n taken from a known population mean μ will have df = n
- deviations for a sample of size n taken from the sample mean X_bar will have df = (n - 1)
- deviations for a sample of size n taken from k ≤ n different numbers (typically mean of sample points) will have df = n - k.
No comments:
Post a Comment