In the previous post, we showed how to compute the confidence interval when σ - i.e. the population standard deviation, is known. But in a typical situation, σ is rarely known.**Computing confidence Interval for μ when σ is Unknown:** The solution then is to use the *sample* standard deviation, and use a variant of the standardized statistic for normal distribution z = (X_bar - μ)/σ.**Student's T Distribution:**

For a *normally distributed population*, the Student's T distribution is given by this standardized statistic: **t = (X_bar - μ)/(S/√n), with (n - 1) degrees of freedom (df) for the deviations**, where S is the sample standard deviation, and **n** is the sample size. Key points:

- The t distribution resembles the bell-shaped z (normal) distribution, but with wider tails than z, with mean zero (the mean of z), and with variance tending to 1 (the variance of z) as df increases. For df > 2,
**σ² = df/(df - 2)**. - The z (normal) distribution deals with one unknown
**μ**- estimated by the random variable X_bar, while the t distribution deals with two unknowns**μ**and**σ**- estimated by random variables X_bar and S respectively. So it tacitly handles greater uncertainty in the data. - As a consequence, the t distribution has wider confidence intervals than z
- There is a t-distribution for each df=1,..n.
- A good comparison of the distributions is provided here.
- For a sample (n < 30) taken from normally distributed population, a
**(1 - α) 100% confidence interval**for μ when σ is__unknown__is**X_bar ± t**. This is the better distribution to use for small samples - with (n - 1) df, and unknown μ and &sigma._{α/2}s/√n - But larger samples (n ≥ 30), and/or with larger df, the t distribution can be approximated by a z distribution, and the
**(1 - α)100% confidence interval**for μ is**X_bar ± z**(Note: Using the sample sd itself)._{α/2}s/√n

x | x_bar | deviation | deviation_squaredGiven the mean, the deviation computation for the

3 | 3 | 0.0 | 0

2 | 3 | -1.0 | 1

4 | 3 | 1.0 | 1

1 | 3 | -2.0 | 4

2 | 3 | -1.0 | 1

Sum of Squared Deviation = 7

*random*5 samples effectively retain 5 degrees of freedom. Next, assume we

*don't*know the population mean, and instead are asked to compute the deviation from

*one*random number. Our goal is to choose a number that will minimize the deviation. A readily available number is the s sample mean (3+2+4+1+2)/5 = 2.4 - so we will use it:

x | x_bar | deviation | deviation_squaredThe use of sample mean biases the SSD downward from 7 (actual) to 5.2. But given the choice of a mean, the deviation for the same random 5 samples retain df = (5 - 1) = 4 degrees of freedom.

3 | 2.4 | 0.6 | 0.36

2 | 2.4 | -0.4 | 0.16

4 | 2.4 | 1.6 | 2.56

1 | 2.4 | -1.4 | 1.96

2 | 2.4 | -0.4 | 0.16

Sum of Squared Deviation = 5.2

Subsequent choices of 2 means - (3+2)/2, (4+1+2)/3 - or 3 means, would reduce the SSD down further; at the same time, reducing the degrees of freedom for the deviation for the 5 random samples: df=(5-2), df=(5-3) and so on. As an extreme case, if we consider the sample mean of each sampled number as itself, then we have:

x | x_bar | deviation | deviation_squaredwhich reduces SSD to 0, and the deviation df to (5-5) = 0. So in general,

3 | 3 | 0 | 0

2 | 3 | 0 | 0

4 | 4 | 0 | 0

1 | 1 | 0 | 0

2 | 2 | 0 | 0

Sum of Squared Deviation = 0

- deviations (and hence SSD) for a sample of size n taken from a known population mean μ will have df = n
- deviations for a sample of size n taken from the sample mean X_bar will have df = (n - 1)
- deviations for a sample of size n taken from k ≤ n different numbers (typically mean of sample points) will have df = n - k.

**Student's T-Test**- which requires understanding the concepts of Hypothesis Testing. So I will defer the code for an equivalent confidence_interval() routine based on T-distribution for later.

## No comments:

Post a Comment