Numoracle Recipes: Sampling Distributions, Confidence Interval

The key goal of inferential statistics is to make predictions/observations about the population (the whole) as a generalization of observations made on a random sample (the part). In the previous post, we discussed common techniques to derive samples from a population. In this post, we will discuss sampling distributions - a key building block for the practice of statistical inference. These tools help answer questions such as: "What should be the sample size to make a particular inference about this population?" or "100 random water samples along this river show an average of 50 ppm (parts per million) of this toxin, with standard deviation of 4.5 - how much is the river contaminated on average with 95% confidence interval", and so on.

The objective in the next few posts is to discuss the use of Oracle SQL statistical functions for various sampling distributions. But if you are a typical DB developer with novice/intermediate knowledge of statistics (like me), spending some time on these foundational concepts may be worthwhile. I am currently using Complete Business Statistics and the Idiot's Guide to Statisticsas my guides - you may use these or other books and/or the free online references on the right pane of this blog.

The various numerical measures - such as mean, variance etc - when applied to a sample, are called sample statistics or simply statistics.
When these numerical measures are applied to a population, they are called population parameters or simply parameters.
An estimator of a population parameter is the sample statistic used to estimate the parameter. The sample statistic - mean, X_bar - estimates the population mean μ; the sample statistic - variance, S² - estimates the population variance σ².
When a single numerical value is the estimate, it is called a point estimate. For example, when we sample a population and obtain a value for X_bar - the statistic - we get a specific sample mean, denoted by x_bar (lower-case), which is the estimate for population mean μ. When the estimate covers a range or an interval, it is called an interval estimate - the unknown population parameter is likely to be found in this interval
A sample statistic, such as X_bar, is a random variable; the values of this randome variable depend on the values in the random sample from which the statistic is computed; the sample itself depends on the population from which it is drawn. This random variable has probability distribution, which is called the sampling distribution
The principal use of sampling distributions and its related concepts is to help predict how close the estimate is to the population parameter, and with what probability.

Central Limit theorem:
The sample statistic sample mean X_bar exhibits a unique behavior - regardless of the population distribution (uniform, exponential, other), in the limit n → ∞ ("n tends to infinity", where n is the sample size), the sampling distribution of X_bar tends to a normal distribution. The rate at which the sampling distribution approaches normal distribution depends on the population distribution. Now, if the population itself is normally distributed, then X_bar is normally distributed for any sample size. This is the essence of what is called Central Limit theorem.

Formally, when a population with mean μ and standard deviation σ is sampled, the sampling distribution of the sample mean X_bar will tend to a normal distribution with (the same) mean μ and standard deviation σ_{x_bar} = σ/√n, as the sample size n becomes large. "Large" is empirically defined to be n ≥ 30. The value σ_{x_bar} is called the standard error of the mean.

"Okay,... so what is the big deal?". The big deal is that we can now estimate the population mean (regardless of the population's distribution) using the familiar technique (that we saw in an earlier post) for standard normal distribution.

Now, it is not common that one or more of the population parameters (like standard deviation) are always known. The computations have to be modified to accommodate these unknowns - which brings us to two more concepts associated with sampling distributions.

Degrees of Freedom (DF):
If we are asked to choose three random numbers a, b and c, we are free to choose any three numbers without any restrictions - in other words, we have 3 degrees of freedom. But if the three numbers are put together in a model a + b + c = 10, then we have just 2 degrees of freedom - choice of a and b can be arbitrary , but c is constrained to take a specific value that satisfies the model. The use of df appears to be a compensatory mechanism in the computations, specific to the context/situation in which is it applied - so we'll discuss this in the context of the technique we are illustrating.

Confidence Interval:
An interval estimate, with its associated measure of confidence is called confidence interval. It is a range of numbers that probably contains the unknown population parameter, with an adjoining level of confidence that it indeed does. This is better than a point estimate in that it gives some indication of the accuracy of the estimation.

In an earlier post, we briefly touched upon the transformation of a normal random variable (X, with arbitrary μ and σ) to a standard normal variable (Z, with μ = 0 and σ = 1). The transformations are X to Z: Z = (X - μ)/σ and Z to X: X = μ + Zσ. Applying the latter transformation to standardized sampling distribution with mean μ and standard deviation σ/√n, the confidence interval for the population mean is μ ± Z σ/√n.

A typical question will be "Give me the 95% confidence interval for the population mean". Given the confidence level, and the knowledge that the area under the standard normal curve is 1, we can obtain the value of Z from the standard normal table. For example, a 95% confidence level translates to an area of 0.95 symmetrically distributed around the mean, leaving 0.025 as areas on the left and right tails. From the table, Z = -1.96 for P=0.025, and Z = 1.96 for P=(0.025+0.95). So the 95% confidence interval for the population mean, when the population standard deviation is known, is given by μ ± 1.96 σ/√n

We'll wrap this section reinforcing some concepts for use later:

In probability-speak, the statement "95% confidence interval for the population mean" implies that "there is a 95% probability that a given confidence interval from a given random sample from the same population will contain the population mean". It does NOT imply a "95% probability that the population mean is a value in the range of the interval". In the figure, sample mean x for a specific sample falls within the interval - based on this, the confidence interval is considered to contain the population mean μ. If x for another sample falls in the tail region, then that confidence interval cannot assert that it contains μ.
The quantity Z σ/√n is called sampling error or margin of error.
The combined area under the curve in the tails (i.e. 1 - 0.95 = 0.05 in the above example) is called level of significance α, and/or error probability.
The area under the curve excluding the tails under the curve in the tails (1 - α) is called confidence coefficient.
The confidence coefficient x 100, expressed as a percentage, is the confidence level.
The Z value that cuts off the area under the right tail (i.e. the area α/2 on the right of the curve, 1.96 in our example) is denoted as z_α/2.
For a small sample (n < 30), or a sample taken from a normally distributed population, the (1 - α) 100% confidence interval for μ with known σ is X_bar ± z_α/2 σ/√n

Confidence Interval for Population Mean with Known σ:
Excel has a CONFIDENCE() function to compute the confidence interval. See a simple equivalent for Oracle SQL below. The function takes in a table name, the column name representing the sampled quantity, and level of significance value of 0.05, 0.01, or 0.1 (that corresponds to the three popular confidence levels - 95%, 99%, and 90% - respectively), and returns an object that contains the sample mean, sample error, the lower and upper bounds of the interval.

CREATE OR REPLACE TYPE conf_interval_t AS OBJECT (
  pop_mean NUMBER, sample_err NUMBER, lower NUMBER, upper NUMBER);
/
CREATE OR REPLACE FUNCTION confidence_interval (
  table_name     IN VARCHAR2,
  column_name    IN VARCHAR2,
  sample_percent IN NUMBER,
  alpha          IN NUMBER DEFAULT 0.05,
  seed           IN NUMBER DEFAULT NULL)
RETURN conf_interval_t IS
  pop_mean   NUMBER;
  pop_stddev NUMBER;
  sample_sz  NUMBER;
  z          NUMBER;
  err        NUMBER;
  v_stmt     VARCHAR2(32767);
BEGIN
  v_stmt :=
  'SELECT AVG(' || column_name || '), count(*) ' ||
    'FROM (SELECT * ' ||
            'FROM ' || table_name ||
                  ' SAMPLE(' || sample_percent || ')';
  IF (seed IS NOT NULL) THEN
    v_stmt := v_stmt || ' SEED(' || seed || ')';
  END IF;
  v_stmt := v_stmt || ')';
  EXECUTE IMMEDIATE v_stmt INTO pop_mean, sample_sz;

  v_stmt :=
  'SELECT STDDEV(' || column_name || ') ' ||
    'FROM ' || table_name;
  EXECUTE IMMEDIATE v_stmt INTO pop_stddev;
  
  IF (alpha = 0.05) THEN
    z := 1.96;
  ELSIF (alpha = 0.01) THEN
    z := 2.57;
  ELSIF (alpha = 0.1) THEN
    z := 1.64;
  ELSE
    RETURN(NULL);
  END IF;

  err := z * pop_stddev / SQRT(sample_sz);
  RETURN (conf_interval_t(pop_mean, err, (pop_mean - err), (pop_mean + err)));
END confidence_interval;
/

I used this function to find the 90%, 95%, and 99% confidence interval for the population mean of ORDERS_TOTAL in the ORDERS table, with an approx sample size of 15, with a seed to enable repeatable runs from the SQL sampler. Notice how the interval widens and becomes less precise as the confidence level increases. The true population mean is also shown to be contained in the interval

SQL> select confidence_interval('ORDERS', 'ORDER_TOTAL', 15, 0.1, 3) from dual;

CONFIDENCE_INTERVAL('ORDERS','ORDER_TOTAL',15,0.1,3)(POP_MEAN, SAMPLE_ERR, LOWER, UPPER)
-------------------------------------------------------------------------
CONF_INTERVAL_T(24310.9188, 21444.6451, 2866.27368, 45755.5638)

SQL> select confidence_interval('ORDERS', 'ORDER_TOTAL', 15, 0.05, 3) from dual;

CONFIDENCE_INTERVAL('ORDERS','ORDER_TOTAL',15,0.05,3)(POP_MEAN, SAMPLE_ERR, LOWER, UPPER)
-------------------------------------------------------------------------
CONF_INTERVAL_T(24310.9188, 25628.9661, -1318.0473, 49939.8848)

SQL> select confidence_interval('ORDERS', 'ORDER_TOTAL', 15, 0.01, 3) from dual;

CONFIDENCE_INTERVAL('ORDERS','ORDER_TOTAL',15,0.01,3)(POP_MEAN, SAMPLE_ERR, LOWER, UPPER)
-------------------------------------------------------------------------
CONF_INTERVAL_T(24310.9188, 33605.3279, -9294.4092, 57916.2467)

SQL> select avg(order_total) from orders;

AVG(ORDER_TOTAL)
----------------
      34933.8543

SQL>

Fine - but how do we find the confidence interval when σ is unknown (which is the norm in practice)? Enter T (or Student's) Distribution - we will look at this in the next post.