Numoracle Recipes: Continuous Probability Distributions

Continuous Probability Distributions
A continuous probability distribution is a (infinitely large) table that lists the continuous variables (outcomes) of an experiment with the relative frequency (a.k.a probability) of each outcome. Consider a histogram that plots the probability (y axis) that a particular job will get done within a time interval (x axis). As you keep making the interval shorter and more fine-grained, the step-like top of the histogram eventually melds into a curve - called the continuous probability distribution. The total area under this probability curve is 1, the probability that the value of x is between two values a and b is the area under f(x) between a and b and f(x) <= 0 for all x For continuous distributions, the probability for any single point in the distribution is 0, you can compute a non-zero probability only for an interval between two values of the continuous variable x.

Oracle SQL provides statistical functions to determine if the values in a given column fit a particular distribution. Before we proceed to the examples, let us look at some of the popularly known distributions.

Normal (Gaussian) Probability Distribution
The most common continuous probability distribution - to the point of being synonymous with the concept is Normal Probability Distribution, represented graphically by a bell curve, plotted with the continuous value on the x axis, and the probability along the y axis, that is symmetric about the mean x value, with the two ends tapering off to infinity. The curve has these properties:

The mean, median and mode are the same
the distribution is bell-shaped and symmetrical about the mean
the area under this curve is always = 1

The generic normal distribution can have any mean value and standard deviation. For example, weather info may indicate an annual average rainfall in Boston of 38 inches with standard deviation 3 inches. The smaller the standard deviation (say, 2 inches instead of 3), the steeper the bell curve about the mean. Now if the mean were to shift to, say 40, the symmetric bell curve will shift two places to the right too. The probability density function for normal distribution is given by:
f(x) = 1/(σ√(2π))e^{-0.5 * ((x - μ)/σ)²}

The Standard Normal Distribution is a special case of normal distribution with σ=0 and μ=1 as shown below (graph not to scale).

The standard z-score is a derivative of the standard normal distribution. It is given by z = (x - μ)/σ. The value of z is then cross-checked against a standard normal table or grid, to arrive at the probability of a required interval - repeat interval. Unlike discrete random variables in a discrete probability distribution, continuous variables can have infinite values for a given event - so the probability can be computed only for an interval or range of values. Continuing with the rainfall example, queried the probability that the annual rainfall next year at Boston will be <= 40.5 inches - we will compute z = (40.5 - 38)/3 = 0.8333. Stated another way, this means that a rainfall of 40.5 inches is 0.8333 standard deviations away from the mean. From the standard normal table, the probability is 0.7967 - that is, roughly 80%.

Microsoft Excel's NORMDIST() function provides this functionality, but I was surprised to find no function in Oracle SQL with equivalent simplicity - I'll file a feature bug after some more research. The Oracle OLAP Option provides a NORMAL() function as part of its OLAP DML interface. This Calc-like interface is different from SQL - so we will defer this for later.

Application to Data Mining A key use of the z-score is as a "normalizing" data transformation for mining applications. Note that this concept is completely unrelated to database normalization. The stolen car example in a previous post was a simple example of prediction - we used a few categorical attributes like a car's color, type to predict if a car will be stolen or not.

In the business world, the applications are more grown-up and mission-critical. One example is churn prediction - i.e. finding out if a (say, wireless) customer would stay loyal with the current provider, or move on ("churn") to competitor (in which case, the current provider could try to entice him/her to stay with appropriate promos). The customer data used for such churn prediction applications contains categorical (e.g. gender, education, occupation) and numerical (e.g. age, salary, fico score, distance of residence from a metro) attributes/columns in a table. The data in these numeric columns will be widely dispersed, across different scales. For e.g. values within salary can be from 10s of thousands to several millions. Two numerical attributes will be in different scales - example salary (30K - 2 mil) vs age (1-100). Such disparity in scales, if left untreated, can throw most mining algorithms out of whack - the attributes with higher range of values will start outweighing those in the lower range during the computation of prediction. For such algorithms, the numerical data is normalized to a smaller range [-1, 1] or [0, 1] using the z-transform, to enable uniform handling of numerical data by the algorithm.

Min-max and decimal scaling are other data normalization techniques. Here is one primer on mining data transformations. We will discuss mining transformations using DBMS_DATA_MINING_TRANSFORM package and Oracle SQL in a separate post.

Uniform Distribution

Ever waited outside a airport terminal under a sign that says "Rental-Cars/Long-term Parking - Pickup Every 10 minutes"? Your wait time is an example of uniform distribution - assuming a well-run airport, you arrive at the stop and expect to wait between 5 to max 15 minutes for your shuttle. This simplest of continuous distributions has the probability function
f(x) = 1/ (b - a) a <= x <= b, f(x) = 0 for all other values of x
and is graphically represented as shown. The probability that a uniformly distributed random variable X will have values in the range x₁ to x₂ is:
P(x₁ <= X <= x₂) = (x₂ - x₁)/(b - a), a <= x₁ < x₂ <= b.
The mean E(X) = (a+b)/2 and variance V(X) = (b - a)²/12.
To use the shuttle bus example, probability that the wait time will be 8 to 11 minutes is
P(8 <= X <= 11) = (11 - 8)/(15 - 10) = 3/5 = 0.6

Exponential Distribution

Consider that an event occurs with an average frequency (a.k.a. rate) of λ and this average frequency is constant. Consider that, from a given point in time, you wait for the event to occur. This waiting time follows an exponential distribution - depicted in the adjoining figure. The probability density function is given by: f(x) = λ e^-λx where λ is the frequency with which the event occurs - expressed as a particular number of times per time unit. The mean, or more appropriately, the expected value E(X) of the distribution is μ = 1/λ, the variance is σ² = (1/λ)².

Take the same shuttle bus example. If the bus does not stick to any schedule and randomly goes about its business of picking up passengers, then the wait time is exponentially distributed. This sounds a bit weird, but there are several temporal phenomena in nature that exhibit such behavior:

The time between failure of the ordinary light bulb (which typically just blows out suddenly), or some electronic component, follows an exponential distribution. The mean time between failure, μ is an important metric in the parts failure/warranty claims domain
The time between arrivals, i.e. inter-arrival time of customers at any check-in counter is exponentially distributed.

A key property of an exponentially distributed phenomenon is that it is memory-less - best explained with an example. Suppose you buy a 4-pack GE light at Walmart with a MTBF of 7000 hours (~ 10 months). Assume a bulb blows out, you plug in a new one, the time to next failure of this bulb will remain exponentially distributed. Say, this second bulb fails, and you change the bulb a month (24x30=720 hours) later, the time to next failure will remain exponentially distributed. The time between failure is independent of when the bulb failed and when (i.e. the passage of time before) it was replaced.

The probability functions are best stated in terms of failure (survival). The probability that an item will survive until x units of time, given a MTBF of μ units can be stated as:
P(X >= x) = e^-λx x >= 0
and conversely, failing before x units of time is given by
P(X <= x) = 1 - e^-λx x >= 0
where λ = 1/μ.

For example - if Dell claims your laptop fails following an exponential distribution with MTBF 60 months, and the warranty period is 90 days (3 months), what percentage of laptops does Dell expect to fail within the warranty period?
P(X <= 3 months) = 1 - e^-(1/60)*3) = 0.048 ~ 5% of laptops.

Weibull Distribution
Weibull distribution is a versatile distribution that can emulate other distributions based on its parameter settings, and is a widely used, important tool for reliability engineering/survival analysis. An extensive coverage on Reliability analysis is provided here. I am mentioning this distribution here mainly to set up the next post. There are several other popular probability distributions - we will revisit them in the future on a needs-basis.

For now, let us break off and look at some Oracle code in the next post.