My plan was to finish a full series on Oracle tools for Basic Statistics, before moving on to typical "blog-like" postings - but I realize I'll need at least another month (or two) to get to topics like ANOVA and Multiple Regression. I feel compelled to get a basic feel for a technique/concept before writing up examples of the functions here (which anybody can quickly get from the Oracle SQL Reference Manual).
So I am (reluctantly) breaking my rule to sneak in this note about a talk I gave in Nov '07 at the Worcester Polytechnic '07 Colloqium Series - before it loses its relevance. The talk provides the motivation for in-database mining, and the presentation slides offer a good intro to Oracle Data Mining in the 11gR1 Database release.
Oracle has touted the benefits of in-database mining for over 4 years now, with steady improvements to the product and an expanding customer base. The recent announcement from market leader SAS on integration with Teradata is a nice vindication of this approach. Details of the integration, whether this engineering effort pans out, and how the market receives the integrated product remains to be seen - but this is good for the industry. SAS still owns the bully pulpit in Analytics, and this move can hopefully lead to greater awareness of the benefits of this approach, and consolidation in this space. Marcos has promised to write about these developments from his perspective as a scientist/developer - so I will hold off my thoughts in anticipation of that post.
Impressions on WPI - good school with active AI and Database research groups. I had the pleasure of meeting Prof. Ruiz and Prof. Heffernan from the AI research group - they have some interesting projects there, and Prof. Rundentstein and Prof. Mani from the database research group. Check these programs out if you are an aspiring CS student.
Talk at WPI - Oracle In-Database Mining
Student's T-Distribution, Degrees of Freedom
In the previous post, we showed how to compute the confidence interval when σ - i.e. the population standard deviation, is known. But in a typical situation, σ is rarely known.
Computing confidence Interval for μ when σ is Unknown: The solution then is to use the sample standard deviation, and use a variant of the standardized statistic for normal distribution z = (X_bar - μ)/σ.
Student's T Distribution:
For a normally distributed population, the Student's T distribution is given by this standardized statistic: t = (X_bar - μ)/(S/√n), with (n - 1) degrees of freedom (df) for the deviations, where S is the sample standard deviation, and n is the sample size. Key points:
- The t distribution resembles the bell-shaped z (normal) distribution, but with wider tails than z, with mean zero (the mean of z), and with variance tending to 1 (the variance of z) as df increases. For df > 2, σ² = df/(df - 2).
- The z (normal) distribution deals with one unknown μ - estimated by the random variable X_bar, while the t distribution deals with two unknowns μ and σ - estimated by random variables X_bar and S respectively. So it tacitly handles greater uncertainty in the data.
- As a consequence, the t distribution has wider confidence intervals than z
- There is a t-distribution for each df=1,..n.
- A good comparison of the distributions is provided here.
- For a sample (n < 30) taken from normally distributed population, a (1 - α) 100% confidence interval for μ when σ is unknown is X_bar ± tα/2 s/√n. This is the better distribution to use for small samples - with (n - 1) df, and unknown μ and &sigma.
- But larger samples (n ≥ 30), and/or with larger df, the t distribution can be approximated by a z distribution, and the (1 - α)100% confidence interval for μ is X_bar ± zα/2 s/√n (Note: Using the sample sd itself).
x | x_bar | deviation | deviation_squaredGiven the mean, the deviation computation for the random 5 samples effectively retain 5 degrees of freedom. Next, assume we don't know the population mean, and instead are asked to compute the deviation from one random number. Our goal is to choose a number that will minimize the deviation. A readily available number is the s sample mean (3+2+4+1+2)/5 = 2.4 - so we will use it:
3 | 3 | 0.0 | 0
2 | 3 | -1.0 | 1
4 | 3 | 1.0 | 1
1 | 3 | -2.0 | 4
2 | 3 | -1.0 | 1
Sum of Squared Deviation = 7
x | x_bar | deviation | deviation_squaredThe use of sample mean biases the SSD downward from 7 (actual) to 5.2. But given the choice of a mean, the deviation for the same random 5 samples retain df = (5 - 1) = 4 degrees of freedom.
3 | 2.4 | 0.6 | 0.36
2 | 2.4 | -0.4 | 0.16
4 | 2.4 | 1.6 | 2.56
1 | 2.4 | -1.4 | 1.96
2 | 2.4 | -0.4 | 0.16
Sum of Squared Deviation = 5.2
Subsequent choices of 2 means - (3+2)/2, (4+1+2)/3 - or 3 means, would reduce the SSD down further; at the same time, reducing the degrees of freedom for the deviation for the 5 random samples: df=(5-2), df=(5-3) and so on. As an extreme case, if we consider the sample mean of each sampled number as itself, then we have:
x | x_bar | deviation | deviation_squaredwhich reduces SSD to 0, and the deviation df to (5-5) = 0. So in general,
3 | 3 | 0 | 0
2 | 3 | 0 | 0
4 | 4 | 0 | 0
1 | 1 | 0 | 0
2 | 2 | 0 | 0
Sum of Squared Deviation = 0
- deviations (and hence SSD) for a sample of size n taken from a known population mean μ will have df = n
- deviations for a sample of size n taken from the sample mean X_bar will have df = (n - 1)
- deviations for a sample of size n taken from k ≤ n different numbers (typically mean of sample points) will have df = n - k.
Sampling Distributions, Confidence Interval
The key goal of inferential statistics is to make predictions/observations about the population (the whole) as a generalization of observations made on a random sample (the part). In the previous post, we discussed common techniques to derive samples from a population. In this post, we will discuss sampling distributions - a key building block for the practice of statistical inference. These tools help answer questions such as: "What should be the sample size to make a particular inference about this population?" or "100 random water samples along this river show an average of 50 ppm (parts per million) of this toxin, with standard deviation of 4.5 - how much is the river contaminated on average with 95% confidence interval", and so on.
The objective in the next few posts is to discuss the use of Oracle SQL statistical functions for various sampling distributions. But if you are a typical DB developer with novice/intermediate knowledge of statistics (like me), spending some time on these foundational concepts may be worthwhile. I am currently using Complete Business Statistics and the Idiot's Guide to Statisticsas my guides - you may use these or other books and/or the free online references on the right pane of this blog.
- The various numerical measures - such as mean, variance etc - when applied to a sample, are called sample statistics or simply statistics.
- When these numerical measures are applied to a population, they are called population parameters or simply parameters.
- An estimator of a population parameter is the sample statistic used to estimate the parameter. The sample statistic - mean, X_bar - estimates the population mean μ; the sample statistic - variance, S² - estimates the population variance σ².
- When a single numerical value is the estimate, it is called a point estimate. For example, when we sample a population and obtain a value for X_bar - the statistic - we get a specific sample mean, denoted by x_bar (lower-case), which is the estimate for population mean μ. When the estimate covers a range or an interval, it is called an interval estimate - the unknown population parameter is likely to be found in this interval
- A sample statistic, such as X_bar, is a random variable; the values of this randome variable depend on the values in the random sample from which the statistic is computed; the sample itself depends on the population from which it is drawn. This random variable has probability distribution, which is called the sampling distribution
- The principal use of sampling distributions and its related concepts is to help predict how close the estimate is to the population parameter, and with what probability.
The sample statistic sample mean X_bar exhibits a unique behavior - regardless of the population distribution (uniform, exponential, other), in the limit n → ∞ ("n tends to infinity", where n is the sample size), the sampling distribution of X_bar tends to a normal distribution. The rate at which the sampling distribution approaches normal distribution depends on the population distribution. Now, if the population itself is normally distributed, then X_bar is normally distributed for any sample size. This is the essence of what is called Central Limit theorem.
Formally, when a population with mean μ and standard deviation σ is sampled, the sampling distribution of the sample mean X_bar will tend to a normal distribution with (the same) mean μ and standard deviation σx_bar = σ/√n, as the sample size n becomes large. "Large" is empirically defined to be n ≥ 30. The value σx_bar is called the standard error of the mean.
"Okay,... so what is the big deal?". The big deal is that we can now estimate the population mean (regardless of the population's distribution) using the familiar technique (that we saw in an earlier post) for standard normal distribution.
Now, it is not common that one or more of the population parameters (like standard deviation) are always known. The computations have to be modified to accommodate these unknowns - which brings us to two more concepts associated with sampling distributions.
Degrees of Freedom (DF):
If we are asked to choose three random numbers a, b and c, we are free to choose any three numbers without any restrictions - in other words, we have 3 degrees of freedom. But if the three numbers are put together in a model a + b + c = 10, then we have just 2 degrees of freedom - choice of a and b can be arbitrary , but c is constrained to take a specific value that satisfies the model. The use of df appears to be a compensatory mechanism in the computations, specific to the context/situation in which is it applied - so we'll discuss this in the context of the technique we are illustrating.
Confidence Interval:
An interval estimate, with its associated measure of confidence is called confidence interval. It is a range of numbers that probably contains the unknown population parameter, with an adjoining level of confidence that it indeed does. This is better than a point estimate in that it gives some indication of the accuracy of the estimation.
In an earlier post, we briefly touched upon the transformation of a normal random variable (X, with arbitrary μ and σ) to a standard normal variable (Z, with μ = 0 and σ = 1). The transformations are X to Z: Z = (X - μ)/σ and Z to X: X = μ + Zσ. Applying the latter transformation to standardized sampling distribution with mean μ and standard deviation σ/√n, the confidence interval for the population mean is μ ± Z σ/√n.
A typical question will be "Give me the 95% confidence interval for the population mean". Given the confidence level, and the knowledge that the area under the standard normal curve is 1, we can obtain the value of Z from the standard normal table. For example, a 95% confidence level translates to an area of 0.95 symmetrically distributed around the mean, leaving 0.025 as areas on the left and right tails. From the table, Z = -1.96 for P=0.025, and Z = 1.96 for P=(0.025+0.95). So the 95% confidence interval for the population mean, when the population standard deviation is known, is given by μ ± 1.96 σ/√n
We'll wrap this section reinforcing some concepts for use later:
- In probability-speak, the statement "95% confidence interval for the population mean" implies that "there is a 95% probability that a given confidence interval from a given random sample from the same population will contain the population mean". It does NOT imply a "95% probability that the population mean is a value in the range of the interval". In the figure, sample mean x for a specific sample falls within the interval - based on this, the confidence interval is considered to contain the population mean μ. If x for another sample falls in the tail region, then that confidence interval cannot assert that it contains μ.
- The quantity Z σ/√n is called sampling error or margin of error.
- The combined area under the curve in the tails (i.e. 1 - 0.95 = 0.05 in the above example) is called level of significance α, and/or error probability.
- The area under the curve excluding the tails under the curve in the tails (1 - α) is called confidence coefficient.
- The confidence coefficient x 100, expressed as a percentage, is the confidence level.
- The Z value that cuts off the area under the right tail (i.e. the area α/2 on the right of the curve, 1.96 in our example) is denoted as zα/2.
- For a small sample (n < 30), or a sample taken from a normally distributed population, the (1 - α) 100% confidence interval for μ with known σ is X_bar ± zα/2 σ/√n
Excel has a CONFIDENCE() function to compute the confidence interval. See a simple equivalent for Oracle SQL below. The function takes in a table name, the column name representing the sampled quantity, and level of significance value of 0.05, 0.01, or 0.1 (that corresponds to the three popular confidence levels - 95%, 99%, and 90% - respectively), and returns an object that contains the sample mean, sample error, the lower and upper bounds of the interval.
CREATE OR REPLACE TYPE conf_interval_t AS OBJECT (I used this function to find the 90%, 95%, and 99% confidence interval for the population mean of ORDERS_TOTAL in the ORDERS table, with an approx sample size of 15, with a seed to enable repeatable runs from the SQL sampler. Notice how the interval widens and becomes less precise as the confidence level increases. The true population mean is also shown to be contained in the interval
pop_mean NUMBER, sample_err NUMBER, lower NUMBER, upper NUMBER);
/
CREATE OR REPLACE FUNCTION confidence_interval (
table_name IN VARCHAR2,
column_name IN VARCHAR2,
sample_percent IN NUMBER,
alpha IN NUMBER DEFAULT 0.05,
seed IN NUMBER DEFAULT NULL)
RETURN conf_interval_t IS
pop_mean NUMBER;
pop_stddev NUMBER;
sample_sz NUMBER;
z NUMBER;
err NUMBER;
v_stmt VARCHAR2(32767);
BEGIN
v_stmt :=
'SELECT AVG(' || column_name || '), count(*) ' ||
'FROM (SELECT * ' ||
'FROM ' || table_name ||
' SAMPLE(' || sample_percent || ')';
IF (seed IS NOT NULL) THEN
v_stmt := v_stmt || ' SEED(' || seed || ')';
END IF;
v_stmt := v_stmt || ')';
EXECUTE IMMEDIATE v_stmt INTO pop_mean, sample_sz;
v_stmt :=
'SELECT STDDEV(' || column_name || ') ' ||
'FROM ' || table_name;
EXECUTE IMMEDIATE v_stmt INTO pop_stddev;
IF (alpha = 0.05) THEN
z := 1.96;
ELSIF (alpha = 0.01) THEN
z := 2.57;
ELSIF (alpha = 0.1) THEN
z := 1.64;
ELSE
RETURN(NULL);
END IF;
err := z * pop_stddev / SQRT(sample_sz);
RETURN (conf_interval_t(pop_mean, err, (pop_mean - err), (pop_mean + err)));
END confidence_interval;
/
SQL> select confidence_interval('ORDERS', 'ORDER_TOTAL', 15, 0.1, 3) from dual;Fine - but how do we find the confidence interval when σ is unknown (which is the norm in practice)? Enter T (or Student's) Distribution - we will look at this in the next post.
CONFIDENCE_INTERVAL('ORDERS','ORDER_TOTAL',15,0.1,3)(POP_MEAN, SAMPLE_ERR, LOWER, UPPER)
-------------------------------------------------------------------------
CONF_INTERVAL_T(24310.9188, 21444.6451, 2866.27368, 45755.5638)
SQL> select confidence_interval('ORDERS', 'ORDER_TOTAL', 15, 0.05, 3) from dual;
CONFIDENCE_INTERVAL('ORDERS','ORDER_TOTAL',15,0.05,3)(POP_MEAN, SAMPLE_ERR, LOWER, UPPER)
-------------------------------------------------------------------------
CONF_INTERVAL_T(24310.9188, 25628.9661, -1318.0473, 49939.8848)
SQL> select confidence_interval('ORDERS', 'ORDER_TOTAL', 15, 0.01, 3) from dual;
CONFIDENCE_INTERVAL('ORDERS','ORDER_TOTAL',15,0.01,3)(POP_MEAN, SAMPLE_ERR, LOWER, UPPER)
-------------------------------------------------------------------------
CONF_INTERVAL_T(24310.9188, 33605.3279, -9294.4092, 57916.2467)
SQL> select avg(order_total) from orders;
AVG(ORDER_TOTAL)
----------------
34933.8543
SQL>
Sampling
The key goal of inferential statistics is to make predictions/observations about the population (the whole) as a generalization of observations made on a random sample (the part). In this post, we will present tools/techniques in Oracle for sampling data.
Sampling
For ensuring high accuracy in the results of a statistical inference (technique), the sample dataset should minimally have these properties:
- the sample must be drawn randomly from the population
- every segment of the population must be adequately and proportionately represented
- the sample should not be biased - a classic example is the non-response bias seen in survey/poll data - where the respondents ignore/refuse to answer a particular question ("Have you smoked marijuana?" in a health questionnaire).
The SAMPLE clause in the SQL SELECT statement supports simple random sampling and clustered sampling. Note that the sampling frame here is simply the set of rows returned by the statement - so you can control how many, and which, of the rows are sampled using filters in the WHERE clause. Here are some examples based on the OE.ORDERS table provided with the Sample Schema (connect oe/oe in SQL*Plus to see this table).
Example: Random Sampling without replacement:
Get random samples from ORDERS table with each row having a 15% chance of being in the sample
SQL> select order_id,order_status,order_total from orders sample (15) seed (3);The SEED clause in the above statement helps with repeatability of results from one run to the next. Next, achieving clustered sampling requires a BLOCK qualifier in the SAMPLE clause, as in
ORDER_ID ORDER_STATUS ORDER_TOTAL
---------- ------------ -----------
2354 0 46257
2360 4 990.4
2361 8 120131.3
2384 3 29249.1
2389 4 17620
2400 2 69286.4
2401 3 969.2
2402 8 600
2415 6 310
2416 6 384
2417 5 1926.6
ORDER_ID ORDER_STATUS ORDER_TOTAL
---------- ------------ -----------
2419 3 31574
2423 3 10367.7
2427 7 9055
2429 9 50125
2453 0 129
16 rows selected.
SQL>
select order_id,order_status,order_total from orders sample block (15) seed (3);Example: Random Sampling with replacement: can be accomplished with some minimal coding. In the example below, we using a hashing function (ORA_HASH) on the sequence of enumerated (using ROWNUM pseudo-column) rows to randomize the selection from the table to be sampled. Let's try this: "Get me random samples with replacement with sample size 10 rows from the ORDERS table".
-- stage a view with row numbers tacked on to original tableThis is the result of the above code snippet, when run in the OE schema using SQL*Plus:
create view orders_view as
select rownum rnum, o.*
from orders o;
-- create a sequence
create sequence orders_seq10;
-- create a mapping table
create table orders_map(rnum number, rhsh number);
-- the requested sample size is 10
begin
for i in 1..10
loop
insert into orders_map (rnum) values (orders_seq10.nextval);
end loop;
end;
/
commit;
-- the complete orders table is the sampling frame
-- mark random sampled entries in mapping table
update orders_map set rhsh = ora_hash(rnum, (select count(*) from orders));
-- use the mapping table and orders_view to create a view with sampled rows
create view orders_sample10_view as
select o.*
from orders_view o, orders_map m
where o.rnum = m.rnum;
SQL> select * from orders_sample10_view;For repeatability of sampled results, change all "create view" statements above to "create table". It is also useful if the data to be sampled is persisted in a table, rather than presented to the sampling queries as a view.
RNUM ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID
------ ---------- -------- ----------- ------------ ----------- ------------
11 2455 direct 145 7 14087.5 160
12 2379 direct 146 8 17848.2 161
13 2396 direct 147 8 34930 161
14 2406 direct 148 8 2854.2 161
15 2434 direct 149 8 268651.8 161
16 2436 direct 116 8 6394.8 161
17 2446 direct 117 8 103679.3 161
18 2447 direct 101 8 33893.6 161
19 2432 direct 102 10 10523 163
20 2433 direct 103 10 78 163
10 rows selected.
SQL>
Stratified Sampling can also be coded using SQL. But rather than provide an example query that is specific to a given table, we will provide a query generator function here. This is adapted from the output of code generated by Oracle Data Miner for stratified sampling. The function accepts a source and result table name, the name of the stratifying column (a.k.a. "variable" in statistics, "attribute" in data mining), the sampling size as a percentage of the table size, a scalar comparison operator ('<', '=', '>'), and an indicator to specify if each strata should roughly have the same number of elements.
An important note - Oracle Data Miner - the GUI interface for the Oracle Data Mining platform provides sampling features, and generated SQL/PLSQL code that corresponds to the sampling input. This example has been adapted from this generated code.
CREATE OR REPLACE TYPE TARGET_VALUES_LIST IS TABLE OF VARCHAR2(32);Now, cut and paste the above code in a SQL*Plus session (for this example, the Sample Schema OE/OE session), and then invoke the function in a SQL*Plus session using this wrapper. The inputs are pretty self-explanatory - we are asking to sample roughly 20% of the ORDERS table, stratified based on values in the ORDERS_STATUS column, and to place the sampled output in the table SAMPLED_ORDERS.
/
CREATE OR REPLACE TYPE VALUE_COUNT_LIST IS TABLE OF NUMBER;
/
CREATE OR REPLACE
FUNCTION GENERATE_STRATIFIED_SQL (
result_table_name IN VARCHAR2,
source_table_name IN VARCHAR2,
strat_attr_name IN VARCHAR2,
percentage IN NUMBER,
op IN VARCHAR2,
equal_sized_strata IN BOOLEAN DEFAULT FALSE)
RETURN VARCHAR2 IS
v_stmt VARCHAR2(32767);
tmp_str VARCHAR2(4000);
sample_count PLS_INTEGER;
attr_names TARGET_VALUES_LIST;
attr_values TARGET_VALUES_LIST;
counts VALUE_COUNT_LIST;
counts_sampled VALUE_COUNT_LIST;
v_minvalue NUMBER;
BEGIN
v_stmt :=
'SELECT column_name ' ||
'FROM user_tab_columns ' ||
'WHERE table_name = ' || '''' || UPPER(source_table_name) || '''';
EXECUTE IMMEDIATE v_stmt BULK COLLECT INTO attr_names;
v_stmt :=
'SELECT /*+ noparallel(t)*/ ' || strat_attr_name || ', count(*), ' ||
'ROUND ((count(*) * ' || percentage || ')/100.0) ' ||
'FROM ' || source_table_name ||
' WHERE ' || strat_attr_name || ' IS NOT NULL ' ||
'GROUP BY ' || strat_attr_name;
EXECUTE IMMEDIATE v_stmt
BULK COLLECT INTO attr_values, counts, counts_sampled;
IF (equal_sized_strata = TRUE) THEN
FOR i IN counts.FIRST..counts.LAST
LOOP
IF (i = counts.FIRST) THEN
v_minvalue := counts(i);
ELSIF (counts(i) > 0 AND v_minvalue > counts(i)) THEN
v_minvalue := counts(i);
END IF;
END LOOP;
FOR i IN counts.FIRST..counts.LAST
LOOP
counts(i) := v_minvalue;
END LOOP;
END IF;
v_stmt :=
'CREATE TABLE ' || result_table_name || ' AS ' ||
'SELECT ';
FOR i IN attr_names.FIRST..attr_names.LAST
LOOP
IF (i != attr_names.FIRST) THEN
v_stmt := v_stmt || ',';
END IF;
v_stmt := v_stmt || attr_names(i);
END LOOP;
v_stmt := v_stmt ||
' FROM (SELECT /*+ no_merge */ t.*, ' ||
'ROW_NUMBER() OVER ' ||
'(PARTITION BY ' || strat_attr_name ||
' ORDER BY ORA_HASH(ROWNUM)) RNUM ' ||
'FROM ' || source_table_name || ' t) ' ||
'WHERE RNUM = 1 OR ';
FOR i IN attr_values.FIRST..attr_values.LAST
LOOP
IF (i != attr_values.FIRST) THEN
tmp_str := ' OR ';
END IF;
IF (counts(i) <= 2) THEN
sample_count := counts(i);
ELSE
sample_count := counts_sampled(i);
END IF;
tmp_str :=
tmp_str ||
'(' || strat_attr_name || ' = ''' || attr_values(i) || '''' ||
' AND ORA_HASH(RNUM, (' || counts(i) || ' -1), 12345) ' ||
op || sample_count || ') ';
v_stmt := v_stmt || tmp_str;
END LOOP;
RETURN(v_stmt);
END;
/
set serveroutput onthis will return the following sampling query (output formatted for easy readability):
BEGIN
DBMS_OUTPUT.PUT_LINE(
generate_stratified_sql(
result_table_name => 'sampled_orders',
source_table_name => 'orders',
strat_attr_name => 'order_status',
percentage => 20,
op => ' < '));
END;
/
CREATE TABLE sampled_orders ASIf you execute this SQL in the same OE session, the resulting table is:
SELECT ORDER_ID,ORDER_MODE,CUSTOMER_ID,ORDER_STATUS,ORDER_TOTAL,
SALES_REP_ID,PROMOTION_ID
FROM (SELECT /*+ no_merge */ t.*,
ROW_NUMBER() OVER
(PARTITION BY order_status
ORDER BY ORA_HASH(ROWNUM)) RNUM
FROM orders t)
WHERE RNUM = 1 OR
(order_status = '1' AND ORA_HASH(RNUM, (7 -1), 12345) < 1) OR
(order_status = '6' AND ORA_HASH(RNUM, (9 -1), 12345) < 2) OR
(order_status = '2' AND ORA_HASH(RNUM, (7 -1), 12345) < 1) OR
(order_status = '5' AND ORA_HASH(RNUM, (15 -1), 12345) < 3) OR
(order_status = '4' AND ORA_HASH(RNUM, (12 -1), 12345) < 2) OR
(order_status = '8' AND ORA_HASH(RNUM, (17 -1), 12345) < 3) OR
(order_status = '3' AND ORA_HASH(RNUM, (9 -1), 12345) < 2) OR
(order_status = '7' AND ORA_HASH(RNUM, (3 -1), 12345) < 1) OR
(order_status = '0' AND ORA_HASH(RNUM, (11 -1), 12345) < 2) OR
(order_status = '10' AND ORA_HASH(RNUM, (5 -1), 12345) < 1) OR
(order_status = '9' AND ORA_HASH(RNUM, (10 -1), 12345) < 2);
SQL> select * from sampled_orders;That is, 30 rows (a bit more than 20%) out of 105 are sampled, using order_status as the stratifying column.
ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID
---------- -------- ----------- ------------ ----------- ------------
2363 online 144 0 10082.3
2369 online 116 0 11097.4
2403 direct 162 0 220 154
2439 direct 105 1 22150.1 159
2408 direct 166 1 309 158
2444 direct 109 1 77727.2 155
2358 direct 105 2 7826 155
2400 direct 159 2 69286.4 161
2375 online 122 2 103834.4
2450 direct 147 3 1636 159
2423 direct 145 3 10367.7 160
ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID
---------- -------- ----------- ------------ ----------- ------------
2385 online 147 4 295892
2437 direct 103 4 13550 163
2389 online 151 4 17620
2364 online 145 4 9500
2377 online 141 5 38017.8
2425 direct 147 5 1500.8 163
2394 direct 109 5 21863 158
2457 direct 118 5 21586.2 159
2426 direct 148 6 7200
2410 direct 168 6 45175 156
2427 direct 149 7 9055 163
ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID
---------- -------- ----------- ------------ ----------- ------------
2446 direct 117 8 103679.3 161
2402 direct 161 8 600 154
2434 direct 149 8 268651.8 161
2447 direct 101 8 33893.6 161
2365 online 146 9 27455.3
2372 online 119 9 16447.2
2359 online 106 9 5543.1
2368 online 149 10 60065
30 rows selected.
SQL> select count(*) from orders;
COUNT(*)
----------
105
SQL>
Try this example on your own schemas and tables, and send me some feedback. The maximum length of the string returned from GENERATE_STRATFIED_SQL() is 32767. Based on the interest, I can post another function that can handle SQL strings of <= 64K in size.
Given the explosive growth in data, sampling continues to remain an active, relevant research area in statistics, data mining, and computer science.
Analyzing your data for Distributions
In the previous post, we looked at continuous probability distributions. You can determine if the data in a given column in a given table follows a particular distribution using routines in the DBMS_STAT_FUNCS package. The example shows a table with four columns - each pre-loaded with data synthesized to fit a particular continuous distribution.
You can substitute the table name and column name for your experiment. A minor annoyance in the design of this API - most other Oracle packages will default to the current session name, but this one explicitly expects the schema name as input - which, in my case, is DMUSER - please change this to your schema name accordingly.
set echo on;Let us first start with Normal Distribution. The mean (same as E(x)) and standard deviation are computed for the column num1 and provided as input.
set serveroutput on;
create table disttab (
num1 number, num2 number, num3 number, num4 number, cat1 varchar2(2));
rem num1: Numbers generated based on normal distribution
rem num2: Numbers generated based on exponential distribution
rem num3: Numbers generated based on weibull distribution
rem num4: Numbers generated based on uniform distribution
insert into disttab values (9.604955, 1.05536, 4.126087, 22.950835, 'AA');
insert into disttab values (13.022139, 1.714142, 4.999804, 32.598089, 'AA');
insert into disttab values (11.572116, 3.697564, 2.81854, 24.552021, 'AA');
insert into disttab values (9.817124, 1.530935, 2.131106, 6.359504, 'AA');
insert into disttab values (10.146569, 3.295829, 1.510639, 25.218639, 'AA');
insert into disttab values (11.280488, 0.721109, 3.145515, 23.672146, 'BB');
insert into disttab values (9.26679, 1.390282, 4.074397, 11.262112, 'BB');
insert into disttab values (14.303472, 1.327971, 2.51907, 22.675373, 'BB');
insert into disttab values (11.686556, 0.225337, 2.941825, 23.582254, 'BB');
insert into disttab values (13.124479, 7.265271, 0.945059, 29.18001, 'BB');
insert into disttab values (8.601027, 7.060104, 6.078573, 14.878128, 'BB');
insert into disttab values (12.241662, 0.257739, 3.395142, 31.148244, 'CC');
insert into disttab values (13.781857, 4.281371, 1.349627, 23.862069, 'CC');
insert into disttab values (7.827007, 1.347487, 5.836949, 10.76229, 'CC');
insert into disttab values (9.106408, 1.253113, 5.116857, 6.594224, 'CC');
insert into disttab values (11.066785, 4.56512, 3.393899, 22.435955, 'CC');
insert into disttab values (10.71079, 2.700015, 1.922642, 7.635145, 'DD');
insert into disttab values (9.13019, 5.199126, 3.763481, 32.061213, 'DD');
insert into disttab values (7.873859, 0.978657, 2.268487, 1.030052, 'DD');
insert into disttab values (7.731724, 2.382977, 2.639425, 5.676622, 'DD');
insert into disttab values (12.828234, 1.867099, 3.99808, 26.000458, 'DD');
insert into disttab values (12.125892, 1.01285, 3.345311, 8.026281, 'DD');
insert into disttab values (9.800528, 5.869301, 3.840932, 29.928523, 'EE');
insert into disttab values (10.605782, 3.145211, 2.13718, 27.398604, 'EE');
insert into disttab values (14.054569, 4.089033, 2.436408, 4.483585, 'EE');
insert into disttab values (8.120606, 2.155303, 1.787835, 19.513588, 'EE');
insert into disttab values (13.093059, 0.220456, 3.456848, 24.855135, 'EE');
insert into disttab values (8.421441, 2.4819, 2.817669, 21.137668, 'FF');
insert into disttab values (11.899697, 2.507618, 3.770983, 4.016285, 'FF');
insert into disttab values (9.601342, 1.12639, 3.21053, 28.643809, 'FF');
insert into disttab values (9.32297, 10.003288,6.890515, 33.67171, 'FF');
insert into disttab values (6.896019, 10.76641, 3.123496, 29.077463, 'FF');
insert into disttab values (12.542443, 0.228756, 4.081015, 33.542652, 'FF');
insert into disttab values (14.038144, 7.326175, 3.53459, 11.731359, 'FF');
declareThe output of this procedure looks like this:
mean number;
stdev number;
sig number;
begin
select avg(num1) into mean from disttab;
select stddev(num1) into stdev from disttab;
dbms_output.put_line('NORMAL DISTRIBUTION - SHAPIRO_WILKS');
dbms_stat_funcs.normal_dist_fit(
'dmuser', 'disttab', 'num1', 'SHAPIRO_WILKS', mean, stdev, sig);
dbms_output.put_line('Mean : ' || round(mean, 4));
dbms_output.put_line('Stddev : ' || round(stdev, 4));
dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));
end;
/
NORMAL DISTRIBUTION - SHAPIRO_WILKS
W value : .9601472834631918774434537597940068448938
Mean : 10.7426
Stddev : 2.1094
Sig : .2456
The high value for the W returned from the Shapiro-Wilks test (explained here) indicates an agreement with the null hypothesis that the data follows normal distribution. The significance (0.25) is also relatively high for the Shapiro-Wilks test (0.05 being the typical threshold).
Next, Exponential distribution. For the above dataset, lambda - i.e. the rate of arrival - is provided (Note that lambda is NOT simply 1/mean(num3), it is 1/E(x) and we don't know how the expected value was computed)
declareThe output shows:
lambda number;
mu number;
sig number;
begin
lambda := 0.3194;
-- select 1/mean into lambda from (select mean(num2) from disttab);
mu := NULL;
sig := NULL;
dbms_output.put_line('EXPONENTIAL DISTRIBUTION - KOLMOGOROV_SMIRNOV');
dbms_stat_funcs.exponential_dist_fit(
'dmuser', 'disttab', 'num3', 'KOLMOGOROV_SMIRNOV', lambda, mu, sig);
dbms_output.put_line('Lambda : ' || lambda);
dbms_output.put_line('Mu : ' || mu);
dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));
end;
/
EXPONENTIAL DISTRIBUTION - KOLMOGOROV_SMIRNOV
D value : .0919745379005003387429254774811052604723
Lambda : .3194
Mu : 0
Sig : .9237466160
The low D value (0.09) and high significance (0.92) suggests that the data fits an exponential distribution well.
Next, Weibull Distribution. The alpha input parameter to the procedure corresponds to η - the scale, and the mu to γ - the location - in the Weibull probability density function discussed earlier.
declareThe output shows:
alpha number;
beta number;
begin
alpha := 3;
mu := 0;
beta := 4;
sig := NULL;
dbms_output.put_line('.');
dbms_output.put_line('WEIBULL DISTRIBUTION - KOLMOGOROV_SMIRNOV');
dbms_stat_funcs.weibull_dist_fit(
'dmuser', 'disttab', 'num3', 'KOLMOGOROV_SMIRNOV', alpha, mu, beta, sig);
dbms_output.put_line('Alpha : ' || alpha);
dbms_output.put_line('Mu : ' || mu);
dbms_output.put_line('Beta : ' || beta);
dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));
end;
/
WEIBULL DISTRIBUTION - KOLMOGOROV_SMIRNOV
D value : .2575286246007637604103723582952313687414
Alpha : 3
Mu : 0
Beta : 4
Sig : .0177026134
The Kolmogorov-Smirnov test does not appear emphatic enough, we will try other tests later.
Next, Uniform Distribution.
declareThe output shows:
A number;
B number;
begin
A := 1;
B := 34;
sig := NULL;
dbms_output.put_line('.');
dbms_output.put_line('UNIFORM DISTRIBUTION - KOLMOGOROV_SMIRNOV');
dbms_stat_funcs.uniform_dist_fit(
'dmuser', 'disttab', 'num4', 'CONTINUOUS', 'KOLMOGOROV_SMIRNOV', A, B, sig);
dbms_output.put_line('A : ' || A);
dbms_output.put_line('B : ' || B);
dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));
end;
/
UNIFORM DISTRIBUTION - KOLMOGOROV_SMIRNOV
D value : .2083979233511586452762923351158645276292
A : 1
B : 34
Sig : .0904912415
The Kolmogorov-Smirnov test does not appear emphatic enough, we will try other tests later.
Finally, let us try fitting a small dataset to Poisson Distribution.
create table pdisttab (num1 number);The output of:
insert into pdisttab values(1);
insert into pdisttab values(2);
insert into pdisttab values(3);
insert into pdisttab values(4);
insert into pdisttab values(5);
declare
mean number;
stdev number;
sig number;
begin
mean := 3.0;
stdev := 1.58114;
dbms_output.put_line('.');
dbms_output.put_line('POISSON DISTRIBUTION - KOLMOGOROV_SMIRNOV');
dbms_stat_funcs.poisson_dist_fit(
'dmuser', 'pdisttab', 'num1', 'KOLMOGOROV_SMIRNOV', mean, sig);
dbms_output.put_line('Mean : ' || mean);
dbms_output.put_line('Stddev : ' || stdev);
dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));
end;
/
POISSON DISTRIBUTION - KOLMOGOROV_SMIRNOV
D value : .08391793845975204
Mean : 3
Stddev : 1.58114
Sig : .9999999999
emphatically is in agreement with the null hypothesis that the data fits a Poisson distribution.
That was fun, was'n't it? Now, we have introduced several new terms and concepts here (esp if you reviewed the DBMS_STAT_FUNCS docs) - viz. goodness-of-fit testing if a given data sample fits a particular distribution based on a null hypothesis, using various test types, the test metrics, and the significance output of a test, and various other parameters. In the upcoming posts, we will try to understand these new concepts. I also presume that the various references in the previous post and this one may have been useful to experienced/intermediate statisticians also.
Continuous Probability Distributions
Continuous Probability Distributions
A continuous probability distribution is a (infinitely large) table that lists the continuous variables (outcomes) of an experiment with the relative frequency (a.k.a probability) of each outcome. Consider a histogram that plots the probability (y axis) that a particular job will get done within a time interval (x axis). As you keep making the interval shorter and more fine-grained, the step-like top of the histogram eventually melds into a curve - called the continuous probability distribution. The total area under this probability curve is 1, the probability that the value of x is between two values a and b is the area under f(x) between a and b and f(x) <= 0 for all x For continuous distributions, the probability for any single point in the distribution is 0, you can compute a non-zero probability only for an interval between two values of the continuous variable x.
Oracle SQL provides statistical functions to determine if the values in a given column fit a particular distribution. Before we proceed to the examples, let us look at some of the popularly known distributions.
Normal (Gaussian) Probability Distribution
The most common continuous probability distribution - to the point of being synonymous with the concept is Normal Probability Distribution, represented graphically by a bell curve, plotted with the continuous value on the x axis, and the probability along the y axis, that is symmetric about the mean x value, with the two ends tapering off to infinity. The curve has these properties:
- The mean, median and mode are the same
- the distribution is bell-shaped and symmetrical about the mean
- the area under this curve is always = 1
f(x) = 1/(σ√(2π))e-0.5 * ((x - μ)/σ)²
The Standard Normal Distribution is a special case of normal distribution with σ=0 and μ=1 as shown below (graph not to scale).The standard z-score is a derivative of the standard normal distribution. It is given by z = (x - μ)/σ. The value of z is then cross-checked against a standard normal table or grid, to arrive at the probability of a required interval - repeat interval. Unlike discrete random variables in a discrete probability distribution, continuous variables can have infinite values for a given event - so the probability can be computed only for an interval or range of values. Continuing with the rainfall example, queried the probability that the annual rainfall next year at Boston will be <= 40.5 inches - we will compute z = (40.5 - 38)/3 = 0.8333. Stated another way, this means that a rainfall of 40.5 inches is 0.8333 standard deviations away from the mean. From the standard normal table, the probability is 0.7967 - that is, roughly 80%.
Microsoft Excel's NORMDIST() function provides this functionality, but I was surprised to find no function in Oracle SQL with equivalent simplicity - I'll file a feature bug after some more research. The Oracle OLAP Option provides a NORMAL() function as part of its OLAP DML interface. This Calc-like interface is different from SQL - so we will defer this for later.
Application to Data Mining A key use of the z-score is as a "normalizing" data transformation for mining applications. Note that this concept is completely unrelated to database normalization. The stolen car example in a previous post was a simple example of prediction - we used a few categorical attributes like a car's color, type to predict if a car will be stolen or not.
In the business world, the applications are more grown-up and mission-critical. One example is churn prediction - i.e. finding out if a (say, wireless) customer would stay loyal with the current provider, or move on ("churn") to competitor (in which case, the current provider could try to entice him/her to stay with appropriate promos). The customer data used for such churn prediction applications contains categorical (e.g. gender, education, occupation) and numerical (e.g. age, salary, fico score, distance of residence from a metro) attributes/columns in a table. The data in these numeric columns will be widely dispersed, across different scales. For e.g. values within salary can be from 10s of thousands to several millions. Two numerical attributes will be in different scales - example salary (30K - 2 mil) vs age (1-100). Such disparity in scales, if left untreated, can throw most mining algorithms out of whack - the attributes with higher range of values will start outweighing those in the lower range during the computation of prediction. For such algorithms, the numerical data is normalized to a smaller range [-1, 1] or [0, 1] using the z-transform, to enable uniform handling of numerical data by the algorithm.
Min-max and decimal scaling are other data normalization techniques. Here is one primer on mining data transformations. We will discuss mining transformations using DBMS_DATA_MINING_TRANSFORM package and Oracle SQL in a separate post.
Uniform Distribution
Ever waited outside a airport terminal under a sign that says "Rental-Cars/Long-term Parking - Pickup Every 10 minutes"? Your wait time is an example of uniform distribution - assuming a well-run airport, you arrive at the stop and expect to wait between 5 to max 15 minutes for your shuttle. This simplest of continuous distributions has the probability function
f(x) = 1/ (b - a) a <= x <= b, f(x) = 0 for all other values of x
and is graphically represented as shown. The probability that a uniformly distributed random variable X will have values in the range x1 to x2 is:
P(x1 <= X <= x2) = (x2 - x1)/(b - a), a <= x1 < x2 <= b.
The mean E(X) = (a+b)/2 and variance V(X) = (b - a)²/12.
To use the shuttle bus example, probability that the wait time will be 8 to 11 minutes is
P(8 <= X <= 11) = (11 - 8)/(15 - 10) = 3/5 = 0.6
Exponential Distribution
Consider that an event occurs with an average frequency (a.k.a. rate) of λ and this average frequency is constant. Consider that, from a given point in time, you wait for the event to occur. This waiting time follows an exponential distribution - depicted in the adjoining figure. The probability density function is given by: f(x) = λ e-λx where λ is the frequency with which the event occurs - expressed as a particular number of times per time unit. The mean, or more appropriately, the expected value E(X) of the distribution is μ = 1/λ, the variance is σ² = (1/λ)².
Take the same shuttle bus example. If the bus does not stick to any schedule and randomly goes about its business of picking up passengers, then the wait time is exponentially distributed. This sounds a bit weird, but there are several temporal phenomena in nature that exhibit such behavior:
- The time between failure of the ordinary light bulb (which typically just blows out suddenly), or some electronic component, follows an exponential distribution. The mean time between failure, μ is an important metric in the parts failure/warranty claims domain
- The time between arrivals, i.e. inter-arrival time of customers at any check-in counter is exponentially distributed.
The probability functions are best stated in terms of failure (survival). The probability that an item will survive until x units of time, given a MTBF of μ units can be stated as:
P(X >= x) = e-λx x >= 0
and conversely, failing before x units of time is given by
P(X <= x) = 1 - e-λx x >= 0
where λ = 1/μ.
For example - if Dell claims your laptop fails following an exponential distribution with MTBF 60 months, and the warranty period is 90 days (3 months), what percentage of laptops does Dell expect to fail within the warranty period?
P(X <= 3 months) = 1 - e-(1/60)*3) = 0.048 ~ 5% of laptops.
Weibull Distribution
Weibull distribution is a versatile distribution that can emulate other distributions based on its parameter settings, and is a widely used, important tool for reliability engineering/survival analysis. An extensive coverage on Reliability analysis is provided here. I am mentioning this distribution here mainly to set up the next post. There are several other popular probability distributions - we will revisit them in the future on a needs-basis.
For now, let us break off and look at some Oracle code in the next post.
Discrete Probability Distributions
A Probability Distribution is a table/graph that depicts the assignment of probabilities to the assumption of specific values by a given random variable.
The following concepts are useful to understand probability distributions:
- If Event A can occur in p possible ways and Event B can occur in q possible ways, then both A and B can occur in p x q ways.
- The number of different ways that a set of objects can be arranged is called Combination. The number of combinations of n objects taken r at a time is given by nCr = n! / (n - r)! r!
- The number of different ways that a set of objects can be arranged in order is called Permutation. The number of permutations of n objects taken r at a time is given by
nPr = n! / (n - r)!
FUNCTION factorial(p_n IN NUMBER) RETURN NUMBER ISEX> Compute 9!
BEGIN
IF p_n IS NULL OR p_n < 0 THEN
RAISE_APPLICATION_ERROR(-20000, 'Invalid Input Value');
ELSIF p_n <= 1 THEN
RETURN 1;
ELSE
RETURN factorial(p_n-1) * p_n;
END IF;
END;
select factorial(9) from dual;I was curious to see how far I can push this function - the maximum value of n was 83.7 with NUMBER types, and 84.7 when I changed the input parameter and return type to BINARY_DOUBLE
SQL> select factorial(83.7) from dual;EX> Compute the number of combinations of 9 objects taken 3 at a time.
FACTORIAL(83.7)
---------------
9.642E+125
SQL> select factorial(83.71) from dual;
FACTORIAL(83.71)
---------------
~
SQL> select factorial2(84.7) from dual;
FACTORIAL2(84.7)
----------------
8.167E+127
SQL> select factorial2(84.71) from dual;
FACTORIAL2(84.71)
-----------------
Inf
select factorial(9)/(factorial(9-3) * factorial(3)) from dual;EX> Compute the number of different ways of arranging 9 objects taken 3 at a time.
select factorial(9)/factorial(9-3) from dual;Discrete Probability Distributions
- The discrete probability distribution is a table that lists the discrete variables (outcomes) of an experiment with the relative frequency (a k a probability) of each outcome.
Example: Tossing a coin two times gives you the combinations (H,H), (H,T), (T,H), (T,T) and hence, the following tuples for (#Heads, Frequency, Relative_Frequency):
(0, 1, 1/4=0.25), (1, 2, 2/4=0.5), (2, 1, 1/4=0.25).
This is the probability distribution for # heads after flipping a coin twice. - Mean or Expected value of the discrete probability distribution μ = ∑i=1_to_n xi * P(xi) For the coin example, μ = 0 * 0.25 + 1 * 0.5 + 2 * 0.25 = 1
- Variance of the discrete probability distribution σ² = ∑i=1_to_n (xi - μ)² * P(Xi)
- Standard deviation is the square root of the variance
A binomial or Bernoulli experiment is one which consists of a fixed number of trials, each independent of the other, each with only two possible outcomes, with a fixed probability for success or failure representation in each outcome. The Bernoulli process counts the number of successes over a given number of attempts, or in other words, the random variable for a Binomial distribution is the number of successes over given number of attempts.
- The probability of r successes in n trials with probability of success p and probability of failure q is given by P(r, n) = (n! / (n - r)! r!) pr q(n - r)
- The binomial probability distribution is a table of (r, P(r, n)) which can be subsequently graphed, as discussed in this example
select factorial(7) * power(0.4,2) * power(0.6,(7-2))/The probability that it will rain at least 6 days over the next 7 days is P(r >= 6) = P(6,7)+P(7,7), computed using
(factorial(7-2) * factorial(2)) p_2_7
from dual;
select (factorial(7) * power(0.4,6) * power(0.6,(7-6))/Finally, the probability that it will rain no more than 2 days over the next 7 days is P(r <= 2) = P(0,7) + P(1,7) + P(2,7)
(factorial(7-6) * factorial(6))) +
(factorial(7) * power(0.4,7) * power(0.6,(7-7))/
(factorial(7-7) * factorial(7))) p_r_ge_6
from dual;
- The mean of a binomial distribution is μ = np
- The standard deviation is σ² = npq
FUNCTION binomdist(r NUMBER, n NUMBER, p NUMBER, cumulative BOOLEAN DEFAULT FALSE) RETURN NUMBER IS
ri NUMBER;
ret NUMBER;
fn NUMBER;
BEGIN
ret := 0;
fn := factorial(n);
FOR ri IN REVERSE 0..r LOOP
ret := ret + (fn * power(p, ri) * power((1-p),(n - ri)))/
(factorial(n - ri) * factorial(ri));
IF NOT cumulative THEN
EXIT;
END IF;
END LOOP;
RETURN ret;
END binomdist;
Poisson Probability Distribution
The random variable for Poission distribution is the number of occurrences of the event over a measurable metric (time, space). In a Poisson process, the (measured) mean number of occurences of an event is the same for each interval of measurement, and the number of occurrences in a particular interval are independent of number of occurrences in other intervals.
- The probability of exactly r occurrences over a given interval is given by P(r) = μr * e(-μ)/ r!
- The variance of the Poisson distribution is the same as the (observed) mean.
- A goodness of fit test helps verify if a given dataset fits the Poisson distribution
select power(25,31) * exp(-25)/factorial(25) p_31Just as we saw in Binomial distribution, the probability that no more than 31 customers will walk into the coffee shop is P(r <= 31) = P(0)+P(1)+..+P(31). Inversely, the probability that at least 31 customers will walk into the coffee shop is P(r >= 31) = 1 - P(r < 31). Obviously, this leads up to the need for a function similar to POISSON(r, μ, cumulative) in Excel - where cumulative = FALSE indicates computation of exactly r occurrences, and cumulative = TRUE indicates r or fewer.
from dual;
FUNCTION poissondist(r NUMBER, mu NUMBER,
cumulative BOOLEAN DEFAULT FALSE) RETURN NUMBER IS
ri NUMBER;
ret NUMBER;
BEGIN
ret := 0;
FOR ri IN REVERSE 0..r LOOP
ret := ret + (power(p, ri) * exp(-mu)/factorial(ri));
IF NOT cumulative THEN
EXIT;
END IF;
END LOOP;
RETURN ret;
END poissondist;
Poisson approximation - a Poisson distribution can be used to approximate a Binomial distribution if the number of trials (in the binomial experiment) is >= 20 and the probability of success p is <= 5%.
Random Variables, Probability, Bayes Theorem, Naive Bayes Models
In contrast to descriptive statistics, Inferential statistics describes the population based on information gleaned from a sample taken from the population. Fundamental to understanding statistical inference is the concept of probability.
An experiment is the process of measuring/observing an activity. An outcome is a particular result of the experiment - outcomes are also called Random Variables. Random variables can be discrete - when it can assume a countable number of values (e.g. one of six outcomes from rolling a dice) or Continuous - when the variable can assume uncountably infinite values in a given range of values (time, a person's height). The Sample space is all possible outcomes of the experiment. An event is an outcome of interest.
The Probability of Event A occurring is: P(A) = # of possible outcomes in which Event A occurs/ Total # outcomes in the sample space.
Basic properties of probability:
- P(A) = 1 implies Event A will occur with certainty
- P(A) = 0 implies Event A will not occur with certainty
- 0 >= P(A) >= 1
- The sum of all probabilities for events in the sample space must be 1
- All outcomes in the sample space that are not part of Event A is called the complement of Event A (named A'). P(A') = 1 - P(A)
- Given two events A and B, P(A) or P(B) - i.e. probabilities of each of the events occuring - without the knowledge of the other events occurrence - is called the prior probability.
- Given two events A and B, the probability of event A occurring given that event B has occurred - denoted by P(A / B) - is called the conditional probability or posterior probability of Event A given that Event B has occurred. On the flip side, if P(A / B) = P(A), then events A and B are termed independent.
- Given two events A and B, the probability of both A and B occurring at the same time is called the joint probability for A and B, computed as P(A and B) = P(A) * P(B)
- Given two events A and B, the probability of either A or B occurring is called the union of events A and B. If events A and B do not occur at the same time (i.e. are mutually exclusive), then P(A or B) = P(A) + P(B). If events A and B occur at the same time, i.e. are not mutually exclusive, then P(A or B) = P(A) + P(B) - P(A and B)
- Law of Total Probability: P(A) = P(A / B)P(B) + P(A / B')P(B')
- The Bayes Theorem for probabilities provides the ability to reverse the conditionality of events and compute the outcome:
P(A / B) = P(A) * P(B / A) / (P(A) * P(B / A) + P(A') * P(B / A'))
A modern, predictive, loan processing application builds analytical models based on millions of historical loan applicant records (training data), and uses these models to predict the credit-worthiness (a k a risk of loan default) of an applicant by classifying the applicant into Low, Medium, High or such risk categories. In data mining lingo, a new applicant record is now scored based on the model. At the time of this writing (Aug-Sep 2007), the sub-prime lending woes and its effect on US and world markets is the main story. The trillions lost in this mess is fodder for Quant skeptics/detractors, but as a BusinessWeek cover story ("Not So Smart" - Sep 3 2007) explains, the problem was not analytics per se - the problems were with how various managements (mis)used analytics or (mis)understood their data.
Returning to probability concepts, instead of A and B, the events become A and Bi, i=1..n. The event A (or more appropriately for this example, the target variable Risk) is a dependent variable that assumes one of discrete values (called classes - low, medium, high) based on predictor variables Bi through Bn (age, salary, gender, occupation, and so on). The probability model for this classifier is P(A / B1,..,Bn). We just shifted the language from statistics into the realm of data mining/predictive analytics. The Bayes theorem intrinsically assumes conditional dependence between Bi through Bn. Now if n is large, or if each Bi takes on a large number of values, computing this model becomes intractable.
The Naive Bayes probabilistic model greatly simplifies this by making a naive/strong assumption that Bi through Bn are conditionally independent - the details are provided here. You can build a Naive Bayes model using the Oracle Data Mining Option, and predict the value for a target variable in new records using SQL Prediction Functions. The following example illustrates the process.
EX> Given a small, synthetic dataset about the attributes of stolen cars, predict if a particular car will be stolen - based on its attributes.
create table stolen_cars(Table created.
id varchar2(2),
color varchar2(10),
ctype varchar2(10),
corigin varchar2(10),
stolen varchar2(3));
insert into stolen_cars values ('1', 'Red','Sports','Domestic','yes');Commit complete.
insert into stolen_cars values ('2', 'Red','Sports','Domestic','no');
insert into stolen_cars values ('3', 'Red','Sports','Domestic','yes');
insert into stolen_cars values ('4', 'Yellow','Sports','Domestic','no');
insert into stolen_cars values ('5', 'Yellow','Sports','Imported','yes');
insert into stolen_cars values ('6', 'Yellow','SUV','Imported','no');
insert into stolen_cars values ('7', 'Yellow','SUV','Imported','yes');
insert into stolen_cars values ('8', 'Yellow','SUV','Domestic','no');
insert into stolen_cars values ('9', 'Red','SUV','Imported','no');
insert into stolen_cars values ('10', 'Red','Sports','Imported','yes');
commit;
beginPL/SQL procedure successfully completed.
dbms_data_mining.create_model(
model_name => 'cars',
mining_function => dbms_data_mining.classification,
data_table_name => 'stolen_cars',
case_id_column_name => 'id',
target_column_name => 'stolen');
end;
/
create table new_stolen_cars (Table created.
id varchar2(2),
color varchar2(10),
ctype varchar2(10),
corigin varchar2(10));
insert into new_stolen_cars values ('1', 'Red','SUV','Domestic');Commit complete.
insert into new_stolen_cars values ('2', 'Yellow','SUV','Domestic');
insert into new_stolen_cars values ('3', 'Yellow','SUV','Imported');
insert into new_stolen_cars values ('4', 'Yellow','Sports','Domestic');
insert into new_stolen_cars values ('5', 'Red','Sports','Domestic');
commit;
select prediction(cars using *) pred,The query scores each row in the new_stolen_cars table, returning the prediction, and the certainty of this predition. This dataset is very small, but a cursory glance at the results indicates that the predictions are correct - based on the training data. For example, the model predicts 'No' for a domestic yellow sports car - the training data has no such instance. The model predicts 'Yes' for a domestic red sports car, with > 50% certainty - the training data does support this prediction. You can obtain the details of this model using:
prediction_probability(cars using *) prob
from new_stolen_cars;
-- Results
PRE PROB
--- ----------
no .75
no .870967746
no .75
no .529411793
yes .666666687
select *The SQL output is a collection of objects and may not look pretty at first glance. But once you understand the schema of the output type - viz.
from table(dbms_data_mining.get_model_details_nb('cars'));
dm_nb_details- you can decipher the output to the following simple format:
STOLENThis shows the target variable (STOLEN), its value ('yes', 'no'), the prior probability (0.5), and the conditional probability contributed by each predictor/predictor value pair towards each target/class value.
no
.5
DM_CONDITIONALS(
DM_CONDITIONAL('COLOR', NULL, 'Red', NULL, .4),
DM_CONDITIONAL('COLOR', NULL, 'Yellow', NULL, .6),
DM_CONDITIONAL('CTYPE', NULL, 'SUV', NULL, .6),
DM_CONDITIONAL('CORIGIN', NULL, 'Domestic', NULL, .6),
DM_CONDITIONAL('CORIGIN', NULL, 'Imported', NULL, .4),
DM_CONDITIONAL('CTYPE', NULL, 'Sports', NULL, .4))
STOLEN
yes
.5
DM_CONDITIONALS(
DM_CONDITIONAL('COLOR', NULL, 'Red', NULL, .6),
DM_CONDITIONAL('CORIGIN', NULL, 'Imported', NULL, .6),
DM_CONDITIONAL('CORIGIN', NULL, 'Domestic', NULL, .4),
DM_CONDITIONAL('CTYPE', NULL, 'Sports', NULL, .8),
DM_CONDITIONAL('CTYPE', NULL, 'SUV', NULL, .2),
DM_CONDITIONAL('COLOR', NULL, 'Yellow', NULL, .4))
Such ability to score transactional customer data directly from the database (in other words, deploy the model right at the source of customer data) with such simplicity is a key Oracle differentiator and competitive advantage over standalone data mining tools. For more on ODM, consult the references provided in this blog.
Descriptive Statistics
Descriptive statistics summarizes and displays information about just the observations at hand - i.e. the sample and population are the same. Using data in the sample schema, we will quickly run through descriptive statistics concepts.
Displaying descriptive statistics
Frequency distribution is a table that organizes a number of values into intervals (classes). A histogram is the visual equivalent of frequency distribution - a bar graph that represents the number of observations in each class as the height of each bar.
EX> Compute the frequency distribution of all employee salaries - across 10 classes.
select intvl, count(*) freqRelative frequency distribution provides the number of observations in each class as a percentage of the total number of observations.
from (select width_bucket(salary,
(select min(salary) from employees),
(select max(salary)+1 from employees), 10) intvl
from HR.employees)
group by intvl
order by intvl;
-- Result from this query
INTVL FREQ
----- --------
1 46
2 10
3 23
4 15
5 8
6 2
7 2
10 1
8 rows selected.
EX> Compute the relative frequency distribution of all employee salaries - across 10 classes.
with Q_freq_dist asCumulative frequency distribution provides the percentage of observations that are less than or equal to the class of interest.
(select intvl, count(*) freq
from (select width_bucket(salary,
(select min(salary) from employees),
(select max(salary)+1 from employees), 10) intvl
from HR.employees)
group by intvl
order by intvl)
--
select intvl, freq/(select count(*) from employees) rel_freq
from Q_freq_dist;
-- Result from this query
INTVL REL_FREQ
----- ----------
1 .429906542
2 .093457944
3 .214953271
4 .140186916
5 .074766355
6 .018691589
7 .018691589
10 .009345794
8 rows selected.
EX> Compute the cumulative frequency distribution of all employee salaries - across 10 classes.
with Q_freq_dist asThe width_bucket function creates equi-width histograms. A complementary function, NTILE, helps you bin values into intervals of equal height - i.e. with same count of values in each bin. This is useful for computing quartiles, quintiles etc.
(select intvl, count(*) freq
from (select width_bucket(salary,
(select min(salary) from employees),
(select max(salary)+1 from employees), 10) intvl
from HR.employees)
group by intvl
order by intvl),
--
Q_rel_freq_dist as
(select intvl, freq/(select count(*) from employees) rel_freq
from Q_freq_dist)
--
select intvl,
sum(rel_freq) over (order by intvl
rows between unbounded preceding and current row) cum_freq
from Q_rel_freq_dist;
-- Result from this query
INTVL CUM_FREQ
----- ----------
1 .429906542
2 .523364486
3 .738317757
4 .878504673
5 .953271028
6 .971962617
7 .990654206
10 1
8 rows selected.
EX> The following query bins the 107 records/observations into approx 10 equi-height bins.
select htbin, count(*) htAnyone who has given a competitive exam (SAT thro GRE/GMAT) should be familiar with the term percentile. Given n data points, the Pth percentile represents the value which resides above P% of the values, and is given by percentile = (n+1) * P/100
from (select ntile(10) over (order by salary) htbin
from hr.employees)
group by htbin
order by htbin;
HTBIN HT
----- ------
1 11
2 11
3 11
4 11
5 11
6 11
7 11
8 10
9 10
10 10
10 rows selected.
EX> Assume a student took the GRE test in 2004 and received an overall GRE score is 2150 (out of the possible 2400) and ranked at 89th percentile. What does this mean? Based on information provided here, it means, of the worldwide total of 408,948, the student ranked above (408948+1) * 89/100 ~= 363965 examinees, and conversely, 408948-363964= 44982 students ranked above this student. You can compute percentiles in Oracle SQL using CUME_DIST() as follows.
EX> Rank all sales reps and their managers by their salary percentile.
select job_id, last_name, salary,The partition by job_id enables to rank within job_id. You can infer things like Ms. Bloom has 90% of her fellow sales rep earning as much or less than her (order by salary clause), Mr. Zlotkey is in the last quintile for managers, and so on.
round(cume_dist() over
(partition by job_id order by salary) * 100) as pctile
from employees
where job_id like '%SA_%'
order by job_id, pctile desc, salary;
JOB_ID LAST_NAME SALARY PCTILE
------ -------------- ------ ------
SA_MAN Russell 14000 100
SA_MAN Partners 13500 80
SA_MAN Errazuriz 12000 60
SA_MAN Cambrault 11000 40
SA_MAN Zlotkey 10500 20
SA_REP Ozer 11500 100
SA_REP Abel 11000 97
SA_REP Vishney 10500 93
SA_REP Tucker 10000 90
SA_REP Bloom 10000 90
SA_REP King 10000 90
...
EX> Suppose a new job candidate names his salary requirements - say 7.5K - you can use the same function - in what Oracle calls its aggregate usage to find out what percentile this salary number would fit in, as follows:
select round(cume_dist(7500) within group order by (salary) * 100, 0) offer_pctileIf the new candidate joins the sales force (a.k.a when his record is entered into this table), his salary will be in the 64th percentile. The PERCENT_RANK() function is similar to CUME_DIST() function - see the SQL Reference Manual in the Oracle docs for details.
from employees
OFFER_PCTILE
------------
64
Measures of Central Tendency - Mean, Median, Mode
Mean or Average = sum(all observations)/ count_of_observations. Note the difference between sample mean and population mean for use later.
EX> Compute average salary of employees in the company.
EX> Compute average quantity sold of products from all sale transactions.
select min(salary) min_s, max(salary) max_s,
round(avg(salary),2) avg_s
from HR.employees;
select prod_id, count(*), sum(quantity_sold) sum_qty,We can also compute the Mean of grouped data from a frequency distribution using
avg(quantity_sold) avg_qty
from SH.sales
group by prod_id
order by prod_id;
x_bar = (Σ_1_to_m(freqi * xi))/Σ_1_to_m(freqi) where m is the number of classes.
EX> Compute the mean of the frequency distribution discussed above.
with Q_freq_dist as (Median is the value in a dataset in which half the observations have a higher value and the other half a lower value.
select intvl, count(*) freq
from (select width_bucket(salary,
(select min(salary) from employees),
(select max(salary)+1 from employees), 10) intvl
from HR.employees)
group by intvl
order by intvl)
--
select sum(intvl * freq)/sum(freq)
from Q_freq_dist
EX> Compute the median salary per department.
select department_id, median(salary)Mode is the observation in the dataset that occurs most frequently.
from HR.employees
group by department_id;
EX> Compute the mode per department.
select department_id, stats_mode(salary)Measures of Dispersion
from HR.employees
group by department_id;
These metrics measure how a set of data values are dispersed around (a k a deviate from) the measure of central tendency (typically the mean). Variance represents the relative distance between the data points and the mean of a dataset, and is computed as the sum of squared deviation of each data point from the mean. σ2 = (Σ1_to_n (xi - x_bar)²)/(n-1)
EX> Compute the variance, sample variance, and population variance of employee salaries.
select variance(salary), var_samp(salary), var_pop(salary)Standard deviation is the square root of variance, and generally more useful than variance because it is in units of the original data - rather than its square.
from HR.employees;
EX> Compute the standard deviation, sample standard deviation, and population standard deviation of employee salaries.
select stddev(salary), stddev_samp(salary), stddev_pop(salary)For several commonly occuring datasets, the values tend to cluster around the mean or median - so that the distribution takes on the shape of a symmetrical bell curve. In such situations, the Chebyshev's Theorem states that at least (1 - 1/k²) x 100%, k > 1, will fall within k-standard-deviations from the mean. If the distribution is more or les symmetrical about the mean (i.e. closer to normal distribution) then the more rigorous empirical rule states that approximately 68% of the data will fall within 1 standard deviation of the mean, and 95% of the data within 1 standard deviations of the mean, and 99.7% of the data within 3 standard deviations of the mean.
from HR.employees;
Skewness is a measure of the unevenness or asymmetry of the distribution. For a perfectly symmetric distribution, the mean is the same as mode is same as the median. Generally, for a skewed distribution, it is observed that the mean is to side of the median is to the side of the mode - the side being right (=ve skew) or left (-ve skew) depending on the skew. Skewness of a given population is computed by Skew = Σ1_to_N (xi - μ)/σ)³/N, where μ is the population mean, or σ is the standard deviation.
Kurtosis is the measure of peakedness of the distribution of the population. Kurtosis = Σ1_to_N(xi - μ)/σ)4/N.
EX> Compute the skewness and kurtosis for the employee salary.
-- SkewnessThis brevity is made possible thanks to Oracle SQL analytical functions (avg(salary) over () implies compute average over all rows in the employees). Note that you can use the partition and order by clauses inside the () to group data across different employee classes and scope the computation to the group if required. From a performance standpoint, note that the computation intrinsically demands multiple scans (2 at the least) over the employees table (or the tables behind the employees view) - so for a potentially large employees table, the computations may involve use of sort (i.e. on-disk) memory rather than just the PGA (the database' equivalent of a PC's RAM). For more on such performance implications of SQL, begin your education with Oracle Concepts or SQL Tuning Guide, or look up the various references provided here.
select avg(val3)
from (select power((salary - avg(salary) over ())/
stddev(salary) over (), 3) val3
from employee);
--
-- Kurtosis
select avg(val4)
from (select power((salary - avg(salary) over ())/
stddev(salary) over (), 4) val4
from employee);