My plan was to finish a full series on Oracle tools for Basic Statistics, before moving on to typical "blog-like" postings - but I realize I'll need at least another month (or two) to get to topics like ANOVA and Multiple Regression. I feel compelled to get a basic feel for a technique/concept before writing up examples of the functions here (which anybody can quickly get from the Oracle SQL Reference Manual).

So I am (reluctantly) breaking my rule to sneak in this note about a talk I gave in Nov '07 at the Worcester Polytechnic '07 Colloqium Series - before it loses its relevance. The talk provides the motivation for in-database mining, and the presentation slides offer a good intro to Oracle Data Mining in the 11gR1 Database release.

Oracle has touted the benefits of in-database mining for over 4 years now, with steady improvements to the product and an expanding customer base. The recent announcement from market leader SAS on integration with Teradata is a nice vindication of this approach. Details of the integration, whether this engineering effort pans out, and how the market receives the integrated product remains to be seen - but this is good for the industry. SAS still owns the bully pulpit in Analytics, and this move can hopefully lead to greater awareness of the benefits of this approach, and consolidation in this space. Marcos has promised to write about these developments from his perspective as a scientist/developer - so I will hold off my thoughts in anticipation of that post.

Impressions on WPI - good school with active AI and Database research groups. I had the pleasure of meeting Prof. Ruiz and Prof. Heffernan from the AI research group - they have some interesting projects there, and Prof. Rundentstein and Prof. Mani from the database research group. Check these programs out if you are an aspiring CS student.

### Talk at WPI - Oracle In-Database Mining

### Student's T-Distribution, Degrees of Freedom

In the previous post, we showed how to compute the confidence interval when σ - i.e. the population standard deviation, is known. But in a typical situation, σ is rarely known.**Computing confidence Interval for μ when σ is Unknown:** The solution then is to use the *sample* standard deviation, and use a variant of the standardized statistic for normal distribution z = (X_bar - μ)/σ.**Student's T Distribution:**

For a *normally distributed population*, the Student's T distribution is given by this standardized statistic: **t = (X_bar - μ)/(S/√n), with (n - 1) degrees of freedom (df) for the deviations**, where S is the sample standard deviation, and **n** is the sample size. Key points:

- The t distribution resembles the bell-shaped z (normal) distribution, but with wider tails than z, with mean zero (the mean of z), and with variance tending to 1 (the variance of z) as df increases. For df > 2,
**σ² = df/(df - 2)**. - The z (normal) distribution deals with one unknown
**μ**- estimated by the random variable X_bar, while the t distribution deals with two unknowns**μ**and**σ**- estimated by random variables X_bar and S respectively. So it tacitly handles greater uncertainty in the data. - As a consequence, the t distribution has wider confidence intervals than z
- There is a t-distribution for each df=1,..n.
- A good comparison of the distributions is provided here.
- For a sample (n < 30) taken from normally distributed population, a
**(1 - α) 100% confidence interval**for μ when σ is__unknown__is**X_bar ± t**. This is the better distribution to use for small samples - with (n - 1) df, and unknown μ and &sigma._{α/2}s/√n - But larger samples (n ≥ 30), and/or with larger df, the t distribution can be approximated by a z distribution, and the
**(1 - α)100% confidence interval**for μ is**X_bar ± z**(Note: Using the sample sd itself)._{α/2}s/√n

x | x_bar | deviation | deviation_squaredGiven the mean, the deviation computation for the

3 | 3 | 0.0 | 0

2 | 3 | -1.0 | 1

4 | 3 | 1.0 | 1

1 | 3 | -2.0 | 4

2 | 3 | -1.0 | 1

Sum of Squared Deviation = 7

*random*5 samples effectively retain 5 degrees of freedom. Next, assume we

*don't*know the population mean, and instead are asked to compute the deviation from

*one*random number. Our goal is to choose a number that will minimize the deviation. A readily available number is the s sample mean (3+2+4+1+2)/5 = 2.4 - so we will use it:

x | x_bar | deviation | deviation_squaredThe use of sample mean biases the SSD downward from 7 (actual) to 5.2. But given the choice of a mean, the deviation for the same random 5 samples retain df = (5 - 1) = 4 degrees of freedom.

3 | 2.4 | 0.6 | 0.36

2 | 2.4 | -0.4 | 0.16

4 | 2.4 | 1.6 | 2.56

1 | 2.4 | -1.4 | 1.96

2 | 2.4 | -0.4 | 0.16

Sum of Squared Deviation = 5.2

Subsequent choices of 2 means - (3+2)/2, (4+1+2)/3 - or 3 means, would reduce the SSD down further; at the same time, reducing the degrees of freedom for the deviation for the 5 random samples: df=(5-2), df=(5-3) and so on. As an extreme case, if we consider the sample mean of each sampled number as itself, then we have:

x | x_bar | deviation | deviation_squaredwhich reduces SSD to 0, and the deviation df to (5-5) = 0. So in general,

3 | 3 | 0 | 0

2 | 3 | 0 | 0

4 | 4 | 0 | 0

1 | 1 | 0 | 0

2 | 2 | 0 | 0

Sum of Squared Deviation = 0

- deviations (and hence SSD) for a sample of size n taken from a known population mean μ will have df = n
- deviations for a sample of size n taken from the sample mean X_bar will have df = (n - 1)
- deviations for a sample of size n taken from k ≤ n different numbers (typically mean of sample points) will have df = n - k.

**Student's T-Test**- which requires understanding the concepts of Hypothesis Testing. So I will defer the code for an equivalent confidence_interval() routine based on T-distribution for later.

### Sampling Distributions, Confidence Interval

The key goal of inferential statistics is to make predictions/observations about the population (the whole) as a generalization of observations made on a random sample (the part). In the previous post, we discussed common techniques to derive samples from a population. In this post, we will discuss *sampling distributions* - a key building block for the practice of statistical inference. These tools help answer questions such as: "What should be the sample size to make a particular inference about this population?" or "100 random water samples along this river show an average of 50 ppm (parts per million) of this toxin, with standard deviation of 4.5 - how much is the river contaminated on average with 95% confidence interval", and so on.

The objective in the next few posts is to discuss the use of Oracle SQL statistical functions for various sampling distributions. But if you are a typical DB developer with novice/intermediate knowledge of statistics (like me), spending some time on these foundational concepts may be worthwhile. I am currently using Complete Business Statistics and the Idiot's Guide to Statisticsas my guides - you may use these or other books and/or the free online references on the right pane of this blog.

- The various numerical measures - such as mean, variance etc - when applied to a
*sample*, are called**sample statistics**or simply**statistics**. - When these numerical measures are applied to a
*population*, they are called**population parameters**or simply**parameters**. - An
**estimator**of a population parameter is the sample statistic used to estimate the parameter. The sample statistic - mean, X_bar - estimates the population mean μ; the sample statistic - variance, S² - estimates the population variance σ². - When a single numerical value is the estimate, it is called a
**point estimate**. For example, when we sample a population and obtain a value for X_bar - the statistic - we get a*specific*sample mean, denoted by x_bar (lower-case), which is the estimate for population mean μ. When the estimate covers a range or an interval, it is called an**interval estimate**- the unknown population parameter is likely to be found in this interval - A sample statistic, such as X_bar, is a
*random variable*; the values of this randome variable depend on the values in the random sample from which the statistic is computed; the sample itself depends on the population from which it is drawn. This random variable has probability distribution, which is called the**sampling distribution** - The principal use of sampling distributions and its related concepts is to help predict how close the estimate is to the population parameter, and with what probability.

**Central Limit theorem**:

The sample statistic sample mean X_bar exhibits a unique behavior - regardless of the population distribution (uniform, exponential, other), in the limit n → ∞ ("n tends to infinity", where n is the sample size), the sampling distribution of X_bar tends to a normal distribution. The

*rate*at which the sampling distribution approaches normal distribution depends on the population distribution. Now, if the

__population itself is normally distributed__, then X_bar is normally distributed for

*any*sample size. This is the essence of what is called Central Limit theorem.

Formally, when a population with mean μ and standard deviation σ is sampled, the sampling distribution of the sample mean X_bar will tend to a normal distribution with (the same) mean

**μ**and standard deviation

**σ**, as the sample size

_{x_bar}= σ/√n**n**becomes large. "Large" is empirically defined to be

**n ≥ 30**. The value

**σ**is called the

_{x_bar}**standard error of the mean**.

"Okay,... so what is the big deal?". The big deal is that we can now estimate the population mean (regardless of the population's distribution) using the familiar technique (that we saw in an earlier post) for standard normal distribution.

Now, it is not common that one or more of the population parameters (like standard deviation) are always known. The computations have to be modified to accommodate these unknowns - which brings us to two more concepts associated with sampling distributions.

**Degrees of Freedom (DF):**

If we are asked to choose three random numbers

*a*,

*b*and

*c*, we are free to choose any three numbers without any restrictions - in other words, we have 3

*degrees of freedom*. But if the three numbers are put together in a model

*a + b + c = 10*, then we have just 2 degrees of freedom - choice of

*a*and

*b*can be arbitrary , but

*c*is constrained to take a specific value that satisfies the model. The use of

*df*appears to be a compensatory mechanism in the computations, specific to the context/situation in which is it applied - so we'll discuss this in the context of the technique we are illustrating.

**Confidence Interval:**

An interval estimate, with its associated measure of confidence is called

**confidence interval**. It is a range of numbers that probably contains the unknown population parameter, with an adjoining

*level of confidence*that it indeed does. This is better than a point estimate in that it gives some indication of the

*accuracy*of the estimation.

In an earlier post, we briefly touched upon the transformation of a normal random variable (X, with arbitrary μ and σ) to a

*standard*normal variable (Z, with μ = 0 and σ = 1). The transformations are X to Z: Z = (X - μ)/σ and Z to X: X = μ + Zσ. Applying the latter transformation to standardized sampling distribution with mean μ and standard deviation σ/√n, the confidence interval for the population mean is μ ± Z σ/√n.

A typical question will be "Give me the 95% confidence interval for the population mean". Given the confidence level, and the knowledge that the area under the standard normal curve is 1, we can obtain the value of Z from the standard normal table. For example, a 95% confidence level translates to an area of 0.95 symmetrically distributed around the mean, leaving 0.025 as areas on the left and right tails. From the table, Z = -1.96 for P=0.025, and Z = 1.96 for P=(0.025+0.95). So the 95% confidence interval for the population mean, when the population standard deviation is known, is given by μ ± 1.96 σ/√n

We'll wrap this section reinforcing some concepts for use later:

- In probability-speak, the statement "95% confidence interval for the population mean" implies that "there is a 95% probability that a given confidence interval from a given random sample from the same population will contain the population mean". It does
__NOT__imply a "95% probability that the population mean is a value in the range of the interval". In the figure, sample mean x for a specific sample falls within the interval - based on this, the confidence interval is considered to contain the population mean μ. If x for another sample falls in the tail region, then that confidence interval cannot assert that it contains μ. - The quantity Z σ/√n is called
**sampling error**or**margin of error**. - The combined area under the curve in the tails (i.e. 1 - 0.95 = 0.05 in the above example) is called
**level of significance α**, and/or**error probability**. - The area under the curve excluding the tails under the curve in the tails
**(1 - α)**is called**confidence coefficient**. - The confidence coefficient x 100, expressed as a percentage, is the
**confidence level**. - The Z value that cuts off the area under the right tail (i.e. the area α/2 on the right of the curve, 1.96 in our example) is denoted as
**z**._{α/2} - For a small sample (n < 30), or a sample taken from a normally distributed population, the
**(1 - α) 100% confidence interval**for μ with known σ is**X_bar ± z**_{α/2}σ/√n

**Confidence Interval for Population Mean with Known σ:**

Excel has a CONFIDENCE() function to compute the confidence interval. See a simple equivalent for Oracle SQL below. The function takes in a table name, the column name representing the sampled quantity, and level of significance value of 0.05, 0.01, or 0.1 (that corresponds to the three popular confidence levels - 95%, 99%, and 90% - respectively), and returns an object that contains the sample mean, sample error, the lower and upper bounds of the interval.

CREATE OR REPLACE TYPE conf_interval_t AS OBJECT (I used this function to find the 90%, 95%, and 99% confidence interval for the population mean of ORDERS_TOTAL in the ORDERS table, with an approx sample size of 15, with a seed to enable repeatable runs from the SQL sampler. Notice how the interval widens and becomes less precise as the confidence level increases. The true population mean is also shown to be contained in the interval

pop_mean NUMBER, sample_err NUMBER, lower NUMBER, upper NUMBER);

/

CREATE OR REPLACE FUNCTION confidence_interval (

table_name IN VARCHAR2,

column_name IN VARCHAR2,

sample_percent IN NUMBER,

alpha IN NUMBER DEFAULT 0.05,

seed IN NUMBER DEFAULT NULL)

RETURN conf_interval_t IS

pop_mean NUMBER;

pop_stddev NUMBER;

sample_sz NUMBER;

z NUMBER;

err NUMBER;

v_stmt VARCHAR2(32767);

BEGIN

v_stmt :=

'SELECT AVG(' || column_name || '), count(*) ' ||

'FROM (SELECT * ' ||

'FROM ' || table_name ||

' SAMPLE(' || sample_percent || ')';

IF (seed IS NOT NULL) THEN

v_stmt := v_stmt || ' SEED(' || seed || ')';

END IF;

v_stmt := v_stmt || ')';

EXECUTE IMMEDIATE v_stmt INTO pop_mean, sample_sz;

v_stmt :=

'SELECT STDDEV(' || column_name || ') ' ||

'FROM ' || table_name;

EXECUTE IMMEDIATE v_stmt INTO pop_stddev;

IF (alpha = 0.05) THEN

z := 1.96;

ELSIF (alpha = 0.01) THEN

z := 2.57;

ELSIF (alpha = 0.1) THEN

z := 1.64;

ELSE

RETURN(NULL);

END IF;

err := z * pop_stddev / SQRT(sample_sz);

RETURN (conf_interval_t(pop_mean, err, (pop_mean - err), (pop_mean + err)));

END confidence_interval;

/

SQL> select confidence_interval('ORDERS', 'ORDER_TOTAL', 15, 0.1, 3) from dual;Fine - but how do we find the confidence interval when σ is unknown (which is the norm in practice)? Enter

CONFIDENCE_INTERVAL('ORDERS','ORDER_TOTAL',15,0.1,3)(POP_MEAN, SAMPLE_ERR, LOWER, UPPER)

-------------------------------------------------------------------------

CONF_INTERVAL_T(24310.9188, 21444.6451, 2866.27368, 45755.5638)

SQL> select confidence_interval('ORDERS', 'ORDER_TOTAL', 15, 0.05, 3) from dual;

CONFIDENCE_INTERVAL('ORDERS','ORDER_TOTAL',15,0.05,3)(POP_MEAN, SAMPLE_ERR, LOWER, UPPER)

-------------------------------------------------------------------------

CONF_INTERVAL_T(24310.9188, 25628.9661, -1318.0473, 49939.8848)

SQL> select confidence_interval('ORDERS', 'ORDER_TOTAL', 15, 0.01, 3) from dual;

CONFIDENCE_INTERVAL('ORDERS','ORDER_TOTAL',15,0.01,3)(POP_MEAN, SAMPLE_ERR, LOWER, UPPER)

-------------------------------------------------------------------------

CONF_INTERVAL_T(24310.9188, 33605.3279, -9294.4092, 57916.2467)

SQL> select avg(order_total) from orders;

AVG(ORDER_TOTAL)

----------------

34933.8543

SQL>

**T (or Student's) Distribution**- we will look at this in the next post.

### Sampling

The key goal of inferential statistics is to make predictions/observations about the *population* (the whole) as a generalization of observations made on a random *sample* (the part). In this post, we will present tools/techniques in Oracle for sampling data.**Sampling**

For ensuring high accuracy in the results of a statistical inference (technique), the sample dataset should minimally have these properties:

- the sample must be drawn randomly from the population
- every segment of the population must be adequately and proportionately represented
- the sample should not be
*biased*- a classic example is the*non-response bias*seen in survey/poll data - where the respondents ignore/refuse to answer a particular question ("Have you smoked marijuana?" in a health questionnaire).

The SAMPLE clause in the SQL SELECT statement supports simple random sampling and clustered sampling. Note that the

*sampling frame*here is simply the set of rows returned by the statement - so you can control how many, and which, of the rows are sampled using filters in the WHERE clause. Here are some examples based on the OE.ORDERS table provided with the Sample Schema (connect oe/oe in SQL*Plus to see this table).

**Example: Random Sampling without replacement:**

Get random samples from ORDERS table with each row having a 15% chance of being in the sample

SQL> select order_id,order_status,order_total from orders sample (15) seed (3);The SEED clause in the above statement helps with repeatability of results from one run to the next. Next, achieving

ORDER_ID ORDER_STATUS ORDER_TOTAL

---------- ------------ -----------

2354 0 46257

2360 4 990.4

2361 8 120131.3

2384 3 29249.1

2389 4 17620

2400 2 69286.4

2401 3 969.2

2402 8 600

2415 6 310

2416 6 384

2417 5 1926.6

ORDER_ID ORDER_STATUS ORDER_TOTAL

---------- ------------ -----------

2419 3 31574

2423 3 10367.7

2427 7 9055

2429 9 50125

2453 0 129

16 rows selected.

SQL>

**clustered sampling**requires a BLOCK qualifier in the SAMPLE clause, as in

select order_id,order_status,order_total from orders sample block (15) seed (3);

**Example: Random Sampling with replacement:**can be accomplished with some minimal coding. In the example below, we using a hashing function (ORA_HASH) on the sequence of enumerated (using ROWNUM pseudo-column) rows to randomize the selection from the table to be sampled. Let's try this: "Get me random samples with replacement with sample size 10 rows from the ORDERS table".

-- stage a view with row numbers tacked on to original tableThis is the result of the above code snippet, when run in the OE schema using SQL*Plus:

create view orders_view as

select rownum rnum, o.*

from orders o;

-- create a sequence

create sequence orders_seq10;

-- create a mapping table

create table orders_map(rnum number, rhsh number);

-- the requested sample size is 10

begin

for i in 1..10

loop

insert into orders_map (rnum) values (orders_seq10.nextval);

end loop;

end;

/

commit;

-- the complete orders table is the sampling frame

-- mark random sampled entries in mapping table

update orders_map set rhsh = ora_hash(rnum, (select count(*) from orders));

-- use the mapping table and orders_view to create a view with sampled rows

create view orders_sample10_view as

select o.*

from orders_view o, orders_map m

where o.rnum = m.rnum;

SQL> select * from orders_sample10_view;For repeatability of sampled results, change all "create view" statements above to "create table". It is also useful if the data to be sampled is persisted in a table, rather than presented to the sampling queries as a view.

RNUM ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID

------ ---------- -------- ----------- ------------ ----------- ------------

11 2455 direct 145 7 14087.5 160

12 2379 direct 146 8 17848.2 161

13 2396 direct 147 8 34930 161

14 2406 direct 148 8 2854.2 161

15 2434 direct 149 8 268651.8 161

16 2436 direct 116 8 6394.8 161

17 2446 direct 117 8 103679.3 161

18 2447 direct 101 8 33893.6 161

19 2432 direct 102 10 10523 163

20 2433 direct 103 10 78 163

10 rows selected.

SQL>

**Stratified Sampling**can also be coded using SQL. But rather than provide an example query that is specific to a given table, we will provide a query generator function here. This is adapted from the output of code generated by Oracle Data Miner for stratified sampling. The function accepts a source and result table name, the name of the stratifying column (a.k.a. "variable" in statistics, "attribute" in data mining), the sampling size as a percentage of the table size, a scalar comparison operator ('<', '=', '>'), and an indicator to specify if each strata should roughly have the same number of elements.

An important note - Oracle Data Miner - the GUI interface for the Oracle Data Mining platform provides sampling features,

*and*generated SQL/PLSQL code that corresponds to the sampling input. This example has been adapted from this generated code.

CREATE OR REPLACE TYPE TARGET_VALUES_LIST IS TABLE OF VARCHAR2(32);Now, cut and paste the above code in a SQL*Plus session (for this example, the Sample Schema OE/OE session), and then invoke the function in a SQL*Plus session using this wrapper. The inputs are pretty self-explanatory - we are asking to sample roughly 20% of the ORDERS table, stratified based on values in the ORDERS_STATUS column, and to place the sampled output in the table SAMPLED_ORDERS.

/

CREATE OR REPLACE TYPE VALUE_COUNT_LIST IS TABLE OF NUMBER;

/

CREATE OR REPLACE

FUNCTION GENERATE_STRATIFIED_SQL (

result_table_name IN VARCHAR2,

source_table_name IN VARCHAR2,

strat_attr_name IN VARCHAR2,

percentage IN NUMBER,

op IN VARCHAR2,

equal_sized_strata IN BOOLEAN DEFAULT FALSE)

RETURN VARCHAR2 IS

v_stmt VARCHAR2(32767);

tmp_str VARCHAR2(4000);

sample_count PLS_INTEGER;

attr_names TARGET_VALUES_LIST;

attr_values TARGET_VALUES_LIST;

counts VALUE_COUNT_LIST;

counts_sampled VALUE_COUNT_LIST;

v_minvalue NUMBER;

BEGIN

v_stmt :=

'SELECT column_name ' ||

'FROM user_tab_columns ' ||

'WHERE table_name = ' || '''' || UPPER(source_table_name) || '''';

EXECUTE IMMEDIATE v_stmt BULK COLLECT INTO attr_names;

v_stmt :=

'SELECT /*+ noparallel(t)*/ ' || strat_attr_name || ', count(*), ' ||

'ROUND ((count(*) * ' || percentage || ')/100.0) ' ||

'FROM ' || source_table_name ||

' WHERE ' || strat_attr_name || ' IS NOT NULL ' ||

'GROUP BY ' || strat_attr_name;

EXECUTE IMMEDIATE v_stmt

BULK COLLECT INTO attr_values, counts, counts_sampled;

IF (equal_sized_strata = TRUE) THEN

FOR i IN counts.FIRST..counts.LAST

LOOP

IF (i = counts.FIRST) THEN

v_minvalue := counts(i);

ELSIF (counts(i) > 0 AND v_minvalue > counts(i)) THEN

v_minvalue := counts(i);

END IF;

END LOOP;

FOR i IN counts.FIRST..counts.LAST

LOOP

counts(i) := v_minvalue;

END LOOP;

END IF;

v_stmt :=

'CREATE TABLE ' || result_table_name || ' AS ' ||

'SELECT ';

FOR i IN attr_names.FIRST..attr_names.LAST

LOOP

IF (i != attr_names.FIRST) THEN

v_stmt := v_stmt || ',';

END IF;

v_stmt := v_stmt || attr_names(i);

END LOOP;

v_stmt := v_stmt ||

' FROM (SELECT /*+ no_merge */ t.*, ' ||

'ROW_NUMBER() OVER ' ||

'(PARTITION BY ' || strat_attr_name ||

' ORDER BY ORA_HASH(ROWNUM)) RNUM ' ||

'FROM ' || source_table_name || ' t) ' ||

'WHERE RNUM = 1 OR ';

FOR i IN attr_values.FIRST..attr_values.LAST

LOOP

IF (i != attr_values.FIRST) THEN

tmp_str := ' OR ';

END IF;

IF (counts(i) <= 2) THEN

sample_count := counts(i);

ELSE

sample_count := counts_sampled(i);

END IF;

tmp_str :=

tmp_str ||

'(' || strat_attr_name || ' = ''' || attr_values(i) || '''' ||

' AND ORA_HASH(RNUM, (' || counts(i) || ' -1), 12345) ' ||

op || sample_count || ') ';

v_stmt := v_stmt || tmp_str;

END LOOP;

RETURN(v_stmt);

END;

/

set serveroutput onthis will return the following sampling query (output formatted for easy readability):

BEGIN

DBMS_OUTPUT.PUT_LINE(

generate_stratified_sql(

result_table_name => 'sampled_orders',

source_table_name => 'orders',

strat_attr_name => 'order_status',

percentage => 20,

op => ' < '));

END;

/

CREATE TABLE sampled_orders ASIf you execute this SQL in the same OE session, the resulting table is:

SELECT ORDER_ID,ORDER_MODE,CUSTOMER_ID,ORDER_STATUS,ORDER_TOTAL,

SALES_REP_ID,PROMOTION_ID

FROM (SELECT /*+ no_merge */ t.*,

ROW_NUMBER() OVER

(PARTITION BY order_status

ORDER BY ORA_HASH(ROWNUM)) RNUM

FROM orders t)

WHERE RNUM = 1 OR

(order_status = '1' AND ORA_HASH(RNUM, (7 -1), 12345) < 1) OR

(order_status = '6' AND ORA_HASH(RNUM, (9 -1), 12345) < 2) OR

(order_status = '2' AND ORA_HASH(RNUM, (7 -1), 12345) < 1) OR

(order_status = '5' AND ORA_HASH(RNUM, (15 -1), 12345) < 3) OR

(order_status = '4' AND ORA_HASH(RNUM, (12 -1), 12345) < 2) OR

(order_status = '8' AND ORA_HASH(RNUM, (17 -1), 12345) < 3) OR

(order_status = '3' AND ORA_HASH(RNUM, (9 -1), 12345) < 2) OR

(order_status = '7' AND ORA_HASH(RNUM, (3 -1), 12345) < 1) OR

(order_status = '0' AND ORA_HASH(RNUM, (11 -1), 12345) < 2) OR

(order_status = '10' AND ORA_HASH(RNUM, (5 -1), 12345) < 1) OR

(order_status = '9' AND ORA_HASH(RNUM, (10 -1), 12345) < 2);

SQL> select * from sampled_orders;That is, 30 rows (a bit more than 20%) out of 105 are sampled, using order_status as the stratifying column.

ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID

---------- -------- ----------- ------------ ----------- ------------

2363 online 144 0 10082.3

2369 online 116 0 11097.4

2403 direct 162 0 220 154

2439 direct 105 1 22150.1 159

2408 direct 166 1 309 158

2444 direct 109 1 77727.2 155

2358 direct 105 2 7826 155

2400 direct 159 2 69286.4 161

2375 online 122 2 103834.4

2450 direct 147 3 1636 159

2423 direct 145 3 10367.7 160

ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID

---------- -------- ----------- ------------ ----------- ------------

2385 online 147 4 295892

2437 direct 103 4 13550 163

2389 online 151 4 17620

2364 online 145 4 9500

2377 online 141 5 38017.8

2425 direct 147 5 1500.8 163

2394 direct 109 5 21863 158

2457 direct 118 5 21586.2 159

2426 direct 148 6 7200

2410 direct 168 6 45175 156

2427 direct 149 7 9055 163

ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID

---------- -------- ----------- ------------ ----------- ------------

2446 direct 117 8 103679.3 161

2402 direct 161 8 600 154

2434 direct 149 8 268651.8 161

2447 direct 101 8 33893.6 161

2365 online 146 9 27455.3

2372 online 119 9 16447.2

2359 online 106 9 5543.1

2368 online 149 10 60065

30 rows selected.

SQL> select count(*) from orders;

COUNT(*)

----------

105

SQL>

Try this example on your own schemas and tables, and send me some feedback. The maximum length of the string returned from GENERATE_STRATFIED_SQL() is 32767. Based on the interest, I can post another function that can handle SQL strings of <= 64K in size.

Given the explosive growth in data, sampling continues to remain an active, relevant research area in statistics, data mining, and computer science.

### Analyzing your data for Distributions

In the previous post, we looked at continuous probability distributions. You can determine if the data in a given column in a given table follows a particular distribution using routines in the DBMS_STAT_FUNCS package. The example shows a table with four columns - each pre-loaded with data synthesized to fit a particular continuous distribution.

You can substitute the table name and column name for your experiment. A minor annoyance in the design of this API - most other Oracle packages will default to the current session name, but this one explicitly expects the schema name as input - which, in my case, is DMUSER - please change this to your schema name accordingly.

set echo on;Let us first start with

set serveroutput on;

create table disttab (

num1 number, num2 number, num3 number, num4 number, cat1 varchar2(2));

rem num1: Numbers generated based on normal distribution

rem num2: Numbers generated based on exponential distribution

rem num3: Numbers generated based on weibull distribution

rem num4: Numbers generated based on uniform distribution

insert into disttab values (9.604955, 1.05536, 4.126087, 22.950835, 'AA');

insert into disttab values (13.022139, 1.714142, 4.999804, 32.598089, 'AA');

insert into disttab values (11.572116, 3.697564, 2.81854, 24.552021, 'AA');

insert into disttab values (9.817124, 1.530935, 2.131106, 6.359504, 'AA');

insert into disttab values (10.146569, 3.295829, 1.510639, 25.218639, 'AA');

insert into disttab values (11.280488, 0.721109, 3.145515, 23.672146, 'BB');

insert into disttab values (9.26679, 1.390282, 4.074397, 11.262112, 'BB');

insert into disttab values (14.303472, 1.327971, 2.51907, 22.675373, 'BB');

insert into disttab values (11.686556, 0.225337, 2.941825, 23.582254, 'BB');

insert into disttab values (13.124479, 7.265271, 0.945059, 29.18001, 'BB');

insert into disttab values (8.601027, 7.060104, 6.078573, 14.878128, 'BB');

insert into disttab values (12.241662, 0.257739, 3.395142, 31.148244, 'CC');

insert into disttab values (13.781857, 4.281371, 1.349627, 23.862069, 'CC');

insert into disttab values (7.827007, 1.347487, 5.836949, 10.76229, 'CC');

insert into disttab values (9.106408, 1.253113, 5.116857, 6.594224, 'CC');

insert into disttab values (11.066785, 4.56512, 3.393899, 22.435955, 'CC');

insert into disttab values (10.71079, 2.700015, 1.922642, 7.635145, 'DD');

insert into disttab values (9.13019, 5.199126, 3.763481, 32.061213, 'DD');

insert into disttab values (7.873859, 0.978657, 2.268487, 1.030052, 'DD');

insert into disttab values (7.731724, 2.382977, 2.639425, 5.676622, 'DD');

insert into disttab values (12.828234, 1.867099, 3.99808, 26.000458, 'DD');

insert into disttab values (12.125892, 1.01285, 3.345311, 8.026281, 'DD');

insert into disttab values (9.800528, 5.869301, 3.840932, 29.928523, 'EE');

insert into disttab values (10.605782, 3.145211, 2.13718, 27.398604, 'EE');

insert into disttab values (14.054569, 4.089033, 2.436408, 4.483585, 'EE');

insert into disttab values (8.120606, 2.155303, 1.787835, 19.513588, 'EE');

insert into disttab values (13.093059, 0.220456, 3.456848, 24.855135, 'EE');

insert into disttab values (8.421441, 2.4819, 2.817669, 21.137668, 'FF');

insert into disttab values (11.899697, 2.507618, 3.770983, 4.016285, 'FF');

insert into disttab values (9.601342, 1.12639, 3.21053, 28.643809, 'FF');

insert into disttab values (9.32297, 10.003288,6.890515, 33.67171, 'FF');

insert into disttab values (6.896019, 10.76641, 3.123496, 29.077463, 'FF');

insert into disttab values (12.542443, 0.228756, 4.081015, 33.542652, 'FF');

insert into disttab values (14.038144, 7.326175, 3.53459, 11.731359, 'FF');

**Normal Distribution**. The mean (same as E(x)) and standard deviation are computed for the column

*num1*and provided as input.

declareThe output of this procedure looks like this:

mean number;

stdev number;

sig number;

begin

select avg(num1) into mean from disttab;

select stddev(num1) into stdev from disttab;

dbms_output.put_line('NORMAL DISTRIBUTION - SHAPIRO_WILKS');

dbms_stat_funcs.normal_dist_fit(

'dmuser', 'disttab', 'num1', 'SHAPIRO_WILKS', mean, stdev, sig);

dbms_output.put_line('Mean : ' || round(mean, 4));

dbms_output.put_line('Stddev : ' || round(stdev, 4));

dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));

end;

/

NORMAL DISTRIBUTION - SHAPIRO_WILKS

W value : .9601472834631918774434537597940068448938

Mean : 10.7426

Stddev : 2.1094

Sig : .2456

The high value for the W returned from the Shapiro-Wilks test (explained here) indicates an agreement with the null hypothesis that the data follows normal distribution. The significance (0.25) is also relatively high for the Shapiro-Wilks test (0.05 being the typical threshold).

Next,

**Exponential distribution**. For the above dataset, lambda - i.e. the rate of arrival - is provided (Note that lambda is NOT simply 1/mean(num3), it is 1/E(x) and we don't know how the expected value was computed)

declareThe output shows:

lambda number;

mu number;

sig number;

begin

lambda := 0.3194;

-- select 1/mean into lambda from (select mean(num2) from disttab);

mu := NULL;

sig := NULL;

dbms_output.put_line('EXPONENTIAL DISTRIBUTION - KOLMOGOROV_SMIRNOV');

dbms_stat_funcs.exponential_dist_fit(

'dmuser', 'disttab', 'num3', 'KOLMOGOROV_SMIRNOV', lambda, mu, sig);

dbms_output.put_line('Lambda : ' || lambda);

dbms_output.put_line('Mu : ' || mu);

dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));

end;

/

EXPONENTIAL DISTRIBUTION - KOLMOGOROV_SMIRNOV

D value : .0919745379005003387429254774811052604723

Lambda : .3194

Mu : 0

Sig : .9237466160

The low D value (0.09) and high significance (0.92) suggests that the data fits an exponential distribution well.

Next,

**Weibull Distribution**. The alpha input parameter to the procedure corresponds to

*η*- the scale, and the mu to

*γ*- the location - in the Weibull probability density function discussed earlier.

declareThe output shows:

alpha number;

beta number;

begin

alpha := 3;

mu := 0;

beta := 4;

sig := NULL;

dbms_output.put_line('.');

dbms_output.put_line('WEIBULL DISTRIBUTION - KOLMOGOROV_SMIRNOV');

dbms_stat_funcs.weibull_dist_fit(

'dmuser', 'disttab', 'num3', 'KOLMOGOROV_SMIRNOV', alpha, mu, beta, sig);

dbms_output.put_line('Alpha : ' || alpha);

dbms_output.put_line('Mu : ' || mu);

dbms_output.put_line('Beta : ' || beta);

dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));

end;

/

WEIBULL DISTRIBUTION - KOLMOGOROV_SMIRNOV

D value : .2575286246007637604103723582952313687414

Alpha : 3

Mu : 0

Beta : 4

Sig : .0177026134

The Kolmogorov-Smirnov test does not appear emphatic enough, we will try other tests later.

Next,

**Uniform Distribution**.

declareThe output shows:

A number;

B number;

begin

A := 1;

B := 34;

sig := NULL;

dbms_output.put_line('.');

dbms_output.put_line('UNIFORM DISTRIBUTION - KOLMOGOROV_SMIRNOV');

dbms_stat_funcs.uniform_dist_fit(

'dmuser', 'disttab', 'num4', 'CONTINUOUS', 'KOLMOGOROV_SMIRNOV', A, B, sig);

dbms_output.put_line('A : ' || A);

dbms_output.put_line('B : ' || B);

dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));

end;

/

UNIFORM DISTRIBUTION - KOLMOGOROV_SMIRNOV

D value : .2083979233511586452762923351158645276292

A : 1

B : 34

Sig : .0904912415

The Kolmogorov-Smirnov test does not appear emphatic enough, we will try other tests later.

Finally, let us try fitting a small dataset to

**Poisson Distribution**.

create table pdisttab (num1 number);The output of:

insert into pdisttab values(1);

insert into pdisttab values(2);

insert into pdisttab values(3);

insert into pdisttab values(4);

insert into pdisttab values(5);

declare

mean number;

stdev number;

sig number;

begin

mean := 3.0;

stdev := 1.58114;

dbms_output.put_line('.');

dbms_output.put_line('POISSON DISTRIBUTION - KOLMOGOROV_SMIRNOV');

dbms_stat_funcs.poisson_dist_fit(

'dmuser', 'pdisttab', 'num1', 'KOLMOGOROV_SMIRNOV', mean, sig);

dbms_output.put_line('Mean : ' || mean);

dbms_output.put_line('Stddev : ' || stdev);

dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));

end;

/

POISSON DISTRIBUTION - KOLMOGOROV_SMIRNOV

D value : .08391793845975204

Mean : 3

Stddev : 1.58114

Sig : .9999999999

emphatically is in agreement with the null hypothesis that the data fits a Poisson distribution.

That was fun, was'n't it? Now, we have introduced several new terms and concepts here (esp if you reviewed the DBMS_STAT_FUNCS docs) - viz.

*goodness-of-fit testing*if a given data sample

*fits*a particular distribution based on a

*null hypothesis*, using various

*test types*, the test metrics, and the

*significance*output of a test, and various other parameters. In the upcoming posts, we will try to understand these new concepts. I also presume that the various references in the previous post and this one may have been useful to experienced/intermediate statisticians also.

### Continuous Probability Distributions

**Continuous Probability Distributions**

A *continuous probability distribution* is a (infinitely large) table that lists the continuous variables (outcomes) of an experiment with the relative frequency (a.k.a probability) of each outcome. Consider a histogram that plots the probability (y axis) that a particular job will get done within a time interval (x axis). As you keep making the interval shorter and more fine-grained, the step-like top of the histogram eventually melds into a curve - called the continuous probability distribution. The total area under this probability curve is 1, the probability that the value of *x* is between two values *a* and *b* is the area under *f(x)* between *a* and *b* and *f(x) <= 0* for all *x* For continuous distributions, the probability for any single point in the distribution is 0, you can compute a non-zero probability only for an interval between two values of the continuous variable *x*.

Oracle SQL provides statistical functions to determine if the values in a given column fit a particular distribution. Before we proceed to the examples, let us look at some of the popularly known distributions.**Normal (Gaussian) Probability Distribution**

The most common continuous probability distribution - to the point of being synonymous with the concept is Normal Probability Distribution, represented graphically by a bell curve, plotted with the continuous value on the x axis, and the probability along the y axis, that is symmetric about the mean x value, with the two ends tapering off to infinity. The curve has these properties:

- The mean, median and mode are the same
- the distribution is bell-shaped and symmetrical about the mean
- the area under this curve is always = 1

*generic*normal distribution can have any mean value and standard deviation. For example, weather info may indicate an annual average rainfall in Boston of 38 inches with standard deviation 3 inches. The smaller the standard deviation (say, 2 inches instead of 3), the steeper the bell curve about the mean. Now if the mean were to shift to, say 40, the symmetric bell curve will shift two places to the right too. The

*probability density function*for normal distribution is given by:

f(x) = 1/(σ√(2π))e

^{-0.5 * ((x - μ)/σ)²}

The

**Standard Normal Distribution**is a special case of normal distribution with σ=0 and μ=1 as shown below (graph not to scale).The

*is a derivative of the standard normal distribution. It is given by*

**standard z-score***z = (x - μ)/σ*. The value of

*z*is then cross-checked against a

*standard normal table*or grid, to arrive at the probability of a required interval - repeat

*interval*. Unlike discrete random variables in a discrete probability distribution, continuous variables can have infinite values for a given event - so the probability can be computed only for an interval or range of values. Continuing with the rainfall example, queried the probability that the annual rainfall next year at Boston will be <= 40.5 inches - we will compute z = (40.5 - 38)/3 = 0.8333. Stated another way, this means that a rainfall of 40.5 inches is 0.8333 standard deviations away from the mean. From the standard normal table, the probability is 0.7967 - that is, roughly 80%.

Microsoft Excel's NORMDIST() function provides this functionality, but I was surprised to find

__no function in Oracle SQL with equivalent simplicity__- I'll file a feature bug after some more research. The Oracle OLAP Option provides a NORMAL() function as part of its OLAP DML interface. This Calc-like interface is different from SQL - so we will defer this for later.

**Application to Data Mining**A key use of the

*z-score*is as a "normalizing" data transformation for mining applications.

__Note__that this concept is

*completely unrelated*to database normalization. The stolen car example in a previous post was a simple example of prediction - we used a few

*categorical*attributes like a car's color, type to predict if a car will be stolen or not.

In the business world, the applications are more grown-up and mission-critical. One example is

**churn prediction**- i.e. finding out if a (say, wireless) customer would stay loyal with the current provider, or move on ("churn") to competitor (in which case, the current provider could try to entice him/her to stay with appropriate promos). The customer data used for such churn prediction applications contains

*categorical*(e.g. gender, education, occupation) and

*numerical*(e.g. age, salary, fico score, distance of residence from a metro) attributes/columns in a table. The data in these numeric columns will be widely dispersed, across different scales. For e.g. values within salary can be from 10s of thousands to several millions. Two numerical attributes will be in different scales - example salary (30K - 2 mil) vs age (1-100). Such disparity in scales, if left untreated, can throw most mining algorithms out of whack - the attributes with higher range of values will start outweighing those in the lower range during the computation of prediction. For such algorithms, the numerical data is

*normalized*to a smaller range [-1, 1] or [0, 1] using the z-transform, to enable uniform handling of numerical data by the algorithm.

Min-max and decimal scaling are other data normalization techniques. Here is one primer on mining data transformations. We will discuss mining transformations using DBMS_DATA_MINING_TRANSFORM package and Oracle SQL in a separate post.

**Uniform Distribution**

Ever waited outside a airport terminal under a sign that says "Rental-Cars/Long-term Parking - Pickup Every 10 minutes"? Your wait time is an example of uniform distribution - assuming a well-run airport, you arrive at the stop and expect to wait between 5 to max 15 minutes for your shuttle. This simplest of continuous distributions has the probability function

*f(x) = 1/ (b - a) a <= x <= b, f(x) = 0 for all other values of x*

and is graphically represented as shown. The probability that a uniformly distributed random variable X will have values in the range x

_{1}to x

_{2}is:

*P(x*.

_{1}<= X <= x_{2}) = (x_{2}- x_{1})/(b - a), a <= x_{1}< x_{2}<= bThe mean

*E(X) = (a+b)/2*and variance

*V(X) = (b - a)²/12*.

To use the shuttle bus example, probability that the wait time will be 8 to 11 minutes is

P(8 <= X <= 11) = (11 - 8)/(15 - 10) = 3/5 = 0.6

**Exponential Distribution**

Consider that an event occurs with an average frequency (a.k.a.

*rate*) of λ and this average frequency is constant. Consider that, from a given point in time, you wait for the event to occur. This

*waiting time*follows an

*exponential distribution*- depicted in the adjoining figure. The probability density function is given by:

*f(x) = λ e*where

^{-λx}*λ*is the frequency with which the event occurs - expressed as a particular number of times per time unit. The

*mean*, or more appropriately, the

*expected value E(X)*of the distribution is

*μ = 1/λ*, the

*variance*is

*σ² = (1/λ)²*.

Take the same shuttle bus example. If the bus does not stick to any schedule and randomly goes about its business of picking up passengers, then the wait time is exponentially distributed. This sounds a bit weird, but there are several temporal phenomena in nature that exhibit such behavior:

- The time between failure of the ordinary light bulb (which typically just blows out suddenly), or some electronic component, follows an exponential distribution. The
*mean time between failure*,*μ*is an important metric in the parts failure/warranty claims domain - The time between arrivals, i.e.
*inter-arrival time*of customers at any check-in counter is exponentially distributed.

*memory-less*- best explained with an example. Suppose you buy a 4-pack GE light at Walmart with a MTBF of 7000 hours (~ 10 months). Assume a bulb blows out, you plug in a new one, the time to next failure of this bulb will remain exponentially distributed. Say, this second bulb fails, and you change the bulb a month (24x30=720 hours) later, the time to next failure will remain exponentially distributed. The time between failure is independent of

*when*the bulb failed and when (i.e. the passage of time before) it was replaced.

The probability functions are best stated in terms of failure (survival). The probability that an item will survive until

*x*units of time, given a MTBF of

*μ*units can be stated as:

*P(X >= x) = e*

^{-λx}x >= 0and conversely, failing before

*x*units of time is given by

*P(X <= x) = 1 - e*

^{-λx}x >= 0where

*λ = 1/μ*.

For example - if Dell claims your laptop fails following an exponential distribution with MTBF 60 months, and the warranty period is 90 days (3 months), what percentage of laptops does Dell expect to fail within the warranty period?

P(X <= 3 months) = 1 - e

^{-(1/60)*3)}= 0.048 ~ 5% of laptops.

**Weibull Distribution**

Weibull distribution is a versatile distribution that can emulate other distributions based on its parameter settings, and is a widely used, important tool for reliability engineering/survival analysis. An extensive coverage on Reliability analysis is provided here. I am mentioning this distribution here mainly to set up the next post. There are several other popular probability distributions - we will revisit them in the future on a needs-basis.

For now, let us break off and look at some Oracle code in the next post.

### Discrete Probability Distributions

A **Probability Distribution** is a table/graph that depicts the assignment of probabilities to the assumption of specific values by a given random variable.

The following concepts are useful to understand probability distributions:

- If Event A can occur in
*p*possible ways and Event B can occur in*q*possible ways, then both A and B can occur in*p*x*q*ways. - The number of different ways that a set of objects can be arranged is called
*Combination*. The number of combinations of*n*objects taken*r*at a time is given by*nCr = n! / (n - r)! r!* - The number of different ways that a set of objects can be arranged in order is called
*Permutation*. The number of permutations of*n*objects taken*r*at a time is given by*nPr = n! / (n - r)!*

FUNCTION factorial(p_n IN NUMBER) RETURN NUMBER ISEX> Compute 9!

BEGIN

IF p_n IS NULL OR p_n < 0 THEN

RAISE_APPLICATION_ERROR(-20000, 'Invalid Input Value');

ELSIF p_n <= 1 THEN

RETURN 1;

ELSE

RETURN factorial(p_n-1) * p_n;

END IF;

END;

select factorial(9) from dual;I was curious to see how far I can push this function - the maximum value of

*n*was 83.7 with NUMBER types, and 84.7 when I changed the input parameter and return type to BINARY_DOUBLE

SQL> select factorial(83.7) from dual;EX> Compute the number of combinations of 9 objects taken 3 at a time.

FACTORIAL(83.7)

---------------

9.642E+125

SQL> select factorial(83.71) from dual;

FACTORIAL(83.71)

---------------

~

SQL> select factorial2(84.7) from dual;

FACTORIAL2(84.7)

----------------

8.167E+127

SQL> select factorial2(84.71) from dual;

FACTORIAL2(84.71)

-----------------

Inf

select factorial(9)/(factorial(9-3) * factorial(3)) from dual;EX> Compute the number of different ways of arranging 9 objects taken 3 at a time.

select factorial(9)/factorial(9-3) from dual;

**Discrete Probability Distributions**

- The
*discrete probability distribution*is a table that lists the discrete variables (outcomes) of an experiment with the relative frequency (a k a probability) of each outcome.

Example: Tossing a coin two times gives you the combinations (H,H), (H,T), (T,H), (T,T) and hence, the following tuples for (#Heads, Frequency, Relative_Frequency):

(0, 1, 1/4=0.25), (1, 2, 2/4=0.5), (2, 1, 1/4=0.25).

This is the probability distribution for # heads after flipping a coin twice. *Mean*or*Expected value*of the discrete probability distribution*μ*= ∑_{i=1_to_n}*x*For the coin example,_{i}* P(x_{i})*μ*= 0 * 0.25 + 1 * 0.5 + 2 * 0.25 = 1*Variance*of the discrete probability distribution*σ²*= ∑_{i=1_to_n}*(x*_{i}- μ)² * P(X_{i})*Standard deviation*is the square root of the variance

**A**

*Binomial Probability Distribution*

*binomial*or

*Bernoulli*experiment is one which consists of a fixed number of trials, each independent of the other, each with only two possible outcomes, with a fixed probability for success or failure representation in each outcome. The Bernoulli process counts the number of successes over a given number of attempts, or in other words, the random variable for a Binomial distribution is the number of successes over given number of attempts.

- The probability of
*r*successes in*n*trials with probability of success*p*and probability of failure*q*is given by*P(r, n) = (n! / (n - r)! r!) p*^{r}q^{(n - r)} - The binomial probability distribution is a table of (r, P(r, n)) which can be subsequently graphed, as discussed in this example

^{2}0.6

^{(7 - 2)}, which can be computed using

select factorial(7) * power(0.4,2) * power(0.6,(7-2))/The probability that it will rain

(factorial(7-2) * factorial(2)) p_2_7

from dual;

*at least*6 days over the next 7 days is P(r >= 6) = P(6,7)+P(7,7), computed using

select (factorial(7) * power(0.4,6) * power(0.6,(7-6))/Finally, the probability that it will rain no more than 2 days over the next 7 days is P(r <= 2) = P(0,7) + P(1,7) + P(2,7)

(factorial(7-6) * factorial(6))) +

(factorial(7) * power(0.4,7) * power(0.6,(7-7))/

(factorial(7-7) * factorial(7))) p_r_ge_6

from dual;

- The
*mean*of a binomial distribution is*μ = np* - The
*standard deviation*is*σ² = npq*

*BINOMDIST(r, n, p, cumulative)*. p is the probability of success, set cumulative=TRUE if you want the probability of r or fewer successes, set cumulative=FALSE if you want exactly r successes. Here is the PL/SQL version:

FUNCTION binomdist(r NUMBER, n NUMBER, p NUMBER, cumulative BOOLEAN DEFAULT FALSE) RETURN NUMBER IS

ri NUMBER;

ret NUMBER;

fn NUMBER;

BEGIN

ret := 0;

fn := factorial(n);

FOR ri IN REVERSE 0..r LOOP

ret := ret + (fn * power(p, ri) * power((1-p),(n - ri)))/

(factorial(n - ri) * factorial(ri));

IF NOT cumulative THEN

EXIT;

END IF;

END LOOP;

RETURN ret;

END binomdist;

*Poisson Probability Distribution*

The random variable for Poission distribution is the number of occurrences of the event over a measurable metric (time, space). In a Poisson process, the (measured) *mean* number of occurences of an event is the same for each interval of measurement, and the number of occurrences in a particular interval are independent of number of occurrences in other intervals.

- The probability of exactly
*r*occurrences over a given interval is given by*P(r) = μ*^{r}* e^{(-μ)}/ r! - The
*variance*of the Poisson distribution is the same as the (observed) mean. - A
*goodness of fit*test helps verify if a given dataset fits the Poisson distribution

select power(25,31) * exp(-25)/factorial(25) p_31Just as we saw in Binomial distribution, the probability that no more than 31 customers will walk into the coffee shop is P(r <= 31) = P(0)+P(1)+..+P(31). Inversely, the probability that

from dual;

*at least*31 customers will walk into the coffee shop is P(r >= 31) = 1 - P(r < 31). Obviously, this leads up to the need for a function similar to POISSON(r, μ, cumulative) in Excel - where cumulative = FALSE indicates computation of exactly r occurrences, and cumulative = TRUE indicates r or fewer.

FUNCTION poissondist(r NUMBER, mu NUMBER,

cumulative BOOLEAN DEFAULT FALSE) RETURN NUMBER IS

ri NUMBER;

ret NUMBER;

BEGIN

ret := 0;

FOR ri IN REVERSE 0..r LOOP

ret := ret + (power(p, ri) * exp(-mu)/factorial(ri));

IF NOT cumulative THEN

EXIT;

END IF;

END LOOP;

RETURN ret;

END poissondist;

*Poisson approximation* - a Poisson distribution can be used to approximate a Binomial distribution if the number of trials (in the binomial experiment) is >= 20 and the probability of success *p* is <= 5%.

### Random Variables, Probability, Bayes Theorem, Naive Bayes Models

In contrast to descriptive statistics, **Inferential statistics** describes the *population* based on information gleaned from a *sample* taken from the population. Fundamental to understanding statistical inference is the concept of probability.

An *experiment* is the process of measuring/observing an activity. An *outcome* is a particular result of the experiment - outcomes are also called **Random Variables**. Random variables can be **discrete** - when it can assume a countable number of values (e.g. one of six outcomes from rolling a dice) or **Continuous** - when the variable can assume uncountably infinite values in a given range of values (time, a person's height). The **Sample space** is all possible outcomes of the experiment. An **event** is an outcome of interest.

The **Probability** of Event A occurring is: P(A) = # of possible outcomes in which Event A occurs/ Total # outcomes in the sample space.**Basic properties of probability:**

- P(A) = 1 implies Event A will occur with certainty
- P(A) = 0 implies Event A will not occur with certainty
- 0 >= P(A) >= 1
- The sum of all probabilities for events in the sample space must be 1
- All outcomes in the sample space that are not part of Event A is called the
*complement*of Event A (named A'). P(A') = 1 - P(A) - Given two events A and B, P(A) or P(B) - i.e. probabilities of each of the events occuring - without the knowledge of the other events occurrence - is called the
*prior probability*. - Given two events A and B, the probability of event A occurring given that event B has occurred - denoted by P(A / B) - is called the
*conditional probability*or*posterior**probability*of Event A given that Event B has occurred. On the flip side, if P(A / B) = P(A), then events A and B are termed*independent*. - Given two events A and B, the probability of both A and B occurring at the same time is called the
*joint probability*for A and B, computed as P(A and B) = P(A) * P(B) - Given two events A and B, the probability of either A or B occurring is called the
*union*of events A and B. If events A and B do not occur at the same time (i.e. are*mutually**exclusive*), then P(A or B) = P(A) + P(B). If events A and B occur at the same time, i.e. are not mutually exclusive, then P(A or B) = P(A) + P(B) - P(A and B) - Law of Total Probability: P(A) = P(A / B)P(B) + P(A / B')P(B')
- The
*Bayes Theorem*for probabilities provides the ability to reverse the conditionality of events and compute the outcome:

P(A / B) = P(A) * P(B / A) / (P(A) * P(B / A) + P(A') * P(B / A'))

*predicting that a given event will occur with a given level of certainty or chance*- quantified by the probability. This is a good segue to look at a real

*business*problem and its solution based on Bayes theorem.

A modern,

*predictive*, loan processing application builds analytical models based on millions of historical loan applicant records (

*training*data), and uses these models to

*predict*the credit-worthiness (a k a risk of loan default) of an applicant by

*classifying*the applicant into

*Low, Medium, High*or such risk categories. In data mining lingo, a new applicant record is now

*scored*based on the model. At the time of this writing (Aug-Sep 2007), the sub-prime lending woes and its effect on US and world markets is the main story. The trillions lost in this mess is fodder for Quant skeptics/detractors, but as a BusinessWeek cover story ("Not So Smart" - Sep 3 2007) explains, the problem was not analytics per se - the problems were with how various managements (mis)used analytics or (mis)understood their data.

Returning to probability concepts, instead of A and B, the events become A and B

_{i}, i=1..n. The event A (or more appropriately for this example, the

*target variable*

**Risk**) is a dependent variable that assumes one of discrete values (called

*classes*-

**low, medium, high**) based on

*predictor*variables B

_{i}through B

_{n}(age, salary, gender, occupation, and so on). The probability model for this

*classifier*is P(A / B

_{1},..,B

_{n}). We just shifted the language from statistics into the realm of data mining/predictive analytics. The Bayes theorem intrinsically assumes conditional dependence between B

_{i}through B

_{n}. Now if

*n*is large, or if each B

_{i}takes on a large number of values, computing this model becomes intractable.

The

**Naive Bayes**probabilistic model greatly simplifies this by making a naive/strong assumption that B

_{i}through B

_{n}are

*conditionally independent*- the details are provided here. You can build a Naive Bayes model using the Oracle Data Mining Option, and predict the value for a target variable in new records using SQL Prediction Functions. The following example illustrates the process.

EX> Given a small, synthetic dataset about the attributes of stolen cars, predict if a particular car will be stolen - based on its attributes.

create table stolen_cars(Table created.

id varchar2(2),

color varchar2(10),

ctype varchar2(10),

corigin varchar2(10),

stolen varchar2(3));

insert into stolen_cars values ('1', 'Red','Sports','Domestic','yes');Commit complete.

insert into stolen_cars values ('2', 'Red','Sports','Domestic','no');

insert into stolen_cars values ('3', 'Red','Sports','Domestic','yes');

insert into stolen_cars values ('4', 'Yellow','Sports','Domestic','no');

insert into stolen_cars values ('5', 'Yellow','Sports','Imported','yes');

insert into stolen_cars values ('6', 'Yellow','SUV','Imported','no');

insert into stolen_cars values ('7', 'Yellow','SUV','Imported','yes');

insert into stolen_cars values ('8', 'Yellow','SUV','Domestic','no');

insert into stolen_cars values ('9', 'Red','SUV','Imported','no');

insert into stolen_cars values ('10', 'Red','Sports','Imported','yes');

commit;

beginPL/SQL procedure successfully completed.

dbms_data_mining.create_model(

model_name => 'cars',

mining_function => dbms_data_mining.classification,

data_table_name => 'stolen_cars',

case_id_column_name => 'id',

target_column_name => 'stolen');

end;

/

create table new_stolen_cars (Table created.

id varchar2(2),

color varchar2(10),

ctype varchar2(10),

corigin varchar2(10));

insert into new_stolen_cars values ('1', 'Red','SUV','Domestic');Commit complete.

insert into new_stolen_cars values ('2', 'Yellow','SUV','Domestic');

insert into new_stolen_cars values ('3', 'Yellow','SUV','Imported');

insert into new_stolen_cars values ('4', 'Yellow','Sports','Domestic');

insert into new_stolen_cars values ('5', 'Red','Sports','Domestic');

commit;

select prediction(cars using *) pred,The query scores each row in the new_stolen_cars table, returning the prediction, and the certainty of this predition. This dataset is very small, but a cursory glance at the results indicates that the predictions are correct - based on the training data. For example, the model predicts 'No' for a domestic yellow sports car - the training data has no such instance. The model predicts 'Yes' for a domestic red sports car, with > 50% certainty - the training data does support this prediction. You can obtain the details of this model using:

prediction_probability(cars using *) prob

from new_stolen_cars;

-- Results

PRE PROB

--- ----------

no .75

no .870967746

no .75

no .529411793

yes .666666687

select *The SQL output is a collection of objects and may not look pretty at first glance. But once you understand the schema of the output type - viz.

from table(dbms_data_mining.get_model_details_nb('cars'));

dm_nb_details- you can decipher the output to the following simple format:

STOLENThis shows the target variable (STOLEN), its value ('yes', 'no'), the prior probability (0.5), and the conditional probability contributed by each predictor/predictor value pair towards each target/class value.

no

.5

DM_CONDITIONALS(

DM_CONDITIONAL('COLOR', NULL, 'Red', NULL, .4),

DM_CONDITIONAL('COLOR', NULL, 'Yellow', NULL, .6),

DM_CONDITIONAL('CTYPE', NULL, 'SUV', NULL, .6),

DM_CONDITIONAL('CORIGIN', NULL, 'Domestic', NULL, .6),

DM_CONDITIONAL('CORIGIN', NULL, 'Imported', NULL, .4),

DM_CONDITIONAL('CTYPE', NULL, 'Sports', NULL, .4))

STOLEN

yes

.5

DM_CONDITIONALS(

DM_CONDITIONAL('COLOR', NULL, 'Red', NULL, .6),

DM_CONDITIONAL('CORIGIN', NULL, 'Imported', NULL, .6),

DM_CONDITIONAL('CORIGIN', NULL, 'Domestic', NULL, .4),

DM_CONDITIONAL('CTYPE', NULL, 'Sports', NULL, .8),

DM_CONDITIONAL('CTYPE', NULL, 'SUV', NULL, .2),

DM_CONDITIONAL('COLOR', NULL, 'Yellow', NULL, .4))

Such ability to score transactional customer data directly from the database (in other words, deploy the model right at the source of customer data) with such simplicity is a key Oracle differentiator and competitive advantage over standalone data mining tools. For more on ODM, consult the references provided in this blog.

### Descriptive Statistics

Descriptive statistics summarizes and displays information about just the observations at hand - i.e. the sample and population are the same. Using data in the sample schema, we will quickly run through descriptive statistics concepts.**Displaying descriptive statistics***Frequency distribution* is a table that organizes a number of values into intervals (classes). A *histogram* is the visual equivalent of frequency distribution - a bar graph that represents the number of observations in each class as the height of each bar.

EX> Compute the frequency distribution of all employee salaries - across 10 classes.

select intvl, count(*) freq

from (select width_bucket(salary,

(select min(salary) from employees),

(select max(salary)+1 from employees), 10) intvl

from HR.employees)

group by intvl

order by intvl;

-- Result from this query

INTVL FREQ

----- --------

1 46

2 10

3 23

4 15

5 8

6 2

7 2

10 1

8 rows selected.

*Relative frequency distribution*provides the number of observations in each class as a percentage of the total number of observations.

EX> Compute the relative frequency distribution of all employee salaries - across 10 classes.

with Q_freq_dist as

(select intvl, count(*) freq

from (select width_bucket(salary,

(select min(salary) from employees),

(select max(salary)+1 from employees), 10) intvl

from HR.employees)

group by intvl

order by intvl)

--

select intvl, freq/(select count(*) from employees) rel_freq

from Q_freq_dist;

-- Result from this query

INTVL REL_FREQ

----- ----------

1 .429906542

2 .093457944

3 .214953271

4 .140186916

5 .074766355

6 .018691589

7 .018691589

10 .009345794

8 rows selected.

*Cumulative frequency distribution*provides the percentage of observations that are less than or equal to the class of interest.

EX> Compute the cumulative frequency distribution of all employee salaries - across 10 classes.

with Q_freq_dist asThe width_bucket function creates equi-width histograms. A complementary function,

(select intvl, count(*) freq

from (select width_bucket(salary,

(select min(salary) from employees),

(select max(salary)+1 from employees), 10) intvl

from HR.employees)

group by intvl

order by intvl),

--

Q_rel_freq_dist as

(select intvl, freq/(select count(*) from employees) rel_freq

from Q_freq_dist)

--

select intvl,

sum(rel_freq) over (order by intvl

rows between unbounded preceding and current row) cum_freq

from Q_rel_freq_dist;

-- Result from this query

INTVL CUM_FREQ

----- ----------

1 .429906542

2 .523364486

3 .738317757

4 .878504673

5 .953271028

6 .971962617

7 .990654206

10 1

8 rows selected.

*NTILE*, helps you bin values into intervals of equal height - i.e. with same count of values in each bin. This is useful for computing

*quartiles*,

*quintiles*etc.

EX> The following query bins the 107 records/observations into approx 10

*equi-height*bins.

select htbin, count(*) htAnyone who has given a competitive exam (SAT thro GRE/GMAT) should be familiar with the term

from (select ntile(10) over (order by salary) htbin

from hr.employees)

group by htbin

order by htbin;

HTBIN HT

----- ------

1 11

2 11

3 11

4 11

5 11

6 11

7 11

8 10

9 10

10 10

10 rows selected.

*percentile*. Given

*n*data points, the

*P*th percentile represents the value which resides above

*P*% of the values, and is given by

*percentile = (n+1) * P/100*

EX> Assume a student took the GRE test in 2004 and received an overall GRE score is 2150 (out of the possible 2400) and ranked at 89th percentile. What does this mean? Based on information provided here, it means, of the worldwide total of 408,948, the student ranked above (408948+1) * 89/100 ~= 363965 examinees, and conversely, 408948-363964= 44982 students ranked above this student. You can compute percentiles in Oracle SQL using CUME_DIST() as follows.

EX> Rank all sales reps and their managers by their salary percentile.

select job_id, last_name, salary,The

round(cume_dist() over

(partition by job_id order by salary) * 100) as pctile

from employees

where job_id like '%SA_%'

order by job_id, pctile desc, salary;

JOB_ID LAST_NAME SALARY PCTILE

------ -------------- ------ ------

SA_MAN Russell 14000 100

SA_MAN Partners 13500 80

SA_MAN Errazuriz 12000 60

SA_MAN Cambrault 11000 40

SA_MAN Zlotkey 10500 20

SA_REP Ozer 11500 100

SA_REP Abel 11000 97

SA_REP Vishney 10500 93

SA_REP Tucker 10000 90

SA_REP Bloom 10000 90

SA_REP King 10000 90

...

*partition by job_id*enables to rank within job_id. You can infer things like Ms. Bloom has 90% of her fellow sales rep earning as much or less than her (

*order by salary*clause), Mr. Zlotkey is in the last quintile for managers, and so on.

EX> Suppose a new job candidate names his salary requirements - say 7.5K - you can use the same function - in what Oracle calls its

*aggregate*usage to find out what percentile this salary number would fit in, as follows:

select round(cume_dist(7500) within group order by (salary) * 100, 0) offer_pctileIf the new candidate joins the sales force (a.k.a when his record is entered into this table), his salary will be in the 64th percentile. The

from employees

OFFER_PCTILE

------------

64

*PERCENT_RANK()*function is similar to

*CUME_DIST()*function - see the SQL Reference Manual in the Oracle docs for details.

**Measures of Central Tendency - Mean, Median, Mode**

**Mean or Average**= sum(all observations)/ count_of_observations. Note the difference between sample mean and population mean for use later.

EX> Compute average salary of employees in the company.

EX> Compute average quantity sold of products from all sale transactions.

select min(salary) min_s, max(salary) max_s,

round(avg(salary),2) avg_s

from HR.employees;

select prod_id, count(*), sum(quantity_sold) sum_qty,We can also compute the

avg(quantity_sold) avg_qty

from SH.sales

group by prod_id

order by prod_id;

**Mean of grouped data from a frequency distribution**using

x_bar = (Σ

_{_1_to_m}(freq

_{i}* x

_{i}))/Σ

_{_1_to_m}(freq

_{i}) where m is the number of classes.

EX> Compute the mean of the frequency distribution discussed above.

with Q_freq_dist as (

select intvl, count(*) freq

from (select width_bucket(salary,

(select min(salary) from employees),

(select max(salary)+1 from employees), 10) intvl

from HR.employees)

group by intvl

order by intvl)

--

select sum(intvl * freq)/sum(freq)

from Q_freq_dist

**Median**is the value in a dataset in which half the observations have a higher value and the other half a lower value.

EX> Compute the median salary per department.

select department_id,median(salary)

from HR.employees

group by department_id;

**Mode**is the observation in the dataset that occurs most frequently.

EX> Compute the mode per department.

select department_id,stats_mode(salary)

from HR.employees

group by department_id;

**Measures of Dispersion**

These metrics measure how a set of data values are dispersed around (a k a deviate from) the measure of central tendency (typically the mean).

**Variance**represents the relative distance between the data points and the mean of a dataset, and is computed as the sum of squared deviation of each data point from the mean. σ

^{2}= (Σ

_{1_to_n}(x

_{i}- x_bar)²)/(n-1)

EX> Compute the variance, sample variance, and population variance of employee salaries.

select variance(salary), var_samp(salary), var_pop(salary)

from HR.employees;

**Standard deviation**is the square root of variance, and generally more useful than variance because it is in units of the original data - rather than its square.

EX> Compute the standard deviation, sample standard deviation, and population standard deviation of employee salaries.

select stddev(salary), stddev_samp(salary), stddev_pop(salary)For several commonly occuring datasets, the values tend to cluster around the mean or median - so that the distribution takes on the shape of a symmetrical bell curve. In such situations, the

from HR.employees;

*Chebyshev's Theorem*states that

*at least*(1 - 1/k²) x 100%, k > 1, will fall within k-standard-deviations from the mean. If the distribution is more or les symmetrical about the mean (i.e. closer to

*normal distribution*) then the more rigorous

*empirical*rule states that

*approximately*68% of the data will fall within 1 standard deviation of the mean, and 95% of the data within 1 standard deviations of the mean, and 99.7% of the data within 3 standard deviations of the mean.

**Skewness**is a measure of the unevenness or asymmetry of the distribution. For a perfectly symmetric distribution, the mean is the same as mode is same as the median. Generally, for a skewed distribution, it is observed that the mean is to side of the median is to the side of the mode - the side being right (=ve skew) or left (-ve skew) depending on the skew. Skewness of a given population is computed by Skew = Σ

_{1_to_N}(x

_{i}- μ)/σ)³/N, where μ is the population mean, or σ is the standard deviation.

**Kurtosis**is the measure of peakedness of the distribution of the population. Kurtosis = Σ

_{1_to_N}(x

_{i}- μ)/σ)

^{4}/N.

EX> Compute the skewness and kurtosis for the employee salary.

-- SkewnessThis brevity is made possible thanks to Oracle SQL analytical functions (

select avg(val3)

from (select power((salary - avg(salary) over ())/

stddev(salary) over (), 3) val3

from employee);

--

-- Kurtosis

select avg(val4)

from (select power((salary - avg(salary) over ())/

stddev(salary) over (), 4) val4

from employee);

*avg(salary) over ()*implies compute average over all rows in the employees). Note that you can use the partition and order by clauses inside the () to group data across different employee classes and scope the computation to the group if required. From a performance standpoint, note that the computation intrinsically demands multiple scans (2 at the least) over the employees table (or the tables behind the employees view) - so for a potentially large employees table, the computations may involve use of sort (i.e. on-disk) memory rather than just the PGA (the database' equivalent of a PC's RAM). For more on such performance implications of SQL, begin your education with Oracle Concepts or SQL Tuning Guide, or look up the various references provided here.