Analyzing your data for Distributions

In the previous post, we looked at continuous probability distributions. You can determine if the data in a given column in a given table follows a particular distribution using routines in the DBMS_STAT_FUNCS package. The example shows a table with four columns - each pre-loaded with data synthesized to fit a particular continuous distribution.

You can substitute the table name and column name for your experiment. A minor annoyance in the design of this API - most other Oracle packages will default to the current session name, but this one explicitly expects the schema name as input - which, in my case, is DMUSER - please change this to your schema name accordingly.

set echo on;
set serveroutput on;
create table disttab (
num1 number, num2 number, num3 number, num4 number, cat1 varchar2(2));
rem num1: Numbers generated based on normal distribution
rem num2: Numbers generated based on exponential distribution
rem num3: Numbers generated based on weibull distribution
rem num4: Numbers generated based on uniform distribution
insert into disttab values (9.604955, 1.05536, 4.126087, 22.950835, 'AA');
insert into disttab values (13.022139, 1.714142, 4.999804, 32.598089, 'AA');
insert into disttab values (11.572116, 3.697564, 2.81854, 24.552021, 'AA');
insert into disttab values (9.817124, 1.530935, 2.131106, 6.359504, 'AA');
insert into disttab values (10.146569, 3.295829, 1.510639, 25.218639, 'AA');
insert into disttab values (11.280488, 0.721109, 3.145515, 23.672146, 'BB');
insert into disttab values (9.26679, 1.390282, 4.074397, 11.262112, 'BB');
insert into disttab values (14.303472, 1.327971, 2.51907, 22.675373, 'BB');
insert into disttab values (11.686556, 0.225337, 2.941825, 23.582254, 'BB');
insert into disttab values (13.124479, 7.265271, 0.945059, 29.18001, 'BB');
insert into disttab values (8.601027, 7.060104, 6.078573, 14.878128, 'BB');
insert into disttab values (12.241662, 0.257739, 3.395142, 31.148244, 'CC');
insert into disttab values (13.781857, 4.281371, 1.349627, 23.862069, 'CC');
insert into disttab values (7.827007, 1.347487, 5.836949, 10.76229, 'CC');
insert into disttab values (9.106408, 1.253113, 5.116857, 6.594224, 'CC');
insert into disttab values (11.066785, 4.56512, 3.393899, 22.435955, 'CC');
insert into disttab values (10.71079, 2.700015, 1.922642, 7.635145, 'DD');
insert into disttab values (9.13019, 5.199126, 3.763481, 32.061213, 'DD');
insert into disttab values (7.873859, 0.978657, 2.268487, 1.030052, 'DD');
insert into disttab values (7.731724, 2.382977, 2.639425, 5.676622, 'DD');
insert into disttab values (12.828234, 1.867099, 3.99808, 26.000458, 'DD');
insert into disttab values (12.125892, 1.01285, 3.345311, 8.026281, 'DD');
insert into disttab values (9.800528, 5.869301, 3.840932, 29.928523, 'EE');
insert into disttab values (10.605782, 3.145211, 2.13718, 27.398604, 'EE');
insert into disttab values (14.054569, 4.089033, 2.436408, 4.483585, 'EE');
insert into disttab values (8.120606, 2.155303, 1.787835, 19.513588, 'EE');
insert into disttab values (13.093059, 0.220456, 3.456848, 24.855135, 'EE');
insert into disttab values (8.421441, 2.4819, 2.817669, 21.137668, 'FF');
insert into disttab values (11.899697, 2.507618, 3.770983, 4.016285, 'FF');
insert into disttab values (9.601342, 1.12639, 3.21053, 28.643809, 'FF');
insert into disttab values (9.32297, 10.003288,6.890515, 33.67171, 'FF');
insert into disttab values (6.896019, 10.76641, 3.123496, 29.077463, 'FF');
insert into disttab values (12.542443, 0.228756, 4.081015, 33.542652, 'FF');
insert into disttab values (14.038144, 7.326175, 3.53459, 11.731359, 'FF');
Let us first start with Normal Distribution. The mean (same as E(x)) and standard deviation are computed for the column num1 and provided as input.
declare
mean number;
stdev number;
sig number;
begin
select avg(num1) into mean from disttab;
select stddev(num1) into stdev from disttab;
dbms_output.put_line('NORMAL DISTRIBUTION - SHAPIRO_WILKS');
dbms_stat_funcs.normal_dist_fit(
'dmuser', 'disttab', 'num1', 'SHAPIRO_WILKS', mean, stdev, sig);
dbms_output.put_line('Mean : ' || round(mean, 4));
dbms_output.put_line('Stddev : ' || round(stdev, 4));
dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));
end;
/
The output of this procedure looks like this:
NORMAL DISTRIBUTION - SHAPIRO_WILKS
W value : .9601472834631918774434537597940068448938
Mean : 10.7426
Stddev : 2.1094
Sig : .2456
The high value for the W returned from the Shapiro-Wilks test (explained here) indicates an agreement with the null hypothesis that the data follows normal distribution. The significance (0.25) is also relatively high for the Shapiro-Wilks test (0.05 being the typical threshold).

Next, Exponential distribution. For the above dataset, lambda - i.e. the rate of arrival - is provided (Note that lambda is NOT simply 1/mean(num3), it is 1/E(x) and we don't know how the expected value was computed)
declare
lambda number;
mu number;
sig number;
begin
lambda := 0.3194;
-- select 1/mean into lambda from (select mean(num2) from disttab);
mu := NULL;
sig := NULL;
dbms_output.put_line('EXPONENTIAL DISTRIBUTION - KOLMOGOROV_SMIRNOV');
dbms_stat_funcs.exponential_dist_fit(
'dmuser', 'disttab', 'num3', 'KOLMOGOROV_SMIRNOV', lambda, mu, sig);
dbms_output.put_line('Lambda : ' || lambda);
dbms_output.put_line('Mu : ' || mu);
dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));
end;
/
The output shows:
EXPONENTIAL DISTRIBUTION - KOLMOGOROV_SMIRNOV
D value : .0919745379005003387429254774811052604723
Lambda : .3194
Mu : 0
Sig : .9237466160
The low D value (0.09) and high significance (0.92) suggests that the data fits an exponential distribution well.

Next, Weibull Distribution. The alpha input parameter to the procedure corresponds to η - the scale, and the mu to γ - the location - in the Weibull probability density function discussed earlier.
declare
alpha number;
beta number;
begin
alpha := 3;
mu := 0;
beta := 4;
sig := NULL;
dbms_output.put_line('.');
dbms_output.put_line('WEIBULL DISTRIBUTION - KOLMOGOROV_SMIRNOV');
dbms_stat_funcs.weibull_dist_fit(
'dmuser', 'disttab', 'num3', 'KOLMOGOROV_SMIRNOV', alpha, mu, beta, sig);
dbms_output.put_line('Alpha : ' || alpha);
dbms_output.put_line('Mu : ' || mu);
dbms_output.put_line('Beta : ' || beta);
dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));
end;
/
The output shows:
WEIBULL DISTRIBUTION - KOLMOGOROV_SMIRNOV
D value : .2575286246007637604103723582952313687414
Alpha : 3
Mu : 0
Beta : 4
Sig : .0177026134
The Kolmogorov-Smirnov test does not appear emphatic enough, we will try other tests later.

Next, Uniform Distribution.
declare
A number;
B number;
begin
A := 1;
B := 34;
sig := NULL;
dbms_output.put_line('.');
dbms_output.put_line('UNIFORM DISTRIBUTION - KOLMOGOROV_SMIRNOV');
dbms_stat_funcs.uniform_dist_fit(
'dmuser', 'disttab', 'num4', 'CONTINUOUS', 'KOLMOGOROV_SMIRNOV', A, B, sig);
dbms_output.put_line('A : ' || A);
dbms_output.put_line('B : ' || B);
dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));
end;
/
The output shows:
UNIFORM DISTRIBUTION - KOLMOGOROV_SMIRNOV
D value : .2083979233511586452762923351158645276292
A : 1
B : 34
Sig : .0904912415
The Kolmogorov-Smirnov test does not appear emphatic enough, we will try other tests later.

Finally, let us try fitting a small dataset to Poisson Distribution.
create table pdisttab (num1 number);
insert into pdisttab values(1);
insert into pdisttab values(2);
insert into pdisttab values(3);
insert into pdisttab values(4);
insert into pdisttab values(5);
declare
mean number;
stdev number;
sig number;
begin
mean := 3.0;
stdev := 1.58114;
dbms_output.put_line('.');
dbms_output.put_line('POISSON DISTRIBUTION - KOLMOGOROV_SMIRNOV');
dbms_stat_funcs.poisson_dist_fit(
'dmuser', 'pdisttab', 'num1', 'KOLMOGOROV_SMIRNOV', mean, sig);
dbms_output.put_line('Mean : ' || mean);
dbms_output.put_line('Stddev : ' || stdev);
dbms_output.put_line('Sig : ' || to_char(sig,'9.9999'));
end;
/
The output of:
POISSON DISTRIBUTION - KOLMOGOROV_SMIRNOV
D value : .08391793845975204
Mean : 3
Stddev : 1.58114
Sig : .9999999999
emphatically is in agreement with the null hypothesis that the data fits a Poisson distribution.

That was fun, was'n't it? Now, we have introduced several new terms and concepts here (esp if you reviewed the DBMS_STAT_FUNCS docs) - viz. goodness-of-fit testing if a given data sample fits a particular distribution based on a null hypothesis, using various test types, the test metrics, and the significance output of a test, and various other parameters. In the upcoming posts, we will try to understand these new concepts. I also presume that the various references in the previous post and this one may have been useful to experienced/intermediate statisticians also.

1 comment:

Anonymous said...

Hi Ram:

Any more info on how to interpret the sig values returned by the dbms_stat_funcs.normal_dist_fit function. If I want to be 95% sure that the data is normally distributed should I be making sure the that returned sig is 0.05 or less? Also what is this W value and how to interpret it?

Thanks,
Arun