Numoracle Recipes: Analyzing your data for Distributions

In the previous post, we looked at continuous probability distributions. You can determine if the data in a given column in a given table follows a particular distribution using routines in the DBMS_STAT_FUNCS package. The example shows a table with four columns - each pre-loaded with data synthesized to fit a particular continuous distribution.

You can substitute the table name and column name for your experiment. A minor annoyance in the design of this API - most other Oracle packages will default to the current session name, but this one explicitly expects the schema name as input - which, in my case, is DMUSER - please change this to your schema name accordingly.

set echo on;
set serveroutput on;
create table disttab (
  num1 number, num2 number, num3 number, num4 number, cat1 varchar2(2));
rem num1: Numbers generated based on normal distribution
rem num2: Numbers generated based on exponential distribution
rem num3: Numbers generated based on weibull distribution
rem num4: Numbers generated based on uniform distribution
insert into disttab values (9.604955,  1.05536,  4.126087, 22.950835, 'AA');
insert into disttab values (13.022139, 1.714142, 4.999804, 32.598089, 'AA');
insert into disttab values (11.572116, 3.697564, 2.81854,  24.552021, 'AA');
insert into disttab values (9.817124,  1.530935, 2.131106, 6.359504,  'AA');
insert into disttab values (10.146569, 3.295829, 1.510639, 25.218639, 'AA');
insert into disttab values (11.280488, 0.721109, 3.145515, 23.672146, 'BB');
insert into disttab values (9.26679,   1.390282, 4.074397, 11.262112, 'BB');
insert into disttab values (14.303472, 1.327971, 2.51907,  22.675373, 'BB');
insert into disttab values (11.686556, 0.225337, 2.941825, 23.582254, 'BB');
insert into disttab values (13.124479, 7.265271, 0.945059, 29.18001,  'BB');
insert into disttab values (8.601027,  7.060104, 6.078573, 14.878128, 'BB');
insert into disttab values (12.241662, 0.257739, 3.395142, 31.148244, 'CC');
insert into disttab values (13.781857, 4.281371, 1.349627, 23.862069, 'CC');
insert into disttab values (7.827007,  1.347487, 5.836949, 10.76229,  'CC');
insert into disttab values (9.106408,  1.253113, 5.116857, 6.594224,  'CC');
insert into disttab values (11.066785, 4.56512,  3.393899, 22.435955, 'CC');
insert into disttab values (10.71079,  2.700015, 1.922642, 7.635145,  'DD');
insert into disttab values (9.13019,   5.199126, 3.763481, 32.061213, 'DD');
insert into disttab values (7.873859,  0.978657, 2.268487, 1.030052,  'DD');
insert into disttab values (7.731724,  2.382977, 2.639425, 5.676622,  'DD');
insert into disttab values (12.828234, 1.867099, 3.99808,  26.000458, 'DD');
insert into disttab values (12.125892, 1.01285,  3.345311, 8.026281,  'DD');
insert into disttab values (9.800528,  5.869301, 3.840932, 29.928523, 'EE');
insert into disttab values (10.605782, 3.145211, 2.13718,  27.398604, 'EE');
insert into disttab values (14.054569, 4.089033, 2.436408, 4.483585,  'EE');
insert into disttab values (8.120606,  2.155303, 1.787835, 19.513588, 'EE');
insert into disttab values (13.093059, 0.220456, 3.456848, 24.855135, 'EE');
insert into disttab values (8.421441,  2.4819,   2.817669, 21.137668, 'FF');
insert into disttab values (11.899697, 2.507618, 3.770983, 4.016285,  'FF');
insert into disttab values (9.601342,  1.12639,  3.21053,  28.643809, 'FF');
insert into disttab values (9.32297,   10.003288,6.890515, 33.67171,  'FF');
insert into disttab values (6.896019,  10.76641, 3.123496, 29.077463, 'FF');
insert into disttab values (12.542443, 0.228756, 4.081015, 33.542652, 'FF');
insert into disttab values (14.038144, 7.326175, 3.53459,  11.731359, 'FF');

Let us first start with Normal Distribution. The mean (same as E(x)) and standard deviation are computed for the column num1 and provided as input.

declare
 mean   number;
 stdev  number;
 sig    number;
begin
 select avg(num1) into mean from disttab;
 select stddev(num1) into stdev from disttab;
 dbms_output.put_line('NORMAL DISTRIBUTION - SHAPIRO_WILKS');
 dbms_stat_funcs.normal_dist_fit(
  'dmuser', 'disttab', 'num1', 'SHAPIRO_WILKS', mean, stdev, sig);
 dbms_output.put_line('Mean    : ' || round(mean, 4));
 dbms_output.put_line('Stddev  : ' || round(stdev, 4));
 dbms_output.put_line('Sig     : ' || to_char(sig,'9.9999'));
end;
/

The output of this procedure looks like this:
NORMAL DISTRIBUTION - SHAPIRO_WILKS
W value : .9601472834631918774434537597940068448938
Mean : 10.7426
Stddev : 2.1094
Sig : .2456
The high value for the W returned from the Shapiro-Wilks test (explained here) indicates an agreement with the null hypothesis that the data follows normal distribution. The significance (0.25) is also relatively high for the Shapiro-Wilks test (0.05 being the typical threshold).

Next, Exponential distribution. For the above dataset, lambda - i.e. the rate of arrival - is provided (Note that lambda is NOT simply 1/mean(num3), it is 1/E(x) and we don't know how the expected value was computed)

declare
  lambda number;
  mu     number;
  sig    number;
begin
 lambda := 0.3194;
 -- select 1/mean into lambda from (select mean(num2) from disttab);
 mu     := NULL;
 sig    := NULL;
 dbms_output.put_line('EXPONENTIAL DISTRIBUTION - KOLMOGOROV_SMIRNOV');
 dbms_stat_funcs.exponential_dist_fit(
  'dmuser', 'disttab', 'num3', 'KOLMOGOROV_SMIRNOV', lambda, mu, sig);
 dbms_output.put_line('Lambda  : ' || lambda);
 dbms_output.put_line('Mu      : ' || mu);
 dbms_output.put_line('Sig     : ' || to_char(sig,'9.9999'));
end;
/

The output shows:
EXPONENTIAL DISTRIBUTION - KOLMOGOROV_SMIRNOV
D value : .0919745379005003387429254774811052604723
Lambda : .3194
Mu : 0
Sig : .9237466160
The low D value (0.09) and high significance (0.92) suggests that the data fits an exponential distribution well.

Next, Weibull Distribution. The alpha input parameter to the procedure corresponds to η - the scale, and the mu to γ - the location - in the Weibull probability density function discussed earlier.

declare
  alpha  number;
  beta   number;
begin
 alpha := 3;
 mu    := 0;
 beta  := 4; 
 sig   := NULL;
 dbms_output.put_line('.');
 dbms_output.put_line('WEIBULL DISTRIBUTION - KOLMOGOROV_SMIRNOV');
 dbms_stat_funcs.weibull_dist_fit(
  'dmuser', 'disttab', 'num3', 'KOLMOGOROV_SMIRNOV', alpha, mu, beta, sig);
 dbms_output.put_line('Alpha   : ' || alpha);
 dbms_output.put_line('Mu      : ' || mu);
 dbms_output.put_line('Beta    : ' || beta);
 dbms_output.put_line('Sig     : ' || to_char(sig,'9.9999'));
end;
/

The output shows:
WEIBULL DISTRIBUTION - KOLMOGOROV_SMIRNOV
D value : .2575286246007637604103723582952313687414
Alpha : 3
Mu : 0
Beta : 4
Sig : .0177026134
The Kolmogorov-Smirnov test does not appear emphatic enough, we will try other tests later.

Next, Uniform Distribution.

declare
  A      number;
  B      number;
begin
 A   := 1;
 B   := 34;
 sig := NULL;
 dbms_output.put_line('.');
 dbms_output.put_line('UNIFORM DISTRIBUTION - KOLMOGOROV_SMIRNOV');
 dbms_stat_funcs.uniform_dist_fit(
  'dmuser', 'disttab', 'num4', 'CONTINUOUS', 'KOLMOGOROV_SMIRNOV', A, B, sig);
 dbms_output.put_line('A    : ' || A);
 dbms_output.put_line('B    : ' || B);
 dbms_output.put_line('Sig     : ' || to_char(sig,'9.9999'));
end;
/

The output shows:
UNIFORM DISTRIBUTION - KOLMOGOROV_SMIRNOV
D value : .2083979233511586452762923351158645276292
A : 1
B : 34
Sig : .0904912415
The Kolmogorov-Smirnov test does not appear emphatic enough, we will try other tests later.

Finally, let us try fitting a small dataset to Poisson Distribution.

create table pdisttab (num1 number);
insert into pdisttab values(1);
insert into pdisttab values(2);
insert into pdisttab values(3);
insert into pdisttab values(4);
insert into pdisttab values(5);
declare
  mean   number;
  stdev  number;
  sig    number;
begin
 mean  := 3.0;
 stdev := 1.58114;
 dbms_output.put_line('.');
 dbms_output.put_line('POISSON DISTRIBUTION - KOLMOGOROV_SMIRNOV');
 dbms_stat_funcs.poisson_dist_fit(
  'dmuser', 'pdisttab', 'num1', 'KOLMOGOROV_SMIRNOV', mean, sig);
 dbms_output.put_line('Mean    : ' || mean);
 dbms_output.put_line('Stddev  : ' || stdev);
 dbms_output.put_line('Sig     : ' || to_char(sig,'9.9999'));
end;
/

The output of:
POISSON DISTRIBUTION - KOLMOGOROV_SMIRNOV
D value : .08391793845975204
Mean : 3
Stddev : 1.58114
Sig : .9999999999
emphatically is in agreement with the null hypothesis that the data fits a Poisson distribution.

That was fun, was'n't it? Now, we have introduced several new terms and concepts here (esp if you reviewed the DBMS_STAT_FUNCS docs) - viz. goodness-of-fit testing if a given data sample fits a particular distribution based on a null hypothesis, using various test types, the test metrics, and the significance output of a test, and various other parameters. In the upcoming posts, we will try to understand these new concepts. I also presume that the various references in the previous post and this one may have been useful to experienced/intermediate statisticians also.

1 comment:

Anonymous said...: Hi Ram:

Any more info on how to interpret the sig values returned by the dbms_stat_funcs.normal_dist_fit function. If I want to be 95% sure that the data is normally distributed should I be making sure the that returned sig is 0.05 or less? Also what is this W value and how to interpret it?

Thanks,
Arun; April 14, 2008 at 3:52 PM