Descriptive Statistics

Descriptive statistics summarizes and displays information about just the observations at hand - i.e. the sample and population are the same. Using data in the sample schema, we will quickly run through descriptive statistics concepts.

Displaying descriptive statistics
Frequency distribution is a table that organizes a number of values into intervals (classes). A histogram is the visual equivalent of frequency distribution - a bar graph that represents the number of observations in each class as the height of each bar.
EX> Compute the frequency distribution of all employee salaries - across 10 classes.

select intvl, count(*) freq
from (select width_bucket(salary,
(select min(salary) from employees),
(select max(salary)+1 from employees), 10) intvl
from HR.employees)
group by intvl
order by intvl;
-- Result from this query
INTVL FREQ
----- --------
1 46
2 10
3 23
4 15
5 8
6 2
7 2
10 1
8 rows selected.
Relative frequency distribution provides the number of observations in each class as a percentage of the total number of observations.
EX> Compute the relative frequency distribution of all employee salaries - across 10 classes.
with Q_freq_dist as
(select intvl, count(*) freq
from (select width_bucket(salary,
(select min(salary) from employees),
(select max(salary)+1 from employees), 10) intvl
from HR.employees)
group by intvl
order by intvl)
--
select intvl, freq/(select count(*) from employees) rel_freq
from Q_freq_dist;
-- Result from this query
INTVL REL_FREQ
----- ----------
1 .429906542
2 .093457944
3 .214953271
4 .140186916
5 .074766355
6 .018691589
7 .018691589
10 .009345794
8 rows selected.
Cumulative frequency distribution provides the percentage of observations that are less than or equal to the class of interest.
EX> Compute the cumulative frequency distribution of all employee salaries - across 10 classes.
with Q_freq_dist as
(select intvl, count(*) freq
from (select width_bucket(salary,
(select min(salary) from employees),
(select max(salary)+1 from employees), 10) intvl
from HR.employees)
group by intvl
order by intvl),
--
Q_rel_freq_dist as
(select intvl, freq/(select count(*) from employees) rel_freq
from Q_freq_dist)
--
select intvl,
sum(rel_freq) over (order by intvl
rows between unbounded preceding and current row) cum_freq
from Q_rel_freq_dist;
-- Result from this query
INTVL CUM_FREQ
----- ----------
1 .429906542
2 .523364486
3 .738317757
4 .878504673
5 .953271028
6 .971962617
7 .990654206
10 1
8 rows selected.
The width_bucket function creates equi-width histograms. A complementary function, NTILE, helps you bin values into intervals of equal height - i.e. with same count of values in each bin. This is useful for computing quartiles, quintiles etc.
EX> The following query bins the 107 records/observations into approx 10 equi-height bins.
select htbin, count(*) ht
from (select ntile(10) over (order by salary) htbin
from hr.employees)
group by htbin
order by htbin;
HTBIN HT
----- ------
1 11
2 11
3 11
4 11
5 11
6 11
7 11
8 10
9 10
10 10
10 rows selected.
Anyone who has given a competitive exam (SAT thro GRE/GMAT) should be familiar with the term percentile. Given n data points, the Pth percentile represents the value which resides above P% of the values, and is given by percentile = (n+1) * P/100
EX> Assume a student took the GRE test in 2004 and received an overall GRE score is 2150 (out of the possible 2400) and ranked at 89th percentile. What does this mean? Based on information provided here, it means, of the worldwide total of 408,948, the student ranked above (408948+1) * 89/100 ~= 363965 examinees, and conversely, 408948-363964= 44982 students ranked above this student. You can compute percentiles in Oracle SQL using CUME_DIST() as follows.
EX> Rank all sales reps and their managers by their salary percentile.
select job_id, last_name, salary,
round(cume_dist() over
(partition by job_id order by salary) * 100) as pctile
from employees
where job_id like '%SA_%'
order by job_id, pctile desc, salary;
JOB_ID LAST_NAME SALARY PCTILE
------ -------------- ------ ------
SA_MAN Russell 14000 100
SA_MAN Partners 13500 80
SA_MAN Errazuriz 12000 60
SA_MAN Cambrault 11000 40
SA_MAN Zlotkey 10500 20
SA_REP Ozer 11500 100
SA_REP Abel 11000 97
SA_REP Vishney 10500 93
SA_REP Tucker 10000 90
SA_REP Bloom 10000 90
SA_REP King 10000 90
...
The partition by job_id enables to rank within job_id. You can infer things like Ms. Bloom has 90% of her fellow sales rep earning as much or less than her (order by salary clause), Mr. Zlotkey is in the last quintile for managers, and so on.
EX> Suppose a new job candidate names his salary requirements - say 7.5K - you can use the same function - in what Oracle calls its aggregate usage to find out what percentile this salary number would fit in, as follows:
select round(cume_dist(7500) within group order by (salary) * 100, 0) offer_pctile
from employees
OFFER_PCTILE
------------
64
If the new candidate joins the sales force (a.k.a when his record is entered into this table), his salary will be in the 64th percentile. The PERCENT_RANK() function is similar to CUME_DIST() function - see the SQL Reference Manual in the Oracle docs for details.

Measures of Central Tendency - Mean, Median, Mode
Mean or Average = sum(all observations)/ count_of_observations. Note the difference between sample mean and population mean for use later.
EX> Compute average salary of employees in the company.

select min(salary) min_s, max(salary) max_s,
round(avg(salary),2) avg_s
from HR.employees;
EX> Compute average quantity sold of products from all sale transactions.
select prod_id, count(*), sum(quantity_sold) sum_qty,
avg(quantity_sold) avg_qty
from SH.sales
group by prod_id
order by prod_id;
We can also compute the Mean of grouped data from a frequency distribution using
x_bar = (Σ_1_to_m(freqi * xi))/Σ_1_to_m(freqi) where m is the number of classes.
EX> Compute the mean of the frequency distribution discussed above.
with Q_freq_dist as (
select intvl, count(*) freq
from (select width_bucket(salary,
(select min(salary) from employees),
(select max(salary)+1 from employees), 10) intvl
from HR.employees)
group by intvl
order by intvl)
--
select sum(intvl * freq)/sum(freq)
from Q_freq_dist
Median is the value in a dataset in which half the observations have a higher value and the other half a lower value.
EX> Compute the median salary per department.
select department_id, median(salary)
from HR.employees
group by department_id;
Mode is the observation in the dataset that occurs most frequently.
EX> Compute the mode per department.
select department_id, stats_mode(salary)
from HR.employees
group by department_id;
Measures of Dispersion
These metrics measure how a set of data values are dispersed around (a k a deviate from) the measure of central tendency (typically the mean). Variance represents the relative distance between the data points and the mean of a dataset, and is computed as the sum of squared deviation of each data point from the mean. σ2 = (Σ1_to_n (xi - x_bar)²)/(n-1)
EX> Compute the variance, sample variance, and population variance of employee salaries.
select variance(salary), var_samp(salary), var_pop(salary)
from HR.employees;
Standard deviation is the square root of variance, and generally more useful than variance because it is in units of the original data - rather than its square.
EX> Compute the standard deviation, sample standard deviation, and population standard deviation of employee salaries.
select stddev(salary), stddev_samp(salary), stddev_pop(salary)
from HR.employees;
For several commonly occuring datasets, the values tend to cluster around the mean or median - so that the distribution takes on the shape of a symmetrical bell curve. In such situations, the Chebyshev's Theorem states that at least (1 - 1/k²) x 100%, k > 1, will fall within k-standard-deviations from the mean. If the distribution is more or les symmetrical about the mean (i.e. closer to normal distribution) then the more rigorous empirical rule states that approximately 68% of the data will fall within 1 standard deviation of the mean, and 95% of the data within 1 standard deviations of the mean, and 99.7% of the data within 3 standard deviations of the mean.

Skewness is a measure of the unevenness or asymmetry of the distribution. For a perfectly symmetric distribution, the mean is the same as mode is same as the median. Generally, for a skewed distribution, it is observed that the mean is to side of the median is to the side of the mode - the side being right (=ve skew) or left (-ve skew) depending on the skew. Skewness of a given population is computed by Skew = Σ1_to_N (xi - μ)/σ)³/N, where μ is the population mean, or σ is the standard deviation.
Kurtosis is the measure of peakedness of the distribution of the population. Kurtosis = Σ1_to_N(xi - μ)/σ)4/N.
EX> Compute the skewness and kurtosis for the employee salary.
-- Skewness
select avg(val3)
from (select power((salary - avg(salary) over ())/
stddev(salary) over (), 3) val3
from employee);
--
-- Kurtosis
select avg(val4)
from (select power((salary - avg(salary) over ())/
stddev(salary) over (), 4) val4
from employee);
This brevity is made possible thanks to Oracle SQL analytical functions (avg(salary) over () implies compute average over all rows in the employees). Note that you can use the partition and order by clauses inside the () to group data across different employee classes and scope the computation to the group if required. From a performance standpoint, note that the computation intrinsically demands multiple scans (2 at the least) over the employees table (or the tables behind the employees view) - so for a potentially large employees table, the computations may involve use of sort (i.e. on-disk) memory rather than just the PGA (the database' equivalent of a PC's RAM). For more on such performance implications of SQL, begin your education with Oracle Concepts or SQL Tuning Guide, or look up the various references provided here.

No comments: