Descriptive statistics summarizes and displays information about just the observations at hand - i.e. the sample and population are the same. Using data in the sample schema, we will quickly run through descriptive statistics concepts.**Displaying descriptive statistics***Frequency distribution* is a table that organizes a number of values into intervals (classes). A *histogram* is the visual equivalent of frequency distribution - a bar graph that represents the number of observations in each class as the height of each bar.

EX> Compute the frequency distribution of all employee salaries - across 10 classes.

select intvl, count(*) freq

from (select width_bucket(salary,

(select min(salary) from employees),

(select max(salary)+1 from employees), 10) intvl

from HR.employees)

group by intvl

order by intvl;

-- Result from this query

INTVL FREQ

----- --------

1 46

2 10

3 23

4 15

5 8

6 2

7 2

10 1

8 rows selected.

*Relative frequency distribution*provides the number of observations in each class as a percentage of the total number of observations.

EX> Compute the relative frequency distribution of all employee salaries - across 10 classes.

with Q_freq_dist as

(select intvl, count(*) freq

from (select width_bucket(salary,

(select min(salary) from employees),

(select max(salary)+1 from employees), 10) intvl

from HR.employees)

group by intvl

order by intvl)

--

select intvl, freq/(select count(*) from employees) rel_freq

from Q_freq_dist;

-- Result from this query

INTVL REL_FREQ

----- ----------

1 .429906542

2 .093457944

3 .214953271

4 .140186916

5 .074766355

6 .018691589

7 .018691589

10 .009345794

8 rows selected.

*Cumulative frequency distribution*provides the percentage of observations that are less than or equal to the class of interest.

EX> Compute the cumulative frequency distribution of all employee salaries - across 10 classes.

with Q_freq_dist asThe width_bucket function creates equi-width histograms. A complementary function,

(select intvl, count(*) freq

from (select width_bucket(salary,

(select min(salary) from employees),

(select max(salary)+1 from employees), 10) intvl

from HR.employees)

group by intvl

order by intvl),

--

Q_rel_freq_dist as

(select intvl, freq/(select count(*) from employees) rel_freq

from Q_freq_dist)

--

select intvl,

sum(rel_freq) over (order by intvl

rows between unbounded preceding and current row) cum_freq

from Q_rel_freq_dist;

-- Result from this query

INTVL CUM_FREQ

----- ----------

1 .429906542

2 .523364486

3 .738317757

4 .878504673

5 .953271028

6 .971962617

7 .990654206

10 1

8 rows selected.

*NTILE*, helps you bin values into intervals of equal height - i.e. with same count of values in each bin. This is useful for computing

*quartiles*,

*quintiles*etc.

EX> The following query bins the 107 records/observations into approx 10

*equi-height*bins.

select htbin, count(*) htAnyone who has given a competitive exam (SAT thro GRE/GMAT) should be familiar with the term

from (select ntile(10) over (order by salary) htbin

from hr.employees)

group by htbin

order by htbin;

HTBIN HT

----- ------

1 11

2 11

3 11

4 11

5 11

6 11

7 11

8 10

9 10

10 10

10 rows selected.

*percentile*. Given

*n*data points, the

*P*th percentile represents the value which resides above

*P*% of the values, and is given by

*percentile = (n+1) * P/100*

EX> Assume a student took the GRE test in 2004 and received an overall GRE score is 2150 (out of the possible 2400) and ranked at 89th percentile. What does this mean? Based on information provided here, it means, of the worldwide total of 408,948, the student ranked above (408948+1) * 89/100 ~= 363965 examinees, and conversely, 408948-363964= 44982 students ranked above this student. You can compute percentiles in Oracle SQL using CUME_DIST() as follows.

EX> Rank all sales reps and their managers by their salary percentile.

select job_id, last_name, salary,The

round(cume_dist() over

(partition by job_id order by salary) * 100) as pctile

from employees

where job_id like '%SA_%'

order by job_id, pctile desc, salary;

JOB_ID LAST_NAME SALARY PCTILE

------ -------------- ------ ------

SA_MAN Russell 14000 100

SA_MAN Partners 13500 80

SA_MAN Errazuriz 12000 60

SA_MAN Cambrault 11000 40

SA_MAN Zlotkey 10500 20

SA_REP Ozer 11500 100

SA_REP Abel 11000 97

SA_REP Vishney 10500 93

SA_REP Tucker 10000 90

SA_REP Bloom 10000 90

SA_REP King 10000 90

...

*partition by job_id*enables to rank within job_id. You can infer things like Ms. Bloom has 90% of her fellow sales rep earning as much or less than her (

*order by salary*clause), Mr. Zlotkey is in the last quintile for managers, and so on.

EX> Suppose a new job candidate names his salary requirements - say 7.5K - you can use the same function - in what Oracle calls its

*aggregate*usage to find out what percentile this salary number would fit in, as follows:

select round(cume_dist(7500) within group order by (salary) * 100, 0) offer_pctileIf the new candidate joins the sales force (a.k.a when his record is entered into this table), his salary will be in the 64th percentile. The

from employees

OFFER_PCTILE

------------

64

*PERCENT_RANK()*function is similar to

*CUME_DIST()*function - see the SQL Reference Manual in the Oracle docs for details.

**Measures of Central Tendency - Mean, Median, Mode**

**Mean or Average**= sum(all observations)/ count_of_observations. Note the difference between sample mean and population mean for use later.

EX> Compute average salary of employees in the company.

EX> Compute average quantity sold of products from all sale transactions.

select min(salary) min_s, max(salary) max_s,

round(avg(salary),2) avg_s

from HR.employees;

select prod_id, count(*), sum(quantity_sold) sum_qty,We can also compute the

avg(quantity_sold) avg_qty

from SH.sales

group by prod_id

order by prod_id;

**Mean of grouped data from a frequency distribution**using

x_bar = (Σ

_{_1_to_m}(freq

_{i}* x

_{i}))/Σ

_{_1_to_m}(freq

_{i}) where m is the number of classes.

EX> Compute the mean of the frequency distribution discussed above.

with Q_freq_dist as (

select intvl, count(*) freq

from (select width_bucket(salary,

(select min(salary) from employees),

(select max(salary)+1 from employees), 10) intvl

from HR.employees)

group by intvl

order by intvl)

--

select sum(intvl * freq)/sum(freq)

from Q_freq_dist

**Median**is the value in a dataset in which half the observations have a higher value and the other half a lower value.

EX> Compute the median salary per department.

select department_id,median(salary)

from HR.employees

group by department_id;

**Mode**is the observation in the dataset that occurs most frequently.

EX> Compute the mode per department.

select department_id,stats_mode(salary)

from HR.employees

group by department_id;

**Measures of Dispersion**

These metrics measure how a set of data values are dispersed around (a k a deviate from) the measure of central tendency (typically the mean).

**Variance**represents the relative distance between the data points and the mean of a dataset, and is computed as the sum of squared deviation of each data point from the mean. σ

^{2}= (Σ

_{1_to_n}(x

_{i}- x_bar)²)/(n-1)

EX> Compute the variance, sample variance, and population variance of employee salaries.

select variance(salary), var_samp(salary), var_pop(salary)

from HR.employees;

**Standard deviation**is the square root of variance, and generally more useful than variance because it is in units of the original data - rather than its square.

EX> Compute the standard deviation, sample standard deviation, and population standard deviation of employee salaries.

select stddev(salary), stddev_samp(salary), stddev_pop(salary)For several commonly occuring datasets, the values tend to cluster around the mean or median - so that the distribution takes on the shape of a symmetrical bell curve. In such situations, the

from HR.employees;

*Chebyshev's Theorem*states that

*at least*(1 - 1/k²) x 100%, k > 1, will fall within k-standard-deviations from the mean. If the distribution is more or les symmetrical about the mean (i.e. closer to

*normal distribution*) then the more rigorous

*empirical*rule states that

*approximately*68% of the data will fall within 1 standard deviation of the mean, and 95% of the data within 1 standard deviations of the mean, and 99.7% of the data within 3 standard deviations of the mean.

**Skewness**is a measure of the unevenness or asymmetry of the distribution. For a perfectly symmetric distribution, the mean is the same as mode is same as the median. Generally, for a skewed distribution, it is observed that the mean is to side of the median is to the side of the mode - the side being right (=ve skew) or left (-ve skew) depending on the skew. Skewness of a given population is computed by Skew = Σ

_{1_to_N}(x

_{i}- μ)/σ)³/N, where μ is the population mean, or σ is the standard deviation.

**Kurtosis**is the measure of peakedness of the distribution of the population. Kurtosis = Σ

_{1_to_N}(x

_{i}- μ)/σ)

^{4}/N.

EX> Compute the skewness and kurtosis for the employee salary.

-- SkewnessThis brevity is made possible thanks to Oracle SQL analytical functions (

select avg(val3)

from (select power((salary - avg(salary) over ())/

stddev(salary) over (), 3) val3

from employee);

--

-- Kurtosis

select avg(val4)

from (select power((salary - avg(salary) over ())/

stddev(salary) over (), 4) val4

from employee);

*avg(salary) over ()*implies compute average over all rows in the employees). Note that you can use the partition and order by clauses inside the () to group data across different employee classes and scope the computation to the group if required. From a performance standpoint, note that the computation intrinsically demands multiple scans (2 at the least) over the employees table (or the tables behind the employees view) - so for a potentially large employees table, the computations may involve use of sort (i.e. on-disk) memory rather than just the PGA (the database' equivalent of a PC's RAM). For more on such performance implications of SQL, begin your education with Oracle Concepts or SQL Tuning Guide, or look up the various references provided here.

## No comments:

Post a Comment