Sampling

The key goal of inferential statistics is to make predictions/observations about the population (the whole) as a generalization of observations made on a random sample (the part). In this post, we will present tools/techniques in Oracle for sampling data.

Sampling
For ensuring high accuracy in the results of a statistical inference (technique), the sample dataset should minimally have these properties:

  • the sample must be drawn randomly from the population
  • every segment of the population must be adequately and proportionately represented
  • the sample should not be biased - a classic example is the non-response bias seen in survey/poll data - where the respondents ignore/refuse to answer a particular question ("Have you smoked marijuana?" in a health questionnaire).
We will use this succinct introduction to sampling techniques (or alternatively, this resource) as a basis to discuss probabilistic (a.k.a random) sampling with and without replacement, clustered sampling, and stratified sampling in Oracle.

The SAMPLE clause in the SQL SELECT statement supports simple random sampling and clustered sampling. Note that the sampling frame here is simply the set of rows returned by the statement - so you can control how many, and which, of the rows are sampled using filters in the WHERE clause. Here are some examples based on the OE.ORDERS table provided with the Sample Schema (connect oe/oe in SQL*Plus to see this table).

Example: Random Sampling without replacement:
Get random samples from ORDERS table with each row having a 15% chance of being in the sample
SQL> select order_id,order_status,order_total from orders sample (15) seed (3);

ORDER_ID ORDER_STATUS ORDER_TOTAL
---------- ------------ -----------
2354 0 46257
2360 4 990.4
2361 8 120131.3
2384 3 29249.1
2389 4 17620
2400 2 69286.4
2401 3 969.2
2402 8 600
2415 6 310
2416 6 384
2417 5 1926.6

ORDER_ID ORDER_STATUS ORDER_TOTAL
---------- ------------ -----------
2419 3 31574
2423 3 10367.7
2427 7 9055
2429 9 50125
2453 0 129

16 rows selected.

SQL>
The SEED clause in the above statement helps with repeatability of results from one run to the next. Next, achieving clustered sampling requires a BLOCK qualifier in the SAMPLE clause, as in
select order_id,order_status,order_total from orders sample block (15) seed (3);
Example: Random Sampling with replacement: can be accomplished with some minimal coding. In the example below, we using a hashing function (ORA_HASH) on the sequence of enumerated (using ROWNUM pseudo-column) rows to randomize the selection from the table to be sampled. Let's try this: "Get me random samples with replacement with sample size 10 rows from the ORDERS table".
-- stage a view with row numbers tacked on to original table
create view orders_view as
select rownum rnum, o.*
from orders o;

-- create a sequence
create sequence orders_seq10;

-- create a mapping table
create table orders_map(rnum number, rhsh number);

-- the requested sample size is 10
begin
for i in 1..10
loop
insert into orders_map (rnum) values (orders_seq10.nextval);
end loop;
end;
/
commit;

-- the complete orders table is the sampling frame
-- mark random sampled entries in mapping table
update orders_map set rhsh = ora_hash(rnum, (select count(*) from orders));

-- use the mapping table and orders_view to create a view with sampled rows
create view orders_sample10_view as
select o.*
from orders_view o, orders_map m
where o.rnum = m.rnum;
This is the result of the above code snippet, when run in the OE schema using SQL*Plus:
SQL> select * from orders_sample10_view;

RNUM ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID
------ ---------- -------- ----------- ------------ ----------- ------------
11 2455 direct 145 7 14087.5 160
12 2379 direct 146 8 17848.2 161
13 2396 direct 147 8 34930 161
14 2406 direct 148 8 2854.2 161
15 2434 direct 149 8 268651.8 161
16 2436 direct 116 8 6394.8 161
17 2446 direct 117 8 103679.3 161
18 2447 direct 101 8 33893.6 161
19 2432 direct 102 10 10523 163
20 2433 direct 103 10 78 163

10 rows selected.

SQL>
For repeatability of sampled results, change all "create view" statements above to "create table". It is also useful if the data to be sampled is persisted in a table, rather than presented to the sampling queries as a view.

Stratified Sampling can also be coded using SQL. But rather than provide an example query that is specific to a given table, we will provide a query generator function here. This is adapted from the output of code generated by Oracle Data Miner for stratified sampling. The function accepts a source and result table name, the name of the stratifying column (a.k.a. "variable" in statistics, "attribute" in data mining), the sampling size as a percentage of the table size, a scalar comparison operator ('<', '=', '>'), and an indicator to specify if each strata should roughly have the same number of elements.

An important note - Oracle Data Miner - the GUI interface for the Oracle Data Mining platform provides sampling features, and generated SQL/PLSQL code that corresponds to the sampling input. This example has been adapted from this generated code.
CREATE OR REPLACE TYPE TARGET_VALUES_LIST IS TABLE OF VARCHAR2(32);
/
CREATE OR REPLACE TYPE VALUE_COUNT_LIST IS TABLE OF NUMBER;
/

CREATE OR REPLACE
FUNCTION GENERATE_STRATIFIED_SQL (
result_table_name IN VARCHAR2,
source_table_name IN VARCHAR2,
strat_attr_name IN VARCHAR2,
percentage IN NUMBER,
op IN VARCHAR2,
equal_sized_strata IN BOOLEAN DEFAULT FALSE)
RETURN VARCHAR2 IS
v_stmt VARCHAR2(32767);
tmp_str VARCHAR2(4000);
sample_count PLS_INTEGER;
attr_names TARGET_VALUES_LIST;
attr_values TARGET_VALUES_LIST;
counts VALUE_COUNT_LIST;
counts_sampled VALUE_COUNT_LIST;
v_minvalue NUMBER;
BEGIN
v_stmt :=
'SELECT column_name ' ||
'FROM user_tab_columns ' ||
'WHERE table_name = ' || '''' || UPPER(source_table_name) || '''';
EXECUTE IMMEDIATE v_stmt BULK COLLECT INTO attr_names;

v_stmt :=
'SELECT /*+ noparallel(t)*/ ' || strat_attr_name || ', count(*), ' ||
'ROUND ((count(*) * ' || percentage || ')/100.0) ' ||
'FROM ' || source_table_name ||
' WHERE ' || strat_attr_name || ' IS NOT NULL ' ||
'GROUP BY ' || strat_attr_name;
EXECUTE IMMEDIATE v_stmt
BULK COLLECT INTO attr_values, counts, counts_sampled;

IF (equal_sized_strata = TRUE) THEN
FOR i IN counts.FIRST..counts.LAST
LOOP
IF (i = counts.FIRST) THEN
v_minvalue := counts(i);
ELSIF (counts(i) > 0 AND v_minvalue > counts(i)) THEN
v_minvalue := counts(i);
END IF;
END LOOP;
FOR i IN counts.FIRST..counts.LAST
LOOP
counts(i) := v_minvalue;
END LOOP;
END IF;

v_stmt :=
'CREATE TABLE ' || result_table_name || ' AS ' ||
'SELECT ';
FOR i IN attr_names.FIRST..attr_names.LAST
LOOP
IF (i != attr_names.FIRST) THEN
v_stmt := v_stmt || ',';
END IF;
v_stmt := v_stmt || attr_names(i);
END LOOP;

v_stmt := v_stmt ||
' FROM (SELECT /*+ no_merge */ t.*, ' ||
'ROW_NUMBER() OVER ' ||
'(PARTITION BY ' || strat_attr_name ||
' ORDER BY ORA_HASH(ROWNUM)) RNUM ' ||
'FROM ' || source_table_name || ' t) ' ||
'WHERE RNUM = 1 OR ';

FOR i IN attr_values.FIRST..attr_values.LAST
LOOP
IF (i != attr_values.FIRST) THEN
tmp_str := ' OR ';
END IF;
IF (counts(i) <= 2) THEN
sample_count := counts(i);
ELSE
sample_count := counts_sampled(i);
END IF;
tmp_str :=
tmp_str ||
'(' || strat_attr_name || ' = ''' || attr_values(i) || '''' ||
' AND ORA_HASH(RNUM, (' || counts(i) || ' -1), 12345) ' ||
op || sample_count || ') ';
v_stmt := v_stmt || tmp_str;
END LOOP;

RETURN(v_stmt);
END;
/
Now, cut and paste the above code in a SQL*Plus session (for this example, the Sample Schema OE/OE session), and then invoke the function in a SQL*Plus session using this wrapper. The inputs are pretty self-explanatory - we are asking to sample roughly 20% of the ORDERS table, stratified based on values in the ORDERS_STATUS column, and to place the sampled output in the table SAMPLED_ORDERS.
set serveroutput on
BEGIN
DBMS_OUTPUT.PUT_LINE(
generate_stratified_sql(
result_table_name => 'sampled_orders',
source_table_name => 'orders',
strat_attr_name => 'order_status',
percentage => 20,
op => ' < '));
END;
/
this will return the following sampling query (output formatted for easy readability):
CREATE TABLE sampled_orders AS
SELECT ORDER_ID,ORDER_MODE,CUSTOMER_ID,ORDER_STATUS,ORDER_TOTAL,
SALES_REP_ID,PROMOTION_ID
FROM (SELECT /*+ no_merge */ t.*,
ROW_NUMBER() OVER
(PARTITION BY order_status
ORDER BY ORA_HASH(ROWNUM)) RNUM
FROM orders t)
WHERE RNUM = 1 OR
(order_status = '1' AND ORA_HASH(RNUM, (7 -1), 12345) < 1) OR
(order_status = '6' AND ORA_HASH(RNUM, (9 -1), 12345) < 2) OR
(order_status = '2' AND ORA_HASH(RNUM, (7 -1), 12345) < 1) OR
(order_status = '5' AND ORA_HASH(RNUM, (15 -1), 12345) < 3) OR
(order_status = '4' AND ORA_HASH(RNUM, (12 -1), 12345) < 2) OR
(order_status = '8' AND ORA_HASH(RNUM, (17 -1), 12345) < 3) OR
(order_status = '3' AND ORA_HASH(RNUM, (9 -1), 12345) < 2) OR
(order_status = '7' AND ORA_HASH(RNUM, (3 -1), 12345) < 1) OR
(order_status = '0' AND ORA_HASH(RNUM, (11 -1), 12345) < 2) OR
(order_status = '10' AND ORA_HASH(RNUM, (5 -1), 12345) < 1) OR
(order_status = '9' AND ORA_HASH(RNUM, (10 -1), 12345) < 2);
If you execute this SQL in the same OE session, the resulting table is:
SQL> select * from sampled_orders;

ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID
---------- -------- ----------- ------------ ----------- ------------
2363 online 144 0 10082.3
2369 online 116 0 11097.4
2403 direct 162 0 220 154
2439 direct 105 1 22150.1 159
2408 direct 166 1 309 158
2444 direct 109 1 77727.2 155
2358 direct 105 2 7826 155
2400 direct 159 2 69286.4 161
2375 online 122 2 103834.4
2450 direct 147 3 1636 159
2423 direct 145 3 10367.7 160

ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID
---------- -------- ----------- ------------ ----------- ------------
2385 online 147 4 295892
2437 direct 103 4 13550 163
2389 online 151 4 17620
2364 online 145 4 9500
2377 online 141 5 38017.8
2425 direct 147 5 1500.8 163
2394 direct 109 5 21863 158
2457 direct 118 5 21586.2 159
2426 direct 148 6 7200
2410 direct 168 6 45175 156
2427 direct 149 7 9055 163

ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID
---------- -------- ----------- ------------ ----------- ------------
2446 direct 117 8 103679.3 161
2402 direct 161 8 600 154
2434 direct 149 8 268651.8 161
2447 direct 101 8 33893.6 161
2365 online 146 9 27455.3
2372 online 119 9 16447.2
2359 online 106 9 5543.1
2368 online 149 10 60065

30 rows selected.

SQL> select count(*) from orders;

COUNT(*)
----------
105

SQL>
That is, 30 rows (a bit more than 20%) out of 105 are sampled, using order_status as the stratifying column.

Try this example on your own schemas and tables, and send me some feedback. The maximum length of the string returned from GENERATE_STRATFIED_SQL() is 32767. Based on the interest, I can post another function that can handle SQL strings of <= 64K in size.

Given the explosive growth in data, sampling continues to remain an active, relevant research area in statistics, data mining, and computer science.

3 comments:

prodlife said...

I always wondered why the sample clause allows specifying percentage but not number of rows in sample.

If you know something about the distribution, you will be able to calculate how many rows you'll need to get specific bias. Why force us to convert the number into percentage?

Ram said...

Hi, Good question. One clarification before I investigate the answer and get back to you. Note
that the percentage value is not the percentage of rows from the table - as in "sample and return 30% of the rows in a table". Acc to the docs "This percentage indicates the probability of each row, or each cluster of rows in the case of block sampling, being selected as part of the sample." - as in "return rows that have a 30% chance of being sampled". I'll get back with an answer shortly.

Ram said...

Chen, Upon inquiry, I found that the answer to this is non-trivial and a factor of implementation of row storage (not just specific to Oracle, but any DB that has a hierarchy of granularity for storing rows/columns). The short answer to this question is that for the sample clause implementation to consider both performance and "fairness" in sampling - accepting the probability that a row will be present in the sample is better - rather than a specific number of rows. From a usage standpoint, the good news is that the number of rows returned are pretty close to the % requested - for e.g., I input 10% over a 105 row table, and got 13 rows sample. Over several trials of sampling, the number of rows returned will eventually equal the % of rows in the table - that is 11 rows - even if the interpretation of this input appears to be different. Hope this helps. +Ram