Numoracle Recipes: Sampling

The key goal of inferential statistics is to make predictions/observations about the population (the whole) as a generalization of observations made on a random sample (the part). In this post, we will present tools/techniques in Oracle for sampling data.

Sampling
For ensuring high accuracy in the results of a statistical inference (technique), the sample dataset should minimally have these properties:

the sample must be drawn randomly from the population
every segment of the population must be adequately and proportionately represented
the sample should not be biased - a classic example is the non-response bias seen in survey/poll data - where the respondents ignore/refuse to answer a particular question ("Have you smoked marijuana?" in a health questionnaire).

We will use this succinct introduction to sampling techniques (or alternatively, this resource) as a basis to discuss probabilistic (a.k.a random) sampling with and without replacement, clustered sampling, and stratified sampling in Oracle.

The SAMPLE clause in the SQL SELECT statement supports simple random sampling and clustered sampling. Note that the sampling frame here is simply the set of rows returned by the statement - so you can control how many, and which, of the rows are sampled using filters in the WHERE clause. Here are some examples based on the OE.ORDERS table provided with the Sample Schema (connect oe/oe in SQL*Plus to see this table).

Example: Random Sampling without replacement:
Get random samples from ORDERS table with each row having a 15% chance of being in the sample

SQL> select order_id,order_status,order_total from orders sample (15) seed (3);

  ORDER_ID ORDER_STATUS ORDER_TOTAL
---------- ------------ -----------
      2354       0       46257
      2360       4       990.4
      2361       8    120131.3
      2384       3     29249.1
      2389       4       17620
      2400       2     69286.4
      2401       3       969.2
      2402       8         600
      2415       6         310
      2416       6         384
      2417       5      1926.6

  ORDER_ID ORDER_STATUS ORDER_TOTAL
---------- ------------ -----------
      2419       3       31574
      2423       3     10367.7
      2427       7        9055
      2429       9       50125
      2453       0         129

16 rows selected.

SQL>

The SEED clause in the above statement helps with repeatability of results from one run to the next. Next, achieving clustered sampling requires a BLOCK qualifier in the SAMPLE clause, as in

select order_id,order_status,order_total from orders sample block (15) seed (3);

Example: Random Sampling with replacement: can be accomplished with some minimal coding. In the example below, we using a hashing function (ORA_HASH) on the sequence of enumerated (using ROWNUM pseudo-column) rows to randomize the selection from the table to be sampled. Let's try this: "Get me random samples with replacement with sample size 10 rows from the ORDERS table".

-- stage a view with row numbers tacked on to original table
create view orders_view as
select rownum rnum, o.*
  from orders o;

-- create a sequence
create sequence orders_seq10;

-- create a mapping table
create table orders_map(rnum number, rhsh number);

-- the requested sample size is 10
begin
  for i in 1..10
  loop
    insert into orders_map (rnum) values (orders_seq10.nextval);
  end loop;
end;
/
commit;

-- the complete orders table is the sampling frame
-- mark random sampled entries in mapping table
update orders_map set rhsh = ora_hash(rnum, (select count(*) from orders));

-- use the mapping table and orders_view to create a view with sampled rows
create view orders_sample10_view as
select o.*
  from orders_view o, orders_map m
 where o.rnum = m.rnum;

This is the result of the above code snippet, when run in the OE schema using SQL*Plus:

SQL> select * from orders_sample10_view;

RNUM    ORDER_ID   ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID
------  ---------- -------- ----------- ------------ ----------- ------------
11 2455       direct       145       7     14087.5      160
12 2379       direct       146       8     17848.2      161
13 2396       direct       147       8       34930      161
14 2406       direct       148       8      2854.2      161
15 2434       direct       149       8    268651.8      161
16 2436       direct       116       8      6394.8      161
17 2446       direct       117       8    103679.3      161
18 2447       direct       101       8     33893.6      161
19 2432       direct       102      10       10523      163
20 2433       direct       103      10   78      163

10 rows selected.

SQL>

For repeatability of sampled results, change all "create view" statements above to "create table". It is also useful if the data to be sampled is persisted in a table, rather than presented to the sampling queries as a view.

Stratified Sampling can also be coded using SQL. But rather than provide an example query that is specific to a given table, we will provide a query generator function here. This is adapted from the output of code generated by Oracle Data Miner for stratified sampling. The function accepts a source and result table name, the name of the stratifying column (a.k.a. "variable" in statistics, "attribute" in data mining), the sampling size as a percentage of the table size, a scalar comparison operator ('<', '=', '>'), and an indicator to specify if each strata should roughly have the same number of elements.

An important note - Oracle Data Miner - the GUI interface for the Oracle Data Mining platform provides sampling features, and generated SQL/PLSQL code that corresponds to the sampling input. This example has been adapted from this generated code.

CREATE OR REPLACE TYPE TARGET_VALUES_LIST IS TABLE OF VARCHAR2(32);
/
CREATE OR REPLACE TYPE VALUE_COUNT_LIST IS TABLE OF NUMBER;
/

CREATE OR REPLACE
FUNCTION GENERATE_STRATIFIED_SQL ( 
  result_table_name  IN VARCHAR2,
  source_table_name  IN VARCHAR2, 
  strat_attr_name    IN VARCHAR2,
  percentage         IN NUMBER,
  op                 IN VARCHAR2,
  equal_sized_strata IN BOOLEAN DEFAULT FALSE)
RETURN VARCHAR2 IS
  v_stmt                VARCHAR2(32767);
  tmp_str               VARCHAR2(4000);
  sample_count          PLS_INTEGER;
  attr_names            TARGET_VALUES_LIST;
  attr_values           TARGET_VALUES_LIST;
  counts                VALUE_COUNT_LIST;
  counts_sampled        VALUE_COUNT_LIST;
  v_minvalue            NUMBER;
BEGIN
  v_stmt :=
    'SELECT column_name ' ||
      'FROM user_tab_columns ' ||
     'WHERE table_name = ' || '''' || UPPER(source_table_name) || '''';
  EXECUTE IMMEDIATE v_stmt BULK COLLECT INTO attr_names;

  v_stmt := 
    'SELECT /*+ noparallel(t)*/ ' || strat_attr_name || ', count(*), ' ||
            'ROUND ((count(*) * ' || percentage || ')/100.0) ' ||
      'FROM ' || source_table_name || 
    ' WHERE ' || strat_attr_name || ' IS NOT NULL ' ||
    'GROUP BY ' || strat_attr_name;
  EXECUTE IMMEDIATE v_stmt
  BULK COLLECT INTO attr_values, counts, counts_sampled;

  IF (equal_sized_strata = TRUE) THEN
    FOR i IN counts.FIRST..counts.LAST
    LOOP
      IF (i = counts.FIRST) THEN
         v_minvalue := counts(i);
      ELSIF (counts(i) > 0 AND v_minvalue > counts(i)) THEN
         v_minvalue := counts(i);
      END IF;
    END LOOP;
    FOR i IN counts.FIRST..counts.LAST
    LOOP
      counts(i) := v_minvalue;
    END LOOP;
  END IF;

  v_stmt :=
   'CREATE TABLE ' || result_table_name || ' AS ' ||
   'SELECT ';
  FOR i IN attr_names.FIRST..attr_names.LAST 
  LOOP
    IF (i != attr_names.FIRST) THEN
       v_stmt := v_stmt || ',';
    END IF;
    v_stmt := v_stmt || attr_names(i);
  END LOOP;
  
  v_stmt := v_stmt ||
    ' FROM (SELECT /*+ no_merge */ t.*, ' ||
                  'ROW_NUMBER() OVER ' ||
                    '(PARTITION BY ' || strat_attr_name ||
                    ' ORDER BY ORA_HASH(ROWNUM)) RNUM ' ||
             'FROM ' || source_table_name || ' t) ' ||
    'WHERE RNUM = 1 OR ';

  FOR i IN attr_values.FIRST..attr_values.LAST 
  LOOP
    IF (i != attr_values.FIRST) THEN
       tmp_str := ' OR ';
    END IF;
    IF (counts(i) <= 2) THEN
      sample_count := counts(i);
    ELSE
      sample_count := counts_sampled(i);
    END IF;
    tmp_str :=
      tmp_str || 
      '(' || strat_attr_name || ' = ''' || attr_values(i) || '''' ||
      ' AND ORA_HASH(RNUM, (' || counts(i) || ' -1), 12345) ' ||
      op || sample_count || ') ';
    v_stmt := v_stmt || tmp_str;
  END LOOP;

  RETURN(v_stmt);
END;
/

Now, cut and paste the above code in a SQL*Plus session (for this example, the Sample Schema OE/OE session), and then invoke the function in a SQL*Plus session using this wrapper. The inputs are pretty self-explanatory - we are asking to sample roughly 20% of the ORDERS table, stratified based on values in the ORDERS_STATUS column, and to place the sampled output in the table SAMPLED_ORDERS.

set serveroutput on
BEGIN
  DBMS_OUTPUT.PUT_LINE(
    generate_stratified_sql(
      result_table_name => 'sampled_orders',
      source_table_name => 'orders',
      strat_attr_name   => 'order_status',
      percentage        => 20,
      op                => ' < '));
END;
/

this will return the following sampling query (output formatted for easy readability):

CREATE TABLE sampled_orders AS
SELECT ORDER_ID,ORDER_MODE,CUSTOMER_ID,ORDER_STATUS,ORDER_TOTAL,
       SALES_REP_ID,PROMOTION_ID
  FROM (SELECT /*+ no_merge */ t.*,
               ROW_NUMBER() OVER
               (PARTITION BY order_status
                ORDER BY ORA_HASH(ROWNUM)) RNUM
          FROM orders t)
 WHERE RNUM = 1 OR
       (order_status = '1' AND ORA_HASH(RNUM, (7 -1), 12345) < 1)  OR
       (order_status = '6' AND ORA_HASH(RNUM, (9 -1), 12345) < 2)  OR
       (order_status = '2' AND ORA_HASH(RNUM, (7 -1), 12345) < 1)  OR
       (order_status = '5' AND ORA_HASH(RNUM, (15 -1), 12345) < 3) OR
       (order_status = '4' AND ORA_HASH(RNUM, (12 -1), 12345) < 2) OR
       (order_status = '8' AND ORA_HASH(RNUM, (17 -1), 12345) < 3) OR
       (order_status = '3' AND ORA_HASH(RNUM, (9 -1), 12345) < 2) OR
       (order_status = '7' AND ORA_HASH(RNUM, (3 -1), 12345) < 1) OR
       (order_status = '0' AND ORA_HASH(RNUM, (11 -1), 12345) < 2) OR
       (order_status = '10' AND ORA_HASH(RNUM, (5 -1), 12345) < 1) OR
       (order_status = '9' AND ORA_HASH(RNUM, (10 -1), 12345) < 2);

If you execute this SQL in the same OE session, the resulting table is:

SQL> select * from sampled_orders;

  ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID
---------- -------- ----------- ------------ ----------- ------------
      2363 online     144     0  10082.3
      2369 online     116     0  11097.4
      2403 direct     162     0      220   154
      2439 direct     105     1  22150.1   159
      2408 direct     166     1      309   158
      2444 direct     109     1  77727.2   155
      2358 direct     105     2     7826   155
      2400 direct     159     2  69286.4   161
      2375 online     122     2 103834.4
      2450 direct     147     3     1636   159
      2423 direct     145     3  10367.7   160

  ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID
---------- -------- ----------- ------------ ----------- ------------
      2385 online     147     4   295892
      2437 direct     103     4    13550   163
      2389 online     151     4    17620
      2364 online     145     4     9500
      2377 online     141     5  38017.8
      2425 direct     147     5   1500.8   163
      2394 direct     109     5    21863   158
      2457 direct     118     5  21586.2   159
      2426 direct     148     6     7200
      2410 direct     168     6    45175   156
      2427 direct     149     7     9055   163

  ORDER_ID ORDER_MO CUSTOMER_ID ORDER_STATUS ORDER_TOTAL SALES_REP_ID
---------- -------- ----------- ------------ ----------- ------------
      2446 direct     117     8 103679.3   161
      2402 direct     161     8      600   154
      2434 direct     149     8 268651.8   161
      2447 direct     101     8  33893.6   161
      2365 online     146     9  27455.3
      2372 online     119     9  16447.2
      2359 online     106     9   5543.1
      2368 online     149    10    60065

30 rows selected.

SQL> select count(*) from orders;

  COUNT(*)
----------
       105

SQL>

That is, 30 rows (a bit more than 20%) out of 105 are sampled, using order_status as the stratifying column.

Try this example on your own schemas and tables, and send me some feedback. The maximum length of the string returned from GENERATE_STRATFIED_SQL() is 32767. Based on the interest, I can post another function that can handle SQL strings of <= 64K in size.

Given the explosive growth in data, sampling continues to remain an active, relevant research area in statistics, data mining, and computer science.

3 comments:

Anonymous said...: I always wondered why the sample clause allows specifying percentage but not number of rows in sample.

If you know something about the distribution, you will be able to calculate how many rows you'll need to get specific bias. Why force us to convert the number into percentage?; December 19, 2007 at 4:21 PM
Anonymous said...: Hi, Good question. One clarification before I investigate the answer and get back to you. Note
that the percentage value is not the percentage of rows from the table - as in "sample and return 30% of the rows in a table". Acc to the docs "This percentage indicates the probability of each row, or each cluster of rows in the case of block sampling, being selected as part of the sample." - as in "return rows that have a 30% chance of being sampled". I'll get back with an answer shortly.; December 21, 2007 at 12:58 PM
Anonymous said...: Chen, Upon inquiry, I found that the answer to this is non-trivial and a factor of implementation of row storage (not just specific to Oracle, but any DB that has a hierarchy of granularity for storing rows/columns). The short answer to this question is that for the sample clause implementation to consider both performance and "fairness" in sampling - accepting the probability that a row will be present in the sample is better - rather than a specific number of rows. From a usage standpoint, the good news is that the number of rows returned are pretty close to the % requested - for e.g., I input 10% over a 105 row table, and got 13 rows sample. Over several trials of sampling, the number of rows returned will eventually equal the % of rows in the table - that is 11 rows - even if the interpretation of this input appears to be different. Hope this helps. +Ram; December 21, 2007 at 3:21 PM

Numoracle Recipes

Sampling

3 comments:

About this Blog

About Me

Blog Posts

Archives by Month

Categories

Search

Subscribe

DM & Analytics

Business Intelligence

Oracle Database

Documentation

Readings

Links of Interest