Numoracle Recipes: Random Variables, Probability, Bayes Theorem, Naive Bayes Models

In contrast to descriptive statistics, Inferential statistics describes the population based on information gleaned from a sample taken from the population. Fundamental to understanding statistical inference is the concept of probability.

An experiment is the process of measuring/observing an activity. An outcome is a particular result of the experiment - outcomes are also called Random Variables. Random variables can be discrete - when it can assume a countable number of values (e.g. one of six outcomes from rolling a dice) or Continuous - when the variable can assume uncountably infinite values in a given range of values (time, a person's height). The Sample space is all possible outcomes of the experiment. An event is an outcome of interest.

The Probability of Event A occurring is: P(A) = # of possible outcomes in which Event A occurs/ Total # outcomes in the sample space.

Basic properties of probability:

P(A) = 1 implies Event A will occur with certainty
P(A) = 0 implies Event A will not occur with certainty
0 >= P(A) >= 1
The sum of all probabilities for events in the sample space must be 1
All outcomes in the sample space that are not part of Event A is called the complement of Event A (named A'). P(A') = 1 - P(A)
Given two events A and B, P(A) or P(B) - i.e. probabilities of each of the events occuring - without the knowledge of the other events occurrence - is called the prior probability.
Given two events A and B, the probability of event A occurring given that event B has occurred - denoted by P(A / B) - is called the conditional probability or posterior probability of Event A given that Event B has occurred. On the flip side, if P(A / B) = P(A), then events A and B are termed independent.
Given two events A and B, the probability of both A and B occurring at the same time is called the joint probability for A and B, computed as P(A and B) = P(A) * P(B)
Given two events A and B, the probability of either A or B occurring is called the union of events A and B. If events A and B do not occur at the same time (i.e. are mutually exclusive), then P(A or B) = P(A) + P(B). If events A and B occur at the same time, i.e. are not mutually exclusive, then P(A or B) = P(A) + P(B) - P(A and B)
Law of Total Probability: P(A) = P(A / B)P(B) + P(A / B')P(B')
The Bayes Theorem for probabilities provides the ability to reverse the conditionality of events and compute the outcome:
P(A / B) = P(A) * P(B / A) / (P(A) * P(B / A) + P(A') * P(B / A'))

Note that the act of finding the probability for given event is tantamount to predicting that a given event will occur with a given level of certainty or chance - quantified by the probability. This is a good segue to look at a real business problem and its solution based on Bayes theorem.

A modern, predictive, loan processing application builds analytical models based on millions of historical loan applicant records (training data), and uses these models to predict the credit-worthiness (a k a risk of loan default) of an applicant by classifying the applicant into Low, Medium, High or such risk categories. In data mining lingo, a new applicant record is now scored based on the model. At the time of this writing (Aug-Sep 2007), the sub-prime lending woes and its effect on US and world markets is the main story. The trillions lost in this mess is fodder for Quant skeptics/detractors, but as a BusinessWeek cover story ("Not So Smart" - Sep 3 2007) explains, the problem was not analytics per se - the problems were with how various managements (mis)used analytics or (mis)understood their data.

Returning to probability concepts, instead of A and B, the events become A and B_i, i=1..n. The event A (or more appropriately for this example, the target variable Risk) is a dependent variable that assumes one of discrete values (called classes - low, medium, high) based on predictor variables B_i through B_n (age, salary, gender, occupation, and so on). The probability model for this classifier is P(A / B₁,..,B_n). We just shifted the language from statistics into the realm of data mining/predictive analytics. The Bayes theorem intrinsically assumes conditional dependence between B_i through B_n. Now if n is large, or if each B_i takes on a large number of values, computing this model becomes intractable.

The Naive Bayes probabilistic model greatly simplifies this by making a naive/strong assumption that B_i through B_n are conditionally independent - the details are provided here. You can build a Naive Bayes model using the Oracle Data Mining Option, and predict the value for a target variable in new records using SQL Prediction Functions. The following example illustrates the process.
EX> Given a small, synthetic dataset about the attributes of stolen cars, predict if a particular car will be stolen - based on its attributes.

create table stolen_cars(
  id      varchar2(2),
  color   varchar2(10),
  ctype   varchar2(10),
  corigin varchar2(10),
  stolen  varchar2(3));

Table created.

insert into stolen_cars values ('1', 'Red','Sports','Domestic','yes');
insert into stolen_cars values ('2', 'Red','Sports','Domestic','no');
insert into stolen_cars values ('3', 'Red','Sports','Domestic','yes');
insert into stolen_cars values ('4', 'Yellow','Sports','Domestic','no');
insert into stolen_cars values ('5', 'Yellow','Sports','Imported','yes');
insert into stolen_cars values ('6', 'Yellow','SUV','Imported','no');
insert into stolen_cars values ('7', 'Yellow','SUV','Imported','yes');
insert into stolen_cars values ('8', 'Yellow','SUV','Domestic','no');
insert into stolen_cars values ('9', 'Red','SUV','Imported','no');
insert into stolen_cars values ('10', 'Red','Sports','Imported','yes');
commit;

Commit complete.

begin
  dbms_data_mining.create_model(
    model_name => 'cars',
    mining_function => dbms_data_mining.classification,
    data_table_name => 'stolen_cars',
    case_id_column_name => 'id',
    target_column_name => 'stolen');
end;
/

PL/SQL procedure successfully completed.

create table new_stolen_cars (
  id      varchar2(2),
  color   varchar2(10),
  ctype   varchar2(10),
  corigin varchar2(10));

Table created.

insert into new_stolen_cars values ('1', 'Red','SUV','Domestic');
insert into new_stolen_cars values ('2', 'Yellow','SUV','Domestic');
insert into new_stolen_cars values ('3', 'Yellow','SUV','Imported');
insert into new_stolen_cars values ('4', 'Yellow','Sports','Domestic');
insert into new_stolen_cars values ('5', 'Red','Sports','Domestic');
commit;

Commit complete.

select prediction(cars using *) pred,
       prediction_probability(cars using *) prob
  from new_stolen_cars;
-- Results
PRE   PROB
--- ----------
no    .75
no  .870967746
no    .75
no  .529411793
yes .666666687

The query scores each row in the new_stolen_cars table, returning the prediction, and the certainty of this predition. This dataset is very small, but a cursory glance at the results indicates that the predictions are correct - based on the training data. For example, the model predicts 'No' for a domestic yellow sports car - the training data has no such instance. The model predicts 'Yes' for a domestic red sports car, with > 50% certainty - the training data does support this prediction. You can obtain the details of this model using:

select *
  from table(dbms_data_mining.get_model_details_nb('cars'));

The SQL output is a collection of objects and may not look pretty at first glance. But once you understand the schema of the output type - viz.

dm_nb_details

- you can decipher the output to the following simple format:

STOLEN
no
.5
DM_CONDITIONALS(
DM_CONDITIONAL('COLOR', NULL, 'Red', NULL, .4),
DM_CONDITIONAL('COLOR', NULL, 'Yellow', NULL, .6),
DM_CONDITIONAL('CTYPE', NULL, 'SUV', NULL, .6),
DM_CONDITIONAL('CORIGIN', NULL, 'Domestic', NULL, .6),
DM_CONDITIONAL('CORIGIN', NULL, 'Imported', NULL, .4),
DM_CONDITIONAL('CTYPE', NULL, 'Sports', NULL, .4))

STOLEN
yes
.5
DM_CONDITIONALS(
DM_CONDITIONAL('COLOR', NULL, 'Red', NULL, .6),
DM_CONDITIONAL('CORIGIN', NULL, 'Imported', NULL, .6),
DM_CONDITIONAL('CORIGIN', NULL, 'Domestic', NULL, .4),
DM_CONDITIONAL('CTYPE', NULL, 'Sports', NULL, .8),
DM_CONDITIONAL('CTYPE', NULL, 'SUV', NULL, .2),
DM_CONDITIONAL('COLOR', NULL, 'Yellow', NULL, .4))

This shows the target variable (STOLEN), its value ('yes', 'no'), the prior probability (0.5), and the conditional probability contributed by each predictor/predictor value pair towards each target/class value.

Such ability to score transactional customer data directly from the database (in other words, deploy the model right at the source of customer data) with such simplicity is a key Oracle differentiator and competitive advantage over standalone data mining tools. For more on ODM, consult the references provided in this blog.