In contrast to descriptive statistics, **Inferential statistics** describes the *population* based on information gleaned from a *sample* taken from the population. Fundamental to understanding statistical inference is the concept of probability.

An *experiment* is the process of measuring/observing an activity. An *outcome* is a particular result of the experiment - outcomes are also called **Random Variables**. Random variables can be **discrete** - when it can assume a countable number of values (e.g. one of six outcomes from rolling a dice) or **Continuous** - when the variable can assume uncountably infinite values in a given range of values (time, a person's height). The **Sample space** is all possible outcomes of the experiment. An **event** is an outcome of interest.

The **Probability** of Event A occurring is: P(A) = # of possible outcomes in which Event A occurs/ Total # outcomes in the sample space.**Basic properties of probability:**

- P(A) = 1 implies Event A will occur with certainty
- P(A) = 0 implies Event A will not occur with certainty
- 0 >= P(A) >= 1
- The sum of all probabilities for events in the sample space must be 1
- All outcomes in the sample space that are not part of Event A is called the
*complement*of Event A (named A'). P(A') = 1 - P(A) - Given two events A and B, P(A) or P(B) - i.e. probabilities of each of the events occuring - without the knowledge of the other events occurrence - is called the
*prior probability*. - Given two events A and B, the probability of event A occurring given that event B has occurred - denoted by P(A / B) - is called the
*conditional probability*or*posterior**probability*of Event A given that Event B has occurred. On the flip side, if P(A / B) = P(A), then events A and B are termed*independent*. - Given two events A and B, the probability of both A and B occurring at the same time is called the
*joint probability*for A and B, computed as P(A and B) = P(A) * P(B) - Given two events A and B, the probability of either A or B occurring is called the
*union*of events A and B. If events A and B do not occur at the same time (i.e. are*mutually**exclusive*), then P(A or B) = P(A) + P(B). If events A and B occur at the same time, i.e. are not mutually exclusive, then P(A or B) = P(A) + P(B) - P(A and B) - Law of Total Probability: P(A) = P(A / B)P(B) + P(A / B')P(B')
- The
*Bayes Theorem*for probabilities provides the ability to reverse the conditionality of events and compute the outcome:

P(A / B) = P(A) * P(B / A) / (P(A) * P(B / A) + P(A') * P(B / A'))

*predicting that a given event will occur with a given level of certainty or chance*- quantified by the probability. This is a good segue to look at a real

*business*problem and its solution based on Bayes theorem.

A modern,

*predictive*, loan processing application builds analytical models based on millions of historical loan applicant records (

*training*data), and uses these models to

*predict*the credit-worthiness (a k a risk of loan default) of an applicant by

*classifying*the applicant into

*Low, Medium, High*or such risk categories. In data mining lingo, a new applicant record is now

*scored*based on the model. At the time of this writing (Aug-Sep 2007), the sub-prime lending woes and its effect on US and world markets is the main story. The trillions lost in this mess is fodder for Quant skeptics/detractors, but as a BusinessWeek cover story ("Not So Smart" - Sep 3 2007) explains, the problem was not analytics per se - the problems were with how various managements (mis)used analytics or (mis)understood their data.

Returning to probability concepts, instead of A and B, the events become A and B

_{i}, i=1..n. The event A (or more appropriately for this example, the

*target variable*

**Risk**) is a dependent variable that assumes one of discrete values (called

*classes*-

**low, medium, high**) based on

*predictor*variables B

_{i}through B

_{n}(age, salary, gender, occupation, and so on). The probability model for this

*classifier*is P(A / B

_{1},..,B

_{n}). We just shifted the language from statistics into the realm of data mining/predictive analytics. The Bayes theorem intrinsically assumes conditional dependence between B

_{i}through B

_{n}. Now if

*n*is large, or if each B

_{i}takes on a large number of values, computing this model becomes intractable.

The

**Naive Bayes**probabilistic model greatly simplifies this by making a naive/strong assumption that B

_{i}through B

_{n}are

*conditionally independent*- the details are provided here. You can build a Naive Bayes model using the Oracle Data Mining Option, and predict the value for a target variable in new records using SQL Prediction Functions. The following example illustrates the process.

EX> Given a small, synthetic dataset about the attributes of stolen cars, predict if a particular car will be stolen - based on its attributes.

create table stolen_cars(Table created.

id varchar2(2),

color varchar2(10),

ctype varchar2(10),

corigin varchar2(10),

stolen varchar2(3));

insert into stolen_cars values ('1', 'Red','Sports','Domestic','yes');Commit complete.

insert into stolen_cars values ('2', 'Red','Sports','Domestic','no');

insert into stolen_cars values ('3', 'Red','Sports','Domestic','yes');

insert into stolen_cars values ('4', 'Yellow','Sports','Domestic','no');

insert into stolen_cars values ('5', 'Yellow','Sports','Imported','yes');

insert into stolen_cars values ('6', 'Yellow','SUV','Imported','no');

insert into stolen_cars values ('7', 'Yellow','SUV','Imported','yes');

insert into stolen_cars values ('8', 'Yellow','SUV','Domestic','no');

insert into stolen_cars values ('9', 'Red','SUV','Imported','no');

insert into stolen_cars values ('10', 'Red','Sports','Imported','yes');

commit;

beginPL/SQL procedure successfully completed.

dbms_data_mining.create_model(

model_name => 'cars',

mining_function => dbms_data_mining.classification,

data_table_name => 'stolen_cars',

case_id_column_name => 'id',

target_column_name => 'stolen');

end;

/

create table new_stolen_cars (Table created.

id varchar2(2),

color varchar2(10),

ctype varchar2(10),

corigin varchar2(10));

insert into new_stolen_cars values ('1', 'Red','SUV','Domestic');Commit complete.

insert into new_stolen_cars values ('2', 'Yellow','SUV','Domestic');

insert into new_stolen_cars values ('3', 'Yellow','SUV','Imported');

insert into new_stolen_cars values ('4', 'Yellow','Sports','Domestic');

insert into new_stolen_cars values ('5', 'Red','Sports','Domestic');

commit;

select prediction(cars using *) pred,The query scores each row in the new_stolen_cars table, returning the prediction, and the certainty of this predition. This dataset is very small, but a cursory glance at the results indicates that the predictions are correct - based on the training data. For example, the model predicts 'No' for a domestic yellow sports car - the training data has no such instance. The model predicts 'Yes' for a domestic red sports car, with > 50% certainty - the training data does support this prediction. You can obtain the details of this model using:

prediction_probability(cars using *) prob

from new_stolen_cars;

-- Results

PRE PROB

--- ----------

no .75

no .870967746

no .75

no .529411793

yes .666666687

select *The SQL output is a collection of objects and may not look pretty at first glance. But once you understand the schema of the output type - viz.

from table(dbms_data_mining.get_model_details_nb('cars'));

dm_nb_details- you can decipher the output to the following simple format:

STOLENThis shows the target variable (STOLEN), its value ('yes', 'no'), the prior probability (0.5), and the conditional probability contributed by each predictor/predictor value pair towards each target/class value.

no

.5

DM_CONDITIONALS(

DM_CONDITIONAL('COLOR', NULL, 'Red', NULL, .4),

DM_CONDITIONAL('COLOR', NULL, 'Yellow', NULL, .6),

DM_CONDITIONAL('CTYPE', NULL, 'SUV', NULL, .6),

DM_CONDITIONAL('CORIGIN', NULL, 'Domestic', NULL, .6),

DM_CONDITIONAL('CORIGIN', NULL, 'Imported', NULL, .4),

DM_CONDITIONAL('CTYPE', NULL, 'Sports', NULL, .4))

STOLEN

yes

.5

DM_CONDITIONALS(

DM_CONDITIONAL('COLOR', NULL, 'Red', NULL, .6),

DM_CONDITIONAL('CORIGIN', NULL, 'Imported', NULL, .6),

DM_CONDITIONAL('CORIGIN', NULL, 'Domestic', NULL, .4),

DM_CONDITIONAL('CTYPE', NULL, 'Sports', NULL, .8),

DM_CONDITIONAL('CTYPE', NULL, 'SUV', NULL, .2),

DM_CONDITIONAL('COLOR', NULL, 'Yellow', NULL, .4))

Such ability to score transactional customer data directly from the database (in other words, deploy the model right at the source of customer data) with such simplicity is a key Oracle differentiator and competitive advantage over standalone data mining tools. For more on ODM, consult the references provided in this blog.

## No comments:

Post a Comment