Classification.pdf

Advanced Business Analytics

Data Mining: Classi�cation

Advanced Business Analytics– Majid Karimi

Data Mining RevisitedData Mining

The process of discovering patterns in large data sets for prediction andclassi�cation.

classi�cation

The process of determining the future values of a qualitative variable(s).

Approaches for predicting a qualitative variable(s):• Logistic Regression• k-Nearest Neighborhood (kNN)• Arti�cial Neural Networks (ANN)• …

Data Mining: Classi�cation (©2019 Cengage) 2 – 23

Advanced Business Analytics– Majid Karimi

Data Mining RevisitedData Mining

The process of discovering patterns in large data sets for prediction andclassi�cation.

classi�cation

The process of determining the future values of a qualitative variable(s).

Approaches for predicting a qualitative variable(s):• Logistic Regression• k-Nearest Neighborhood (kNN)• Arti�cial Neural Networks (ANN)• …

Data Mining: Classi�cation (©2019 Cengage) 2 – 23

Advanced Business Analytics– Majid Karimi

Data Mining RevisitedData Mining

The process of discovering patterns in large data sets for prediction andclassi�cation.

classi�cation

The process of determining the future values of a qualitative variable(s).

Approaches for predicting a qualitative variable(s):• Logistic Regression• k-Nearest Neighborhood (kNN)• Arti�cial Neural Networks (ANN)• …

Data Mining: Classi�cation (©2019 Cengage) 2 – 23

Advanced Business Analytics– Majid Karimi

Before we begin: Data Sampling, Preparation, and Partitioning

• When dealing with large volumes of data, best practice is to extract a representativesample for analysis.

• A sample is representative if the analyst can make the same conclusions from it asfrom the entire population of data.

• The sample of data must be large enough to contain signi�cant information, yetsmall enough to be manipulated quickly.

• Data mining algorithms typically are more e�ective given more data.

Data Mining: Classi�cation (©2019 Cengage) 3 – 23

Advanced Business Analytics– Majid Karimi

Data Sampling, Preparation, and Partitioning: Continued

• When obtaining a representative sample, it is generally best to include as manyvariables as possible in the sample.

• After exploring the data with descriptive statistics and visualization, the analyst caneliminate variables that are not of interest.

• Data mining applications deal with an abundance of data that simpli�es the processof assessing the accuracy of data-based estimates of variable e�ects.

Data Mining: Classi�cation (©2019 Cengage) 4 – 23

Advanced Business Analytics– Majid Karimi

Over�tting

• Model over�tting occurs when the analyst builds a model that does a great job ofexplaining the sample of data on which it is based, but fails to accurately predictoutside the sample data.

• We can use the abundance of data to guard against the potential for over�tting bydecomposing the data set into three partitions:

• The training set.• The validation set.• The test set.

Data Mining: Classi�cation (©2019 Cengage) 5 – 23

Advanced Business Analytics– Majid Karimi

Data Partitioning

• Training set: Consists of the data used to build the candidate models.• Validation set: The data set to which the promising subset of models is applied toidentify which model is the most accurate at predicting observations that were notused to build the model.

• Test set: The data set to which the �nal model should be applied to estimate thismodel’s e�ectiveness when applied to data that have not been used to build orselect the model.

Data Mining: Classi�cation (©2019 Cengage) 6 – 23

Advanced Business Analytics– Majid Karimi

Data Partitioning Visualized

Data Mining: Classi�cation (©2019 Cengage) 7 – 23

Advanced Business Analytics– Majid Karimi

Data Partitioning Continued

• There are no de�nite rules for the size of the three partitions.• But the training set is typically the largest.• For estimation tasks, a rule of thumb is to have at least 10 times as manyobservations as variables.

• For classi�cation tasks, a rule of thumb is to have at least 6 × m × q observations,where m is the number of outcome categories and q is the number of variables.

Data Mining: Classi�cation (©2019 Cengage) 8 – 23

Advanced Business Analytics– Majid Karimi

Data Partitioning with Oversampling• When we are interested in predicting a rare event, such as a click-through on anadvertisement posted on a web site or a fraudulent creditcard transaction, it isrecommended that the training set oversample the number of observationscorresponding to the rare events to provide the data-mining algorithm su�cientdata to “learn” about the rare events.

Clicks

If only one out of every 10,000 users clicks on an advertisement posted on a website, we would not have su�cient information to distinguish between users who donot click-through and those who do if we constructed a representative training setconsisting of one observation corresponding to a click-through and 9,999 observa-tions with no click-through. In these cases, the training set should contain equal ornearly equal numbers of observations corresponding to the di�erent values of theoutcome variable.

Data Mining: Classi�cation (©2019 Cengage) 9 – 23

Advanced Business Analytics– Majid Karimi

Data Partitioning with Oversampling

Note that we do not oversample the validation set and test sets; these samplesshould be representative of the overall population so that accuracy measuresevaluated on these data sets appropriately re�ect future performance of thedata-mining model.

Data Mining: Classi�cation (©2019 Cengage) 10 – 23

Advanced Business Analytics– Majid Karimi

Data Partitioning in Excel

Credit Scores

• Download the �le "Optiva.xlsx" from the Classi�cation module. This �leincludes loan costumer’s data. Consider the task of classifying loancustomers as either “default” or “no default.” Partition the data into three setsof training, validation, and test with a 50%, 30%, 20% split, respectively.

• What is the chance of a customer defaulting on their loan?• What can you conclude with regard to the partitioning of the data?

Data Mining: Classi�cation (©2019 Cengage) 11 – 23

Advanced Business Analytics– Majid Karimi

Data Partitioning in Excel

Credit Scores

• Download the �le "Optiva.xlsx" from the Classi�cation module. This �leincludes loan costumer’s data. Consider the task of classifying loancustomers as either “default” or “no default.” Partition the data into three setsof training, validation, and test with a 50%, 30%, 20% split, respectively.

• What is the chance of a customer defaulting on their loan?

• What can you conclude with regard to the partitioning of the data?

Data Mining: Classi�cation (©2019 Cengage) 11 – 23

Advanced Business Analytics– Majid Karimi

Data Partitioning in Excel

Credit Scores

• Download the �le "Optiva.xlsx" from the Classi�cation module. This �leincludes loan costumer’s data. Consider the task of classifying loancustomers as either “default” or “no default.” Partition the data into three setsof training, validation, and test with a 50%, 30%, 20% split, respectively.

• What is the chance of a customer defaulting on their loan?• What can you conclude with regard to the partitioning of the data?

Data Mining: Classi�cation (©2019 Cengage) 11 – 23

Advanced Business Analytics– Majid Karimi

Data Partitioning in Excel Using Analytics Solver

Partitioning with Over-sampling

We cover the implementing of partitioning with over-sampling during the syn-chronous classes.

Data Mining: Classi�cation (©2019 Cengage) 12 – 23

Advanced Business Analytics– Majid Karimi

Who is going to win the Oscars?

Using Oscars nominations to predict the Oscars winners.

Download the OscardDemo �le from the Classi�cation module on cougar courses,and �t a regression equation to predict Winning Oscars using the indpendent vari-able of Oscars Nominations.

Data Mining: Classi�cation (©2019 Cengage) 13 – 23

Advanced Business Analytics– Majid Karimi

Who is going to win the Oscars? (continued)

• Does this make sense?

• Why can’t we apply linearregression to classify acategorical variable?

• We should be estimating the“probability” of winningOscars.

Data Mining: Classi�cation (©2019 Cengage) 14 – 23

Advanced Business Analytics– Majid Karimi

Who is going to win the Oscars? (continued)

• Does this make sense?• Why can’t we apply linearregression to classify acategorical variable?

• We should be estimating the“probability” of winningOscars.

Data Mining: Classi�cation (©2019 Cengage) 14 – 23

Advanced Business Analytics– Majid Karimi

Who is going to win the Oscars? (continued)

• Does this make sense?• Why can’t we apply linearregression to classify acategorical variable?

• We should be estimating the“probability” of winningOscars.

Data Mining: Classi�cation (©2019 Cengage) 14 – 23

Advanced Business Analytics– Majid Karimi

Logistic Regression: The Idea

• Logistic regression attempts to classify a binary categorical outcome as a linearfunction of explanatory variables.

• A linear regression model fails to appropriately explain a categorical outcomevariable.

• Odds is a measure related to probability.• If an estimate of the probability of an event is p̂, the the equivalent odds measure is

p̂1−p̂ .

• The odds metric ranges between zero and positive in�nity.• We eliminate the �t problem by using logit, ln

(p̂1−p̂

)

Data Mining: Classi�cation (©2019 Cengage) 15 – 23

Advanced Business Analytics– Majid Karimi

Logistic Regression: The Procedure

Logistic Regression Model:

ln(

p̂1 − p̂

)= b0 + b1x1 + · · · + bnxn

Given a set of explanatory variables, a logistic regression algorithm determines values ofb0, b1, · · · , bn that best estimate the log odds.

To calculate the estimated odds, we can use the logistic function:

p̂ =1

1 + e−(b0+b1x1+···+bnxn)

Data Mining: Classi�cation (©2019 Cengage) 16 – 23

Advanced Business Analytics– Majid Karimi

Logistic Regression: The Procedure

Logistic Regression Model:

ln(

p̂1 − p̂

)= b0 + b1x1 + · · · + bnxn

Given a set of explanatory variables, a logistic regression algorithm determines values ofb0, b1, · · · , bn that best estimate the log odds.To calculate the estimated odds, we can use the logistic function:

p̂ =1

1 + e−(b0+b1x1+···+bnxn)

Data Mining: Classi�cation (©2019 Cengage) 16 – 23

Advanced Business Analytics– Majid Karimi

Back to the Oscars Example.

• If we apply logistics regressionto the Oscars example we get:

p̂ =1

1 + e−(−6.214+0.596x)

• For example, a movie with �venominations has 3.8% chanceof winning the Oscars:

p̂ =1

1 + e−(−6.214+0.596(5))= 0.038.

Data Mining: Classi�cation (©2019 Cengage) 17 – 23

Advanced Business Analytics– Majid Karimi

Back to the Oscars Example.

• If we apply logistics regressionto the Oscars example we get:

p̂ =1

1 + e−(−6.214+0.596x)

• For example, a movie with �venominations has 3.8% chanceof winning the Oscars:

p̂ =1

1 + e−(−6.214+0.596(5))= 0.038.

Data Mining: Classi�cation (©2019 Cengage) 17 – 23

Advanced Business Analytics– Majid Karimi

Logistic Regression in Excel

Oscars

Assume we have been given the task to construct a logistic regression model toclassify winners of the Best Picture Oscar; using Winner as the output variable andOscarNominations, GoldenGlobeWins, and Comedy as input variables.Can we use our linear regression model to �t a logistic regression equation for thisdata?

Logistics Regression in Analytics Solver

We cover the implementing of logistics regression and the above practice questionsduring the synchronous classes.

Data Mining: Classi�cation (©2019 Cengage) 18 – 23

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors

• k-Nearest Neighbors (k-NN): This method can be used either to classify a categoricaloutcome or predict a continuous outcome.

• k-NN uses the k most similar observations from the training set, where similarity istypically measured with Euclidean distance.

• A nearest-neighbor is a “lazy learner” that directly uses the entire training set toclassify observations in the validation and test sets.

Data Mining: Classi�cation (©2019 Cengage) 19 – 23

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors ExampleLoan Default

Consider the following costumer information.

What are the chances in which a 28 year old costumer with Average Balance of 900default their loan?

Data Mining: Classi�cation (©2019 Cengage) 20 – 23

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors Example Continued

Data Mining: Classi�cation (©2019 Cengage) 21 – 23

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors Example Continued

Data Mining: Classi�cation (©2019 Cengage) 21 – 23

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors Example Continued

k=1: Classi�ed as a Loan Default (Class 1) because its nearest neighbor (Observation 2) isin Class 1

Data Mining: Classi�cation (©2019 Cengage) 21 – 23

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors Example Continued

k=2: Two nearest neighbors are Observation 2 (Class 1) and Observation 7 (Class 0). Atleast 0.5 of the k = 2 neighbors are Class 1, the new observation is classi�ed as Class 1.

Data Mining: Classi�cation (©2019 Cengage) 21 – 23

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors Example Continued

k=3: Three nearest neighbors are Observation 2 (Class 1), Observation 7 (Class 0), andObservation 6 (Class 0). Because only 1/3 of the neighbors are Class 1, the newobservation is classi�ed as Class 0.

Data Mining: Classi�cation (©2019 Cengage) 21 – 23

Advanced Business Analytics– Majid Karimi

k-Nearest Neighbors for Prediction

• When k-NN is used to estimate a continuous outcome, a new observation’s outcomevalue is predicted to be the average of the outcome values of its k nearest neighborsin the training set.

Data Mining: Classi�cation (©2019 Cengage) 22 – 23

Advanced Business Analytics– Majid Karimi

kNN in Excel

Loan Default

Download the �le "Optiva.xlsx" from the Classi�cation module. This �le includesloan costumers’ data. Consider the task of classifying loan customers as either“default” or “no default.” Partition the data into three sets of training, validation,and test. Appy the k-NN algorithm to answer the following question.

• What is the chance of the following costumer to default their loan.• Average Balance: $1500, Age: 25, Employed, Married, and College Student.

kNN in Analytics Solver

We cover the implementing of kNN and the above practice questions during thesynchronous classes.

Data Mining: Classi�cation (©2019 Cengage) 23 – 23