Pages

Thursday, December 8, 2016

Regression Analysis




Beginners Guide to Regression Analysis and Plot Interpretations


Introduction

"The road to machine learning starts with Regression. Are you ready?"
If you are aspiring to become a data scientist, regression is the first algorithm you need to
learnmaster. Not just to clear job interviews, but to solve real world problems. Till today, a lot of consultancy firms continue to use regression techniques at a larger scale to help their clients. No doubt, it's one of the easiest algorithms to learn, but it requires persistent effort to get to the master level.
Running a regression model is a no-brainer. A simple model <- y~x does the job. But optimizing this model for higher accuracy is a real challenge. Let's say your model gives adjusted R² = 0.678; how will you improve it?
In this article, I'll introduce you to crucial concepts of regression analysis with practice in R. Data is given for download below. Once you are finished reading this article, you'll able to build, improve, and optimize regression models on your own. Regression has several types; however, in this article I'll focus on linear and multiple regression.
Note: This article is best suited for people new to machine learning with requisite knowledge of statistics. You should have R installed in your laptops.

Table of Contents

  1. What is Regression? How does it work?
  2. What are the assumptions made in Regression?
  3. How do I know if these assumptions are violated in my data?
  4. How can I improve the accuracy of a Regression Model?
  5. How can I access the fit of a Regression Model?
  6. Practice Time - Solving a Regression Problem

What is Regression ?  How does it work ?

Regression is a parametric technique used to predict continuous (dependent) variable given a set of independent variables. It is parametric in nature because it makes certain assumptions (discussed next) based on the data set. If the data set follows those assumptions, regression gives incredible results. Otherwise, it struggles to provide convincing accuracy. Don't worry. There are several tricks (we'll learn shortly) we can use to obtain convincing results.
Mathematically, regression uses a linear function to approximate (predict) the dependent variable given as:
Y = βo + β1X + ∈
where, Y - Dependent variable
X - Independent variable
βo - Intercept
β1 - Slope
∈ - Error
βo and β1 are known as coefficients. This is the equation of simple linear regression. It's called 'linear' because there is just one independent variable (X) involved. In multiple regression, we have many independent variables (Xs).  If you recall, the equation above is nothing but a line equation (y = mx + c) we studied in schools. Let's understand what these parameters say:
Y - This is the variable we predict.
X - This is the variable we use to make a prediction.
βo - This is the intercept term. It is the prediction value you get when X = 0.
β1 - This is the slope term. It explains the change in Y when X changes by 1 unit.
∈ - This represents the residual value, i.e. the difference between actual and predicted values.
Error is an inevitable part of the prediction-making process. No matter how powerful the algorithm we choose, there will always remain an (∈) irreducible error which reminds us that the "future is uncertain."
Yet, we humans have a unique ability to persevere, i.e. we know we can't completely eliminate the (∈) error term, but we can still try to reduce it to the lowest. Right? To do this, regression uses a technique known as Ordinary Least Square (OLS).
So the next time when you say, I am using linear /multiple regression, you are actually referring to the OLS technique. Conceptually, OLS technique tries to reduce the sum of squared errors ∑[Actual(y) - Predicted(y')]² by finding the best possible value of regression coefficients (β0, β1, etc).
Is OLS the only technique regression can use? No! There are other techniques such as Generalized Least Square, Percentage Least Square, Total Least Squares, Least absolute deviation, and many more. Then, why OLS? Let's see.
  1. It uses squared error which has nice mathematical properties, thereby making it easier to differentiate and compute gradient descent.
  2. OLS is easy to analyze and computationally faster, i.e. it can be quickly applied to data sets having 1000s of features.
  3. Interpretation of OLS is much easier than other regression techniques.
Let's understand OLS in detail using an example:
We are given a data set with 100 observations and 2 variables, namely Heightand Weight. We need to predict weight(y) given height(x1). The OLS equation can we written as:
                                Y = βo + β1(Height)+ ∈
When using R, Python or any computing language, you don't need to know how these coefficients and errors are calculated. As a matter of fact, most people don't care. But you must know, and that's how you'll get close to becoming a master.
The formula to calculate these coefficients is easy. Let's say you are given the data, and you don't have access to any statistical tool for computation. Can you still make any prediction? Yes!
The most intuitive and closest approximation of Y is mean of Y, i.e. even in the worst case scenario our predictive model should at least give higher accuracy than mean prediction. The formula to calculate coefficients goes like this:
β1 = Σ(xi - xmean)(yi-ymean)/ Σ (xi - xmean)² where i= 1 to n (no. of obs.)
βo = ymean - β1(xmean)
Now you know ymean plays a crucial role in determining regression coefficients and furthermore accuracy. In OLS, the error estimates can be divided into three parts:
Residual Sum of Squares (RSS) - ∑[Actual(y) - Predicted(y)]²
Explained Sum of Squares (ESS) - ∑[Predicted(y) - Mean(ymean)]²
Total Sum of Squares (TSS) - ∑[Actual(y) - Mean(ymean)]²
anat
The most important use of these error terms is used in the calculation of the Coefficient of Determination (R²).
R² = 1 - (ESS/TSS)
R² metric tells us the amount of variance explained by the independent variables in the model. In the upcoming section, we'll learn and see the importance of this coefficient and more metrics to compute the model's accuracy.

What are the assumptions made in regression ?

As we discussed above, regression is a parametric technique, so it makes assumptions. Let's look at the assumptions it makes:
  1. There exists a linear and additive relationship between dependent (DV) and independent variables (IV). By linear, it means that the change in DV by 1 unit change in IV is constant. By additive, it refers to the effect of X on Y is independent of other variables.
  2. There must be no correlation among independent variables. Presence of correlation in independent variables lead to Multicollinearity. If variables are correlated, it becomes extremely difficult for the model to determine the true effect of IVs on DV.
  3. The error terms must possess constant variance. Absence of constant variance leads to heteroskedestacity.
  4. The error terms must be uncorrelated i.e. error at ∈t must not indicate the at error at ∈t+1. Presence of correlation in error terms is known as Autocorrelation. It drastically affects the regression coefficients and standard error values since they are based on the assumption of uncorrelated error terms.
  5. The dependent variable and the error terms must possess a normal distribution.
Presence of these assumptions make regression quite restrictive. By restrictive I meant, the performance of a regression model is conditioned on fulfillment of these assumptions.

How do I know these assumptions are violated in my data?

Once these assumptions get violated, regression makes biased, erratic predictions. I'm sure you are tempted to ask me, "How do I know these assumptions are getting violated?"
Of course, you can check performance metrics to estimate violation. But the real treasure is present in the diagnostic a.k.a residual plots. Let's look at the important ones:
1. Residual vs. Fitted Values Plot
Ideally, this plot shouldn't show any pattern. But if you see any shape (linear, curve, U shape), it suggests non-linearity in the data set. In addition, if you see a funnel shape pattern, it suggests your data is suffering from heteroskedasticity, i.e. the error terms have non-constant variance.
non constant variance heteroskedasticity
dlt
2. Normality Q-Q Plot
As the name suggests, this plot is used to determine the normal distribution of errors. It uses standardized values of residuals. Ideally, this plot should show a straight line. If you find a curved, distorted line, then your residuals have a non-normal distribution (problematic situation).
linearity and non linearity in the data

3. Scale Location Plot
This plot is also useful to determine heteroskedasticity. Ideally, this plot shouldn't show any pattern. Presence of a pattern determine heteroskedasticity. Don't forget to corroborate the findings of this plot with the funnel shape in residual vs. fitted values.
det
If you are a non-graphical person, you can also perform quick tests / methods to check assumption violations:
  1. Durbin Watson Statistic (DW) - This test is used to check autocorrelation. Its value lies between 0 and 4. A DW=2 value shows no autocorrelation. However, a value between 0 < DW < 2 implies positive autocorrelation, while 2 < DW < 4 implies negative autocorrelation.
  2. Variance Inflation Factor (VIF) - This metric is used to check multicollinearity. VIF <=4 implies no multicollinearity but VIF >=10 suggests high multicollinearity. Alternatively, you can also look at the tolerance (1/VIF) value to determine correlation in IVs. In addition, you can also create a correlation matrix to determine collinear variables.
  3. Breusch-Pagan / Cook Weisberg Test - This test is used to determine presence of heteroskedasticity. If you find p < 0.05, you reject the null hypothesis and infer that heteroskedasticity is present.

How can you improve the accuracy of a regression model ?

There is little you can do when your data violates regression assumptions. An obvious solution is to use tree-based algorithms which capture non-linearity quite well. But if you are adamant at using regression, following are some tips you can implement:
  1. If your data is suffering from non-linearity, transform the IVs using sqrt, log, square, etc.
  2. If your data is suffering from heteroskedasticity, transform the DV using sqrt, log, square, etc. Also, you can use weighted least square method to tackle this problem.
  3. If your data is suffering from multicollinearity, use a correlation matrix to check correlated variables. Let's say variables A and B are highly correlated. Now, instead of removing one of them, use this approach: Find the average correlation of A and B with the rest of the variables. Whichever variable has the higher average in comparison with other variables, remove it. Alternatively, you can use penalized regression methods such as lasso, ridge, elastic net, etc.
  4. You can do variable selection based on p values. If a variable shows p value > 0.05, we can remove that variable from model since at p> 0.05, we'll always fail to reject null hypothesis.

How can you access the fit of regression model?

The ability to determine model fit is a tricky process. The metrics used to determine model fit can have different values based on the type of data. Hence, we need to be extremely careful while interpreting regression analysis. Following are some metrics you can use to evaluate your regression model:
  1. R Square (Coefficient of Determination) - As explained above, this metric explains the percentage of variance explained by covariates in the model. It ranges between 0 and 1. Usually, higher values are desirable but it rests on the data quality and domain. For example, if the data is noisy, you'd be happy to accept a model at low R² values. But it's a good practice to consider adjusted R² than R² to determine model fit.
  2. Adjusted R²- The problem with R² is that it keeps on increasing as you increase the number of variables, regardless of the fact that the new variable is actually adding new information to the model. To overcome that, we use adjusted R² which doesn't increase (stays same or decrease) unless the newly added variable is truly useful.
  3. F Statistics - It evaluates the overall significance of the model. It is the ratio of explained variance by the model by unexplained variance. It compares the full model with an intercept only (no predictors) model. Its value can range between zero and any arbitrary large number. Naturally, higher the F statistics, better the model.
  4. RMSE / MSE / MAE - Error metric is the crucial evaluation number we must check. Since all these are errors, lower the number, better the model. Let's look at them one by one:
    • MSE - This is mean squared error. It tends to amplify the impact of outliers on the model's accuracy. For example, suppose the actual y is 10 and predictive y is 30, the resultant MSE would be (30-10)² = 400.
    • MAE - This is mean absolute error. It is robust against the effect of outliers. Using the previous example, the resultant MAE would be (30-10) = 20
    • RMSE - This is root mean square error. It is interpreted as how far on an average, the residuals are from zero. It nullifies squared effect of MSE by square root and provides the result in original units as data. Here, the resultant RMSE would be √(30-10)² = 20. Don't get baffled when you see the same value of MAE and RMSE. Usually, we calculate these numbers after summing overall values (actual - predicted) from the data.

Solving a Regression Problem

Let's use our theoretical knowledge and create a model practically. As mentioned above, you should install R in your laptops. I've taken the data set from UCI Machine Learning repository. Originally, the data set is available in .txt file. To save you some time, I've converted it into .csv, and you can download it here.
Let's load the data set and do initial data analysis:
#set working directory
> path <- "C:/Users/Data/UCI"
> setwd(path)
#load data and check data
> mydata <- read.csv("airfoil_self_noise.csv")
> str(mydata)
This data has 5 independent variables and Sound_pressure_level as the dependent variable (to be predicted). In predictive modeling, we should always check missing values in data. If any data is missing, we can use methods like mean, median, and predictive modeling imputation to make up for missing data.
#check missing values
> colsums(is.na(mydata))
This data set has no missing values. Good for us! Now, to avoid multicollinearity, let's check correlation matrix.
> cor(mydata)
After you see carefully, you'd infer that Angle_of_Attack and Displacement show 75% correlation. It's up to us if we should consider this correlation % as a damaging level. Usually, correlation above 80% (subjective) is considered higher. Therefore, we can forego this combination and won't remove any variable.
In R, the base function lm is used for regression. We can run regression on this data by:
> regmodel <- lm(Sound_pressure_level ~ ., data = mydata)
> summary(regmodel)
lm(formula = Sound_pressure_level ~ ., data = mydata)
Residuals:
   Min    1Q    Median  3Q    Max 
-17.480 -2.882 -0.209 3.152 16.064
Coefficients:
                             Estimate   Std. Error   t value    Pr(>|t|) 
(Intercept)                  1.328e+02   5.447e-01    243.87   <2e-16 ***
Frquency(Hz)              -1.282e-03   4.211e-05   -30.45    <2e-16 ***
Angle_of_Attack             -4.219e-01   3.890e-02   -10.85    <2e-16 ***
Chord_Length                -3.569e+01   1.630e+00   -21.89    <2e-16 ***
Free_stream_velocity         9.985e-02   8.132e-03    12.28    <2e-16 ***
Displacement                -1.473e+02   1.501e+01   -9.81     <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.809 on 1497 degrees of freedom
Multiple R-squared: 0.5157, Adjusted R-squared: 0.5141 
F-statistic: 318.8 on 5 and 1497 DF, p-value: < 2.2e-16
~ . tells lm to use all the independent variables. Let's understand the regression output in detail:
  • Intercept - This is the βo value. It's the prediction made by model when all the independent variables are set to zero.
  • Estimate - This represents regression coefficients for respective variables. It's the value of slope. Let's interpret it for Chord_Length. We can say, when Chord_Length is increased by 1 unit, holding other variables constant, Sound_pressure_level decreases by a value of -35.69.
  • Std. Error - This determines the level of variability associated with the estimates. Smaller the standard error of an estimate is, more accurate will be the predictions.
  • t value - t statistic is generally used to determine variable significance, i.e. if a variable is significantly adding information to the model. t value > 2 suggests the variable is significant. I used it as an optional value as the same information can be extracted from the p value.
  • p value - It's the probability value of respective variables determining their significance in the model. p value < 0.05 is always desirable.
The adjusted R² implies that our model explains ~51% total variance in the data. And, the overall p value of the model is significant. Can we still improve this model ? Let's try to do it. Now, we'll check the residual plots, understand the pattern and derive actionable insights (if any):
> #set graphic output
> par(mfrow=c(2,2))
> #create residual plots
> plot (regmodel)
residual plot interpretation regression
Among all, Residual vs. Fitted value catches my attention. Not exactly though, but I see signs of heteroskedasticity in this data. Remember funnel shape? You can see a similar pattern. To overcome this situation, we'll build another model with log(y).
> regmodel <- update(regmodel, log(Sound_pressure_level)~.)
> summary(flm)
Call:
lm(formula = log(Sound_pressure_level) ~ Frquency(Hz) + Angle_of_Attack + 
Chord_Length + Free_stream_velocity + Displacement, data = mydata)
Residuals:
   Min        1Q      Median     3Q       Max 
-0.146939 -0.023272 -0.000701 0.025425 0.122213
Coefficients:
                       Estimate     Std. Error   t value  Pr(>|t|) 
(Intercept)             4.891e+00   4.393e-03   1113.31  <2e-16 ***
Frquency(Hz)         -1.054e-05   3.396e-07   -31.05   <2e-16 ***
Angle_of_Attack        -3.369e-03   3.137e-04   -10.74   <2e-16 ***
Chord_Length           -2.878e-01   1.315e-02   -21.89   <2e-16 ***
Free_stream_velocity    8.071e-04   6.559e-05    12.31   <2e-16 ***
Displacement           -1.244e+00   1.211e-01   -10.28   <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03878 on 1497 degrees of freedom
Multiple R-squared: 0.5235, Adjusted R-squared: 0.5219 
F-statistic: 329 on 5 and 1497 DF, p-value: < 2.2e-16
Though, the improvement isn't significant, we've increased our adjusted R² to 52.19%. Also, it looked like that funnel shape wasn't completely evident, thus implying non-severe effect of non-constant variance.
Let's divide the data set into train and test to check our final evaluation metric. We'll keep 70% data in train and 30% in test file. The reason being that we should keep enough data in train so that the model identifies obvious emerging patterns.
#sample
> set.seed(1)
> d <- sample(x = nrow(mydata),size = nrow(mydata)*0.7)
> train <- mydata[d,] #1052 rows
> test <- mydata[-d,] #451 rows
#train model
> regmodel <- lm(log(Sound_pressure_level)~.,data = train)
> summary(regmodel)
#test model
> regpred <- predict(regmodel,test)
#convert back to original value
> regpred <- exp(regpred)
> library(Metrics)
> rmse(actual = test$Sound_pressure_level,predicted = regpred)
[1] 5.03423
Interpretation of RMSE error will be more helpful while comparing it with other models. Right now, we can't say if 5.03 error is the optimal value we could expect. I've got you started solving regression problems. Now, you should spend more time and try to obtain a lower error rate than 5.03. For example, you should next check for outlier values using a box plot:
#save the output of boxplot
> d <- boxplot(train$Displacement,varwidth = T,outline = T,border = T,plot = T)
> d$out #enlist outlier observations
Outliers have a substantial impact on regression's accuracy. Treating outliers is a tricky task. It requires in-depth understanding of data to acknowledge the existence of these high leverage points. I would suggest you read more about it, and if you are unable to find a way let me know in comments. I'm excited to see if you can do it!

Summary

My motive in writing this article is to get you started at solving regression problems, with a greater focus on the theoretical aspects. Running an algorithm isn't  rocket science, but knowing how it works will surely give you more control over what you do.
In this article, I've discussed the basics and semi-advanced concepts of regression. In addition, I've also explained best practices which you are advised to follow when facing low model accuracy. We learned about regression assumptions, violations, model fit, and residual plots with practical dealing in R. If you are a python user, you can run regression using linear.fit(x_train,y_train) after loading scikit learn library.

Sunday, December 4, 2016

Uplift Models

What are Uplift Models ?


Uplift modeling, also known as incremental modeling, true lift modeling, or net-lift modeling is a predictive modeling technique that directly models the incremental impact of a treatment (such as a direct marketing action) on an individual’s behavior. Uplift modeling has applications in customer relationship management for up-sell, cross-sell and retention modeling. It has also been applied to personalized medicine. Unlike the related Differential Prediction concept in psychology, Uplift modeling assumes an active agent.

Return on Investment

All of your marketing effort are about Return on Investment (ROI), ultimately, unless you are a non-profit. But, are you maximizing your ROI? There are at least three ways to calculate profit per sale (profit(sale)):
  • Profit of Product or Service (Margin)
  • Net Present Value (NPV)
  • Life Time Value (LTV)
If marketing is your business, you may have heard the term "Uplift" when referring of class of predictive models. Uplift refers to a more common term, “incremental lift”. This lift is based on the difference in the responses from a control group and a treatment group:
Rtreat – Rcontrol
The ROI for a direct mail (DM) campaign, for example is calculated by:
ROI = (N ´ profit(sale) ´ (R(treat) – R(control)) – cost(DM))/cost(DM)
In order to improve ROI, incremental response is required to be maximized, equivalent to Increasing R(treat) and meanwhile Reducing R(control).

Current Situation

Here is the current situation in most marketing organization from a modeler’s eyes. For direct marketing, a contact is delivered through DM channel such as DM, DEM or OBTM. Customers who are most likely to respond or churn will be selected for the targeting. The most popular modeling solutions are:
  • Ordinary logistic model, built to score customers’ propensities of product acquisition or service activation
  • Survival model, built to score how likely and when a customer is going to churn
  • Most often, we name them either Propensity model or Response model, or Churn model
However, a big assumption is made: a Direct Marketing campaign will achieve maximal Incremental Response when a group of the highest scored customers is targeted.
A Propensity/Response model itself is not going to tell marketers which customers are most likely to contribute to the incremental campaign response. An alternative statistical model is needed, targeting the customers whose propensities of response are dramatically driven by “touching” customers with a promotion. There are primarily four types of customers we deal with went planning a marketing campaign, as depicted below.
We want to pay attention to the “Persuadables”, those consumers who will only buy if they receive an offer. What spend marketing dollars on those who are going buy anyway, those who do not want to receive your offer, or those who will not ever consider your offer?
This is something we can accomplish using an Uplift model. Before I do that, however, do not get your hopes up. Uplift models are hard to construct and even harder to maintain—they are not a "cure all" solution to marketing problems.

Uplift Modeling

Uplift modeling uses a randomized scientific control to not only measure the effectiveness of a marketing action but also to build a predictive model that predicts the incremental response to the marketing action. It is a data mining technique that has been applied predominantly in the financial services, telecommunications and retail direct marketing industries to up-sell, cross-sell, churn and retention activities.

The Path to Uplift Modeling

This kind of modeling is not for the faint at heart. It is a bold approach that requires an understanding of the underlying data, precision is preparing the response data, tremendous modeler judgment, and adept salesmanship. Uplift modeling does not begin with an experimental design, though that is required. Rather, it begins with raising awareness with the client, progresses through proactive measuring uplift for marketing campaigns as they occur, until at last there is sufficient confidence in our ability to predict uplift.

Measuring Uplift

The uplift of a marketing campaign is usually defined as the difference in response rate between a treated group and a randomized control group. This allows a marketing team to isolate the effect of a marketing action and measure the effectiveness or otherwise of that individual marketing action. Honest marketing teams will only take credit for the incremental effect of their campaign.

The uplift problem statement

The uplift problem can be stated as follow, given the following:
  • Cases P={1,..,n},
  • treatments ={1,…,U}
  • expected return R(i,t)cfor each case i and treatment t,
  • non-negative integers n1,…,nU such that
n1+⋯+ nU = n
find a treatment assignment
f: P→J
So that the total return
∑ Rif(i) for i=1…N
is maximized, subject to the constraints that the number of cases assigned to treatment j is not to exceed.
nj,(j =1,..,U)
Example: Marketing action case
  • P: a group of customers,
  • two treatments:
  1. treatment 1: exercise some marketing action; Ri1is the expected return if treatment 1 is given to customer ,
  2. treatment 2: exercise no the marketing action; let Ri2 be the expected return if treatment 2 is given to customer i.
Solution to the maximization problem, can be reached with a sum does not involve f, so maximizing total return is equivalent to maximizing the first term:
∑ (Ri1-Ri2) over (i∈(f=1))
As for to the solution to the problem when we consider only the responses to treatment 1, to attain the maximum return:
  • assign treatment 1 to the customers with the largest values of Ri1-Ri2
  • assign treatment 2 to the remaining customers
The difference Ri1-Ri2 is called net lift, uplift, incremental response, differential response, etc.uplift
If we consider only the response to treatment 1, and base targeting on a model built out of responses to previous marketing actions, we are not proceeding as if to maximize ∑ Rif(i) for i=1…N. Such maximization would not yield the maximum return. We need to consider the return from cases subjected to no marketing action. Now, consider a model with a binary response, e.g., Yes = 1, No = 0. Then the netlift is:
Prob(1)=exp(score1 )/(1+exp (score1 ) );
Prob(0)=exp(score0 )/(1+exp (score0 ) );
netlift = Prob(1)–Prob(0),
where Prob(1) is the probability of a response equal to 1, Prob(0) is the probability of a response = 0, score1 is the model scores for responses equal to 1, and score0 is the model scores for responses equal to 0. The incremental response lift, with all initial vales set to 0, can be obtained using the following pseudo code:
prob1 = exp(score_1)/(1+exp(score_1));
prob0 = exp(score_0)/(1+exp(score_0)); n
etlift = prob1 – prob0;
if treatment flag = 1 then
mail total = mail total + 1;
if response = 1 then
mail response = mail response + 1;
expected netlift mailed = expected netlift mailed + netlift;
expected anyway mailed = expected anyway mailed + prob(0);
expected mail response = expected mail response + prob(1);
end;
else do;
nomail totoal = nomail total + 1;
if response = 1 then nomail response = nomail response + 1;
end;
The associated probabilities and netlift pseudo code is:
if last pentile;
empirical prob mail = mail response/mail total;
empirical prob nomail = nomail response/nomail total;
empirical netlift = empirical prob mail - empirical prob nomail;
percent gain = 100* empirical netlift/ empirical prob nomail;
empirical expected buyanyway mailed = mail total* empirical prob nomail;
empirical expected netlift = mailresp - empirical expected buyanyway mailed;
The table below shows the details of a campaign showing the number of responses and calculated response rate for a hypothetical marketing campaign. This campaign would be defined as having a response rate uplift of 5%. It has created 50,000 incremental responses (100,000 - 50,000).
GroupNumber of CustomersResponsesResponse Rate
Treated1,000,000100,00010%
Control1,000,00050,0005%

Traditional response modeling

Traditional response modeling typically takes a group of treated customers and attempts to build a predictive model that separates the likely responders from the non-responders through the use of one of a number of predictive modeling techniques. Typically this would use decision trees or regression analysis. This model would only use the treated customers to build the model. In contrast uplift modeling uses both the treated and control customers to build a predictive model that focuses on the incremental response. To understand this type of model it is proposed that there is a fundamental segmentation that separates customers into the following groups (Lo, 2002):
  • The Persuadables : customers who only respond to the marketing action because they were targeted
  • The Sure Things : customers who would have responded whether they were targeted or not
  • The Lost Causes : customers who will not respond irrespective of whether or not they are targeted
  • The Do Not Disturbs or Sleeping Dogs : customers who are less likely to respond because they were targeted
The only segment that provides true incremental responses is the Persuadables. Uplift modeling provides a scoring technique that can separate customers into the groups described above. Traditional response modeling often targets the Sure Things being unable to distinguish them from the Persuadables.

Types of Uplift Models

Differential response (two model)

The Two-Model approach requires us to build two logistic models as follows:
Logit(Ptest(response | X, treatment =1)) = a + b*X + g*treatment
Logit(Pcontrol(response | X, treatment=0) ) = a + b*X
We then calculate the uplift score by taking difference of two scores
Score = Ptest(response | X, treatment =1) – Pcontrol(response | X, treatment =0)
Advantages
  • Uses standard logistic regression modeling techniques
  • Easy to implement and maintain
Disadvantages
  • Does not fit he target directly (i.e. incremental response)
  • Introduces modeling errors twice
  • Sensitive to predictive variable selections and parameter estimations

Differential response (one model)

Build two logistic models
Logit(P(reponse|X) = a + b*X + g*treatment + l* treatment *X
Now we calculate the uplift score by taking difference of two scores
Score = P(response|X,treatment =1) - P(response|X,treatment =0)
Advantages
  • Uses standard logistic regression modeling techniques
  • Better robustness comparing to two model approach
  • Effect modifications due to treatment
Disadvantages
  • Does not fit the target directly (i.e. Lift)
  • Increases modeling complexity due to assumptions of Non-linearity
  • Needs trade-off between significances and sizes of parameter estimations due to turning treatment on/off
Random Forest
Uplift Random Forests estimate personalized treatment effects (a.k.a. uplift) by binary recursive partitioning. The algorithm and split methods are described in Guelman et al. (2013a, 2013b). In short, an ensemble of B trees are grown, each built on a fraction ν of the training data3 (which includes both treatment and control records). The estimated personalized treatment effect is obtained by averaging the predictions of the individual trees in the ensemble.
Advantages
  • It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier.
  • The sampling, motivated by Friedman (2002), incorporates randomness as an integral part of the fitting procedure.
  • Adds an additional layer of randomness, which further reduces the correlation between trees, and hence reduces the variance of the ensemble.
Disadvantages
  • The main limitation of the Random Forests algorithm is that a large number of trees may make the algorithm slow for real-time prediction.
  • For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.

Example: Simulation-Educators.com

Majority of direct marketing campaigns are based on purchase propensity models, selecting customer email, paper mail or other marketing contact lists based on customers’ probability to make a purchase. Simulation-Educators.com offers training courses in modeling and simulation topics. The following is an example of a of standard purchase propensity model output for a mailing campaign for such courses.
Scoring RankResponse RateLift
128.1%3.41
217.3%2.10
39.6%1.17
48.4%1.02
54.8%0.58
63.9%0.47
73.3%0.40
83.4%0.41
93.5%0.42
100.1%0.01
Total8.2%
Table 1. Example of standard purchase propensity model output used to generate direct campaign mailing list at Simulation-Educators.com
This purchase propensity model had a ‘nice’ lift (rank’s response rate over total response rate) for the top 4 ranks on the validation data set. Consequently, we would contact customers included in top 4 ranks. After the catalog campaign had been completed, we conducted post analysis of mailing list performance vs. control group. The control group consisted of customers who were not contacted, grouped by the same purchase probability scoring ranks.
Mailing GroupControl Group
Scoring RankResponse RateResponse RateIncremental Response Rate
126.99%27.90%-0.91%
220.34%20.90%-0.56%
310.70%10.04%0.66%
48.90%7.52%1.38%
Total16.70%16.55%0.15%
Table 2. Campaign Post analysis
As shown the table 2, the top four customer ranks selected by propensity model perform well for both mailing group and control group. However, even though mailing/test group response rate was at decent level – 16.7%, our incremental response rate (mailing group net of control group) for combined top 4 ranks was only 0.15%. With such low incremental response rate, our undertaking would be likely generating a negative ROI. What was the reason that our campaign shown such poor incremental results? The purchase propensity model did its job well and we did send an offer to people who were likely to make a purchase. Apparently, modeling based on expected purchase propensity is not always the right solution for a successful direct marking campaign. Since there was no increase in response rate over control group, we could have been contacting customers who would have bought our product without promotional direct mail. Customers in top ranks of purchase propensity model may not need a nudge or they are buying in response to a contact via other channels. If that is the case, the customers in the lower purchase propensity ranks would be more ‘responsive’ to a marketing contact. We should be predicting incremental impact – additional purchases generated by a campaign, not purchases that would be made without the contact. Our marketing mailing can be substantially more cost efficient if we don’t mail customers who are going to buy anyway. Since customers very rarely use promo codes from catalogs or click on web display ads, it is difficult to identify undecided, swing customer based on the promotion codes or web display click-throughs. Net lift models predict which customer segments are likely to make a purchase ONLY if prompted by a marketing undertaking. Purchasers from mailing group include customers that needed a nudge, however, all purchasers in the holdout/control group did not need our catalog to make their purchasing decision. All purchasers in the control group can be classified as ‘need no contact’. Since we need a model that would separate ‘need contact’ purchasers from ‘no contact’ purchasers, the net lift models look at differences in purchasers in mailing (contact) group versus purchasers from control group. In order to classify our customers into these groups we need mailing group and control group purchases results from similar prior campaigns. If there are no comparable historic undertakings, we have to create a small scale trial before the main rollout.

Uplift modeling approach—probability decomposition models

Segments used in probability decomposition models:
Contacted GroupControl Group
Purchasers prompted by contactAD
Purchasers not needing contactBE
Non PurchasersCF
Figure 2. Segments in probability decomposition models
Standard purchase propensity models are only capable of predicting all purchasers (combined segments A and B). The probability decomposition model predicts purchasers segments that need to be contacted (segment A) by leveraging two logistic regression models, as shown in the formula below (Zhong, 2009).
P(A I AUBUC) =P(AUB I AUBUC) x(2 - 1/P(AUB I AUBUE))
Probability of purchase prompted by contactProbability of purchase out of contact groupProbability of purchaser being in contact group out of all purchasers

Summary of probability decomposition modeling process:

  1. Build stepwise logistic regression purchase propensity model (M1) and record model score for every customer in a modeled population.
  2. Use past campaign results or small scale trial campaign results to create a dataset with two equal size sections of purchasers from contact group and control group. Build a stepwise regression logistic model predicting which purchasers are from the contact group. The main task of this model will be to penalize the score of model built in the step 1 when purchaser is not likely to need contact.
  3. Calculate net purchasers score based on probability decomposition formula

Results of the probability decomposition modeling process

Scoring RankContact Group Response %Control Group Response %Incremental Response Rate
118.8%12.9%5.9%
27.8%5.4%2.4%
36.9%4.5%2.5%
44.3%3.6%0.7%
53.9%3.5%0.4%
64.1%4.1%0.0%
73.7%4.0%-0.2%
84.7%4.1%0.6%
95.0%6.7%-1.7%
1011.0%15.7%-4.7%
Table 3. Post analysis of campaign leveraging probability decomposition model for Simulation-Educators.com
Scoring Ranks 1 thru 6 show positive incremental response rates. The scoring ranks are ordered based on the incremental response rates.

Return on investment

Because uplift modeling focuses on incremental responses only, it provides very strong return on investment cases when applied to traditional demand generation and retention activities. For example, by only targeting the persuadable customers in an outbound marketing campaign, the contact costs and hence the return per unit spend can be dramatically improved (Radcliffe & Surry, 2011).

Removal of negative effects

One of the most effective uses of uplift modeling is in the removal of negative effects from retention campaigns. Both in the telecommunications and financial services industries often retention campaigns can trigger customers to cancel a contract or policy. Uplift modeling allows these customers, the Do Not Disturbs, to be removed from the campaign.

History of uplift modeling

The first appearance of true response modeling appears to be in the work of Radcliffe and Surry (Radcliffe & Surry, 1999). Victor Lo also published on this topic in The True Lift Model (Lo, 2002), and more recently Radcliffe (Radcliffe, Using Control Groups to Target on Predicted Lift: Building and Assessing Uplift Models, 2007). Radcliffe also provides a very useful frequently asked questions (FAQ) section on his web site, Scientific Marketer (Uplift Modelling FAQ, 2007). Similar approaches have been explored in personalized medicine (Cai, Tian, Wong, & Wei, 2009). Uplift modeling is a special case of the older psychology concept of Differential Prediction. In contrast to differential prediction, uplift modeling assumes an active agent, and uses the uplift measure as an optimization metric.