Naive Bayes vs Binary Logistic regression using R

The Naïve Bayes method is one of the most frequently used machine learning algorithms and is used for classification problems. It can therefore be considered as an alternative to Binary Logistic regression and Multinomial Logistic Regression. We have discussed these in previous tutorials In this tutorial we’ll look at Naïve Bayes in detail. Data for the case study can be downloaded here.

You can download the data files for this tutorial here.

We’ll begin with an  overview of various classification methods and then introduce the Naïve Bayes Classifier. The method is based on Bayes’ Theorem, which requires an understanding of the concept of conditional probability. We’ll discuss how to form the classification rule using Naïve Bayes method and then implement the method in R. We’ll finally discuss both the advantages and limitations of the method.

Classification Methods

Apart from Naïve Bayes, there are several  other machine Learning algorithms used for classification problems. These include Support Vector Machine, K Nearest Neighbor (KNN), Decision Tree, Random Forest and Neural Networks. For this session we’ll firstly focus on the Naïve Bayes method, and look at the other methods in subsequent tutorials.

Machine Learning Algorithms

About the Naive Bayes Classifier

The Naïve Bayes classifier is a simple probabilistic classifier based on Bayes’ Theorem. It can be used as an alternative method to binary logistic regression or multinomial logistic regression. It’s important to note that the Naïve Bayes classifier assumes strong conditional independence among predictors, and is particularly suitable when the dimensionality of inputs is high. Despite its simplicity, Naive Bayes may in some situations outperform more sophisticated classification methods, but this is not always the case.

Conditional Probability

Before we look at Bayes theorem, it’s important to understand the concept of conditional probability. Let’s consider a simple example that is typically discussed in academics, the “tossing of an unbiased die (which has numbers 1,2,3,4,5,6)”.  So, the sample space has 6 points.

We can define an event A such that we get a number greater than 1 on the uppermost face of the die. We can define another event B such that we get an even number (i.e. either 2 or 4 or 6) on the uppermost face of the die.

By definition, probability of an event is obtained as the ratio of the number of favourable outcomes to the total number of outcomes. Therefore, the probability of occurrence of an event A is 5/6. Similarly, we get probability of the occurrence of event B as 3/6.

We are now interested in knowing “what is the probability of the occurrence of event B given the condition that A has already occurred?” Since we already know that A has occurred, it implies that a number greater than 1 must have appeared on the uppermost face of the die. Hence, now to get conditional probability of B given that A has occurred, the sample space has only 5 points. (i.e. the uppermost face must have been either 2,3,4,5 or 6). The favourable number of cases for the occurrence of B are still 3 (that is 2,4,6). Hence by the definition of probability, we shall get probability of B given that A has occurred is 3/5.

Contional Probability

Bayes Theorem

This is the statement of Bayes Theorem. The principle behind Naive Bayes is the Bayes theorem, also known as the Bayes Rule. Bayes theorem is used to calculate conditional probability, which is nothing but the probability of an event occurring based on information about events in the past. Mathematically, the Bayes theorem is represented as shown in this equation:

Bayes Theorem

Naive Bayes Framework

To get a better understanding of how Naive Bayes works in classification problems, let’s look at the following situation:

Here Y is the target variable, which must be categorical. It is important to note that this method is not applicable if Y is a continuous variable. However, X variables or predictors can be either categorical or continuous variables. The objective is to estimate the probability of Y taking a specific value given the values of X variables. Since these are conditional probabilities, Bayes theorem will be used to estimate them.

Naive Bayes Framework

Naive Bayes Framework – Example

Now let’s look at a hypothetical example in order to further understand the framework of the Naïve Bayes method. Here we have Y as a binary variable that takes a value of 1, in case if the person is a “potential buyer” (of a certain product) and Y takes a value of 0, if the person is not a “potential buyer”. We consider two other variables X1 as “Age” and X2 as “Gender”, both coded as binary variables.

Buyer/non-buyer

Classification Rule

Here we find conditional probabilities, probability of Y is equal to zero, given the values of X1 and X2 and also the probability of Y is equal to One, given the values of X1 and X2. Based on these two probabilities, we can classify Y to be either zero or one. Here Y can have more than two categories and, in that case, you may classify Y to that category for which the conditional probability is maximum.

Naive Bayes Classification rule

Expected Output

This the expected output when we apply Naïve Bayes method using any software. We get the estimated probabilities for “Y is equal to 1” as well as Y” is equal to zero”. In general, if we have K categories, we shall get K estimated probabilities. Based on these predicted probabilities we can classify Y to a specific category.

Naive Bayes output

Advantages of the Naive Bayes Method

Let us look at the advantages of Naïve Bayes method. Firstly, the classification rule is simple to understand. Secondly, the method requires a small amount of training data to estimate the parameters necessary for classification.Thirdly, the evaluation of the classifier is quick and easy and finally the method can be a good alternative to logistic regression.

Advantages of Naive Bayes

Limitations of Naive Bayes Method

There are a few limitations of Naïve Bayes Method too.

The assumption of conditional independence of the independent variables is highly impractical. In the case of continuous independent variables, the density function must be known or assumed to be normal. In the case of categorical independent variables, the probabilities cannot be calculated if the count in any conditional category is zero. For instance, if there are no respondents in the age group 25-30 yrs. then P(X1=0 | Y=1) = 0. But a remedy exists for this limitation. If a category has zero entries, we replace 0 by 0.5/n (n = sample size) so that the probability expression does not reduce to zero.

Limitations of Naive Bayes

Case Study – Modeling Loan Defaults

Let’s now implement the Naïve Bayes Method for some bank loan data. Here we are going to apply Naïve Bayes method to Bank Loan Data and then compare its performance with Binary Logistic Regression. For this data, we have assumed that a bank possesses demographic and transactional data of its loan customers. If the bank has a model to predict defaulters it can help in loan disbursal decision making. The objective here is “to predict whether the customer applying for the loan will be a defaulter or not”. The sample size for this data is 700. We have, Age group, Years at current address, Years at current employer, Debt to Income Ratio, Credit Card Debts, Other Debts as Independent Variables and Defaulter (=1 if defaulter ,0 otherwise) is the Dependent Variable. The information on predictors was collected at the time of the loan application process. The status is observed after a loan is disbursed.

Bank loan case study

Bank Loan Data

This is the snapshot of the data. In this data, Age is the categorical variable, with three categories, although it is coded as integers. The other independent variables are continuous whereas the Dependent variable “ Defaulter” , is a binary variable.

Data Snapshot

Logistic Regression in R

Before we implement the Naïve Bayes method, let’s first apply Binary Logistic Regression (BLR) to the “Bank Loan Data” to understand the performance of BLR. We import the data file using familiar read.csv function and look at the structure of the data. We notice that “ Age” is an integer variable, which needs to be converted to factor so that we can treat the variable appropriately. We use the glm function, in R, which stands for General Linear Model. We specify the dependent variable to be “Defaulter” and six independent variables. “family=binomial”, indicates that it is Binary Logistic Regression. The analysis output is stored in the object “riskmodel”.

# Importing data and checking data structure

 bankloan<-read.csv("BANK LOAN.csv",header=T)
 str(bankloan)  

# Output

Bank loan data
 bankloan$AGE<-factor(bankloan$AGE)
 riskmodel<-glm(DEFAULTER~AGE+EMPLOY+ADDRESS+DEBTINC+CREDDEBT+OTHDEBT,
               family=binomial,data=bankloan) 

glm() fits a generalised linear model. family=binomial ensures that a binary regression is used.

Model Summary

The “summary” function applied on “riskmodel”, gives estimates of Regression Coefficients as well as the results of the hypothesis testing. We can see that except “Age” and “Other debt”, the rest of the independent variables are statistically significant.

 summary(riskmodel) 

summary() generates model summary.

# Output

BLR output

Excluding Insignificant Variables

We’ll now re-apply the Binary Logistic Regression model, excluding the insignificant variables, “Age” and “Other debt”. We use the glm function where we use “Defaulter” as a dependent variable and four independent variables. The revised output is shown here where we see that all the four independent variables are statistically significant.

 riskmodel<-glm(DEFAULTER~EMPLOY+ADDRESS+DEBTINC+CREDDEBT,
                family=binomial,data=bankloan)
 
 summary(riskmodel) 

ROC Curve and Area Under ROC Curve

Let’s assess the performance of the Binary Logistic Regression model using the ROC Curve and the area under the ROC Curve. This concept was already explained in a previous tutorial. Hence, we won’t be discussing each component of the output in detail. The fitted function gives the predicted probabilities and the prediction function prepares the data required to plot the ROC Curve. The performance function calculates the “True Positive Rate” and “False Positive Rate”, which are then used to plot ROC Curve. The abline function gives the reference line which is a diagonal line.

# ROC Curve

  install.packages("ROCR")
 library(ROCR) 
 
 bankloan$predprob<-fitted(riskmodel)
 
 pred<-prediction(bankloan$predprob,bankloan$DEFAULTER)
 
 perf<-performance(pred,"tpr","fpr")
 
 plot(perf)
 
 abline(0,1) 

prediction() function prepares data required for ROC curve.
performance() function creates performance objects, “tpr“ (True positive rate), “fpr” (False positive rate).
plot() function plots the objects created using performance
abline() adds a straight line to the plot.

This is the output of the ROC Curve and the area under the ROC Curve, which is 0.8556 indicating that model performance is quite good.

# Output

Area Under ROC curve BLR
 auc<-performance(pred,"auc")
 auc@y.values
[[1]]
[1] 0.8556193 

Naive Bayes Method in R

Now let’s use the Naïve Bayes method for the same problem. The naiveBayes function in R is available in the e1071 package. So we first install that package and load the e1071library. Use naiveBayes with the dependent variable as “Defaulter”, and we use all six variables as independent variables. Despite the variables “Age” and “other debt” being statistically insignificant according to Binary Logistic regression, we still use them in Naïve Bayes method because the statistical theory behind Naïve Bayes algorithm is different. The output of Naïve Bayes method is stored in the object, “riskmodel2”.

# Install and load package “e1071”.

# Model Fitting

 install.packages("e1071")
 library(e1071)
 
 riskmodel2<-naiveBayes(DEFAULTER~AGE+EMPLOY+ADDRESS+DEBTINC+CREDDEBT+OTHDEBT,
                        data=bankloan) 

naiveBayes() fits a Naive Bayes algorithm. qIt computes the conditional posterior probabilities of customer being defaulter/Non defaulter given values of independent variables using the Bayes rule.

riskmodel2 

Naive Bayes Model Output

Below is the basic output of Naïve Bayes method, when we type the name of the object “riskmodel2”. As the Naïve bayes method depends on the Byes theorem, which uses conditional probabilities, we can see the conditional probabilities are given for the variable “Age”, which is a factor variable. These are the conditional probabilities for “Age” given the values of Y as 0 and 1. However the other variables are continuous variables and hence instead of conditional probabilities, the output shows conditional means and conditional standard deviations. This is an interim output which is used to estimate the probability of Y is equal to 0 or 1, given the values of X variables. We generally do not interpret anything based on this output.

# Output

Naive Bayes Model output

Predicted Probabilities

We use the predict function with riskmodel2 as the object and argument, type=’raw’ which gives two columns of probabilities, where the first column indicates the probability Y=0 and the second column indicates the probability Y=1. The sum of these probabilities in the two columns is always 1. The object, “prednb” is created using the predict function. We’ll focus on the second column in the output, which is the probability that Y is equal to 1 for classification purposes.

# Predicted Probabilities

 prednb<-predict(riskmodel2,bankloan,type='raw') 

predict() returns predicted probabilities based on the model results and historical data. type=“raw” returns raw probabilities. If not specified, predicted class is returned for each case

head(prednb)  

# Output

Predicted probabilities

ROC Curve and Area Under ROC Curve

Now we compare the performance of Naïve Bayes method with that of Binary Logistic Regression using the ROC Curve and area under the ROC Curve. The syntax remains the same, though the difference is that the predicted probabilities are used in this case are estimated using the Bayes method. Here we can see that the shape of ROC Curve is quite similar to that obtained in Binary Logistic Regression and area under the curve is less than the one under Binary Logistic Regression. However, it is still good enough to be acceptable. We can conclude that Naïve Bayes and Binary logistic Regression have similar performance for this specific data, but the situation may not be the same for every case under study. Depending on the situation, the Naïve Bayes method can outperform Binary Logistic Regression or vice versa.

# ROC Curve and Area Under ROC Curve

 pred<-prediction(prednb[,2],bankloan$DEFAULTER)
 perf<-performance(pred,"tpr","fpr")
 plot(perf)
 abline(0,1)  
Area under ROC curve

# Area Under ROC Curve

 auc<-performance(pred,"auc")
 auc@y.values  
  [[1]]
 [1] 0.794971 

Quick Recap

Here’s a quick recap. We started with the concept of conditional probability and also discussed the Bayes Theorem. The classical Bayes Theorem, which is the landmark theorem in Statistics, is used in Naïve Bayes Classifier. We have used the function “naiveBayes” in package e1071 and compared the performance of the Naïve Bayes method using ROC Curve and area under the ROC Curve.

Naive Bayes summary