In this tutorial, we will learn about binary logistic regression and its application to real life data using Python. We have also covered binary logistic regression in R in another tutorial. Without a doubt, binary logistic regression remains the most widely used predictive modeling method. Logistic Regression is a classification algorithm that is used to predict the probability of a categorical dependent variable. The method is used to model a binary variable that takes two possible values, typically coded as 0 and 1
You can download the data files for this tutorial here.
We’ll first recap a few aspects of binary logistic regression and then focus on statistical modeling, hypothesis testing and classification tables using Python. We’ll use a case study in the banking domain to demonstrate the method.
Binary Logistic Regression in Python
Binary logistic regression models the relationship between a set of independent variables and a binary dependent variable. It is useful when the dependent variable is dichotomous in nature, such as death or survival, absence or presence, pass or fail, for example. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P (Y=1) as a function of X. Independent variables can be categorical or continuous, for example, gender, age, income, geographical region and so on. Binary logistic regression models a dependent variable as a logit of p, where p is the probability that dependent variables take a value of ‘one’.
Statistical Model – For k Predictors
Let’s see what the statistical model in binary logistic regression looks like. In this equation, p is the probability that Y equals one given X, where Y is the dependent variable and X’s are independent variables. B 0 to B K are the parameters of the model. These parameters of the model are estimated using the maximum likelihood method. The left-hand side of the equation ranges between minus infinity to plus infinity.
p : Probability that Y=1 given X
Y : Dependent Variable
X1, X2 ,…, Xk : Independent Variables
b0, b1 ,…, bk : Parameters of Model
Case Study – Modeling Loan Defaults
Let’s explain the concept of binary logistic regression using a case study from the banking sector. Our bank has the demographic and transactional data of its loan customers. It wants to develop a model that predicts defaulters and help the bank in its loan disbursal decision making. The objective here is to predict whether customers applying for a loan will be defaulters or not. We will use a sample of size 700 to develop the model. The independent variables are age group, years at current address, years at current employer, debt to income ratio, credit card debt and other debt. All of these variables are collected at the time of the loan application process and will be used as independent variables. The dependent variable is the status observed after the loan is disbursed, which will be one if it is a defaulter and zero if not.
BLR Data Snapshot
Here’s a snapshot of the data. Our dependent variable is binary, whereas the independent variables are either categorical or continuous in nature.
Binary Logistic Regression in Python
Let’s import our data and check the data structure in Python. As usual, we import the data using read_csv function in the pandas library, and use the info function to check the data structure. We can see here that the Age variable is an integer type.
# Import data and check data structure before running model
import pandas as pd bankloan=pd.read_csv('BANK LOAN.csv') bankloan.info()
Age should be a categorical variable, and therefore needs to be converted into a category type. If it isn’t converted into a category type, then Python will interpret it as a numeric variable, which is not correct, as we are considering age groups in our model
# Change ‘AGE’ variable into categorical
Age is an integer and need to convert into type “category” for modeling purpose.
Logistic regression uses the logit link function. As with the linear regression model, dependent and independent variables are separated using the tilde sign, and independent variables are separated by the plus sign.
So let’s see which independent variables impact customers turning into defaulters? After fitting the logistic regression model, we carry out individual hypothesis testing to identify significant variables. We then use the summary function on the model object to get detailed output. Variables whose P value is less than 0.05 are considered to be statistically significant. Since the p-value is < 0.05 for Employ, Address, Debtinc, and Creddebt, these independent variables are significant.
Logistic Regression using logit function
import statsmodels.formula.api as smf riskmodel = smf.logit(formula = 'DEFAULTER ~ AGE + EMPLOY + ADDRESS + DEBTINC + CREDDEBT + OTHDEBT', data = bankloan).fit()
logit() fits a logistic regression model to the data.
BLR Model summary
summary() generates detailed summary of the model.
Re-run the BLR Model in Python
Once the variables to be retained are finalized, we re-run the model with these we re-run the binary logistic regression model by including only the significant variables. Again the output of the summary function provides the revised coefficients for the model.
riskmodel = smf.logit(formula = 'DEFAULTER ~ EMPLOY + ADDRESS + DEBTINC + CREDDEBT', data = bankloan).fit() riskmodel.summary()
In this output, all independent variables are statistically significant and the signs are logical, so this model can be used for further diagnosis.
Odds Ratios In Python
After substituting values of parameter estimates this is how the final model will appear.
The probability of defaulting can be predicted if the values of the X variables are entered into the equation.
We use the odds ratio to measure the association between the independent variable and dependent variable. Once the parameter is estimated with confidence intervals, by simply taking the antilog we can get the Odds Ratios with confidence intervals. In Python the ‘conf_int’ function calculates the confidence interval for parameters, and then parameter estimates are added to the object. The antilog values are printed to give a table of odds ratios.
import numpy as np conf = riskmodel.conf_int() conf['OR'] = riskmodel.params conf.columns = ['2.5%', '97.5%', 'OR'] print(np.exp(conf))
conf_int(): calculates confidence intervals for parameters
riskmodel.params: identify the model parameter estimates
Odds Ratios in Python
From the output here, we can see that none of the confidence intervals for the odds ratio includes one, which indicates that all the variables included in the model are significant. The odds ratio for CREDDEBT is approximately 1.77
So for one unit change CREDDEBT, the odds of being a defaulter will change 1.77 fold.
Predicting Probabilities in Python
We determine the probability of the final model using the predict function. Predicted probabilities are saved in the same bankloan dataset in the new variable ‘pred’.
The last column in the data gives predicted probabilities using the final model.
It’s important to measure the goodness of fit of any fitted model. Based on some cut off value of probability, the dependent variable Y is estimated to be either one or zero. A cross tabulation of observed values of Y and predicted values of Y is known as a classification table.
The accuracy percentage measures how accurate a model is in predicting the outcomes.
In the table, the dependent variable equals zero was observed and predicted 478 times, whereas it was observed and predicted to be one 92 times.
Therefore, the accuracy rate is calculated as 478 plus 92 divided by the total sample size of 700. The accuracy therefore is 81.43 %. The misclassification rate is the percentage of wrongly predicted observations. In this example, the misclassification rate is obtained as 38 + 91 divided by 700 giving misclassification rate as 18.57%
Classification Table Terminology
Different terminologies are used for observations in a classification table. These are sensitivity, specificity, false positive rate and false negative rate. The sensitivity of a model is the percentage of correctly predicted occurrences or events. It is the probability that the predicted value of Y is one, given the observed value of Y being one. On the contrary, specificity is the percentage of non-occurrences being correctly predicted – that is the probability that the predicted value of Y is zero, given that the observed value of Y is also zero. The false positive rate is the percentage of non-occurrences that are predicted wrongly as events. Similarly, the false negative rate is the percentage of occurrences which are predicted incorrectly.
Sensitivity and Specificity calculations
This table represents the accuracy, sensitivity and specificity values for different cut off values. On the basis of the accuracy, sensitivity and specificity values, we can deduce that the cut off value of 0.3 is the best cut off value for the model.
Classification table in Python
Let’s now obtain the classification table in Python. The predict function gives predicted probabilities. We set the threshold value to 0.5 and the predicted class is assigned a value of 1 if the predicted probability is greater than the threshold of 0.5. Finally, we use the confusion_matrix function to obtain a classification table using the observed defaulter status and the predicted class.
from sklearn.metrics import confusion_matrix predicted_values1 = riskmodel.predict() threshold=0.5 predicted_class1=np.zeros(predicted_values1.shape) predicted_class1[predicted_values1>threshold]=1 cm1 = confusion_matrix(bankloan['DEFAULTER'],predicted_class1) print('Confusion Matrix : \n', cm1)
confusion_matrix function creates a cross table of observed Y (defaulter)vs. predicted Y
Sensitivity and Specificity in Python
Now let’s calculate sensitivity and specificity values in Python. We calculate these using the formula discussed earlier. On calculation, the sensitivity of the model is 50.27%, whereas the specificity is at 92.46%. The sensitivity value is definitely lower than the desired value so, we can try a different threshold and obtain optimum threshold as explained earlier.
Sensitivity and Specificity
sensitivity = cm1[1,1]/(cm1[1,0]+cm1[1,1]) print('Sensitivity : ', sensitivity) specificity = cm1[0,0]/(cm1[0,0]+cm1[0,1]) print('Specificity : ', specificity )
Sensitivity : 0.5027322404371585 Specificity : 0.9245647969052224
The Sensitivity is at 50.27% and the Specificity is at 92.46%. Note that the threshold is set at 0.5
Precision & Recall values of the model
The precision and recall values of the model are routinely assessed in a classification model. Precision tells us what percentage of predicted positive cases are correctly predicted.
Recall tells us what percentage of actual positive cases are correctly predicted.
The classifcation_report function in python is also very useful. We import it from the sklearn metrics library. It accepts observed Y and predicted class of Y as two arguments. The output shows the recall, precision and accuracy of the model.
from sklearn.metrics import classification_report print(classification_report(bankloan['DEFAULTER'],predicted_class1))
classification_report() gives recall, precision and accuracy along with other measures.
Let’s quickly recap. In this session, we learned about binary logistic regression modelling and its application. We then used python code to estimate model parameters and obtain a classification report.
This tutorial lesson is taken from Digita Schools Advanced Diploma in Data Analytics and the Postgraduate Diploma in Data Science. Continue to the follow on tutorial on Binary Logistic Regression in Python Part II
You can try our courses for free to learn more.