In this tutorial we’ll learn about binary logistic regression and its application to real life data. Without any doubt, binary logistic regression remains the most widely used predictive modeling method.
You can download the data files for this tutorial here.
Firstly, we’ll discuss what binary logistic regression is and its applications in different areas. We’ll then focus on statistical models and hypothesis testing, and finally we’ll use a case study in banking to strengthen our understanding.
Binary Logistic Regression
Binary logistic regression models the relationship between a set of independent variables and a binary dependent variable. It’s useful when the dependent variable is dichotomous in nature, like death or survival, absence or presence, pass or fail and so on. Independent variables can be categorical or continuous, for example, gender, age, income or geographical region. Binary logistic regression models a dependent variable as a logit of p, where p is the probability that the dependent variables take a value of 1.
Binary logistic regression models are used across many domains and sectors. It can be used in marketing analytics to identify potential buyers of a product, or in human resources management to identify employees who are likely to leave a company, or in risk management, the objective could be to predict defaulters, or in insurance where the objective is to predict policy lapses. All of these objectives are based on information such as age, gender, occupation, premium amount, purchase frequency, etc and in all these objectives, the dependent variable is binary, whereas independent variables are categorical or continuous.
Why not use a linear regression model?
Why can’t we use linear regression for binary dependent variables.? One reason is that the distribution of Y is random and not normal, as in the case of linear regression. Also, the left-hand and right hand sides of the model will not be comparable if we use linear regression for a binary dependent variable.
Linear regression is suitable for predicting a continuous value such as predicting the price of property based on area in square feet. In such a case the regression line is a straight line. Logistic regression on the other hand is used for classification problems which predict a probability that a dependent variable Y takes a value of ‘one’, given the values of predictors. In binary logistic regression, the regression curve is a sigmoid curve.
This equation is a statistical model for binary logistic regression with a single predictor. Small p is the probability that the dependent variable ‘Y’ will take the value one, given the value of ‘X’, where X is the independent variable. The natural log of “p divided by one minus p” is called the logit or link function. The right-hand side is a linear function of X, very much similar to a linear regression model.
Statistical Model – For k Predictors
The general logistic regression model for a single predictor can be extended to a model with k predictors and is represented as given here. In this equation, p is the probability that Y equals one given X, where Y is the dependent variable and X’s are independent variables. B0 to b K are the parameters of the model, they are estimated using the maximum likelihood method, which we’ll discuss shortly. The left-hand side of the equation ranges between minus infinity to plus infinity.
p : Probability that Y=1 given X
Y : Dependent Variable
X1, X2 ,…, Xk : Independent Variables
b0, b1 ,…, bk : Parameters of Model
Case Study – Modeling Loan Defaults
Let us now look at the concept of binary logistic regression using a banking case study. We have a bank which possesses the demographic and transactional data of its loan customers. This bank wants to develop a model which predicts defaulters and can help the bank in loan disbursal decision making. The objective here is to predict whether customers applying for a loan will default or not. We use a sample of size 700 to develop the model. The independent variables are age group, years at current address, years at current employer, debt to income ratio, credit card debts and other debts. All of these variables are collected at the time of the loan application process and will be used as independent variables. The dependent variable is the status observed after the loan is disbursed, which will be one if it is a defaulter, and zero otherwise
Here is the snapshot of the data. Our dependent variable is binary, whereas the independent variables are either categorical or continuous in nature.
Exploratory Data Analysis
Before moving to modelling, it is always useful to perform exploratory analysis. Let us perform some bivariate analysis. Table 1 presents the relationship between defaulter status and different age groups. Table 2 shows the transactional behavior for defaulters and non-defaulters. The average credit card liability of defaulters is 2.42 vs. 1.25 for non-defaulters. Also, average debt to income for defaulters is very high as compared to non-defaulters.
Binary Logistic Regression Model for the bank loan data
This is how the binary logistic regression model will look for our case study. Here, p is the probability that the customer will be a defaulter. The parameter b is the intercept and b1 b2 etc are coefficients of other independent variables. For the purposes of understanding, we have included all independent variables in the model. However, the final model will be presented using only significant independent variables.
As mentioned before, the parameters of the logistic regression model are estimated using maximum likelihood estimation. The likelihood function is a joint probability of Y1, Y2….up to Yn. The parameters are estimated by maximizing the likelihood function L. Two commonly used iterative algorithms are the Fisher scoring method and the Newton-Raphson method. Both of these algorithms give the same parameter estimates with a slight difference in the estimated covariance matrix. From an application point of view, we don’t need to worry about complex mathematics. The algorithm is available as a built-in function in R and Python.
Maximum Likelihood Estimates of Parameters
The table here gives all parameter estimates that can be used to write the model equation. The model equation can be used to estimate probability of default by substituting values of specific customer characteristics. Note that AGE is a categorical variable with 3 categories and hence two coefficient values are shown for two dummy variables.
Parameters, Probability and Odds
The logit link or logit function is also known as the log of odds, where odds is the probability of success divided by the probability of failure. The estimated parameter gives the change in log of odds given one unit change in the independent variable. For example, the estimated coefficient of employ (that is number of years customer is working at current employer) is -0.26172. This means that one unit change in employ will result in a change of -0.26712 in log of odds. The negative sign implies that a customer with a steady job is less likely to be a loan defaulter. We usually interpret the odds ratio rather than regression coefficients for a logistic regression model. The odds ratio gives a measure of association between the dependent and independent variable.
What is the odds ratio
Let us look at what the odds ratio is. The odds ratio is a measure of association between the independent variable and an outcome. The odds ratio is the ratio of odds of the event occurring given X=1 and X= 0. It represents the factor by which the odds change for a one unit change in the independent variable. Taking the ant-log of the regression coefficient we can get the odds ratio.
Odds Ratio – Case Study
The table shows the odds ratio for each independent variable. An odds ratio greater than one indicates a positive association between the dependent and independent variables, whereas an odds ratio less than one indicates a negative relationship between the dependent and independent variables. An odds ratio equal to one indicates no association between the variables. For example, the odds ratio of the employ independent variable is 0.77 indicates that for one unit change in employ, the odds of being a defaulter will change by 0.77 fold or decrease by 23%.
Individual testing using Wald’s test
To identify which independent variables are statistically significant and are to be included in the final model, we use Wald’s test. This test is used for assessing the significance of each independent variable separately. The null hypothesis for Wald’s test is “Specific Parameter is Zero”. It also means that there is no effect of that predictor on the dependent variable. The test statistic given by Z is the ratio of the estimate of the independent variable to the standard error of the estimate of the independent variable. Under the null hypothesis, the test statistic is assumed to follow standard normal distribution. We reject the null hypothesis if the p value is less than 0.05.
Let us do quick recap. In this session, we learned about the binary logistic regression model and its application. We also discussed how to estimate parameters of binary logistic regression using the maximum likelihood method and how to intrepret the odds ratio.
This tutorial lesson is taken from the Postgraduate Diploma in Data Science.
You can try our courses for free to learn more