In a previous tutorial, we discussed the concept and application of binary logistic regression. We’ll now learn more about binary logistic regression model building and its assessment using R.
Firstly, we’ll recap our earlier case study and then develop a binary logistic regression model in R. followed by and explanation of model sensitivity and specificity, and how to estimate these using R.
You can download the data files for this tutorial here.
Binary Logistic Regression Data Snapshot
Let’s consider the same example of loan disbursement discussed in the previous tutorial. Here’s a snapshot of the data. To recall, a bank wants to develop a model which predicts defaulters in order to help its loan disbursal decision making. The dependent variable is the status observed after the loan is disbursed, which will be 1 if a customer is a defaulter, and 0 otherwise. The variables – age group, years at current address, years at current employer, debt to income ratio, credit card debts and other debts are our independent variables.
Binary Logistic Regression in R
First we import our data and check our data structure in R. As usual, we use the read.csv function and use the str function to check data structure. Age is a categorical variable and therefore needs to be converted into a factor variable. We use the ‘factor’ function to convert an integer variable to a factor.
Import data and check data structure before running model
data<-read.csv("BANK LOAN.csv",header=TRUE) str(data)
Age is an integer and need to convert into factor. Since, it is a categorical variable.
Logistic Regression in R
Logistic regression is a type of generalized linear regression and therefore the function name is glm. We use the argument family equals to binomial for specifying the regression model as binary logistic regression. As in the linear regression model, dependent and independent variables are separated using the tilde sign and independent variables are separated by the plus sign.
Using the glm function to develop a binary logistic regression model
glm is Generalized Linear Model. Logistic regression is type of GLM.
LHS of ~ is dependent variable and independent variables on RHS are separated by ‘+’.
riskmodel is the model object
By setting the family =binomial, glm() fits a logistic regression model
Individual Hypothesis Testing in R
Which independent variables have an impact on the customer turning into a defaulter? After fitting the logistic regression model, we can carry out individual hypothesis testing to identify significant variables. We just use the summary function on the model object and then get detailed output. Variables whose P value is less than 0.05 are considered to be statistically significant. Since the p-value is < 0.05 for Employ, Address, Debtinc, and Creddebt, these independent variables are significant.
summary() function gives the output of glm.
Individual Testing in R
Once we obtain our coefficients, we check them for their signs based on business logic. If the coefficient sign does not match with the business logic, then that variable should be reconsidered for inclusion in the model.
Re-run Model in R
Next, we re-run the binary logistic regression model by including only significant variables. The output of the summary function provides revised estimates of the model parameters.
riskmodel<-glm(DEFAULTER~EMPLOY+ADDRESS+DEBTINC+CREDDEBT, family=binomial,data=data) summary(riskmodel)
In this output, all independent variables are statistically significant and the signs are logical. This model will therefore be used for further diagnostics.
This is how the final model will look after substituting the values of parameter estimates. The probability of default can be predicted if the values of the X variables are entered into this equation.
Odds Ratio in R
As discussed previously, we use the odds ratio to measure the association between independent variables and dependent variables. In R, we identify model coefficients using the coef function and estimate the odds ratio by taking the antilog. The conf-int argument inside the exponential function calculates the confidence interval for the odds ratio of the model. Having calculated these, we can then combine these estimates with the model coefficients using the cbind function.
coef(riskmodel) exp(coef(riskmodel)) exp(confint(riskmodel)) cbind(coef(riskmodel),odds_ratio=exp(coef(riskmodel)),exp(confint(riskmodel)))
coef(riskmodel): identify the model coefficients.
exp(coef(riskmodel)): find odds ratio.
exp(confint(riskmodel)): calculates confidence interval for odds ratio.
From the output, we can see that none of the confidence intervals for the odds ratio includes one, which indicates that all the variables included in the model are significant. The odds ratio for CREDDEBT is approximately 1.77
For a one unit change in CREDDEBT, the odds of being a defaulter will change by 1.77 fold.
Predicting Probabilities in R
We predict the probability of the final model using the fitted function. The round function helps to round probabilities to two decimal places. Predicted probabilities are saved in the same dataset, ‘data’ in a new variable, ‘predprob’.
fitted function generates the predicted probabilities based on the final riskmodel.
round function helps rounding the probabilities to 2 decimal
data$predprob: Predicted probabilities are saved in the same dataset ‘data’ in new variable ‘predprob’.
This is data with predicted probabilities. The last column in the data gives predicted probabilities using the final model.
It is important to measure the goodness of fit of any fitted model. Based on some cut off value of probability, the dependent variable Y is estimated to be either one or zero. A cross tabulation of observed values of Y and the predicted values of Y is known as a classification table. Since this classification table varies with the cut off value, it is not considered to be a good measure of goodness of fit unless an optimum cut-off is obtained.
The accuracy percentage measures how accurate a model is in predicting outcomes. In the table, the dependent variable equals zero was observed and predicted 479 times, whereas it was observed and predicted to be one 92 times. Therefore, the accuracy rate is calculated as 479 plus 92, divided by the total sample size 700. The accuracy is 81.57 %.
Next, we see what is meant by the misclassification rate. The misclassification rate is the percentage of wrongly predicted observations. In this example, the misclassification rate is obtained as 38 + 91 divided by 700 giving misclassification rate as 18.43%
Classification Table Terminology
Different terminologies are used for observations in the classification table. They are sensitivity, specificity, false positive rate and false negative rate. The sensitivity of a model is the percentage of correctly predicted occurrences or events. It is the probability that the predicted value of Y is one, given the observed value of Y being one. On the contrary, specificity is the percentage of non-occurrences being correctly predicted: that is the probability that the predicted value of Y is zero, given that the observed value of Y is also zero. The false positive rate is the percentage of non-occurrences that are predicted wrongly as events. Similarly, the false negative rate is the percentage of occurrences which are predicted incorrectly.
Sensitivity and Specificity calculations
This table represents the accuracy, sensitivity and specificity values for different cut off values. On the basis of our accuracy, sensitivity and specificity values, we can deduce that the cut off value of 0.3 is the best cut off value for the model.
Classification and Sensitivity and Specificity table in R
Let us obtain our classification table in R. We use the table function to create a cross table of the observed and predicted values of the dependent variable. Here, TRUE indicates predicted defaulters, whereas FALSE indicates predicted non-defaulters. There are 479 correctly predicted non-defaulters and 92 correctly predicted defaulters, whereas there are 38 wrongly predicted defaulters and 91 wrongly predicted non-defaulters.
# Predicting Probabilities
classificationtable<-table(data$DEFAULTER,data$predprob > 0.5) classificationtable
table function will create a cross table of observed Y (defaulter) vs. predicted Y (predprob).
Sensitivity and Specificity in R
Let us now calculate sensitivity and specificity values in R, using the formula discussed above. On calculation, the sensitivity of the model is 50.3%, whereas specificity is at 92.7%. The sensitivity value is definitely lower than the desired value.
# Sensitivity and Specificity
sensitivity<-(classificationtable[2,2]/(classificationtable[2,2]+classificationtable[2,1]))*100 sensitivity specificity<-(classificationtable[1,1]/(classificationtable[1,1]+classificationtable[1,2]))*100 specificity
sensitivity  50.27322 specificity  92.6499
The Sensitivity is at 50.3% and the Specificity is at 92.7% . This is when the cutoff was set at 0.5
Let’s have a quick recap. In this tutorial, we explained how to perform binary logistic regression in R. Model performance is assessed using sensitivity and specificity values. Sensitivity is the percentage of events correctly predicted, whereas specificity is the percentage of non-events correctly predicted.
This tutorial lesson is taken from the Postgraduate Diploma in Data Science.
You can try our courses for free to learn more.