Multiple Linear Regression (MLR) is the backbone of predictive modeling and machine learning and an in-depth knowledge of MLR is critical to understanding these key areas of data science. This tutorial is intended to provide an initial introduction to MLR using R. If you’d like to cover the same area using Python, you can find our tutorial here
You can download the data files for this tutorial here.
When is multiple linear regression used?
Multiple linear regression explains the relationship between one continuous dependent variable and two or more independent variables. The following example will make things clear.
The price of a house in USD can be a dependent variable. The area of the house, its location, the air quality index in the area, distance from the airport, for example can be independent variables. Independent variables can be continuous, such as the air quality index, or categorical, such as the location of the house. The price of the house is our target variable, which we call the dependent variable. To sum up, we have one dependent variable and a set of independent variables.
Statistical model for multiple linear regression
The statistical model for multiple linear regression has two parts – the left hand side has the dependent variable denoted as Y, and the right hand side has independent variables denoted as X1 , X2…up to Xp.
This means that there are, in general, p independent variables, with each independent variable having a specific weightage, which we call a regression parameter.
The parameter b0 is termed the regression intercept in the model.
Our question is how to get values of these unknown parameters using known values of Y and X variables?
To do this, we use the least square method. This method minimizes the error sum of squares in the data to fit the optimum model. Software gives least square estimates as the main output of the regression model.
Multiple Linear Regression Case Study
Let’s illustrate these concepts using a case study. The objective is to model a job performance Index based on the various test scores of newly recruited employees. Our dependent variable is Job Performance Index, and our independent variables are aptitude, test of language, technical knowledge, and general information.
Here’s a snapshot of the data with our dependent and independent variables. All variables are numeric in nature and obviously the employee ID not used as a model variable.
It’s always advisable to have a graphical representation of the data, such as scatter plots, which will give us insights into the variables’ bivariate relationships.
Now let’s import our example data using the read.csv function in R. We use the GGally library and the ggpairs function to present our data graphically, specifically to create scatterplots for our variables of interest.
Importing the Data
Graphical Representation of the Data
library(GGally) ggpairs(perindex[,c("jpi","aptitude","tol","technical","general")], title="Scatter Plot Matrix", columnLabels=c("jpi","aptitude","tol","technical","general"))
The ggpairs function in the GGally library helps us to visualise bivariate relationships between two variables, as well as quantify them in the form of correlation coefficients,while giving the distribution for each variable. We can observe that the job proficiency index has a high correlation with technical knowledge and general information scores.
Usually, multiple linear regression is more robust than simple linear regression. A single predictor provides inadequate information about the response variable. In contrast, a simultaneous study of multiple variables is essential as the response is always influenced by more than one variable, as seen in the example just explained.
Multiple linear regression can answer many questions such as:
Do tests conducted at recruitment time determine a candidate’s performance in the initial six months of the job?
Which of the four test scores is more significant in determining job performance?
Can any test be discontinued?
Can the performance of newly recruited candidates be estimated based on test scores at the time of recruitment?
This is our MLR model for our case study, where the left-hand side is the dependent variable, and which in our case is the job performance index and the right hand side is the set of independent variables. B0 is the intercept or constant of the model, whereas b1 to b4 are our parameter estimates for the respective independent variables. Finally, e is the error term in the model.
Parameters are estimated using the least square method as discussed previously and here are our five parameter estimates – one for each independent variable and a constant term B0. We now have a model equation wholly defined in terms of variables and estimated parameters.
Let’s now fit the model using the lm function in R. lm stands for linear model, and we define an object, jpimodel, to show its coefficient estimates. The lm function requires a dependent variable, and independent variables are separated using a plus sign.
jpimodel<-lm(jpi~aptitude+tol+technical+general, data=perindex) jpimodel
lm() fits a linear regression.
~ separates dependent and independent variables
Left hand side of tilde(~) represents the dependent variable and right-hand side shows independent variables
+ separates multiple independent variables.
The table shows the output of the MLR model as displayed in R. Coefficients are the model parameter estimates, and the sign of each parameter represents its relationship with the dependent variable.
Let’s see how to interpret these partial regression coefficients. In general, we say that for every unit increase in the independent variable (X), the expected value of the dependent variable will change by the corresponding parameter estimate (b), keeping all other variables constant. For example, the parameter estimate for aptitude test is observed to be 0.32. Therefore, we infer that for one unit increase in aptitude score, the expected value of the job performance index will increase by 0.32 units.
Here’s a recap of the main concepts covered in this tutorial. First, we learned how to understand our data and ensure consistency in the dataset. We then covered how to represent our data graphically by using the ggpairs function. Lastly, we learned how to fit a multiple linear regression model in R and interpret its coefficients.
This tutorial lesson is taken from the Postgraduate Diploma in Data Science.