Multiple Linear Regression (MLR) is the backbone of predictive modelling and machine learning and an in-depth knowledge of MLR is critical in the predictive modeling world. we previously discussed implementing multiple linear regression in R tutorial, now we’ll look at implementing multiple linear regression using Python programming.
You can download the data files for this tutorial here.
In this tutorial the focus is on estimating model parameters in Python to fit a model and then interpreting the results. We will use the same case study that we used in the R tutorial earlier to explain the Python code. As statistical concepts were discussed in detail earlier and we will just summarize the key points here.
Multiple Linear Regression in Python
Multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more
Again, the price of a house in US dollars can be the dependent variable and the size of the house, its location, the air quality index in the area, distance from airport and so on, can be independent variables.
The price of a house is our target variable, which we call the DEPENDENT VARIABLE.
Statistical Model for MLR
Our statistical model has two parts – The left hand side has the dependent variable denoted as Y, and the right hand side has independent variables denoted as X1, X2…up to Xp.
Each independent variable has a specific WEIGHTAGE called a REGRESSION PARAMETER.
The parameter b0 is the intercept in the model.
The parameters of the model are estimated using the LEAST SQUARE METHOD.
Multiple Linear Regression Case Study – Modeling Job Performance Index
Let’s illustrate all of these concepts using a case study. The objective is to model the Job Performance Index based on the various TEST scores of newly recruited employees. The dependent variable is Job Performance Index and the independent variables are aptitude, test of language, technical knowledge and general information.
Multiple Linear Regression Dataset Snapshot
Here is a snapshot of the data with our dependent and independent variables.
All variables are numeric in nature. Employee ID is obviously not used as a variable in the model
Graphical Representation of Data
It is always advisable to have a graphical representation of your data through scatter plots as these will give you insights into bivariate relationships between variables. Let us import our example data with the help of the read _csv function available in the pandas library. To present our data graphically, we use the seaborn library and the ‘pairplot’ function in seaborn.
#Importing the Data
import pandas as pd perindex = pd.read_csv("Performance Index.csv")
#Graphical Representation of the Data
import seaborn as sns sns.pairplot(perindex)
Scatter Plot Matrix
The pairplot function in the seaborn library helps us to visualize bivariate relationships between variables. It also shows the distribution of each variable using a histogram. We can observe that the job proficiency index has a high correlation with technical knowledge and general information score.
Model for the Case Study
This is our MLR model for the example, where the left-hand side is the dependent variable, which in our case is job performance index, and the right-hand side is the set of independent variables. ‘B Zero’ is the intercept or constant of the model whereas b1 to b 4 are our parameter estimates for the respective independent variables. E is the error term in the model.
Parameter Estimation using Least Square Method
The parameters are estimated using the least square method. Here we have 5 parameter estimates: One for each independent variable and a constant term ‘B0’. We now have the model equation completely defined in terms of variables and estimated parameters. Let us see how to get these values in Python.
Parameter Estimation Using ols() function in Python
We import the statsmodels library and use it with the alias smf. The function to fit the regression model is ols, which stands for Ordinary Least Square. The ols function requires a dependent variable and independent variables. The independent variables are separated using a plus sign.
The data argument specifies our case study dataset and the fit function estimates all our regression parameters. The results are stored in the jpimodel object. The params function used with the jpimodel object shows parameter estimates.
The sign of each parameter represents its relationship with the dependent variable.
import statsmodels.formula.api as smf jpimodel=smf.ols('jpi ~ tol + aptitude + technical +general', data=perindex).fit() jpimodel.params
ols() fits a linear regression.
~ separates dependent and independent variables
Left hand side of tilde(~) represents the dependent variable and right-hand side shows independent variables
+ separates multiple independent variables.
jpimodel.params gives the model parameters.
Signs of each parameter represent their relationship with the dependent variable.
Interpretation of Partial Regression Coefficients
Let’s learn how to interpret these partial regression coefficients. In general, we say that for every unit increase in an independent variable (X), the expected value of the dependent variable will change by the corresponding parameter estimate (b), keeping all other variables constant. For example, the parameter estimate for aptitude test is observed to be 0.32. therefore, we infer that for one unit increase in aptitude score, the expected value of the job performance index will increase by 0.32 units.
To recap what we learned in this tutorial, we visualized bivariate relationships using a scatter plot matrix and discussed how to fit a MLR model in Python and interpret the coefficients of a model.
This tutorial lesson is taken from the Postgraduate Diploma in Data Science. Multiple linear regression in Python.