We will refresh the concepts of scatter plot, correlation coefficient and simple regression.. These concepts were already discussed in the R programming sessions. The focus here will be on writing python porgrams to perform bivariate data analysis.
Scatter plot provides nature of relationship graphically. As you can see from these plots, the nature of a bivariate relationship can be
positive or negative. We can also get a situation when 2 variables are not correlated.
In the case of positive correlation, as the value of one variable increases, the value of other variable also increases.
For Example: If increase in marketing spends imply increase in total sales then the variables have positive correlation.
In the case of negative correlation, As the value of one variable increases, the value of other variable tends to decrease.
For Example: If increase in price leads to decrease in product demand then we get a case of negative correlation.
The Pearson’s correlation coefficient numerically measures the strength of a linear relationship between two variables.
It can take values between -1 and +1 . The sign indicates the nature of the bivariate relationship which can be positive or negative.
The value of r can be zero if two variables are totally uncorrelated. The correlation coefficient is not affected by change of Origin and Scale.
Therefore, we need not worry about variables measured on entirely different units.
For Example, you could correlate a person’s age with their blood sugar levels. Here, the units are completely different.
Simple linear regression is a mathematical model describing relationship between a dependent variable Y and an independent variable X.
In a model, unknown parameters are a and b where a is Intercept (The value at which the fitted line crosses the y-axis i.e. X=0)
And b is Slope of the Line. The error term e is a random variable and can be thought of as unexplained part of the model.
The unknown parameters a and b are estimated using sample data on Y and X.
A Scatter plot is used to visualize the bivariate relationship whereas the correlation coefficient gives the numerical measure of the strength of
bivariate relationship. Simple Regression is very useful in predicting the value of one variable given the value of another in a bivariate scenario.
Let us revisit the case study. A company conducts different written tests before recruiting employees. The company wishes to see if the scores of these tests have any relation with post-recruitment performance of those employees.
Here objective is to study the correlation between Aptitude and Job Proficiency.
Here’s is snapshot of the data. The last column is a dependent variable and the scores of various tests conducted prior to recruitment are recorded under Aptitude,testofen (test of English) ,tech and gk.
We will start by visualising the bivariate relationship. For creating a scatterplot, we will use the pyplot module from matplotlib library. The data will be a pandas dataframe.
pd.read_csv is used to import csv file and store as job.
plt.scatter() creates scatterplot. It takes two numeric variables as the two main argument. You can also specify colour of the scatter points by adding color= argument.
plt.xlabel and ylabel give lables to x and y axis respectively.
The slight upward direction of the scatterplot suggests that aptitude score and job proficiency have a positive relationship.
# Importing Data and necessary libraries
import pandas as pd import matplotlib.pyplot as plt job= pd.read_csv("Job_Proficiency.csv")
plt.scatter(job.aptitude,job.job_prof, color='red'); plt.xlabel('Aptitude'); plt.ylabel('Job Prof')
Let us now quantify the degree of this bivariate relationship.
numpy has function corrcoef **NOTE: To be read by the narrator as CORR-COEF**.
It calculates Pearson’s correlation coefficient. Two numeric variables area passed as the two arguments of this function.
For aptitude and job proficiency, the correlation coefficient is 0.51 which suggests that there is a moderately positive correlation.
import numpy as np np.corrcoef(job.aptitude,job.job_prof)
The statsmodels library has several robust modules for advanced statistical modeling. In order to perform simple linear modeling we’ll import statsmodels.formula.api as smf.
OLS stands for ordinary least squares, the method by which simple linear regression parameters are estimated. smf.ols() creates an OLS regression model from a formula and dataframe.
The formula has a dependent variable on the left side and an independent variable on the right side, separated by tilde. Argument data= is used to specify the dataframe object.
.fit() fits the model.
.summary() gives the model summary.
# Simple Linear Regression
import statsmodels.formula.api as smf model1= smf.ols("job_prof ~ aptitude", data = job).fit() model1.summary()
We can infer that job proficiency changes by 0.4992 units with a one unit change in aptitude.
Intercept is the value of the dependent variable when the value of the independent variable is zero.
Let us consider one more case study.
The data consists of 100 retailers in platinum segment of an FMCG company. The objective is to describe bivariate relationships in the data.
Each row of the data is information about one retailer with unique Retailer ID.
Zone, Retailer Age and NPS category are categorical variables whereas performance index and growth are numeric continuous variables.
NPS stands for net promoter score and indicates loyalty with the company.
Categorical variables can be summarised using Frequency Tables.
We import the dataset as a pandas dataframe and store it in object called retail_data.
pd.crosstab() gives frequency of counts of the two variables to be studied. First argument of the function is the first variable provided as index whereas the second variable is given as columns.In this data, for instance, South region has the maximum number of promoters as well as detractors. The East and North zones have more Passive retailers than the other two.
retail_data = pd.read_csv('Retail_Data.csv')
# Frequency Tables
Freq = pd.crosstab(index=retail_data["Zone"], columns=retail_data["NPS_Category"]) Freq
Adding an argument normalize=True to pd.crosstab method yields proportional values instead of absolute frequencies.
If normalize= is set as ‘index’, then we can get row-wise distribution.
Similarly, to get column-wise distribution normalize should be set as ‘columns’.
# Percentage Frequency Tables
Freq = pd.crosstab(index=retail_data["Zone"], columns=retail_data["NPS_Category"], normalize=True) Freq
Freq = pd.crosstab(index=retail_data["Zone"], columns=retail_data["NPS_Category"], normalize='index') Freq
It is also possible to summarise more than two categorical variables.
pd.crosstab() can give frequency counts of three variables in a single table.
margins= False ensures row / column subtotals are not printed.
# Three Way Frequency Table
table1 = pd.crosstab([retail_data.Zone, retail_data.NPS_Category], retail_data.Retailer_Age, margins = False) table1
We revised concepts of scatter plot, correlation coefficient and simple regression.
Then we focused on python coding for bivariate data analysis.
Finally, we also looked at summarizing 2 and 3 categorical variables.