Descriptive Statistics – Bivariate Relationships in Python

We will refresh the concepts of scatter plot, correlation coefficient and simple regression.. These concepts were already discussed in the  R programming sessions. The focus here will be on writing python porgrams to perform bivariate data analysis.

Scatter plot provides nature of relationship graphically.  As you can see from these plots, the nature of a bivariate relationship can be

positive or negative. We can also get a situation when 2 variables are not correlated.

In the case of positive correlation, as the value of one variable increases, the value of other variable also increases.

For Example: If increase in marketing spends imply increase in total sales then the variables have positive correlation.

In  the case of negative correlation, As the value of one variable increases, the value of other variable tends to decrease.

For Example: If increase in price leads to decrease in product demand then we get a case of negative correlation.

The Pearson’s correlation coefficient numerically measures the strength of a linear relationship between two variables.

It can take values between -1 and +1 .  The sign indicates the nature of the bivariate relationship which can be positive or negative.

The value of r can be zero if two variables are totally uncorrelated.  The correlation coefficient is not affected by change of Origin and Scale.

Therefore, we need not worry about variables measured on entirely different units.

For Example, you could correlate a person’s age with their blood sugar levels. Here, the units are completely different. 

Simple linear regression is a mathematical model describing relationship between a dependent variable Y and an  independent variable X.

In a model, unknown parameters are a and b where a is Intercept (The value at which the fitted line crosses the y-axis i.e. X=0)

And b is Slope of the Line. The error term e is a random variable and can be thought of as unexplained part of the model.

The unknown parameters a and b are estimated using sample data on Y and X.

A Scatter plot is used to visualize the bivariate relationship whereas the  correlation coefficient gives the numerical measure of  the strength of

bivariate relationship. Simple Regression is very useful in predicting the value of one variable given the value of another in a bivariate scenario.

Let us revisit the case study. A company conducts different written tests before recruiting employees. The company wishes to see if the scores of these tests have any relation with post-recruitment performance of those employees.

Here objective is to study the correlation between Aptitude and Job Proficiency.

Here’s  is snapshot of the data. The last column is a dependent variable and  the scores of various tests conducted prior to recruitment are recorded under Aptitude,testofen (test of English) ,tech and gk.

We will start by visualising the bivariate relationship. For creating a scatterplot, we will use the pyplot module from matplotlib library. The data will be a pandas dataframe.

pd.read_csv is used to import csv file and store as job.

plt.scatter() creates scatterplot. It takes two numeric variables as the two main argument. You can also specify colour of the scatter points by adding color= argument.

plt.xlabel and ylabel give lables to x and y axis respectively.

The slight upward direction of the scatterplot suggests that aptitude score and job proficiency have a positive relationship.

# Importing Data and necessary libraries

 import pandas as pd
 import matplotlib.pyplot as plt
 job= pd.read_csv("Job_Proficiency.csv") 

# Scatterplot

plt.scatter(job.aptitude,job.job_prof, color='red'); plt.xlabel('Aptitude'); plt.ylabel('Job Prof') 

Let us now quantify the degree of this bivariate relationship.

numpy has function corrcoef **NOTE: To be read by the narrator as CORR-COEF**.

It calculates Pearson’s correlation coefficient. Two numeric variables area passed as the two arguments of this function.

For aptitude and job proficiency, the correlation coefficient is 0.51 which suggests that there is a moderately positive correlation.

# Correlation

 import numpy as np
 np.corrcoef(job.aptitude,job.job_prof) 

The statsmodels library has several robust modules for advanced statistical modeling. In order to perform simple linear modeling we’ll import statsmodels.formula.api as smf.

OLS stands for ordinary least squares, the method by which simple linear regression parameters are estimated. smf.ols() creates an OLS regression model from a formula and dataframe.

The formula has a dependent variable on the left side and an  independent variable on the right side, separated by tilde. Argument data= is used to specify the dataframe object.

.fit() fits the model.

.summary() gives the model summary.

# Simple Linear Regression

 import statsmodels.formula.api as smf
 model1= smf.ols("job_prof ~ aptitude", data = job).fit()
 model1.summary() 

We can infer that job proficiency changes by 0.4992 units with a one unit change in aptitude.

Intercept is the value of the dependent variable when the value of the independent variable is zero.

Let us consider one more case study.

The data consists of 100 retailers in platinum segment of an FMCG company. The objective is to describe bivariate relationships in the data.

Each row of the data is information about one retailer with unique Retailer ID.

Zone, Retailer Age and NPS category are categorical variables whereas performance index and growth are numeric continuous variables.

NPS stands for net promoter score and indicates loyalty with the company.

Categorical variables can be summarised using Frequency Tables.

We import the dataset as a pandas dataframe and store it in object called retail_data.

pd.crosstab() gives frequency of counts of the two variables to be studied. First argument of the function is the first variable provided as index whereas the second variable is given as columns.In this data, for instance, South region has the maximum number of promoters as well as detractors. The East and North zones have more Passive retailers than the other two.

#Importing Data

 retail_data = pd.read_csv('Retail_Data.csv') 

# Frequency Tables

 Freq = pd.crosstab(index=retail_data["Zone"], columns=retail_data["NPS_Category"])
 Freq 

Adding an argument normalize=True to pd.crosstab method yields proportional values instead of absolute frequencies.

If normalize= is set as ‘index’, then we can get row-wise distribution.

Similarly, to get column-wise distribution normalize should be set as ‘columns’.

# Percentage Frequency Tables

Freq = pd.crosstab(index=retail_data["Zone"], columns=retail_data["NPS_Category"], normalize=True)
Freq 
Freq = pd.crosstab(index=retail_data["Zone"], 
columns=retail_data["NPS_Category"], normalize='index')
Freq 

It is also possible to summarise more than two categorical variables. 

pd.crosstab() can give frequency counts of three variables in a single table. 

margins= False ensures row / column subtotals are not printed.

# Three Way Frequency Table

 table1 = pd.crosstab([retail_data.Zone, retail_data.NPS_Category], 
                              retail_data.Retailer_Age, margins = False)
 table1 

We revised concepts of scatter plot, correlation coefficient and simple regression.

Then we focused on python coding for bivariate data analysis.

Finally, we also looked at summarizing 2 and 3 categorical variables.

Naive Bayes vs Binary Logistic regression using R

The Naïve Bayes method is one of the most frequently used machine learning algorithms and is used for classification problems. It can therefore be considered as an alternative to Binary Logistic regression and Multinomial Logistic Regression. We have discussed these in previous tutorials In this tutorial we’ll look at Naïve Bayes in detail. Data for the case study can be downloaded here.

(more…)

Time Series Decomposition in R

In a previous tutorial, we discussed the basics of time series and time series analysis. We looked at how to convert data into time series data and analyze this in R. In this tutorial, we”ll go into more depth and look at time series decomposition.

We’ll firstly recap the components of time series and then discuss the moving average concept. After that we’ll focus on two time series decompositions – a simple method based on moving averages and the local regression method.

(more…)

Binary Logistic Regression in Python – a tutorial Part 1

In this tutorial, we will learn about binary logistic regression and its application to real life data using Python. We have also covered binary logistic regression in R in another tutorial. Without a doubt, binary logistic regression remains the most widely used predictive modeling method. Logistic Regression is a classification algorithm that is used to predict the probability of a categorical dependent variable. The method is used to model a binary variable that takes two possible values, typically coded as 0 and 1

(more…)

Binary Logistic Regression – a tutorial

In this tutorial we’ll learn about binary logistic regression and its application to real life data. Without any doubt, binary logistic regression remains the most widely used predictive modeling method.

(more…)

Binary Logistic Regression with R – a tutorial

In a previous tutorial, we discussed the concept and application of binary logistic regression. We’ll now learn more about binary logistic regression model building and its assessment using R.

Firstly, we’ll recap our earlier case study and then develop a binary logistic regression model in R. followed by and explanation of model sensitivity and specificity, and how to estimate these using R.

(more…)

Multiple Linear Regression in R – a tutorial

Multiple Linear Regression (MLR) is the backbone of predictive modeling and machine learning and an in-depth knowledge of MLR is critical to understanding these key areas of data science. This tutorial is intended to provide an initial introduction to MLR using R. If you’d like to cover the same area using Python, you can find our tutorial here

(more…)

Predictive Analytics – An introductory overview

We’ll begin with an introduction to predictive modelling. We’ll then discuss important statistical models, followed by a general approach to building predictive models and finally, we’ll cover the key steps in building predictive models. Please note that prerequisites for starting out in predictive modeling are an understanding of exploratory data analysis and statistical inference.

(more…)

T Distribution , Kolmogrov Smirnov, Shapiro Wilk Tests

In a previous tutorial we looked at key concepts in statistical inference. We’ll now look at T Distribution , Kolmogrov Smirnov, Shapiro Wilk, and standard parametric tests. Parametric tests are tests that make assumptions about the parameters of the population distribution from which a sample is drawn. We’ll begin with normality assessment using the Quantile-Quantile Plot (also called the Q-Q plot), the Shapiro-Wilk test and the Kolmogrov Smirnov test. Then, we’ll cover T distribution briefly. Finally, the one sample t-test, which is a standard parametric test will be looked in detail.

(more…)