Descriptive Statistics – Bivariate Relationships in Python
We will refresh the concepts of scatter plot, correlation coefficient and simple regression.. These concepts were already discussed in the R programming sessions. The focus here will be on writing python porgrams to perform bivariate data analysis.
Scatter plot provides nature of relationship graphically. As you can see from these plots, the nature of a bivariate relationship can be
positive or negative. We can also get a situation when 2 variables are not correlated.
In the case of positive correlation, as the value of one variable increases, the value of other variable also increases.
For Example: If increase in marketing spends imply increase in total sales then the variables have positive correlation.
In the case of negative correlation, As the value of one variable increases, the value of other variable tends to decrease.
For Example: If increase in price leads to decrease in product demand then we get a case of negative correlation.
The Pearson’s correlation coefficient numerically measures the strength of a linear relationship between two variables.
It can take values between -1 and +1 . The sign indicates the nature of the bivariate relationship which can be positive or negative.
The value of r can be zero if two variables are totally uncorrelated. The correlation coefficient is not affected by change of Origin and Scale.
Therefore, we need not worry about variables measured on entirely different units.
For Example, you could correlate a person’s age with their blood sugar levels. Here, the units are completely different.
Simple linear regression is a mathematical model describing relationship between a dependent variable Y and an independent variable X.
In a model, unknown parameters are a and b where a is Intercept (The value at which the fitted line crosses the y-axis i.e. X=0)
And b is Slope of the Line. The error term e is a random variable and can be thought of as unexplained part of the model.
The unknown parameters a and b are estimated using sample data on Y and X.
A Scatter plot is used to visualize the bivariate relationship whereas the correlation coefficient gives the numerical measure of the strength of
bivariate relationship. Simple Regression is very useful in predicting the value of one variable given the value of another in a bivariate scenario.
Let us revisit the case study. A company conducts different written tests before recruiting employees. The company wishes to see if the scores of these tests have any relation with post-recruitment performance of those employees.
Here objective is to study the correlation between Aptitude and Job Proficiency.
Here’s is snapshot of the data. The last column is a dependent variable and the scores of various tests conducted prior to recruitment are recorded under Aptitude,testofen (test of English) ,tech and gk.
We will start by visualising the bivariate relationship. For creating a scatterplot, we will use the pyplot module from matplotlib library. The data will be a pandas dataframe.
pd.read_csv is used to import csv file and store as job.
plt.scatter() creates scatterplot. It takes two numeric variables as the two main argument. You can also specify colour of the scatter points by adding color= argument.
plt.xlabel and ylabel give lables to x and y axis respectively.
The slight upward direction of the scatterplot suggests that aptitude score and job proficiency have a positive relationship.
# Importing Data and necessary libraries
import pandas as pd import matplotlib.pyplot as plt job= pd.read_csv("Job_Proficiency.csv")
plt.scatter(job.aptitude,job.job_prof, color='red'); plt.xlabel('Aptitude'); plt.ylabel('Job Prof')
Let us now quantify the degree of this bivariate relationship.
numpy has function corrcoef **NOTE: To be read by the narrator as CORR-COEF**.
It calculates Pearson’s correlation coefficient. Two numeric variables area passed as the two arguments of this function.
For aptitude and job proficiency, the correlation coefficient is 0.51 which suggests that there is a moderately positive correlation.
import numpy as np np.corrcoef(job.aptitude,job.job_prof)
The statsmodels library has several robust modules for advanced statistical modeling. In order to perform simple linear modeling we’ll import statsmodels.formula.api as smf.
OLS stands for ordinary least squares, the method by which simple linear regression parameters are estimated. smf.ols() creates an OLS regression model from a formula and dataframe.
The formula has a dependent variable on the left side and an independent variable on the right side, separated by tilde. Argument data= is used to specify the dataframe object.
.fit() fits the model.
.summary() gives the model summary.
# Simple Linear Regression
import statsmodels.formula.api as smf model1= smf.ols("job_prof ~ aptitude", data = job).fit() model1.summary()
We can infer that job proficiency changes by 0.4992 units with a one unit change in aptitude.
Intercept is the value of the dependent variable when the value of the independent variable is zero.
Let us consider one more case study.
The data consists of 100 retailers in platinum segment of an FMCG company. The objective is to describe bivariate relationships in the data.
Each row of the data is information about one retailer with unique Retailer ID.
Zone, Retailer Age and NPS category are categorical variables whereas performance index and growth are numeric continuous variables.
NPS stands for net promoter score and indicates loyalty with the company.
Categorical variables can be summarised using Frequency Tables.
We import the dataset as a pandas dataframe and store it in object called retail_data.
pd.crosstab() gives frequency of counts of the two variables to be studied. First argument of the function is the first variable provided as index whereas the second variable is given as columns.In this data, for instance, South region has the maximum number of promoters as well as detractors. The East and North zones have more Passive retailers than the other two.
retail_data = pd.read_csv('Retail_Data.csv')
# Frequency Tables
Freq = pd.crosstab(index=retail_data["Zone"], columns=retail_data["NPS_Category"]) Freq
Adding an argument normalize=True to pd.crosstab method yields proportional values instead of absolute frequencies.
If normalize= is set as ‘index’, then we can get row-wise distribution.
Similarly, to get column-wise distribution normalize should be set as ‘columns’.
# Percentage Frequency Tables
Freq = pd.crosstab(index=retail_data["Zone"], columns=retail_data["NPS_Category"], normalize=True) Freq
Freq = pd.crosstab(index=retail_data["Zone"], columns=retail_data["NPS_Category"], normalize='index') Freq
It is also possible to summarise more than two categorical variables.
pd.crosstab() can give frequency counts of three variables in a single table.
margins= False ensures row / column subtotals are not printed.
# Three Way Frequency Table
table1 = pd.crosstab([retail_data.Zone, retail_data.NPS_Category], retail_data.Retailer_Age, margins = False) table1
We revised concepts of scatter plot, correlation coefficient and simple regression.
Then we focused on python coding for bivariate data analysis.
Finally, we also looked at summarizing 2 and 3 categorical variables.
Naive Bayes vs Binary Logistic regression using R
The Naïve Bayes method is one of the most frequently used machine learning algorithms and is used for classification problems. It can therefore be considered as an alternative to Binary Logistic regression and Multinomial Logistic Regression. We have discussed these in previous tutorials In this tutorial we’ll look at Naïve Bayes in detail. Data for the case study can be downloaded here.
Time Series Decomposition in R
In a previous tutorial, we discussed the basics of time series and time series analysis. We looked at how to convert data into time series data and analyze this in R. In this tutorial, we”ll go into more depth and look at time series decomposition.
We’ll firstly recap the components of time series and then discuss the moving average concept. After that we’ll focus on two time series decompositions – a simple method based on moving averages and the local regression method.
Why getting a data science masters is a great idea in 2021?
Why getting a data science masters is a great idea in 2021?
There’s no doubt that data scientists and their close cousins, artificial intelligence and machine learning specialists are highly sought after in today’s digital economy. Surveys from glassdoor, indeed.com and linkedIn in 2020 and 2021 rank them as the most in-demand professionals. In particular data science salaries for well-qualified practitioners can match and exceed those of more traditional highly paid professions.
There’s a good reason for this. While the most successful companies in the world today aim to do what businesses have always aimed to do – create new products and services, sell more of them, serve customers better, drive down costs, generate efficiencies, reach new markets and so on – what they do differently is use a deep understanding of data and the insights drawn from it to drive their strategies and decision making. The use of science and analytics has enabled them to rise at an unprecedented rate and leave traditional incumbents in their wake.
Because this deep understanding of the power of data is at the core of these organisations and this is also increasingly so in companies that need to catch up, there is a shortage of people with data science skills and knowledge, making data scientists highly valued.
The professional title of data scientist first appeared in 2008, so we could say that data science is a very new profession. Organisations have certainly been collecting and analysing data to inform business decisions since long before then and the data analyst role has been around for some time. But data science as a profession has emerged at the same time as the explosion of data resulting from the internet and mobile computing, the exponential increase in computing processing power, cloud computing and advances in statistical knowledge.
Where traditional data analysis focuses on describing situations with past data, creating visual representations and making predictions using a range of software tools and basic statistical techniques, data science adds machine learning, artificial intelligence and big data to the required skill set. All this means that data scientists need advanced statistical mathematical knowledge as well as programming ability.
However, beyond this is the capability to use those skills to uncover novel and hidden solutions to problems in a vast array of fields, so data scientists also need a good understanding of business and most often knowledge of specific domains where data science is used.
So what does it take to be a data scientist (or a machine learning and artificial intelligence specialist)? First of all there needs to be a passion for data and numbers, a curious, creative mind and a desire to solve problems.
Then there’s the level of education. Given the intellectual demands and multidisciplinary nature of data science, machine learning and artificial intelligence, a high level of education is a must.
Data Scientist positions advertised by tech giants, major multinationals, dynamic start ups and specialist consultancies alike generally require at least a postgraduate level of education, even for entry level roles. Increasingly, a master’s in data science itself, of which a growing number are becoming available globally, is a specific prerequisite.
A 2018 analysis by Indeed engineering, found that 75% of data scientists have at least a master’s degree, and often in a relevant discipline such as computer science, mathematics, statistics or other numerate areas. Interestingly, machine learning engineers had a similar educational profile. Indeed’s analysis also found that data scientists had the highest average level of education in comparison with related job titles, including data engineers, software engineers and data analysts.
Data Scientists also come from the widest variety of backgrounds. As Chris Linder, from indeed put it “If you ask every data scientist around you what they did before DS, they’re each likely to give you a different answer. Many come from master’s and PhD programs, in fields ranging from astrophysics to zoology. Others come from the many new data science graduate programs that universities now offer. And still others came from other technology roles, such as software engineering or data analysis.”
Industry and academic practitioners agree that a good postgraduate programme in data science should have a core of technical technical knowledge – exploratory data analysis, statistical inference, predictive modelling, machine learning and artificial intelligence. There is also programming – a well rounded practitioner will be able to work in Data Science with R or Python, or, ideally both languages. We may then add to this big data analytics and engineering. Perhaps of equal importance is that the programme requires students to look at challenging real world problems and apply their data skills and thinking to creatively solve them.
There are other ways to learn about data science. Intensive bootcamps, modular, self directed courses on MOOC’s and specialist courses leading to industry certifications, for example, are all opportunities to get an introduction to data science and beyond. These programmes give people an opportunity to learn the basics of data analysis, statistical methods and machine learning. They will also give learners a background in the tools and packages used by data scientists and analysts.
However, as leading data scientist Jeff Leek has stated, “the key word in data science is not “data”; it is “science”.” Data science is more about using scientific thinking to solve hard problems and gain meaningful insights from data than using the tools and techniques that shorter, less academic courses focus on. This could be a good approach for those who have already completed a masters or higher degree in another area who want to get into data science, but for those without a higher education looking to enter the field through these options may find it more difficult. The 2018 Indeed Engineering findings referred to earlier backs this up, with fewer than five percent of data scientists having an education up to high school or associate degree level only.
Universities around the world have begun to recognise data science as a discipline in its own right and as a result have introduced specialist post graduate science data science degrees. This option is the most likely to bring success for prospective data scientists and their employers, but the challenge can be the cost and time it takes to complete a master’s degree, particularly for those currently in employment.
The fees for a data science masters online from a reputable university in the US or UK start at $12,000, but are typically around $20,000. For top ranked schools an online masters degree can be upwards of $40,000. Studying full-time, a student can expect to spend 18 months to two years and part time, between two and three years.
One alternative to this is to study a postgraduate level diploma that gains them advanced entry into a master’s degree. One such diploma is the UK awarded Qualifi Level 7 Diploma in Data Science, which carries 120 UK credits and represents two thirds of a masters degree. This not only gives a choice to complete a data science masters degree at a wide range of universities but also provides an opportunity to save both time and cost in doing so.
Binary Logistic Regression in Python – a tutorial Part 1
In this tutorial, we will learn about binary logistic regression and its application to real life data using Python. We have also covered binary logistic regression in R in another tutorial. Without a doubt, binary logistic regression remains the most widely used predictive modeling method. Logistic Regression is a classification algorithm that is used to predict the probability of a categorical dependent variable. The method is used to model a binary variable that takes two possible values, typically coded as 0 and 1
Introduction to Multiple Linear Regression – Python
Multiple Linear Regression (MLR) is the backbone of predictive modelling and machine learning and an in-depth knowledge of MLR is critical in the predictive modeling world. we previously discussed implementing multiple linear regression in R tutorial, now we’ll look at implementing multiple linear regression using Python programming.
Binary Logistic Regression – a tutorial
In this tutorial we’ll learn about binary logistic regression and its application to real life data. Without any doubt, binary logistic regression remains the most widely used predictive modeling method.
Binary Logistic Regression with R – a tutorial
In a previous tutorial, we discussed the concept and application of binary logistic regression. We’ll now learn more about binary logistic regression model building and its assessment using R.
Firstly, we’ll recap our earlier case study and then develop a binary logistic regression model in R. followed by and explanation of model sensitivity and specificity, and how to estimate these using R.
Multiple Linear Regression in R – a tutorial
Multiple Linear Regression (MLR) is the backbone of predictive modeling and machine learning and an in-depth knowledge of MLR is critical to understanding these key areas of data science. This tutorial is intended to provide an initial introduction to MLR using R. If you’d like to cover the same area using Python, you can find our tutorial here
Predictive Analytics – An introductory overview
We’ll begin with an introduction to predictive modelling. We’ll then discuss important statistical models, followed by a general approach to building predictive models and finally, we’ll cover the key steps in building predictive models. Please note that prerequisites for starting out in predictive modeling are an understanding of exploratory data analysis and statistical inference.