In the third of this series on data visitation tutorials, see part 1 and part 2, we’ll learn how to visually represent relationships between two or more variables and how to create those data visualisations in Python. Specifically, we’ll illustrate how to summarise data using scatterplots, heat maps bubble charts, trend lines and motion charts in Python. You can download the data sets for the practice exercises here.
You can download the data files for this tutorial here.
The first plot to explore is the scatter plot. A scatter plot is a two dimensional visualisation that uses dots to represent relationship between two continuous variables, one on the x axis and the other on y axis. Each dot represents the x and y coordinates of a single observation.
In this example, we want to understand the relationship between the height and weight of children. Height is plotted on the x axis and weight on the y axis. The corresponding points show a clear upward trend implying that as the height of a person increases, their weight also increases. Please note that this relationship is not perfect, some taller children might weigh less. But looking at the trend, it can be implied that weight is positively correlated with height.
Scatter plots are used when you want to see how two variables are correlated. In the height and weight example, the chart isn’t just a simple log of the height and weight of a set of children, but it also visualized the relationship between height and weight – namely that weight increases as height increases.
Scatter Plots in Python
Let us take another example and see how to create a scatter plot in Python.
The data is for employees of a company and it shows scores of various attribute tests conducted by the company. Sample size is 25.
We want to check correlation between aptitude score of an employee and their job proficiency.
This is a snapshot of the data.
Empno is the id variable. Aptitude, Test of English, Technical Score, General Knowledge and Job Proficiency Scores are numeric variables with positive values.
We will start by importing the data as a pandas dataframe.
Using the read_csv() function from pandas, the data object is imported and stored as job.
The argument index_col=0 instructs Python to take the first column of the csv file as the index.
For visualization, two Python libraries – seaborn and matplotlib are being used. Seaborn is imported as sns and the pyplot module from matplotlib is imported as plt.
sns.lmplot() calls a scatter plot from an sns object and plots a regression line along the plot. It takes x and y as the first two arguments, followed by name of the data object.
Aptitude score is plotted on the x axis and the Job Proficiency score on the y axis. The next few lines of code add attributes to the basic chart using pyplot.
plt.xlabel and plt.ylabel add x and y axis labels respectively, and plt.title adds a title to the graph.
import pandas as pd job=pd.read_csv('JOB PROFICIENCY DATA.csv', index_col=0)
#Importing Library Seaborn
import seaborn as sns import matplotlib.pyplot as plt
#Scatterplot of job proficiency against aptitude with Regression Line
sns.lmplot('aptitude','job_prof',data=job);plt.xlabel('Aptitude');plt.ylabel('Job Proficiency');plt.title('Fig.No. 1: ScatterPlot with Regression Line')
This scatter plot of the data shows that as the aptitude score increases, job proficiency also increases.
For a given aptitude score, the job proficiency can be estimated and vice-versa using the regression line.
Scatter Plot Matrix in Python
Instead of looking at just two variables at a time, it might be more efficient to visualise several bivariate relationships in a single diagram. A scatter plot matrix does exactly that.
Scatter Plot Matrix gives the Scatterplot diagram of multiple variables with each other, all in one chart and is useful in determining if there is a linear correlation amongst multiple variables.
sns.pairplot generates a pairwise scatterplot matrix. Data object is the only argument it takes.
As seen before, plt.title adds a title to the matrix.
# ScatterPlot Matrix
sns.pairplot(job);plt.title('Fig.No. 2: ScatterPlot Matrix')
The above matrix shows that scores in aptitude, language, technical and general knowledge tests have a positive correlation with job proficiency.
Also, technical and general knowledge scores are also slightly positively correlated, whereas there is no such significant relationship among other test scores.
The next visualisation can be thought of as an extension of the scatterplot concept. Bubble charts are useful when there are three continuous variables to visualise. Two variables are plotted along the two axes, whereas the size of the bubbles is determined by the third variable.
Note here that aptitude score does not actually show any consistent trend
Bubble Chart in Python
Let us now see how to create a bubble chart in Python.
sns.scatterplot() calls a scatterplot object. It takes x and y as the first two arguments, while the next argument takes name of the data object.
Argument size= specifies which variable should be used to measure the bubble size. Hue is the fourth argument which allows for different coloured dots. In this example, we are only using three continuous variables. However, if your data has a fourth variable, then hue can take that.
Title, x label and y label are added to the plot using pyplot functionalities as illustrated earlier.
sns.scatterplot('tech_', 'job_prof', data=job, hue='aptitude',size='aptitude’); plt.title('Fig.No. 3: BUBBLE CHART'); plt.xlabel('Technical'); plt.ylabel('Job Proficiency')
The bubble chart shows that as the technical score increases, job proficiency also increases, irrespective of the aptitude score.
Heat Maps in Python
A heat map is a graphical representation of data where individual values contained in a matrix are represented as colours. It gives us quick information through color patterns.
In the example below, we can see temperature fluctuations in New York across months over several years.
To create a heatmap in Python, we’ll use New York temperature data with 108 observations. The objective is to see which are the hottest months in the years and how temperature has fluctuated over the years.
This is a snapshot of the data.
Year is a categorical variable with values from 2009 to 2017. Month is also a categorical variable with 12 month names. Temperature is a numeric variable showing the average temperature in degrees Fahrenheit.
We need two libraries to create a heatmap in Python – seaborn and calendar.
We start by importing the data and arranging months in the right order. pd.read_csv() imports the csv file and stores it as a pandas dataframe.
pd.pivot.table() converts the data into tabular format. The first argument in this function is the name of the dataframe object. Index= specifies the table rows and columns= specifies the columns. Here, months are rows (indices), years are columns.
The pivot table shows months in alphabetical order, hence we need to reorder them as per chornology. agg.reindex is ordering unique months in the correct order and putting it as index. calendar.month_abbr gives the abbreviated month names.
# Installing and calling the package
import seaborn as sns import calendar
# Importing Data and Arranging the Months in the right order :
heatmapdata=pd.read_csv('Average Temperatures in NY.csv') agg=pd.pivot_table(heatmapdata, index=['Month '], columns=['Year ']) agg.columns = (heatmapdata['Year ']).unique() agg = agg.reindex(list(calendar.month_abbr))
# Heat Map
plt.show; ax=sns.heatmap(agg);ax.set(xlabel='Year', ylabel='Age Group',title='Heatmap ')
The heatmap shows that July is the hottest month across the year. Moreover, 2015 showed a longer hot period as compared to other years.
Colours in the heatmap make it very easy to interpret complex numbers.
Trend Lines in Python
The next visualization type is a trend line. Trend lines are a straight lines that connects two or more data points and then extends into the future to act as a line of support or resistance.They are usually used to plot something over time. We also use them to estimate the future values.
In this example, we can observe the increase and decrease in the total number of calls over a period of 24 weeks
The data that we are using to create a trend line is a telecom company’s data for 24 weeks. The data has 21902 entries and the objective of the visualisation is to observe the trend of total calls over the span of the 24 weeks.
This is a snapshot of the data.
CustID is the id variable.
Week gives week number, Calls is the number of calls, Minutes gives total minutes and Amt shows the amount charged in Indian Rupees.. All these columns are numeric.
We import this data as a pandas dataframe and store it as transaction.
Before plotting, we need to merge and format the data.
In order to plot weekwise calls, we need to aggregate the Calls variable by Week and use sum as the aggregation function.
to_frame() stores the groupby object as a dataframe.
reset_index() resets the index.
We’ll now use this new object to create the trend line.
plt.plot() is the basic function to create a plot. The first argument is the variable that needs to be plotted, in our case, Calls. You may use the color= argument to give colour to both points and lines.
The next argument marker=’o’ is used to draw both points and lines.
Further attributes are added to the plot using xlabel, ylable land title from pyplot.
# Importing Data
transaction = pd.read_csv("TelecomData_WeeklyData.csv")
# Merging and Formatting Data
# Trend Line
plt.plot(trend['Calls'], marker='o');plt.xlabel('Week');plt.ylabel('No. of Calls');plt.title('TREND LINE')
This trend line clearly shows that up to the first four weeks number of calls increases continuously. After that, they show ups and downs.
Motion Charts in Python
The visualizations we have seen up until this point were static. However, you can also create dynamic graphs in Python. Let’s see how to create a dynamic bubble chart which allows exploration of multivariate data.
This chart is called a motion chart and it allows to plot dimension values against up to four metrics.
To plot a motion chart we will use data of Sales and market Penetration of a company in different regions over the years. Objective of plotting this chart is to visually observe movement of sales and penetration numbers.
The data has 22 rows
This is a snapshot of the data.
Year is a numeric variable from 2006 to 2016.
Region is a categorical variable showing two regions North and West.
Sales and Penetration are continuous variables. Sales are in Indian Rupees.
To create a motion chart, you have to execute the python code in Jupyter Notebook.
We first import the data as a pandas dataframe and store it as sales.
The library that we need to create this dynamic chart is called plotly-express. This library may not be available in the Python default distribution and hence needs to be installed first and then imported. For installation use pip installer in the Windows command shell or Mac terminal.
Upon successful installation, we import the library as px. Note that the syntax for importing is plotly.express.
px.scatter() is used to create a motion chart. The first argument is the name of the dataset. x= and y= give the variables to be plotted on the x and y axes respectively.
animation_frame= is the time element to be plotted to show movement. In this case, it is the variable Year.
animation_group= is the input that takes categorical variable.
size= and color= specify which variables to be used to determine size and colour of the points.
The logical argument log_x=True ensures the x axis is log scaled in cartesian coordinates.
hover_name= adds labels to the hover tooltip.
size_max= sets the maximum mark size.
range_x= and range_y= overrides auto scaling on both axes.
sales = pd.read_csv("Sales Data (Motion Chart).csv")
pip install plotly-express
import plotly.express as px
# Motion Chart
px.scatter(sales, x="Penetration", y="Sales", animation_frame="Year", animation_group="Region",size="Sales", color="Region", hover_name="Region", log_x=True, size_max=55, range_x=[700,2000], range_y=[25,90])
Once ready, the motion chart shows both sales and penetration has increased over time for both regions parallelly.
Selecting the Right Chart to Use
Now that we have seen how to create and interpret various plots, let’s clarify how to select the right chart. In the case of univariate analysis, if the variable is categorical, then use bar or pie charts. For continuous single variables, use histograms or boxplots.
In the case of bivariate analysis, if both variables are categorical, use multiple or stacked bar charts. If one is continuous and the other is categorical, then use multiple histograms or boxplots. And if both are continuous then use scatterplots or trendlines. Finally, for multivariate analysis, if the variables are a mix of categorical and continuous, use a heatmap. If all are continuous use a scatter plot matrix or bubble chart.
Here is a quick recap of everything covered in this tutorial.
The seaborn library has functions for creating scatterplots, scatterplot matrices, scatterplots with regression lines and heatmaps. Trendline can be created by using basic plot functions from the pyplot module from matplotlib.
Additional attributes such as axes names, titles can be added to any plot using the pyplot.
plotly-express has function scatter to create dynamic motion chart.
This tutorial lesson is taken from the Postgraduate Diploma in Data Science.