Data visualization skills are a key part of a of data analytics and data science and in this tutorial we’ll cover all the commonly used graphs using Python. We’ll start with a quick introduction to data visualization in Python and then look at python functions for a range of bars and charts.
You can download the data files for this tutorial here.
What is Data Visualization in Python?
Data visualization the visual representation of data in the form of graphs and plots and is particularly useful as non technical people often understand data and analysis presented in a visual form much better than with complicated numbers and tables.
Data visualization enables us to identify patterns or trends easily, as well as help to visualize data distribution, correlation and causality.
Principles of Data Visualization
Here are some important principles of data visualization that we should keep in mind when creating various charts and graphs.
Let’s consider a case study to explain the various charts and graphs.
We have a telecom service provider has demographic and transactional information about their customers. We want to visualize the data using usage variables and customer demographic information in order to generate business insights.
There are 1000 customers in our sample. For each customer age, gender and pincode information is provided. In addition, the number of calls , number of minutes spoken and bill amount over a 6 month period are available for each customer.
Simple bar graphs are a very common type of graph used in data visualization and are used to represent one variable. They consist of vertical or horizontal bars of uniform width and height proportional to the value of the variable for certain groups. They are a one dimensional diagrams. The space between two bars in a simple bar graph must be uniform. The height or length of a bar can represent, for example, frequency, mean, total or percentage for each category/group of a variable.
Bar Charts in Python
We import pandas, matplotlib and seaborn libraries to construct a simple bar diagram.
The syntax is as follows:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
To construct a simple bar diagram of the total number of calls for each age group it’s important to aggregate our data using groupby() function.
Importing the Libraries
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
telecom = pd.read_csv("telecom.csv")
telecom1 = telecom.groupby('Age_Group')['Calls'].sum() telecom1
The output shows the number of calls for each age group Calls Age_Group 18-30 943187 30-45 798721 >45 128870
Simple Bar Chart – Total calls for different age groups
We use the plt.figure() function plot all columns with labels and the Plot.bar function plots a bar chart.
title is a string argument to give the plot a title.
color argument specifies the plot colour. It accepts strings, hex numbers and colour code.
plt.xlabel specifies the x label.plt.ylabel function specifies the y label.
plt.figure(); telecom1.plot.bar(title='Fig.No.1 : SIMPLE BAR CHART (Total Calls – Age Group)', color='darkorange'); plt.xlabel('Age Groups'); plt.ylabel('Total Calls')
This slide shows simple bar diagram of the total number of calls in each age group.
Observing the diagram, we can say that the number of calls made by the 18-30 age group is slightly higher than the 30 – 45 age group and much higher than the over 45 group.
Simple Bar Chart – Mean calls for different age groups
To construct a simple bar diagram for the mean number of calls for different our age groups, the python code remains same, with only difference being that while aggregating the data, “mean” function is used instead of “sum”.
telecom2 = telecom.groupby('Age_Group')['Calls'].mean() telecom2
Calls Age_Group 18-30 1882.608782 30-45 1866.170561 >45 1815.070423
plt.figure(); telecom2.plot.bar(title='Fig.No.2 : SIMPLE BAR CHART (Mean Calls – Age Group)', color='darkorange'); plt.xlabel('Age Groups'); plt.ylabel('Mean Calls')
This graph gives the distribution of the mean number of calls across different age groups. By plotting the average number of calls, we can see that although there is quite a difference in total calls between each age group, the average number of calls across age groups is similar.
Simple Bar Chart in Horizontal Orientation
Here we have replaced the plot.bar function with plot.barh.barh() to give a horizontal orientation to the bars.
plt.figure(); telecom1.plot.barh(title='Fig.No. 3: SIMPLE BAR CHART - HORIZONTAL', color='darkorange'); plt.xlabel('No.of Customers'); plt.ylabel('Age Group')
Stacked Bar Chart in Python
We use the pivot_table function to provide a count of customers by age group and gender.The index option specifies the rows in the table and the columns option specifies columns. We’ve used the count function to obtain the count of customers based on values=CustID. As in the previous case, the plot.bar function is used, in this case with stacked=True.
telecom3=pd.pivot_table(telecom, index=['Age_Group'], columns=['Gender'], values=['CustID'], aggfunc='count') telecom3
CustID Gender F M Age_Group 18-30 256 245 30-45 221 207 >45 32 39
plt.figure(); telecom3.plot.bar(title='Fig.No. 4 : STACKED BAR CHART', stacked=True); plt.xlabel('Age Group'); plt.ylabel('No.of Customers')
This graph divides the number of customers in each age group by gender.
The graph shows that although there are more young customers in data there is an almost equal number of males and females present in each age group.
Percentage Bar Chart in Python
Now let’s get a percentage bar chart in Python.
We first obtain table of percentage values using the div() function on the pivot table obtained.
Note that the object telecom3 is used to obtain percentage values, which are stored in object telecom4.
The percentage subdivided barplot code remains the same with respect to the previous subdivided barplot code. The only difference is that instead of counts, we use percentage values.
telecom4=telecom3.div(telecom3.sum(1).astype(float), axis=0) telecom4
CustID Gender F M Age_Group 18-30 0.510978 0.489022 30-45 0.516355 0.483645 >45 0.450704 0.549296
plt.figure();(telecom4*100).plot.bar(title='Fig.No. 5 : PERCENTAGE BAR CHART', stacked=True); plt.xlabel('Age Groups'); plt.ylabel('Customer %')
We can now see the percentage subdivided diagram for gender wise distribution of the number of customers across the age groups.
We observe that the data contains an almost equal proportion of male and female callers across three different age groups.Therefore plotting a percentage stacked graph makes it efficient for comparing the gender wise distribution of the number of customers across age groups.
Multiple Bar Charts in Python
Let’s now move to multiple bar diagrams
We use pivot_table() to generate a cross table giving the total number of calls by age group and gender.
We then use the pd.plot.bar function with the familiar argument- title ,and plt.xlabel and plt.ylabel function to construct a multiple bar diagram.
telecom5=pd.pivot_table(telecom, index=['Age_Group'], columns=['Gender'], values=['Calls'], aggfunc='sum') telecom5
Calls Gender F M Age_Group 18-30 480235 462952 30-45 408184 390537 >45 58310 70560
plt.figure(); telecom5.plot.bar(title='Fig.No.6 : MULTIPLE BAR CHART (Total Calls - Gender & Age Group)'); plt.xlabel('Age Groups'); plt.ylabel('No. of Calls')
This is how our multiple bar diagram looks. There are two bars for each age group – one for females and theother for males. Multiple bar diagram can be used as an alternative way of representing a stacked bar graph.
Pie Charts in Python
Finally, let’s construct a pie chart in python
We use the groupby function with calls as a variable and age group as a factor to obtain total calls for each age group. Then we obtain the percentage for each age group using the div() functionNext, the function plot.pie is used to obtain a pie diagram with arguments such as:
label that provides a user defined label for the variable on X axis
title gives title of the plot, autopct is used to display percentage values and colormap can be used to input your choice of colors
telecom6 = telecom.groupby('Age_Group')['Calls'].sum() telecom6 = telecom6.div(telecom6.sum().astype(float)).round(2)*100 telecom6
Age_Group 18-30 50.0 30-45 43.0 >45 7.0
telecom6.plot.pie(label=('Age Groups'), title = "Fig.No. 7 : PIE CHART WITH PERCENTAGE",colormap='brg', autopct='%1.0f%%')
This slide displays a pie diagram.Observing the diagram we can say that 50% of calls are made by Age_Group 18-30, 43% by 30-45 & only 7% by greater than 45 Age_Group.
Pie Chart in Python – More than one
To plot multiple pie charts, the argument subplot = True should be included within the plot.pie function.
telecom7 = pd.pivot_table(telecom, index=['Age_Group'], columns=['Gender'], values=['CustID'], aggfunc='count') telecom7
CustID Gender F M Age_Group 18-30 256 245 30-45 221 207 >45 32 39
plt.figure(); telecom7.plot.pie(title='Fig.No. 8 : MULTIPLE PIE CHARTS', colors=['darkcyan','orange','yellowgreen'],autopct='%.1f%%', subplots=True)
This tutorial lesson is taken from the Postgraduate Diploma in Data Science.