In our other Data Visualization tutorial we looked at commonly used and informative bar and pie charts and how to create them with Python functions. In this tutorial we’ll cover box plot examples and other graph types that we use in data analytics and data science.
You can download the data files for this tutorial here.
We’ll start with with box whisker plots and histograms. We’ll then move on to density plots, stem and leaf diagrams, and finally pareto charts.
Box Whisker Plots examples
Box whisker plots show the distribution of a variable under study using five summary measures as follows – minimum, lower quartile, middle quartile, nothing but median, upper quartile and maximum.
The box in the middle represents the middle 50% of the data. The lines (whiskers) extend from the box to the smallest and largest values. The plot also shows outliers and, importantly, minimum and maximum are shown after excluding outliers. A boxplot is particularly effective when comparing two sets of data.
We’ll use the same case study as in our earlier tutorial on bar and pie charts in Python.
A telecom service provider has demographic and transactional information about their customers. We want to visualize the data using usage variables and customer demographic information to generate business insights.
There are 1000 customers in our data set. For each customer age, gender and pincode are provided. In addition the number of calls , number of minutes spoken and bill amount over a 6 month period are available for each customer.
We import the telecom.csv file inside Python using the pd.read_csv function and the plot.box function to create our boxplot
The box function in pandas yields a different type of box chart
Calls specfies the column for which the box plot needs to be plotted
The label argument provides a user defined label for the variable on the X axis and ylabel provides a user defined label for the variable on Y axis
import pandas as pd telecom = pd.read_csv("telecom.csv")
BoxPlot – Total Calls
import matplotlib.pyplot as plt telecom.Calls.plot.box(label='No. Of Calls');plt.title('Fig.No. 8 : BOX PLOT (Total Calls)');plt.ylabel('Total Calls')
Here is the output from the Python code for our box whisker plot.
While we see a few outliers , the distribution of number of calls overall is symmetric
Now let us obtain the box whisker plot for the “Calls” variable, but separate the plot for each age group.
Note that there are three age groups.
boxplot() in pandas yields different types of box chart and is an alternative tousing plot.box()
column specifies variable for which the box plot needs to be plotted
The by argument specifies that box plots are plotted separately for each age group
grid can be used to remove the background grid seen in each plot
patch_artist= True gives coloured boxes
telecom.boxplot(column='Calls', by='Age_Group', grid=False, patch_artist=True);plt.title('Fig.No. 9 : BOXPLOT – Average Call Time');plt.suptitle('');plt.ylabel('Total Calls')
After executing the previous Python code we get three box plots.
Here we can observe that the spread of total calls is higher in the 18-30 age group and the number of outliers is higher in 30 – 45 age group. However, symmetry is observed in all age groups.
To construct a histogram, the first step is to “bin” (or “bucket”) the range of values – that is, divide the entire range of values into a series of intervals – and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable.
Histograms are recommended for continuous variables and are generally used to check the normality of data.
Let’s obtain a histogram for a variable AvgTime, which is available in the telecom data object using the telecom.AvgTime.hist function.
The hist function in python yields a histogram
The bins argument specifies the width of each bar,
xlabel provides a user defined label for the variable on the X axis,
ylabel provides a user defined label for the variable on the Y axis
color can be used to input your choice of color to the bars.
telecom.AvgTime.hist(bins=12,grid=False, color = 'darkorange'); plt.title('Fig.No. 10 : HISTOGRAM – Average Call Time'); plt.xlabel('Average Call Time');plt.ylabel('No. of Customers')
This plot shows that the distribution of average call time is quite symmetric. Note that height of each bar is the count of customers in each bin of the average time variable.
Density Plots are similar to a histograms. We use them to plot probabilities. We generally use them to check the the normality of data when there are higher data points.
Let’s create a density plot of the “Amt” variable available in our “telecom” data object .
Plot.kde is used to plot the density values.
The Kde function returns the density values of the variable.
The plot function plots the line graph of the specified variable
title provides the user defined name of the chart. It must be put in double quotes
xlabel provides a user defined label for the variable on X axis
(telecom['Amt']).plot.kde();plt.title('Fig.No. 11 : DENSITY PLOT (Amount)');plt.xlabel('Amount')
Here is the density plot displayed after executing the Python code. It shows that the distribution of amount is slightly positively skewed.
Stem and Leaf Diagrams
A stem and leaf diagram another alternative to a histogram. Here, each numeric value is split into a stem (first digit(s)) and a leaf (last digit). Stem and leaf diagrams show the shape of a distribution (like bar charts), but have the advantage of not losing the detail of the original data.
One can easily locate the median using this plot.
The Python syntax for stem and leaf diagrams is very simple.
The stem function in python yields a stem chart.telecom.
Calls specfies the variable for which the stem plot needs to be plotted.
plt.stem(telecom.Calls);plt.title('Fig.No. 12 : STEM GRAPH – Total Calls'); plt.xlabel('CustID'); plt.ylabel('Total Calls')
This is how the stem and leaf plot looks in Python. The calls variable is distributed symmetrically and few outliers exist in the data.
The Pareto chart, named after Vilfredo Pareto, is a type of chart that contains both a bar and a line graph, where individual values are represented in descending order by bars. In this way the chart visually depicts which categories are more significant. The cumulative total is represented by the line.
Let ‘s obtain Pareto chart for the Calls variable with Age_Group as a factor.
The groupby function is used to split the data into groups based on variable Age_Group.
The to_frame function is used to convert the given series object to a dataframe
The plt.subplots method provides a way to plot multiple plots on a single figure. Given the number of rows and columns , it returns a tuple ( fig , ax ), giving a single figure fig with an array of axes ax.
telecom1.index is the argument that allows the bars to be named according the row names in the variable mentioned.
telecom1[“Calls”] specifies the variable for which the Pareto chart needs to be plotted
ax2.twinx function creates twin axes sharing the X axis
set_major_formatter(PercentFormatter()) sets the percentage format on y axis for our chart
ax.tick_param provides axis ticks to the chart. It has to be put in double quotes
colors can be used to input your choice of color to the bars
ax.set_xlabel, ax.set_ylabel provides a user defined label for the variable on the X and Y axes
telecom1 = telecom.groupby('Age_Group')['Calls'].sum() telecom1 telecom1 = telecom1.to_frame() telecom1["cumpercentage"]=telecom1["Calls"].cumsum()/telecom1["Calls"].sum()*100 fig, ax = plt.subplots()
Here is the output of Python code for Pareto chart. We can interpret that 50% of the total calls made come from the 18-30 age group. Another 42% of calls are made by the 30-45 age group, only 8% of calls are made by customers older than 45.
Choosing the Right Type of Chart
It is very important to understand what type of graph is appropriate for a given type of variable. To represent a discrete variable a bar graph is appropriate. For a continuous variables, we can use histograms, boxplots or density plots For a categorical variables, bar graph, pie charts or pareto charts are suitable.For a dichotomous variable, we can use multiple bar charts or stacked bar charts.
This tutorial lesson is taken from the Postgraduate Diploma in Data Science.