Box Plots examples and Advanced Charts

In our other Data Visualization tutorial we looked at commonly used and informative bar and pie charts and how to create them with Python functions. In this tutorial we’ll cover box plot examples and other graph types that we use in data analytics and data science.

You can download the data files for this tutorial here.

We’ll start with with box whisker plots and histograms. We’ll then move on to density plots, stem and leaf diagrams, and finally pareto charts.

Box Whisker Plots examples

Box whisker plots show the distribution of a variable under study using five summary measures as follows – minimum, lower quartile, middle quartile, nothing but median, upper quartile and maximum.

Box whisker plot example

The box in the middle represents the middle 50% of the data. The lines (whiskers) extend from the box to the smallest and largest values.  The plot also shows outliers and, importantly, minimum and maximum are shown after excluding outliers. A boxplot is particularly effective when comparing two sets of data.

Case Study

We’ll use the same case study as in our earlier tutorial on bar and pie charts in Python.

A telecom service provider has demographic and transactional  information about their customers. We want to visualize the data using usage variables and customer demographic information to generate business insights.

Case study information

There are 1000 customers in our data set. For each customer age, gender and pincode are provided. In addition the number of calls , number of minutes spoken and bill amount over a 6 month period are available for each customer.

Case study data snapshot

We import the telecom.csv file inside Python using the pd.read_csv function and the plot.box function to create our boxplot

The box function in pandas yields a different type of box chart

Calls specfies the column for which the box plot needs to be plotted

The label argument provides a user defined label for the variable on the X axis and ylabel provides a user defined label for the variable on Y axis

Importing Data

 import pandas as pd
 telecom = pd.read_csv("telecom.csv") 

BoxPlot – Total Calls

import matplotlib.pyplot as plt
telecom.Calls.plot.box(label='No. Of Calls');plt.title('Fig.No. 8 : BOX PLOT (Total Calls)');plt.ylabel('Total Calls') 

Here is the output from the Python code for our box whisker plot.

While we see a few outliers , the distribution of number of calls overall is symmetric 

Box whisker plot in Python

Now let us obtain the box whisker plot for the “Calls” variable, but separate the plot for each age group.

Note that there are three age groups.

boxplot() in pandas yields different types of box chart and is an alternative tousing plot.box()

column specifies variable for which the box plot needs to be plotted

The by argument specifies that box plots are plotted separately for each age group

grid can be used to remove the background grid seen in each plot

patch_artist= True gives coloured boxes

telecom.boxplot(column='Calls', by='Age_Group', grid=False, patch_artist=True);plt.title('Fig.No. 9 : BOXPLOT – Average Call Time');plt.suptitle('');plt.ylabel('Total Calls') 

After executing the previous Python code we get three box plots.

Here we can observe that the spread of total calls is higher in the 18-30 age group and the number of outliers is higher in 30 – 45 age group. However, symmetry is observed in all age groups.

box plots python

Histograms Python

To construct a histogram, the first step is to “bin” (or “bucket”) the range of values – that is, divide the entire range of values into a series of intervals – and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable.

Histograms are recommended for continuous variables and  are generally used to check the normality of data.

Let’s obtain a histogram for a variable AvgTime, which is available in the telecom data object using the telecom.AvgTime.hist function.

The hist function in python yields a histogram

The bins argument specifies the width of each bar,

xlabel provides a user defined label for the variable on the X axis,

ylabel provides a user defined label for the variable on the Y axis

color can be used to input your choice of color to the bars.

telecom.AvgTime.hist(bins=12,grid=False, color = 'darkorange'); 
plt.title('Fig.No. 10 : HISTOGRAM – Average Call Time'); 
plt.xlabel('Average Call Time');plt.ylabel('No. of Customers') 

This plot shows that the distribution of average call time is quite symmetric. Note that height of each bar is the count of customers in each bin of the average time variable.

Histogram in Python

Density Plots

Density Plots are similar to a histograms. We use them to plot probabilities. We generally use them to check the the normality of data when there are higher data points.

Let’s create a density plot of the “Amt” variable available in our “telecom” data object .

Plot.kde is used to plot the density values.

The Kde function returns the density values of the variable.

The plot function plots the line graph of the specified variable

title provides the user defined name of the chart. It must be put in double quotes

xlabel provides a user defined label for the variable on X axis

 (telecom['Amt']).plot.kde();plt.title('Fig.No. 11 : DENSITY PLOT (Amount)');plt.xlabel('Amount') 

Here is the density plot displayed after executing the Python code. It shows that the distribution of amount is slightly positively skewed.

Density plot in Python

Stem and Leaf Diagrams

A stem and leaf diagram another alternative to a histogram. Here, each numeric value is split into a stem (first digit(s)) and a leaf (last digit). Stem and leaf diagrams show the shape of a distribution (like bar charts), but have the advantage of not losing the detail of the original data.

One can easily locate the median using this plot.

The Python syntax for stem and leaf diagrams is very simple.

The stem function in python yields a stem chart.telecom.

Calls specfies the variable for which the stem plot needs to be plotted.

plt.stem(telecom.Calls);plt.title('Fig.No. 12 : STEM GRAPH – Total Calls'); plt.xlabel('CustID'); plt.ylabel('Total Calls') 

This is how the stem and leaf plot looks in Python. The calls variable is distributed symmetrically and few outliers exist in the data.

Stem and leaf diagram in Python

Pareto Charts

The Pareto chart, named after Vilfredo Pareto, is a type of chart that contains both a bar and a line graph, where individual values are represented in descending order by bars. In this way the chart visually depicts which categories are more significant. The cumulative total is represented by the line.

Let ‘s obtain Pareto chart for the Calls variable with Age_Group as a factor.

The groupby function is used to split the data into groups based on variable Age_Group.

The to_frame function is used to convert the given series object to a dataframe

The plt.subplots method provides a way to plot multiple plots on a single figure. Given the number of rows and columns , it returns a tuple ( fig , ax ), giving a single figure fig with an array of axes ax.

telecom1.index is the argument that allows the bars to be named according the row names in the variable mentioned.

telecom1[“Calls”] specifies the variable for which the Pareto chart needs to be plotted

ax2.twinx function creates twin axes sharing the X axis

set_major_formatter(PercentFormatter())  sets the percentage format on y axis for our chart

ax.tick_param provides axis ticks to the chart. It has to be put in double quotes

colors can be used to input your choice of color to the bars

ax.set_xlabel, ax.set_ylabel provides a user defined label for the variable on the X and Y axes

 telecom1 = telecom.groupby('Age_Group')['Calls'].sum()
 telecom1
 telecom1 = telecom1.to_frame()
 telecom1["cumpercentage"]=telecom1["Calls"].cumsum()/telecom1["Calls"].sum()*100
 fig, ax = plt.subplots() 

Here is the output of Python code for Pareto chart. We can interpret that 50% of the total calls made come from the 18-30 age group. Another 42% of calls are made by the 30-45 age group, only 8% of calls are made by customers older than 45.

Pareto chart in Python

Choosing the Right Type of Chart

It is very important to understand what type of graph is appropriate for a given type of variable. To represent a discrete variable a bar graph is appropriate. For a continuous variables, we can use histograms, boxplots or density plots For a categorical variables, bar graph, pie charts or pareto charts are suitable.For a dichotomous variable, we can use multiple bar charts or stacked bar charts.

Choosing the right type of chart

This tutorial lesson is taken from the Postgraduate Diploma in Data Science.