In our other Data Visualization tutorial we looked at commonly used and informative bar and pie charts and how to create them with Python functions. In this tutorial we’ll cover box plot examples and other graph types that we use in data analytics and data science.

You can download the data files for this tutorial here.

We’ll start with with box whisker plots and histograms. We’ll then move on to density plots, stem and leaf diagrams, and finally pareto charts.

## Box Whisker Plots examples

Box whisker plots show the distribution of a variable under study using five summary measures as follows – minimum, lower quartile, middle quartile, nothing but median, upper quartile and maximum.

The box in the middle represents the middle 50% of the data. The lines (whiskers) extend from the box to the smallest and largest values. The plot also shows outliers and, importantly, minimum and maximum are shown after excluding outliers. A boxplot is particularly effective when comparing two sets of data.

### Case Study

We’ll use the same case study as in our earlier tutorial on bar and pie charts in Python.

A telecom service provider has demographic and transactional information about their customers. We want to visualize the data using usage variables and customer demographic information to generate business insights.

There are 1000 customers in our data set. For each customer age, gender and pincode are provided. In addition the number of calls , number of minutes spoken and bill amount over a 6 month period are available for each customer.

We import the **telecom.csv** file inside Python using the **pd.read_csv** function and the **plot.box **function to create our boxplot

The **box function** in pandas yields a different type of box chart

**Calls **specfies the column for which the box plot needs to be plotted

The **label** argument provides a user defined label for the variable on the X axis and **ylabel** provides a user defined label for the variable on Y axis

## Importing Data

```
import pandas as pd
telecom = pd.read_csv("telecom.csv")
```

### BoxPlot – Total Calls

```
import matplotlib.pyplot as plt
telecom.Calls.plot.box(label='No. Of Calls');plt.title('Fig.No. 8 : BOX PLOT (Total Calls)');plt.ylabel('Total Calls')
```

Here is the output from the Python code for our box whisker plot.

While we see a few outliers , the distribution of number of calls overall is symmetric

Now let us obtain the box whisker plot for the “Calls” variable, but separate the plot for each age group.

Note that there are three age groups.

**boxplot()** in pandas yields different types of box chart and is an alternative tousing **plot.box()**

**column **specifies variable for which the box plot needs to be plotted

The **by **argument specifies that box plots are plotted separately for each age group

**grid** can be used to remove the background grid seen in each plot

**patch_artist**= True gives coloured boxes

`telecom.boxplot(column='Calls', by='Age_Group', grid=False, patch_artist=True);plt.title('Fig.No. 9 : BOXPLOT – Average Call Time');plt.suptitle('');plt.ylabel('Total Calls') `

After executing the previous Python code we get three box plots.

Here we can observe that the spread of total calls is higher in the 18-30 age group and the number of outliers is higher in 30 – 45 age group. However, symmetry is observed in all age groups.

## Histograms Python

To construct a histogram, the first step is to “bin” (or “bucket”) the range of values – that is, divide the entire range of values into a series of intervals – and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable.

Histograms are recommended for continuous variables and are generally used to check the normality of data.

Let’s obtain a histogram for a variable AvgTime, which is available in the telecom data object using the telecom.AvgTime.**hist** function.

The **hist **function in python yields a histogram

The **bins** argument specifies the width of each bar,

**xlabel **provides a user defined label for the variable on the X axis,

**ylabel** provides a user defined label for the variable on the Y axis

**color** can be used to input your choice of color to the bars.

```
telecom.AvgTime.hist(bins=12,grid=False, color = 'darkorange');
plt.title('Fig.No. 10 : HISTOGRAM – Average Call Time');
plt.xlabel('Average Call Time');plt.ylabel('No. of Customers')
```

This plot shows that the distribution of average call time is quite symmetric. Note that height of each bar is the count of customers in each bin of the average time variable.

### Density Plots

**Density Plots **are similar to a histograms. We use them to plot probabilities. We generally use them to check the the normality of data when there are higher data points.

Let’s create a density plot of the “Amt” variable available in our “telecom” data object .

**Plot.kde** is used to plot the density values.

The **Kde function **returns the density values of the variable.

The **plot function** plots the line graph of the specified variable

**title** provides the user defined name of the chart. It must be put in double quotes

**xlabel** provides a user defined label for the variable on X axis

` (telecom['Amt']).plot.kde();plt.title('Fig.No. 11 : DENSITY PLOT (Amount)');plt.xlabel('Amount') `

Here is the density plot displayed after executing the Python code. It shows that the distribution of amount is slightly positively skewed.

### Stem and Leaf Diagrams

**A stem and leaf diagram** another alternative to a histogram. Here, each numeric value is split into a **stem** (first digit(s)) and a **leaf*** *(last digit). Stem and leaf diagrams show the shape of a distribution (like bar charts), but have the advantage of not losing the detail of the original data.

One can easily locate the median using this plot.

The Python syntax for stem and leaf diagrams is very simple.

The **stem** function in python yields a stem chart.**telecom.**

**Calls** specfies the variable for which the stem plot needs to be plotted.

`plt.stem(telecom.Calls);plt.title('Fig.No. 12 : STEM GRAPH – Total Calls'); plt.xlabel('CustID'); plt.ylabel('Total Calls') `

This is how the stem and leaf plot looks in Python. The calls variable is distributed symmetrically and few outliers exist in the data.

### Pareto Charts

The **Pareto chart**, named after Vilfredo Pareto, is a type of chart that contains both a **bar and a line graph**, where individual values are represented in descending order by bars. In this way the chart visually depicts which categories are more significant. The cumulative total is represented by the line.

Let ‘s obtain Pareto chart for the Calls variable with Age_Group as a factor.

The **groupby** function is used to split the data into groups based on variable Age_Group.

The **to_frame** function is used to convert the given series object to a dataframe

The **plt.subplots** method provides a way to plot multiple plots on a single figure. Given the number of rows and columns , it returns a **tuple ( fig , ax )**, giving a single figure fig with an array of axes ax.

**telecom1.index** is the argument that allows the bars to be named according the row names in the variable mentioned.

**telecom1[“Calls”]** specifies the variable for which the Pareto chart needs to be plotted

**ax2.twinx function **creates twin axes sharing the X axis

**set_major_formatter(PercentFormatter()) ** sets the percentage format on y axis for our chart

**ax.tick_param** provides axis ticks to the chart. It has to be put in double quotes

**colors** can be used to input your choice of color to the bars

**ax.set_xlabel, ax.set_ylabel** provides a user defined label for the variable on the X and Y axes

```
telecom1 = telecom.groupby('Age_Group')['Calls'].sum()
telecom1
telecom1 = telecom1.to_frame()
telecom1["cumpercentage"]=telecom1["Calls"].cumsum()/telecom1["Calls"].sum()*100
fig, ax = plt.subplots()
```

Here is the output of Python code for Pareto chart. We can interpret that 50% of the total calls made come from the 18-30 age group. Another 42% of calls are made by the 30-45 age group, only 8% of calls are made by customers older than 45.

### Choosing the Right Type of Chart

It is very important to understand what type of graph is appropriate for a given type of variable. To represent a discrete variable a bar graph is appropriate. For a continuous variables, we can use histograms, boxplots or density plots For a categorical variables, bar graph, pie charts or pareto charts are suitable.For a dichotomous variable, we can use multiple bar charts or stacked bar charts.

This tutorial lesson is taken from the Postgraduate Diploma in Data Science.