In this tutorial we’ll study data visualization using the ggplot2 package in R. The ggplot2 package is one of the most popular packages in data science.

First we’ll study how to construct various bar charts and graphs using ggplot2 and then we’ll study how to plot regression lines and trend lines in R.

You can download the data files for this tutorial here.

The ggplot2 package, created by Hadley Wickham, offers a powerful graphics language for creating elegant and complex plots. Originally based on Leland Wilkinson’s “The Grammar of Graphics”, ggplot2 allows us to create graphs that represent both univariate and multivariate, numerical and categorical data in a straightforward manner. We import the package using the install.packages command. You can learn about this in our installing R packages tutorial.

## Case Study

Let’s look at a case study, which we’ll use to explain various graphs.

A telecom service provider has demographic and transactional information for their customers. We want to visualize the data with usage variables and customer demographic information for generating business insights.

The data set comprises 1000 customers, and for each customer age, gender and pin code is provided. In addition, the number of calls , number of minutes spoken and bill amount over a 6 month period are available for each customer.

Let’s start by importing telecom data using the read.csv function.

Also, please ensure that the ggplot2 package is installed and loaded in R .

## Importing Data

telecom<-read.csv("telecom.csv",header=TRUE)

## Installing and calling the package

install.packages("ggplot2") library(ggplot2)

Now let’s get a simple bar chart with age group on the X axis and total calls on the Y axis. Within the ggplot function, telecom is the dataset. The aes function specifies the variables to be used on each axis. The geom_bar function makes the height of the bar proportional to the number of cases in each group, the argument stat=identity is used to represent the height of the bar that represents values in the data. The labs function is used to label the various features of the graph and title argument gives the name of the diagram.

## Ggplot2 Simple Bar Chart (Age Group)

ggplot(telecom,aes(x=Age_Group,y=Calls))+geom_bar(stat="identity",fill="darkorange")+labs(x="Age Groups",y="Total Calls",title="Fig. No. 1 : Simple Bar Diagram(Age Group)")

Here’s the output we’ll get after executing the R code. To get the bars in the proper order, we’ll have to re-order the levels of column “Age_Group” in telecom data and then run the same ggplot code.

telecom$Age_Group <-factor(telecom$Age_Group,levels =c("18-30","30-45", ">45"))

Here we can see a simple bar diagram with the age groups ordered.

## Simple Bar Chart (Gender) – Horizontal

To construct a simple bar chart with horizontal bars the coord_flip function is used. There’s no change in the other syntax.

ggplot(telecom,aes(x=Gender,y=Calls))+geom_bar(stat="identity",fill="cadetblue")+labs(x="Gender",y="Total Calls",title="Fig. No. 2 : Simple Bar Diagram(Gender)-Horizontal")+coord_flip()

We’ll see horizontal bars when coord_flip is used with the ggplot function.

Now let’s move on to a stacked bar chart using the ggplot function.

Within ggplot, telecom is the dataset. The aes function specifies the variables to be used, **aes() **

**The function in geom_bar()** divides each bar as per the input variable using **fill= **Gender ,

We use the labs function to label the various features of the graph and the the title argument gives the name of the diagram.The **scale_fill_manual function** allows the user to define colors for the sub divided bar.

## Stacked Bar Chart (or Sub-Divided)

Here we can see a stacked bar diagram of the number of customers in each age group by gender.

By observing the diagram we can say that although there are more younger customers in the data, there is an almost equal number of males and females present in each age group.

### #output

## Multiple Bar Charts (or Grouped Bar Charts)

Next we’ll consider multiple bar diagrams using the ggplot function. Within the ggplot function telecom is the dataset. The aes function specifies the variables to be used,

**The aes function in geom_bar** divides each bar as per the input variable using **fill= **Gender and the argument position = dodge gives us the divided bars one beside the other. We use the labs function to label the various features of the graph and the title argument gives the name of the diagram. The **scale_fill_manual function** allows the user to define colors for the sub divided bar.

ggplot(telecom,aes(x=Age_Group))+geom_bar(aes(fill=Gender),position="dodge") +labs(x="Age Group", y="No. of customers",title="Fig.No.5 : Multiple Bar Chart(Age-Group)")+scale_fill_manual(values=c("yellowgreen","cadetblue"))

Here we can see the multiple bar diagram of the gender-wise distribution of the number of customers across age groups.

### # Output

## Ggplot2 Pie chart

To construct a Pie chart using the ggplot function we consider the Age group variable.

Within ggplot, telecom is the dataset. The aes function specifies the variables to be used,

the geom_bar function makes the height of the bar proportional to the number of cases in each group, the width = 1 argument ensures there’s no gap between bars.

The ** coord_polar **function transforms stacked bar charts into circular pie charts.

The argument theta=y within coord_polar uses the Y axis scale for proportion and the argument **start=pi/3** starts the firstportion of the pie from a pi/3 angleThe labs function is used to label the various features of the graph and the title argument gives the name of the diagram. The **scale_fill_manual function** allows the user to define colors for the sub divided bar.

The pie chart in the slide shows how the total number of customers are proportionally distributed in the various age groups.

# Output

## Ggplot2 Box-whisker plot

A box-whisker plot is very popular among data scientists. Let’s see a Box plot using the ggplot function. Within ggplot telecom is the dataset and the aes function specifies the variables to be used on each axis,. Here we want boxplots for the “Calls” variable within each age group.

**geom_boxplot** calls the boxplot function. outlier.colour argument and outlier.size define the colour and size of outlier,

The labs function is used to label the various features of the boxplot and the title argument gives a name to the boxplot.

## Ggplot2 Box Plot

After executing the R code, we get 3 box plots- one for each age group.

Remember that a box plot also shows outliers.

### # Output

To construct a horizontal boxplot, the coord_flip function is used. **The aes function in geom_boxplot **gives multiple boxplots one beside the other using** fill=** Gender

### Box Plot (Gender) – Horizontal

This slide shows horizontal multiple box plots for each age group and for each gender.

Note that the variable “Calls” is described using box plots.

Green boxplots are for males and the blue boxplots are for females.

## ggplot2 Histogram

Another popular graph in data science is the histogram. Within the ggplot function telecom is the dataset. We use the aes function to specify our variable and geom_histogram to plot the histogram. The **binwidth **argument** within geom_histogram** gives a size to each bar in the graph

We use the labs function to label the various features of the graph, and title argument gives the name of the diagram.

### # Histogram

We can see the histogram for the “Calls”variable. Histograms are a traditional but still very powerful way of looking at the data.

To plot a histogram for each age group one over the other using the ggplot function, the aes function with fill = Age_group is specified within the geom_histogram function.

This shows histograms of the three age groups, 18-30, 30-45 and greater than 45 plotted one above the other.

### Case Study 2

Here’s another another case study. A company conducts different written tests before recruiting employees. The company wishes to see if the scores of these tests have any relation with the post-recruitment performance of those employees. Here the objective is to study the correlation between aptitude and job proficiency.

This is the data snapshot. The last column is a dependent variable, and scores of various tests conducted prior to recruitment are recorded under Aptitude, testofen, tech and gk.

To create a Scatter plot using the ggplot function ,we first import job proficiency data using read.csv function. Within ggplot job is the dataset. The aes function specifies the variables to be used on each axis,** the geom_point function** is used to plot the data points. In this case it’s a scatter plot. **The geom_smooth function** is used to plot the curve, the argument **method=”lm”** is used to get a linear regression line, the labs function is used to label the various features of the graph and the title argument gives the name of the diagram.

### # Importing Data

## Ggplot2 Scatterplot with Regression Line

Here we see a scatterplot with the regression line constructed using the aptitude variable on the X-axis and job proficiency on the Y axis. By observing the scatterplot, we can see that as the aptitude score increases, job proficiency also increases, thus there’s a positive correlation between job proficiency and aptitude score.

## Plotting a trendline

Our next objective is to construct a trend line. Plotting a trendline requires a time-element.

Let’s look at a snapshot of two datasets, TelecomData_CustDemo and TelecomData_WeeklyData.

The TelecomData_CustDemo dataset has information about Customer Id, age, gender, Pincode and whether the customer is an active user of telecom.

The TelecomData_WeeklyData dataset consists of variables such as CustId, Week, Calls, Minutes and amount charged.

We import both the datasets using read.csv function. Then we merge the two using Customer Id.

Our objective is to obtain a trendline for each age group. To obtain the trendline using the ggplot function, we specify the following functions and arguments. Within the ggplot function, Trend is the dataset and the aes function specifies the variables to be used on each axis,

On the x axis we consider weeks and on the y axis we consider the calls variable. The **geom_line** function is used to call the trend line and the **geom_point** function is used to plot the data points. The labs function is used to label the various features of the trendline and the title argument gives a name to the graph.

### Merging and Formatting Data

# Creating new variable Age_Group & aggregating

### Observing Age_group wise Trend

data<-read.csv("Normality Testing Data.csv",header=TRUE)

Here we can see the trend line for each age group after executing the R code.

If we use fill=darkorange in geom_bar, it overrides fill=gender in ggplot, so instead of getting a multiple bar diagram displayed in this slide we’ll get a simple bar diagram displayed in the next slide.

This slide shows a simple bar diagram obtained because fill=“darkorange” in geom_bar overrides the fill=Gender in ggplot

To recap, in this session we studied how to construct bar charts, pie charts, Box whisker plots , Histograms , scatter plots and trend lines in R.

This tutorial lesson is taken from the Postgraduate Diploma in Data Science.