ggplot2 in R – A Tutorial

In this tutorial we’ll study data visualization using the ggplot2 package in R. The ggplot2 package is one of the most popular packages in data science.

First we’ll study how to construct various bar charts and graphs using ggplot2 and then we’ll study how to plot regression lines and trend lines in R.

You can download the data files for this tutorial here.

The ggplot2 package, created by Hadley Wickham, offers a powerful graphics language for creating elegant and complex plots.   Originally based on Leland Wilkinson’s “The Grammar of Graphics”, ggplot2 allows us to create graphs that represent both univariate and multivariate, numerical and categorical data in a straightforward manner. We import the package using the install.packages command. You can learn about this in our installing R packages tutorial.

Case Study

Let’s look at a case study, which we’ll use to explain various graphs.

A telecom service provider has demographic and transactional information for their customers. We want to visualize the data with usage variables and customer demographic information for generating business insights.

case study

The data set comprises 1000 customers, and for each customer age, gender and pin code is provided.  In addition, the number of calls , number of minutes spoken and bill amount over a 6 month period are available for each customer.

case study data snapshot

Let’s start by importing telecom data using the  read.csv function.

Also, please  ensure that the ggplot2 package is installed and loaded in R .

Importing Data

 telecom<-read.csv("telecom.csv", header=TRUE) 

Installing and calling the package

 install.packages("ggplot2")
 library(ggplot2) 

Now let’s get a simple bar chart with age group on the  X axis and total calls on the Y axis. Within the ggplot function, telecom is the dataset. The aes function specifies the variables to be used on each axis.  The geom_bar function makes the height of the bar proportional to the number of cases in each group, the argument stat=identity is used to represent the height of the bar that represents values in the data. The labs function is used to label the various features of the graph and title argument gives the name of the diagram.

Ggplot2 Simple Bar Chart (Age Group)

ggplot(telecom,aes(x=Age_Group,y=Calls))+        geom_bar(stat="identity",fill="darkorange")+labs(x="Age Groups",y="Total Calls",title="Fig. No. 1 : Simple Bar Diagram(Age Group)")

Here’s the output we’ll get after executing the R code.  To get the bars in the proper order, we’ll have to re-order the levels of column “Age_Group” in telecom data and then run the same ggplot code.

ggplot 2 simple bar chart
telecom$Age_Group <- factor(telecom$Age_Group, levels = c("18-30","30-45",              ">45"))                

Here we can see a simple bar diagram with the age groups ordered.

ggplot2 simple bar chart ordered

Simple Bar Chart (Gender) – Horizontal

To construct a simple bar chart with horizontal bars the  coord_flip function is used. There’s no change in the other syntax.

 ggplot(telecom, aes(x=Gender, y=Calls))+  
       geom_bar(stat="identity",fill="cadetblue")+ 
        labs(x="Gender",y="Total Calls", 
          title="Fig. No. 2 : Simple Bar Diagram(Gender)-Horizontal")+  
             coord_flip() 

We’ll see horizontal bars when  coord_flip is used with the ggplot function.

ggplot2 horizontal bar chatrt

Now let’s move on to a stacked bar chart using the  ggplot function.

Within ggplot,  telecom is the dataset. The aes function specifies the variables to be used, aes() 

The function in geom_bar() divides each bar as per the input variable using fill= Gender ,

We use the labs function to label the various features of the graph and the the title argument gives the name of the diagram.The scale_fill_manual function allows the user to define colors for the sub divided bar.

Stacked Bar Chart (or Sub-Divided)

Here we can see a stacked bar diagram of the number of customers in each age group by gender. 

By observing the diagram we can say that although there are more younger customers in the data, there is an almost equal number of males and females present in each age group.

#output

ggplot2 stacked bar chart

Multiple Bar Charts (or Grouped Bar Charts)

Next we’ll consider multiple bar diagrams using the ggplot function. Within the ggplot function telecom is the dataset. The aes function specifies the variables to be used, 

The aes function in geom_bar divides each bar as per the input variable using fill= Gender and the argument position = dodge gives us the divided bars one beside the other. We use the labs function to label the various features of the graph and the title argument gives the name of the diagram. The scale_fill_manual function allows the user to define colors for the sub divided bar.

 ggplot(telecom, aes(x=Age_Group))+geom_bar(aes(fill=Gender),position="dodge")          + labs(x="Age Group", y="No. of customers",title="Fig.No.5 : Multiple Bar Chart(Age-Group)")+ scale_fill_manual(values=c("yellowgreen","cadetblue")) 

Here we can see the multiple bar diagram of the gender-wise distribution of the number of customers across age groups.

# Output

multiple bar chart

Ggplot2 Pie chart

To construct a Pie chart using the ggplot function we consider the Age group variable.

Within ggplot, telecom is the dataset. The aes function specifies the variables to be used, 

the geom_bar function makes the height of the bar proportional to the number of cases in each group, the width = 1 argument ensures there’s no gap between bars.

The  coord_polar function transforms stacked bar charts into circular pie charts.

The argument theta=y within coord_polar uses the Y axis scale for proportion and the argument start=pi/3 starts the firstportion of the pie from a pi/3 angleThe labs function is used to label the various features of the graph and the  title argument gives the name of the diagram. The scale_fill_manual function allows the user to define colors for the sub divided bar.

The pie chart in the slide shows how the total number of customers are proportionally distributed in the various age groups.

# Output

ggplot2 pie chart

Ggplot2 Box-whisker plot

A box-whisker plot is very popular among data scientists. Let’s see a Box plot using the ggplot function. Within ggplot telecom is the dataset and the  aes function specifies the variables to be used on each axis,. Here we want boxplots for the “Calls” variable within each age group.

geom_boxplot calls the boxplot function.  outlier.colour argument and outlier.size define the colour and size of outlier, 

The labs function is used to label the various features of the boxplot and the title argument gives a name to the boxplot.

Ggplot2 Box Plot

After executing the R code, we get 3 box plots- one for each age group.

Remember that a box plot also shows outliers.

# Output

ggplot2 box plot

To construct a horizontal boxplot, the coord_flip function is used. The aes function in geom_boxplot gives multiple boxplots one beside the other using fill= Gender

Box Plot (Gender) – Horizontal

This slide shows horizontal multiple box plots for each age group and for each gender.

Note that the variable “Calls” is described using box plots.

Green boxplots are for males and the blue boxplots are for females.

horizontal box plot

ggplot2 Histogram

Another popular graph in data science is the histogram.  Within the ggplot function telecom is the dataset. We use the aes function to specify our variable and geom_histogram to plot the histogram. The binwidth argument within geom_histogram gives a size to each bar in the graph

We use the labs function to label the various features of the graph, and title argument gives the name of the diagram.

# Histogram

We can see the histogram for the “Calls”variable.  Histograms are a traditional but still very powerful way of looking at the data.

ggplot2 histogram

To plot a  histogram for each age group one over the other using the  ggplot function, the aes function with fill = Age_group is specified within the geom_histogram function.

This shows histograms of the three age groups, 18-30, 30-45 and greater than 45 plotted one above the other.

histogram

Case Study 2

Here’s another another case study. A company conducts different written tests before recruiting employees. The company wishes to see if the scores of these tests have any relation with the post-recruitment performance of those employees. Here the objective is to study the correlation between aptitude and job proficiency.

case study - employee performance

This is the  data snapshot. The last column is a dependent variable, and scores of various tests conducted prior to recruitment are recorded under Aptitude, testofen, tech and gk.

 data snapshot

To create a Scatter plot  using the ggplot function ,we first import job proficiency data using read.csv function.  Within ggplot job is the dataset. The aes function specifies the variables to be used on each axis, the geom_point function is used to plot the data points. In this case it’s a scatter plot.  The geom_smooth function is used to plot the curve, the argument method=”lm” is used to get a linear regression line, the labs function is used to label the various features of the graph and the title argument gives the name of the diagram.

# Importing Data

Ggplot2 Scatterplot with Regression Line

Here we see a scatterplot with the regression line constructed using the aptitude variable on the X-axis and job proficiency on the Y axis.  By observing the scatterplot, we can see that as the aptitude score increases, job proficiency also increases, thus there’s a positive correlation between job proficiency and aptitude score.

scatterplot with regression line

Plotting a trendline

Our next objective is to construct a trend line. Plotting a trendline requires a time-element.

Let’s look at a snapshot of two datasets, TelecomData_CustDemo and TelecomData_WeeklyData.

The TelecomData_CustDemo dataset has information about Customer Id, age, gender, Pincode and whether the customer is an active user of telecom.

The TelecomData_WeeklyData dataset consists of variables such as CustId, Week, Calls, Minutes and amount charged.

telecom data

We import both the datasets using read.csv function. Then we merge the two using Customer Id.

Our objective is to obtain a trendline for each age group. To obtain the trendline using the ggplot function, we specify the following functions and arguments.  Within the ggplot function, Trend is the dataset and the aes function specifies the variables to be used on each axis,

On the x axis we consider weeks and on the y axis we consider the calls variable. The geom_line function is used to call  the trend line and the geom_point function is used to plot the data points. The labs function is used to label the various features of the trendline and the title argument gives a name to the graph. 

Merging and Formatting Data

# Creating new variable Age_Group & aggregating

Observing Age_group wise Trend

data<-read.csv("Normality Testing Data.csv", header=TRUE) 

Here we can see the trend line for each age group after executing the R code.

ggplot2 - trendline

If we use fill=darkorange in geom_bar, it overrides fill=gender in ggplot, so instead of getting a multiple bar diagram displayed in this slide we’ll get a simple bar diagram displayed in the next slide.

ggplot2 geombar

This slide shows a simple bar diagram obtained because fill=“darkorange” in geom_bar overrides the fill=Gender in ggplot

ggplot2 in R geombar

To recap, in this session we studied how to construct bar charts, pie charts, Box whisker plots , Histograms , scatter plots and trend lines in R. This tutorial is based on lessons from the Data Analytics in R unit of the Digita Schools Advanced Diploma in Data Analytics.

ggplot2-quick recap

This tutorial lesson is taken from Digita Schools Advanced Diploma in Data Analytics and the Postgraduate Diploma in Data Science.

You can try our courses for free to learn more