Joins in Python- Merging, Appending and Aggregating Data

In this tutorial, we’ll look at different types of joins in Python used to merge two datasets.Then we’ll study how to aggregate data using the groupby function.

You can download the data files for this tutorial here

Joins in Python

Here’s a snapshot of our two datasets –  sal_data and bonus_data. The sal_data dataset gives information about the basic salary a group of employees, and the bonus_data  dataset about the bonus given to each employee. Each row corresponds to information about one employee.

Note that “Employee ID” is the common column in both datasets

Join datasets in Python

Joins in Python is basically the merging of the two datasets with one or more common variables. Pandas provides various facilities for easily combining data frames together. We can use the simple merge() function for merging data in pandas.

First, we import the two datasets as sal_data and bonus_data using the pd.read_csv function.

read csv in python

Import sal_data and bonus_data

 import pandas as pd
 sal_data = pd.read_csv('sal_data.csv') 
 bonus_data = pd.read_csv('bonus_data.csv')  

The different types of joins that we’ll study using the merge function are left join, right join, inner join and outer join.

We’ll look at each type closely using our two data sets.

joins in python

Left Join in Python

Left Join returns all rows from the first dataset, even if there are no matches in the second dataset.

We’ve specified two data sets to be merged inside the pd.merge function as sal_data and bonus_data. The primary key to be used for merging is Employee_ID. The how argument is used to specify the type of join, in this case left.

Employee ID’s in sal_data that are not present in bonus_data  will have NaN values under the Bonus variable.

Display all the information(including bonus) of Employees from sal_data

 leftjoin=pd.merge(sal_data,bonus_data,how='left')
 leftjoin 

# Output

Employee data python

Right Join in Python.

Right Join returns all rows from the second dataset even if there are no matches in the first dataset   

We’ve specified the two data sets to be merged inside merge function as sal_data and bonus_data. The primary key to be used for merging is Employee_ID. The argument

how = ‘right’ is used to keep all  rows from the first data set and only those from second dataset that match,

Employee ID’s in bonus_data that are not present in sal_data  will have NaN values in the columns First_Name, Last_Name and Basic_Salary.

# Display all the information of employees who are receiving bonus

 rightjoin=pd.merge(sal_data,bonus_data,how='right')
 rightjoin 

# Output

Employee data python 2

Inner Join returns only those rows that match in the primary key Employee ID.

We have specified two data sets to be merged inside the merge function as sal_data and bonus_data. The primary key to be used for merging is Employee_ID.By default merge returns an inner join.

# Display all the information about employees which are common in both the tables

 innerjoin=pd.merge(sal_data,bonus_data)
 innerjoin 

# Output

output-inner-join

Outer Join returns all rows from both the datasets, including rows which are not matched in Employee_ID.

The argument how = ‘outer’ indicates that all rows from both the datasets are included.

Here you can see that NaN are generated in case of unmatched rows under relevant columns.

# Combine sal_data and bonus_data

 outerjoin=pd.merge(sal_data,bonus_data,how='outer')
 outerjoin 

# Output

output-outer-join

From merging data sets, let’s now move to appending two data sets.

The basic_salary-1 data set has 5 rows and 6 columns.

The basic_salary-2 data set has 7 rows and 6 columns.

Here we want to obtain a data set with 7 rows and 6 columns.

merge data sets python

Concatenate in Python

Append means adding cases or observations to a dataset. In pandas, the concat() function is used to append data.

We observe that the pd.concat function is used to combine the two datasets, basic_salary-1 and basic_salary-2.

You need to pass a sequence or mapping of dataframes or series to concat as objects for it to append. So frames=[Salary_1,Salary_2] is the sequence that is passed as an object to the  concat function.

The python concat function combines the observation in the dataset by adding rows.

You can see that the new data set has 12 rows and 6 columns.

# Import the data sets and append them using pd.concat()

 Salary_1= pd.read_csv('basic_salary - 1.csv') 
 Salary_2= pd.read_csv('basic_salary - 2.csv')
 frames=[Salary_1,Salary_2] 
 appendsalary=pd.concat(frames)
 appendsalary 

# Output

concatenate-in-python

Let’s now look at a snapshot of the data we will be using to study the aggregate function. The data concerns employee salary components, along with Grade and Location.

Each row corresponds to information about one employee.

Employee salary data

Aggregating data means splitting data into subsets, computing summary statistics on each subset and displaying the results in a conveniently summarised form.

The groupby() function in pandas carries out the process of taking numerous records and collapsing them into a single summary record.

aggregating-data

# Import basic_salary data

salary_data=pd.read_csv('basic_salary.csv')

To calculate the sum of the variable ‘ms’ by the variable ‘Location’ we use the groupby function. In this example we are grouping ms by Location. ‘Location’ enclosed in parentheses tells the function to group according to Location variables. ‘ms’ enclosed in square brackets is used to access the ms variable, so as to apply the base function sum to it.

# Calculate sum of variable ‘ms’ by variable ‘Location’

# In this example we are giving one variable and one factor

 A=salary_data.groupby('Location')['ms'].sum()
 A 

# Output

import-basic-salary

To calculate sum of the variables ‘ms’ and ‘ba’ by variable ‘Location’ we use the groupby function. In this example we are grouping ms and ba by Location. ‘Location’ enclosed in parentheses tells the function to group according to the Location variables. ‘ms’ and ‘ba’ enclosed in square brackets is used to access the ms and ba variable so as to apply the base function sum to it.

# Calculate sum of variables ‘ba’ and ‘ms’ by variables ‘Location’

# In this example we are giving multiple variables and one factor

 B=salary_data.groupby('Location')['ba','ms'].sum()
 B 

# Output

location data

We can also group variables using multiple factors. In this example we are grouping ms and ba by the variables Location and Grade.

Within the groupby function, we specify the variables Location and Grade enclosed in square brackets.‘ms’ and ‘ba’ enclosed in square brackets is used to access the ms and ba variable, so as to apply the base function sum to it.

# Calculate sum of variable ‘ms’ and ‘ba’ by variables ‘Location’ and ‘Grade’

# In this example we are giving two variables and two factors

 C=salary_data.groupby(['Location', 'Grade'])['ba','ms'].sum()
 C 

# Output

location data 2

This is a quick recap of the concepts. We learned different ways of joining two data sets using merge() function.

The different types of joins that can be applied on two datasets are left, Right, Inner and outer

We also studied appending data.

Further we learned how to aggregate data using the groupby function.

joins in R - recap

This tutorial lesson is taken from the Postgraduate Diploma in Data Science