• Magdalena Konkiewicz

Practical uses of merge, join and concat


Image by Jenő Szabó from Pixabay

Introduction

In this article, we will talk about combining data frames. You are probably very familiar with load functions in pandas that allow you to get access to data in order to do some analysis.

However, what happens if your data is not in one file but scattered across multiple ones? In that case, you will need to load the files one by one and combine the data into a single data frame using pandas functions.

We will show you how to do that and what functions to use depending on how you want to combine your data and what you want to achieve. We will learn about:

concat(),

merge(),

and join().

After reading this article you should be able to use all three of them to combine data in different ways.

Let’s get started!



concat()

This is the function that we recommend using if you have multiple data files with the same column names. It could be sales for a chain vendor where each year would be saved in a separate spreadsheet.

We are going to create two separate data frames with some fake data in order to illustrate this. Let’s start with creating a data frame for sales for the year 2018:



import pandas as pd
import numpy as np
sales_dictionary_2018 = {'name': ['Michael', 'Ana'], 
                     'revenue': ['1000', '2000'], 
                     'number_of_itmes_sold': [5, 7]}
sales_df_2018 = pd.DataFrame(sales_dictionary_2018)
sales_df_2018.head()


This is a very simple data frame with their columns summarizing sales for the year 2018. We have a vendor name, a number of units they have sold and the revenue they have created.

Let’s create now a data frame that has exactly the same columns but covers a new time period: the year 2019.



sales_dictionary_2019 = {'name': ['Michael', 'Ana', 'George'], 
                     'revenue': ['1000', '3000', '2000'], 
                     'number_of_itmes_sold': [5, 8, 7]}
sales_df_2019 = pd.DataFrame(sales_dictionary_2019)
sales_df_2019.head()


You can see that in 2019 we had a new rep George except for Michael and Ana from 2018 sales. Otherwise, the data structure is the same as we had for the 2018 year.

So how can you combine these two data frames together in order to have them in one data frame? You can use contact() function:



pd.concat([sales_df_2018, sales_df_2019], ignore_index=True)



You can see that after this operation all our data is now on one data frame!

Concat() function takes a list of data frames and adds all their rows together resulting in one data frame. We set here ignore_index=True as otherwise, the resulting data frame would have indexes taken from the original data frame. In our case, we do not want that. Note that you can pass as many data frames as you want in a list form.

Adding data the way we presented is probably the most common way of using concat() function. You could also use concat() function to add columns by setting axis=1 but there are better ways of adding new data in columns such as join() and merge(). This is because when you add new columns you mostly need to specify some condition on which you would like to join the data and concat() does not allow that.



merge()

We are going to start by learning about merge() function as this is probably the most common way of adding new columns to the data frame based on some common conditions.

In order to illustrate the usage of merge() we will go back to our store example which had data frame from sales for 2018 and 2019.

Imagine that you had another file that was containing personal data for each of the reps. Now, you would like to add this data to sales data frames.

Let’s start by creating a data frame for personal rep info.



rep_info_dictionary = {'name': ['Ana', 'Michael', 'George'], 
                     'location': ['New York', 'San Jose', 'New York']}
rep_info_df = pd.DataFrame(rep_info_dictionary)
rep_info_df.head()

As you can see there are only two columns: name and location. The name is something that was also present in our yearly sales data. Imagine that you would like to add a location of the rep to your 2019 data.

You could merge the rep_info_df to sales_df_2019 using ‘name’ as the column that links both of these data frames:



sales_df_2019.merge(rep_info_df, on='name')



You can see that our original data frame with sales data for 2019 has now another column, location. It is a result of merging rp_info_df and sales_df_2019.

As default pandas merge uses ‘inner join’ to perform merge operation. We are not going to discuss types of joins here but if you are familiar with SQL joins they work exactly the same.

If concepts of inner, left, right and outer joins are not familiar to you I suggest that you find some articles that explain these SQL concepts. Once you understand the SQL joins you will be able to use them with pandas merge() function as they work exactly the same.



join()

This is a special case of merge when at least one of the columns that you are joining on is an index. Let’s modify our 2019 data to have names as indexes:



sales_df_2019.set_index('name', inplace=True)
sales_df_2019.head()

As you can see name is not a column but index of the data frame after we have used set_index() function on it.

Let’s now do the same to rep info data frame:



rep_info_df.set_index('name', inplace=True)
rep_info_df.head()


Now both rep_info_df and sales_df_2019 have names as indexes. Now I can use join() instead of merge() in order to combine the data in the same way as we did with merge() in the previous section.



sales_df_2019.join(rep_info_df)


As you can see I do not have to specify the on parameter as it was the case with merge(). This is because join() function takes indexes of data frames as defaults to combine data from both tables.

Join() function similar to merge() can be modified to use different types of SQL join by specifying the how parameter. The default with join is a ‘left’ join and this is what we have used in our example.



Summary

Let’s summarize our findings.

We have used concat() when we were trying to add multiple data blocks with the same structure and put them in one data frame one below the other.

In order to add column data to existing entries, we have used merge(). The data was added based on the same values for columns on which were merging data frames.

Join() was a special case of merge() when at least one of the entries we are joining on was the index.

I hope you have enjoyed this article and have learned how to use these basic data frame operations to combine data.

46 views0 comments

Recent Posts

See All