Joining and Merging Dataframes - p.6 Data Analysis with Python and Pandas Tutorial



Welcome to Part 6 of the Data Analysis with Python and Pandas tutorial series. In this part, we're going to talk about joining and merging dataframes, as another method of combining dataframes. In the previous tutorial, we covered concatenation and appending.

First we will start with some sample dataframes like before, with one change:

import pandas as pd

df1 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])

df2 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2005, 2006, 2007, 2008])

df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

The only major change is in df3, where we change Int_rate to unemployment. First, let's discuss merging.

print(pd.merge(df1,df3, on='HPI'))

Output:

   HPI  Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
0   80         2                50            50             7
1   85         3                55            52             8
2   85         3                55            53             6
3   85         2                55            52             8
4   85         2                55            53             6
5   88         2                65            50             9

So, here, we saw how there was a shared column (HPI). You can share on multiple columns, here's an example of that:

print(pd.merge(df1,df2, on=['HPI','Int_rate']))

Output:

   HPI  Int_rate  US_GDP_Thousands_x  US_GDP_Thousands_y
0   80         2                  50                  50
1   85         3                  55                  55
2   88         2                  65                  65
3   85         2                  55                  55

Notice here there are two versions of the US_GDP_Thousands. This is because we didn't share on these columns, so both are retained, with another letter to differentiate. Remember before I was saying that Pandas is a great module to marry to a database like mysql? here's why.

Generally, with databases, you want to keep them as lightweight as possible so the queries that run on them can execute as fast as possible.

Let's say you run a website like pythonprogramming.net, where you have users, so you definitely want to track username and encrypted password hashes, so that's 2 columns for sure. Maybe then you have a login name, a username, a password, an email and a join date. So that's already 5 columns with basic data points. Then maybe you have something like user settings, posts if you have a forum, completed tutorials. Then maybe you want to have settings like admin, moderator, regular user.

The lists can go on and on. If you have literally just 1 massive table, this can work, but it might also be better to distribute the table, since many operations will simply be much quicker and more efficient. After mergin, you would probably set the new index. Something like this:

df4 = pd.merge(df1,df3, on='HPI')
df4.set_index('HPI', inplace=True)
print(df4)

Output:

     Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
HPI                                                        
80          2                50            50             7
85          3                55            52             8
85          3                55            53             6
85          2                55            52             8
85          2                55            53             6
88          2                65            50             9

Now, what if HPI was already the index? Or, in our case, We'll probably be joining on the dates, but the dates might be the index. In this case, we'd probably use join.

df1.set_index('HPI', inplace=True)
df3.set_index('HPI', inplace=True)

joined = df1.join(df3)
print(joined)

Output:

     Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
HPI                                                        
80          2                50            50             7
85          3                55            52             8
85          3                55            53             6
85          2                55            52             8
85          2                55            53             6
88          2                65            50             9

Now, let's consider joining and merging on slightly differing indexes. Let's redefine dataframes with df1 and df3 starting, turning them into:

df1 = pd.DataFrame({
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55],
                    'Year':[2001, 2002, 2003, 2004]
                    })

df3 = pd.DataFrame({
                    'Unemployment':[7, 8, 9, 6],
                    'Low_tier_HPI':[50, 52, 50, 53],
                    'Year':[2001, 2003, 2004, 2005]})

Here, we now have similar year columns, but different dates. df3 has 2005 but not 2002, and df1 is the reverse of that. Now, what happens when we merge?

merged = pd.merge(df1,df3, on='Year')
print(merged)

Output:

   Int_rate  US_GDP_Thousands  Year  Low_tier_HPI  Unemployment
0         2                50  2001            50             7
1         2                65  2003            52             8
2         2                55  2004            50             9

Now, a bit more usefully:

merged = pd.merge(df1,df3, on='Year')
merged.set_index('Year', inplace=True)
print(merged)

Output:

      Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
Year                                                        
2001         2                50            50             7
2003         2                65            52             8
2004         2                55            50             9

Notice how 2005 and 2002 are just totally missing? Merge will natively just merge existing/shared data. What can we do about this? It turns out, there is a "how" parameter when merging. This parameter reflects the merging choices that come from merging databases. You have the following choices: Left, right, outer inner.

  • Left - equal to left outer join SQL - use keys from left frame only
  • Right - right outer join from SQL- use keys from right frame only.
  • Outer - full outer join - use union of keys
  • Inner - use only intersection of keys.

merged = pd.merge(df1,df3, on='Year', how='left')
merged.set_index('Year', inplace=True)
print(merged)

Output:

      Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
Year                                                        
2001         2                50            50             7
2002         3                55           NaN           NaN
2003         2                65            52             8
2004         2                55            50             9

Merging on the left is literally on the left dataframe. We had df1, df3, the one on the left is the first one, df1. So, we wound up with an index that was identical to the left dataframe (df1).


merged = pd.merge(df1,df3, on='Year', how='right')
merged.set_index('Year', inplace=True)
print(merged)

Output:

      Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
Year                                                        
2001         2                50            50             7
2003         2                65            52             8
2004         2                55            50             9
2005       NaN               NaN            53             6

We chose right, so this time the index here is from the right (df3).


merged = pd.merge(df1,df3, on='Year', how='outer')
merged.set_index('Year', inplace=True)
print(merged)

Output:

      Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
Year                                                        
2001         2                50            50             7
2002         3                55           NaN           NaN
2003         2                65            52             8
2004         2                55            50             9
2005       NaN               NaN            53             6

This time, we did outer, which is a union of the keys. This means all of the indexes are shown.


merged = pd.merge(df1,df3, on='Year', how='inner')
merged.set_index('Year', inplace=True)
print(merged)

Output:

      Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
Year                                                        
2001         2                50            50             7
2003         2                65            52             8
2004         2                55            50             9

Finally, "inner" is the intersection of keys, basically just what is shared between all the sets. Each of these has its own reasoning, but, as you can see, the default option is "inner."

Now we can check out joining, which will join on the index, so we can do something like this:

df1.set_index('Year', inplace=True)
df3.set_index('Year', inplace=True)
joined = df1.join(df3, how="outer")
print(joined)

Output:

      Int_rate  US_GDP_Thousands  Low_tier_HPI  Unemployment
Year                                                        
2001         2                50            50             7
2002         3                55           NaN           NaN
2003         2                65            52             8
2004         2                55            50             9
2005       NaN               NaN            53             6

Alright, I think we've covered enough about combining dataframes. Let's head back to our real estate investing with our new knowledge and build ourselves an epic dataset.

The next tutorial:




  • Data Analysis with Python and Pandas Tutorial Introduction
  • Pandas Basics - p.2 Data Analysis with Python and Pandas Tutorial
  • IO Basics - p.3 Data Analysis with Python and Pandas Tutorial
  • Building dataset - p.4 Data Analysis with Python and Pandas Tutorial
  • Concatenating and Appending dataframes - p.5 Data Analysis with Python and Pandas Tutorial
  • Joining and Merging Dataframes - p.6 Data Analysis with Python and Pandas Tutorial
  • Pickling - p.7 Data Analysis with Python and Pandas Tutorial
  • Percent Change and Correlation Tables - p.8 Data Analysis with Python and Pandas Tutorial
  • Resampling - p.9 Data Analysis with Python and Pandas Tutorial
  • Handling Missing Data - p.10 Data Analysis with Python and Pandas Tutorial
  • Rolling statistics - p.11 Data Analysis with Python and Pandas Tutorial
  • Applying Comparison Operators to DataFrame - p.12 Data Analysis with Python and Pandas Tutorial
  • Joining 30 year mortgage rate - p.13 Data Analysis with Python and Pandas Tutorial
  • Adding other economic indicators - p.14 Data Analysis with Python and Pandas Tutorial
  • Rolling Apply and Mapping Functions - p.15 Data Analysis with Python and Pandas Tutorial
  • Scikit Learn Incorporation - p.16 Data Analysis with Python and Pandas Tutorial