Welcome to Part 6 of the Data Analysis with Python and Pandas tutorial series. In this part, we're going to talk about joining and merging dataframes, as another method of combining dataframes. In the previous tutorial, we covered concatenation and appending.
First we will start with some sample dataframes like before, with one change:
import pandas as pd df1 = pd.DataFrame({'HPI':[80,85,88,85], 'Int_rate':[2, 3, 2, 2], 'US_GDP_Thousands':[50, 55, 65, 55]}, index = [2001, 2002, 2003, 2004]) df2 = pd.DataFrame({'HPI':[80,85,88,85], 'Int_rate':[2, 3, 2, 2], 'US_GDP_Thousands':[50, 55, 65, 55]}, index = [2005, 2006, 2007, 2008]) df3 = pd.DataFrame({'HPI':[80,85,88,85], 'Unemployment':[7, 8, 9, 6], 'Low_tier_HPI':[50, 52, 50, 53]}, index = [2001, 2002, 2003, 2004])
The only major change is in df3, where we change Int_rate to unemployment. First, let's discuss merging.
print(pd.merge(df1,df3, on='HPI'))
Output:
HPI Int_rate US_GDP_Thousands Low_tier_HPI Unemployment 0 80 2 50 50 7 1 85 3 55 52 8 2 85 3 55 53 6 3 85 2 55 52 8 4 85 2 55 53 6 5 88 2 65 50 9
So, here, we saw how there was a shared column (HPI). You can share on multiple columns, here's an example of that:
print(pd.merge(df1,df2, on=['HPI','Int_rate']))
Output:
HPI Int_rate US_GDP_Thousands_x US_GDP_Thousands_y 0 80 2 50 50 1 85 3 55 55 2 88 2 65 65 3 85 2 55 55
Notice here there are two versions of the US_GDP_Thousands. This is because we didn't share on these columns, so both are retained, with another letter to differentiate. Remember before I was saying that Pandas is a great module to marry to a database like mysql? here's why.
Generally, with databases, you want to keep them as lightweight as possible so the queries that run on them can execute as fast as possible.
Let's say you run a website like pythonprogramming.net, where you have users, so you definitely want to track username and encrypted password hashes, so that's 2 columns for sure. Maybe then you have a login name, a username, a password, an email and a join date. So that's already 5 columns with basic data points. Then maybe you have something like user settings, posts if you have a forum, completed tutorials. Then maybe you want to have settings like admin, moderator, regular user.
The lists can go on and on. If you have literally just 1 massive table, this can work, but it might also be better to distribute the table, since many operations will simply be much quicker and more efficient. After mergin, you would probably set the new index. Something like this:
df4 = pd.merge(df1,df3, on='HPI') df4.set_index('HPI', inplace=True) print(df4)
Output:
Int_rate US_GDP_Thousands Low_tier_HPI Unemployment HPI 80 2 50 50 7 85 3 55 52 8 85 3 55 53 6 85 2 55 52 8 85 2 55 53 6 88 2 65 50 9
Now, what if HPI was already the index? Or, in our case, We'll probably be joining on the dates, but the dates might be the index. In this case, we'd probably use join.
df1.set_index('HPI', inplace=True) df3.set_index('HPI', inplace=True) joined = df1.join(df3) print(joined)
Output:
Int_rate US_GDP_Thousands Low_tier_HPI Unemployment HPI 80 2 50 50 7 85 3 55 52 8 85 3 55 53 6 85 2 55 52 8 85 2 55 53 6 88 2 65 50 9
Now, let's consider joining and merging on slightly differing indexes. Let's redefine dataframes with df1 and df3 starting, turning them into:
df1 = pd.DataFrame({ 'Int_rate':[2, 3, 2, 2], 'US_GDP_Thousands':[50, 55, 65, 55], 'Year':[2001, 2002, 2003, 2004] }) df3 = pd.DataFrame({ 'Unemployment':[7, 8, 9, 6], 'Low_tier_HPI':[50, 52, 50, 53], 'Year':[2001, 2003, 2004, 2005]})
Here, we now have similar year columns, but different dates. df3 has 2005 but not 2002, and df1 is the reverse of that. Now, what happens when we merge?
merged = pd.merge(df1,df3, on='Year') print(merged)
Output:
Int_rate US_GDP_Thousands Year Low_tier_HPI Unemployment 0 2 50 2001 50 7 1 2 65 2003 52 8 2 2 55 2004 50 9
Now, a bit more usefully:
merged = pd.merge(df1,df3, on='Year') merged.set_index('Year', inplace=True) print(merged)
Output:
Int_rate US_GDP_Thousands Low_tier_HPI Unemployment Year 2001 2 50 50 7 2003 2 65 52 8 2004 2 55 50 9
Notice how 2005 and 2002 are just totally missing? Merge will natively just merge existing/shared data. What can we do about this? It turns out, there is a "how" parameter when merging. This parameter reflects the merging choices that come from merging databases. You have the following choices: Left, right, outer inner.
merged = pd.merge(df1,df3, on='Year', how='left') merged.set_index('Year', inplace=True) print(merged)
Output:
Int_rate US_GDP_Thousands Low_tier_HPI Unemployment Year 2001 2 50 50 7 2002 3 55 NaN NaN 2003 2 65 52 8 2004 2 55 50 9
Merging on the left is literally on the left dataframe. We had df1, df3, the one on the left is the first one, df1. So, we wound up with an index that was identical to the left dataframe (df1).
merged = pd.merge(df1,df3, on='Year', how='right') merged.set_index('Year', inplace=True) print(merged)
Output:
Int_rate US_GDP_Thousands Low_tier_HPI Unemployment Year 2001 2 50 50 7 2003 2 65 52 8 2004 2 55 50 9 2005 NaN NaN 53 6
We chose right, so this time the index here is from the right (df3).
merged = pd.merge(df1,df3, on='Year', how='outer') merged.set_index('Year', inplace=True) print(merged)
Output:
Int_rate US_GDP_Thousands Low_tier_HPI Unemployment Year 2001 2 50 50 7 2002 3 55 NaN NaN 2003 2 65 52 8 2004 2 55 50 9 2005 NaN NaN 53 6
This time, we did outer, which is a union of the keys. This means all of the indexes are shown.
merged = pd.merge(df1,df3, on='Year', how='inner') merged.set_index('Year', inplace=True) print(merged)
Output:
Int_rate US_GDP_Thousands Low_tier_HPI Unemployment Year 2001 2 50 50 7 2003 2 65 52 8 2004 2 55 50 9
Finally, "inner" is the intersection of keys, basically just what is shared between all the sets. Each of these has its own reasoning, but, as you can see, the default option is "inner."
Now we can check out joining, which will join on the index, so we can do something like this:
df1.set_index('Year', inplace=True) df3.set_index('Year', inplace=True) joined = df1.join(df3, how="outer") print(joined)
Output:
Int_rate US_GDP_Thousands Low_tier_HPI Unemployment Year 2001 2 50 50 7 2002 3 55 NaN NaN 2003 2 65 52 8 2004 2 55 50 9 2005 NaN NaN 53 6
Alright, I think we've covered enough about combining dataframes. Let's head back to our real estate investing with our new knowledge and build ourselves an epic dataset.