Welcome to Part 5 of our Data Analysis with Python and Pandas tutorial series. In this tutorial, we're going to be covering how to combine dataframes in a variety of ways.
In our case with real estate investing, we're hoping to take the 50 dataframes with housing data and then just combine them all into one dataframe. We do this for multiple reasons. First, it is easier and just makes sense to combine these, but also it will result in less memory being used. Every dataframe has a date and value column. This date column is repeated across all the dataframes, but really they should all just share the one, effectively nearly halving our total column count.
When combining dataframes, you might have quite a few goals in mind. For example, you may want to "append" to them, where you may be adding to the end, basically adding more rows. Or maybe you want to add more columns, like in our case. There are four major ways of combining dataframes, which we'll begin covering now. The four major ways are: Concatenation, joining, merging, and appending. We'll begin with Concatenation. Here are some starting dataframes:
import pandas as pd df1 = pd.DataFrame({'HPI':[80,85,88,85], 'Int_rate':[2, 3, 2, 2], 'US_GDP_Thousands':[50, 55, 65, 55]}, index = [2001, 2002, 2003, 2004]) df2 = pd.DataFrame({'HPI':[80,85,88,85], 'Int_rate':[2, 3, 2, 2], 'US_GDP_Thousands':[50, 55, 65, 55]}, index = [2005, 2006, 2007, 2008]) df3 = pd.DataFrame({'HPI':[80,85,88,85], 'Int_rate':[2, 3, 2, 2], 'Low_tier_HPI':[50, 52, 50, 53]}, index = [2001, 2002, 2003, 2004])
Notice there are two major changes between these. df1 and df3 have the same index, but they have some different columns. df2 and df3 have different indexes and some differing columns. With concatenation, we can talk about various methods of bringing these together. Let's try a simple concatenation:
concat = pd.concat([df1,df2]) print(concat)
Output:
HPI Int_rate US_GDP_Thousands 2001 80 2 50 2002 85 3 55 2003 88 2 65 2004 85 2 55 2005 80 2 50 2006 85 3 55 2007 88 2 65 2008 85 2 55
Easy enough. The major difference between these was merely a continuation of the index, but they shared the same columns. Now they have become a single dataframe. In our case, however, we're curious about adding columns, not rows. What happens when we combine some shared and some new:
concat = pd.concat([df1,df2,df3]) print(concat)
Output:
HPI Int_rate Low_tier_HPI US_GDP_Thousands 2001 80 2 NaN 50 2002 85 3 NaN 55 2003 88 2 NaN 65 2004 85 2 NaN 55 2005 80 2 NaN 50 2006 85 3 NaN 55 2007 88 2 NaN 65 2008 85 2 NaN 55 2001 80 2 50 NaN 2002 85 3 52 NaN 2003 88 2 50 NaN 2004 85 2 53 NaN
Not bad, we have some NaN (not a number), because this data didn't exist for that index, but all of our data is indeed here.
Those are the basics of concatenation, next up, let's cover appending. Appending is like the first example of concatenation, only a bit more forceful in that the dataframe will simply be appended to, adding to rows. Let's show an example of how it usually works, but also show where it could possibly go wrong:
df4 = df1.append(df2) print(df4)
Output
HPI Int_rate US_GDP_Thousands 2001 80 2 50 2002 85 3 55 2003 88 2 65 2004 85 2 55 2005 80 2 50 2006 85 3 55 2007 88 2 65 2008 85 2 55
That's what we expect with an append. In most cases, you are going to do something like this, as if you're inserting a new row in a database. Dataframes were not really made to be appended efficiently, they are meant moreso to be manipulated based on their starting data, but you can append if you need to. What happens when we append data with the same index?
df4 = df1.append(df3) print(df4)
Output
HPI Int_rate Low_tier_HPI US_GDP_Thousands 2001 80 2 NaN 50 2002 85 3 NaN 55 2003 88 2 NaN 65 2004 85 2 NaN 55 2001 80 2 50 NaN 2002 85 3 52 NaN 2003 88 2 50 NaN 2004 85 2 53 NaN
Well, that's unfortunate. Some people ask why both concatenation and append exit. That's why. It's far more efficient here to combine these dataframes since the columns shared contain the same data and same index. One more example is to append possibly a series. You're more likely to be appending a series than whole dataframes given the nature of append. We have not spoken about series to this point. A series is basically a single-columned dataframe. A series does have an index, but, if you convert it to a list, it will be just those values. Whenever we say something like df['column'], the return is a series.
s = pd.Series([80,2,50], index=['HPI','Int_rate','US_GDP_Thousands']) df4 = df1.append(s, ignore_index=True) print(df4)
Output
HPI Int_rate US_GDP_Thousands 0 80 2 50 1 85 3 55 2 88 2 65 3 85 2 55 4 80 2 50
We have to ignore the index when appending a series, because that is the law, unless the series has a name.
Here, we have covered Concatenating and Appending dataframes in Pandas. Next, we're going to talk about joining and merging dataframes.