Concatenating and Appending dataframes - p.5 Data Analysis with Python and Pandas Tutorial



Welcome to Part 5 of our Data Analysis with Python and Pandas tutorial series. In this tutorial, we're going to be covering how to combine dataframes in a variety of ways.

In our case with real estate investing, we're hoping to take the 50 dataframes with housing data and then just combine them all into one dataframe. We do this for multiple reasons. First, it is easier and just makes sense to combine these, but also it will result in less memory being used. Every dataframe has a date and value column. This date column is repeated across all the dataframes, but really they should all just share the one, effectively nearly halving our total column count.

When combining dataframes, you might have quite a few goals in mind. For example, you may want to "append" to them, where you may be adding to the end, basically adding more rows. Or maybe you want to add more columns, like in our case. There are four major ways of combining dataframes, which we'll begin covering now. The four major ways are: Concatenation, joining, merging, and appending. We'll begin with Concatenation. Here are some starting dataframes:

import pandas as pd

df1 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2001, 2002, 2003, 2004])

df2 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Int_rate':[2, 3, 2, 2],
                    'US_GDP_Thousands':[50, 55, 65, 55]},
                   index = [2005, 2006, 2007, 2008])

df3 = pd.DataFrame({'HPI':[80,85,88,85],
                    'Int_rate':[2, 3, 2, 2],
                    'Low_tier_HPI':[50, 52, 50, 53]},
                   index = [2001, 2002, 2003, 2004])

Notice there are two major changes between these. df1 and df3 have the same index, but they have some different columns. df2 and df3 have different indexes and some differing columns. With concatenation, we can talk about various methods of bringing these together. Let's try a simple concatenation:

concat = pd.concat([df1,df2])
print(concat)

Output:

      HPI  Int_rate  US_GDP_Thousands
2001   80         2                50
2002   85         3                55
2003   88         2                65
2004   85         2                55
2005   80         2                50
2006   85         3                55
2007   88         2                65
2008   85         2                55

Easy enough. The major difference between these was merely a continuation of the index, but they shared the same columns. Now they have become a single dataframe. In our case, however, we're curious about adding columns, not rows. What happens when we combine some shared and some new:

concat = pd.concat([df1,df2,df3])
print(concat)

Output:

      HPI  Int_rate  Low_tier_HPI  US_GDP_Thousands
2001   80         2           NaN                50
2002   85         3           NaN                55
2003   88         2           NaN                65
2004   85         2           NaN                55
2005   80         2           NaN                50
2006   85         3           NaN                55
2007   88         2           NaN                65
2008   85         2           NaN                55
2001   80         2            50               NaN
2002   85         3            52               NaN
2003   88         2            50               NaN
2004   85         2            53               NaN

Not bad, we have some NaN (not a number), because this data didn't exist for that index, but all of our data is indeed here.

Those are the basics of concatenation, next up, let's cover appending. Appending is like the first example of concatenation, only a bit more forceful in that the dataframe will simply be appended to, adding to rows. Let's show an example of how it usually works, but also show where it could possibly go wrong:

df4 = df1.append(df2)
print(df4)

Output

      HPI  Int_rate  US_GDP_Thousands
2001   80         2                50
2002   85         3                55
2003   88         2                65
2004   85         2                55
2005   80         2                50
2006   85         3                55
2007   88         2                65
2008   85         2                55

That's what we expect with an append. In most cases, you are going to do something like this, as if you're inserting a new row in a database. Dataframes were not really made to be appended efficiently, they are meant moreso to be manipulated based on their starting data, but you can append if you need to. What happens when we append data with the same index?

df4 = df1.append(df3)
print(df4)

Output

      HPI  Int_rate  Low_tier_HPI  US_GDP_Thousands
2001   80         2           NaN                50
2002   85         3           NaN                55
2003   88         2           NaN                65
2004   85         2           NaN                55
2001   80         2            50               NaN
2002   85         3            52               NaN
2003   88         2            50               NaN
2004   85         2            53               NaN

Well, that's unfortunate. Some people ask why both concatenation and append exit. That's why. It's far more efficient here to combine these dataframes since the columns shared contain the same data and same index. One more example is to append possibly a series. You're more likely to be appending a series than whole dataframes given the nature of append. We have not spoken about series to this point. A series is basically a single-columned dataframe. A series does have an index, but, if you convert it to a list, it will be just those values. Whenever we say something like df['column'], the return is a series.

s = pd.Series([80,2,50], index=['HPI','Int_rate','US_GDP_Thousands'])
df4 = df1.append(s, ignore_index=True)
print(df4)

Output

   HPI  Int_rate  US_GDP_Thousands
0   80         2                50
1   85         3                55
2   88         2                65
3   85         2                55
4   80         2                50

We have to ignore the index when appending a series, because that is the law, unless the series has a name.

Here, we have covered Concatenating and Appending dataframes in Pandas. Next, we're going to talk about joining and merging dataframes.

The next tutorial:




  • Data Analysis with Python and Pandas Tutorial Introduction
  • Pandas Basics - p.2 Data Analysis with Python and Pandas Tutorial
  • IO Basics - p.3 Data Analysis with Python and Pandas Tutorial
  • Building dataset - p.4 Data Analysis with Python and Pandas Tutorial
  • Concatenating and Appending dataframes - p.5 Data Analysis with Python and Pandas Tutorial
  • Joining and Merging Dataframes - p.6 Data Analysis with Python and Pandas Tutorial
  • Pickling - p.7 Data Analysis with Python and Pandas Tutorial
  • Percent Change and Correlation Tables - p.8 Data Analysis with Python and Pandas Tutorial
  • Resampling - p.9 Data Analysis with Python and Pandas Tutorial
  • Handling Missing Data - p.10 Data Analysis with Python and Pandas Tutorial
  • Rolling statistics - p.11 Data Analysis with Python and Pandas Tutorial
  • Applying Comparison Operators to DataFrame - p.12 Data Analysis with Python and Pandas Tutorial
  • Joining 30 year mortgage rate - p.13 Data Analysis with Python and Pandas Tutorial
  • Adding other economic indicators - p.14 Data Analysis with Python and Pandas Tutorial
  • Rolling Apply and Mapping Functions - p.15 Data Analysis with Python and Pandas Tutorial
  • Scikit Learn Incorporation - p.16 Data Analysis with Python and Pandas Tutorial