This data analysis with Python and Pandas tutorial is going to cover two topics. First, within the context of machine learning, we need a way to create "labels" for our data. Second, we're going to cover mapping functions and the rolling apply capability with Pandas.
Creating labels is essential for the supervised machine learning process, as it is used to "teach" or train the machine correct answers that are associated with features.
Mapping functions to a Pandas Dataframe is useful, to write custom formulas that you wish to apply to the entire dataframe, a certain column, or to create a new column. If you recall, a while back, we made new columns by doing something like df['Column2'] = df['Column1']*1.5, and so on. If you wanted to create far more logically intense operations, however, you would want to write a function. We will show how to do that.
Since mapping functions is one of the two major ways that users can dramatically customize what Pandas can do, we might as well cover the second major way, which is with rolling_apply. This allows us to do a moving window application of a function. We will just write a moving average function, but you could do just about anything you wanted.
To start, we will have some code like:
import Quandl import pandas as pd import matplotlib.pyplot as plt from matplotlib import style import numpy as np from statistics import mean style.use('fivethirtyeight') housing_data = pd.read_pickle('HPI.pickle') housing_data = housing_data.pct_change()
First, we're going to lead in the dataset, and then convert all columns to percent change. This will help us to normalize all of the data.
Next:
housing_data.replace([np.inf, -np.inf], np.nan, inplace=True) housing_data['US_HPI_future'] = housing_data['United States'].shift(-1)
Here, we replace the infinity values with nan values first. Next, we create a new column, which contains the future HPI. We can do this with a new method: .shift(). This method will shift the column in question. Shifting by -1 means we're shifting down, so the value for the next point is moved back. This is our quick way of having the current value, and the next period's value on the same row for easy comparison.
Next up, we will have some NaN data from both the percent change application and the shift, so we need to do:
housing_data.dropna(inplace=True)
Now, we want to do some sort of label creation. There are actually a couple of ways that we could do this, but, since this is a tutorial, let's cover function mapping. There is a one-liner that we could do to achieve this same result, but I don't tend to buy the hype on crazy one-liners. To map a function, it's quite simple. You just simply build a function with reasonable logic, like so:
def create_labels(cur_hpi, fut_hpi): if fut_hpi > cur_hpi: return 1 else: return 0
Here, we're obviously passing the current HPI and the future HPI columns. If the future HPI is higher than the current, this means prices went up, and we are going to return a 1. This is going to be our label. If the future HPI is not greater than the current, then we return a simple 0. To map this function, we can do something like:
housing_data['label'] = list(map(create_labels,housing_data['United States'], housing_data['US_HPI_future']))
This might look like a confusing one-liner, but it doesn't need to be. It breaks down to:
new_column = list(map( function_to_map, parameter1, parameter2, ... ))
That's all there is to it, and you can continue adding more and more parameters.
print(housing_data.head())
AL AK AZ AR CA CO \ Date 1990-03-31 0.003628 0.062548 -0.003033 0.005570 0.007152 0.000963 1990-04-30 0.006277 0.095081 -0.002126 0.005257 0.005569 -0.000318 1990-05-31 0.007421 0.112105 0.001513 0.005635 0.002409 0.004512 1990-06-30 0.004930 0.100642 0.004353 0.006238 0.003569 0.007884 1990-07-31 0.000436 0.067064 0.003322 0.006173 0.004351 0.004374 CT DE FL GA ... WV WI \ Date ... 1990-03-31 -0.009234 0.002786 -0.001259 -0.007290 ... 0.013441 0.015638 1990-04-30 -0.010818 0.000074 0.002675 -0.002477 ... 0.015765 0.015926 1990-05-31 -0.010963 -0.000692 0.004656 0.002808 ... 0.017085 0.012106 1990-06-30 -0.007302 -0.001542 0.003710 0.002857 ... 0.016638 0.010545 1990-07-31 -0.003439 -0.004680 0.003116 0.002276 ... 0.011129 0.009425 WY United States M30 Unemployment Rate GDP \ Date 1990-03-31 0.009831 0.004019 0.090909 0.035714 -0.234375 1990-04-30 0.016868 0.004957 0.119048 -0.068966 4.265306 1990-05-31 0.026130 0.005260 0.117021 0.000000 -1.092539 1990-06-30 0.029359 0.005118 -0.304762 0.074074 3.115183 1990-07-31 0.023640 0.003516 -0.164384 -0.103448 0.441476 sp500 US_HPI_future label Date 1990-03-31 0.030790 0.004957 1 1990-04-30 -0.001070 0.005260 1 1990-05-31 0.045054 0.005118 0 1990-06-30 0.036200 0.003516 0 1990-07-31 -0.001226 0.000395 0 [5 rows x 57 columns]
Next, let's show a custom way to apply a moving-window function. We're going to just do a simple moving average example:
def moving_average(values): ma = mean(values) return ma
There's our function, notice that we just pass the "values" parameter. We do not need to code any sort of "window" or "time-frame" handling, Pandas will handle that for us.
Now, you can use rolling_apply:
housing_data['ma_apply_example'] = pd.rolling_apply(housing_data['M30'], 10, moving_average)
print(housing_data.tail())
AL AK AZ AR CA CO \ Date 2011-07-31 -0.003545 -0.004337 0.002217 0.003215 -0.005579 0.004794 2011-08-31 -0.006886 -0.007139 0.004283 0.000275 -0.007782 0.001058 2011-09-30 -0.011103 -0.007609 0.003190 0.000505 -0.006537 -0.004569 2011-10-31 -0.013189 -0.007754 0.000541 0.001059 -0.005390 -0.009231 2011-11-30 -0.008055 -0.006551 0.005119 -0.000856 -0.003570 -0.010812 CT DE FL GA ... \ Date ... 2011-07-31 -0.002806 -0.001084 -0.001531 -0.003036 ... 2011-08-31 -0.010243 -0.002133 0.001438 -0.006488 ... 2011-09-30 -0.012240 -0.004171 0.002307 -0.013116 ... 2011-10-31 -0.013075 -0.006204 -0.001566 -0.021542 ... 2011-11-30 -0.012776 -0.008252 -0.006211 -0.022371 ... WI WY United States M30 Unemployment Rate \ Date 2011-07-31 -0.002068 0.001897 -0.000756 -0.008130 0.000000 2011-08-31 -0.006729 -0.002080 -0.005243 0.057377 0.000000 2011-09-30 -0.011075 -0.006769 -0.007180 0.031008 -0.100000 2011-10-31 -0.015025 -0.008818 -0.008293 0.007519 -0.111111 2011-11-30 -0.014445 -0.006293 -0.008541 0.014925 -0.250000 GDP sp500 US_HPI_future label ma_apply_example Date 2011-07-31 0.024865 0.031137 -0.005243 0 -0.003390 2011-08-31 0.022862 -0.111461 -0.007180 0 -0.000015 2011-09-30 -0.039361 -0.010247 -0.008293 0 0.004432 2011-10-31 0.018059 0.030206 -0.008541 0 0.013176 2011-11-30 0.000562 0.016886 -0.009340 0 0.015728 [5 rows x 58 columns]