In this Data Analysis with Pandas and Python tutorial series, we're going to show how quickly we can take our Pandas dataset in the dataframe and convert it to, for example, a numpy array, which can then be fed through a variety of other data analysis Python modules. The example that we're going to use here is Scikit-Learn, or SKlearn. In order to do this, you will need to install it:
pip install sklearn
From here, we're almost already done. For machine learning to take place, at least in the supervised form, we need only a couple things. First, we need "features." In our case, features are things like current HPI, maybe the GDP, and so on. Then you have "labels." Labels are assigned to the feature "sets," where a feature set is the collective GDP, HPI, and so on for any given "label." Our label, in this case, is either a 1 or a 0, where 1 means the HPI increased in the future, and a 0 means it did not.
It should probably go without saying, but I will note: You should not include the "future HPI" column as a feature. If you do this, the machine learning algorithm will recognize this and have a very high accuracy that would be impossible to actually use in the real world.
The previous tutorial's code ended something like:
import Quandl import pandas as pd import matplotlib.pyplot as plt from matplotlib import style import numpy as np from statistics import mean style.use('fivethirtyeight') # Not necessary, I just do this so I do not show my API key. api_key = open('quandlapikey.txt','r').read() def create_labels(cur_hpi, fut_hpi): if fut_hpi > cur_hpi: return 1 else: return 0 def moving_average(values): return mean(values) housing_data = pd.read_pickle('HPI.pickle') housing_data = housing_data.pct_change() housing_data.replace([np.inf, -np.inf], np.nan, inplace=True) housing_data['US_HPI_future'] = housing_data['United States'].shift(-1) housing_data.dropna(inplace=True) #print(housing_data[['US_HPI_future','United States']].head()) housing_data['label'] = list(map(create_labels,housing_data['United States'], housing_data['US_HPI_future'])) #print(housing_data.head()) housing_data['ma_apply_example'] = pd.rolling_apply(housing_data['M30'], 10, moving_average) print(housing_data.tail())
Next, we're going to add some new imports:
from sklearn import svm, preprocessing, cross_validation
We're going to use the svm (support vector machine) library for our machine learning classifier. Preprocessing is used to adjust our dataset. Typically, machine learning will be a bit more accurate if your features are between -1 and 1 values. This does not mean this will always be true, always a good idea to check with and without the scaling that we'll do to be safe. Cross_validation is a library that we'll be using to create our training and testing sets. It's just a nice way to automatically, and randomly, sample out your data for training and testing purposes.
Now, we can create our features and our labels for training/testing:
X = np.array(housing_data.drop(['label','US_HPI_future'], 1)) X = preprocessing.scale(X)
Generally, with features and labels, you have X, y. The uppercase X is used to denote a feature set. The y is the label. What we've done here, is define the featureset as the numpy array (this just converts the dataframe's contents to a multi-dimensional array) of the housing_data dataframe's contents, with the "label" and the "US_HPI_future" columns removed.
y = np.array(housing_data['label'])
Now our labels are defined, and we're ready to split up our data into training and testing sets. We can do this ourselves, but we'll use the cross_validation import from earlier:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
What this does is it splits up your features (X) and labels (y) into random training and testing groups for you. As you can see, the return is the feature training set, feature testing set, labels training, and labels testing. We then are unpacking these into
X_train, X_test, y_train, y_test. cross_validation.train_test_split takes the parameters of your features, labels, and then you can also specify the testing size (test_size), which we've designated to be 0.2, meaning 20%.
Now, we can establish the classifier that we intend to use:
clf = svm.SVC(kernel='linear')
We're going to use support vector classifcation with a linear kernel in this example. Learn more about sklearn.svm.SVC here.
Next, we want to train our classifier:
Finally, we could actually go ahead and make predictions from here, but let's test the classifier's accuracy on known data:
I am getting an average of about 70%ish accuracy. You may get differing results. There are many areas for adjusting machine learning. We could change some of the default parameters, we could check out some of the other algorithms, but this is decent enough for now.