Building on the previous machine learning regression tutorial, we'll be performing regression on our stock price data. The code up to this point:
import Quandl import pandas as pd df = Quandl.get("WIKI/GOOGL") df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']] df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100.0 df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0 df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']] print(df.head())
The hope here is that we've grabbed data, decided on the valuable data, created some new valuable data through manipulation, and now we're ready to actually begin the machine learning process with regression. First, we're going to need a few more imports. All imports now:
import Quandl, math import numpy as np import pandas as pd from sklearn import preprocessing, cross_validation, svm from sklearn.linear_model import LinearRegression
We'll be using the
numpy module to convert data to numpy arrays, which is what Scikit-learn wants. We will talk more on
cross_validation when we get to them in the code, but preprocessing is the module used to do some cleaning/scaling of data prior to machine learning, and cross_ alidation is used in the testing stages. Finally, we're also importing the
LinearRegression algorithm as well as
svm from Scikit-learn, which we'll be using as our machine learning algorithms to demonstrate results.
At this point, we've got data that we think is useful. How does the actual machine learning thing work? With supervised learning, you have features and labels. The features are the descriptive attributes, and the label is what you're attempting to predict or forecast. Another common example with regression might be to try to predict the dollar value of an insurance policy premium for someone. The company may collect your age, past driving infractions, public criminal record, and your credit score for example. The company will use past customers, taking this data, and feeding in the amount of the "ideal premium" that they think should have been given to that customer, or they will use the one they actually used if they thought it was a profitable amount.
Thus, for training the machine learning classifier, the features are customer attributes, the label is the premium associated with those attributes.
In our case, what are the features and what is the label? We're trying to predict the price, so is price the label? If so, what are the featuers? When it comes to forecasting out the price, our label, the thing we're hoping to predict, is actually the future price. As such, our features are actually: current price, high minus low percent, and the percent change volatility. The price that is the label shall be the price at some determined point the future. Let's go ahead and add a few new rows:
forecast_col = 'Adj. Close' df.fillna(value=-99999, inplace=True) forecast_out = int(math.ceil(0.01 * len(df)))
Here, we define the forecasting column, then we fill any NaN data with -99999. You have a few choice here regarding how to handle missing data. You can't just pass a NaN (Not a Number) datapoint to a machine learning classifier, you have to handle for it. One popular option is to replace missing data with -99,999. With many machine learning classifiers, this will just be recognized and treated as an outlier feature. You can also just drop all feature/label sets that contain missing data, but then you're maybe leaving a lot of data out.
In the real world, many data sets are very messy. Most stock price/volume data is pretty clean, rarely with missing data, but many datasets will have a lot of missing data. I've seen datasets where the majority of the rows contain some missing bit of info. You don't necessarily want to forfeit all of that great data, plus, if your sample data has holes, you can probably bet your real-world use-case will also have holes. You need to train, test, and go live all on the same data and characteristics of that data.
Finally, we define what we want to forecast out. In many cases, such as in the case of trying to predict a client's premium for insurance, you just want one number, for the "right now", but, with forecasting, you want to forecast out a certain number of datapoints. We're saying we want to forecast out 1% of the entire length of the dataset. Thus, if our data is 100 days of stock prices, we want to be able to predict the price 1 day out into the future. Choose whatever you like. If you are just trying to predict tomorrow's price, then you would just do 1 day out, and the forecast would be just one day out. If you predict 10 days out, we can actually generate a forcast for every day, for the next week and a half.
In our case, we've decided the features are a bunch of the current values, and the label shall be the price, in the future, where the future is 1% of the entire length of the dataset out. We'll assume all current columns are our features, so we'll add a new column with a simple pandas operation:
df['label'] = df[forecast_col].shift(-forecast_out)
Now we have the data that comprises our features and labels. Next, we need to do some preprocessing and final steps before actually running everything, which is what we will be focusing on in the next tutorial.