Regression - Features and Labels

Building on the previous machine learning regression tutorial, we'll be performing regression on our stock price data. The code up to this point:

import Quandl
import pandas as pd

df = Quandl.get("WIKI/GOOGL")
df = df[['Adj. Open',  'Adj. High',  'Adj. Low',  'Adj. Close', 'Adj. Volume']]
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0
df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]

The hope here is that we've grabbed data, decided on the valuable data, created some new valuable data through manipulation, and now we're ready to actually begin the machine learning process with regression. First, we're going to need a few more imports. All imports now:

import Quandl, math
import numpy as np
import pandas as pd
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression

We'll be using the numpy module to convert data to numpy arrays, which is what Scikit-learn wants. We will talk more on preprocessing and cross_validation when we get to them in the code, but preprocessing is the module used to do some cleaning/scaling of data prior to machine learning, and cross_ alidation is used in the testing stages. Finally, we're also importing the LinearRegression algorithm as well as svm from Scikit-learn, which we'll be using as our machine learning algorithms to demonstrate results.

At this point, we've got data that we think is useful. How does the actual machine learning thing work? With supervised learning, you have features and labels. The features are the descriptive attributes, and the label is what you're attempting to predict or forecast. Another common example with regression might be to try to predict the dollar value of an insurance policy premium for someone. The company may collect your age, past driving infractions, public criminal record, and your credit score for example. The company will use past customers, taking this data, and feeding in the amount of the "ideal premium" that they think should have been given to that customer, or they will use the one they actually used if they thought it was a profitable amount.

Thus, for training the machine learning classifier, the features are customer attributes, the label is the premium associated with those attributes.

In our case, what are the features and what is the label? We're trying to predict the price, so is price the label? If so, what are the featuers? When it comes to forecasting out the price, our label, the thing we're hoping to predict, is actually the future price. As such, our features are actually: current price, high minus low percent, and the percent change volatility. The price that is the label shall be the price at some determined point the future. Let's go ahead and add a few new rows:

forecast_col = 'Adj. Close'
df.fillna(value=-99999, inplace=True)
forecast_out = int(math.ceil(0.01 * len(df)))

Here, we define the forecasting column, then we fill any NaN data with -99999. You have a few choice here regarding how to handle missing data. You can't just pass a NaN (Not a Number) datapoint to a machine learning classifier, you have to handle for it. One popular option is to replace missing data with -99,999. With many machine learning classifiers, this will just be recognized and treated as an outlier feature. You can also just drop all feature/label sets that contain missing data, but then you're maybe leaving a lot of data out.

In the real world, many data sets are very messy. Most stock price/volume data is pretty clean, rarely with missing data, but many datasets will have a lot of missing data. I've seen datasets where the majority of the rows contain some missing bit of info. You don't necessarily want to forfeit all of that great data, plus, if your sample data has holes, you can probably bet your real-world use-case will also have holes. You need to train, test, and go live all on the same data and characteristics of that data.

Finally, we define what we want to forecast out. In many cases, such as in the case of trying to predict a client's premium for insurance, you just want one number, for the "right now", but, with forecasting, you want to forecast out a certain number of datapoints. We're saying we want to forecast out 1% of the entire length of the dataset. Thus, if our data is 100 days of stock prices, we want to be able to predict the price 1 day out into the future. Choose whatever you like. If you are just trying to predict tomorrow's price, then you would just do 1 day out, and the forecast would be just one day out. If you predict 10 days out, we can actually generate a forcast for every day, for the next week and a half.

In our case, we've decided the features are a bunch of the current values, and the label shall be the price, in the future, where the future is 1% of the entire length of the dataset out. We'll assume all current columns are our features, so we'll add a new column with a simple pandas operation:

df['label'] = df[forecast_col].shift(-forecast_out)

Now we have the data that comprises our features and labels. Next, we need to do some preprocessing and final steps before actually running everything, which is what we will be focusing on in the next tutorial.

The next tutorial:

  • Practical Machine Learning Tutorial with Python Introduction
  • Regression - Intro and Data
  • Regression - Features and Labels
    You are currently here.
  • Regression - Training and Testing
  • Regression - Forecasting and Predicting
  • Pickling and Scaling
  • Regression - Theory and how it works
  • Regression - How to program the Best Fit Slope
  • Regression - How to program the Best Fit Line
  • Regression - R Squared and Coefficient of Determination Theory
  • Regression - How to Program R Squared
  • Creating Sample Data for Testing
  • Classification Intro with K Nearest Neighbors
  • Applying K Nearest Neighbors to Data
  • Euclidean Distance theory
  • Creating a K Nearest Neighbors Classifer from scratch
  • Creating a K Nearest Neighbors Classifer from scratch part 2
  • Testing our K Nearest Neighbors classifier
  • Final thoughts on K Nearest Neighbors
  • Support Vector Machine introduction
  • Vector Basics
  • Support Vector Assertions
  • Support Vector Machine Fundamentals
  • Constraint Optimization with Support Vector Machine
  • Beginning SVM from Scratch in Python
  • Support Vector Machine Optimization in Python
  • Support Vector Machine Optimization in Python part 2
  • Visualization and Predicting with our Custom SVM
  • Kernels Introduction
  • Why Kernels
  • Soft Margin Support Vector Machine
  • Kernels, Soft Margin SVM, and Quadratic Programming with Python and CVXOPT
  • Support Vector Machine Parameters
  • Machine Learning - Clustering Introduction
  • Handling Non-Numerical Data for Machine Learning
  • K-Means with Titanic Dataset
  • K-Means from Scratch in Python
  • Finishing K-Means from Scratch in Python
  • Hierarchical Clustering with Mean Shift Introduction
  • Mean Shift applied to Titanic Dataset
  • Mean Shift algorithm from scratch in Python
  • Dynamically Weighted Bandwidth for Mean Shift
  • Introduction to Neural Networks
  • Installing TensorFlow for Deep Learning - OPTIONAL
  • Introduction to Deep Learning with TensorFlow
  • Deep Learning with TensorFlow - Creating the Neural Network Model
  • Deep Learning with TensorFlow - How the Network will run
  • Deep Learning with our own Data
  • Simple Preprocessing Language Data for Deep Learning
  • Training and Testing on our Data for Deep Learning
  • 10K samples compared to 1.6 million samples with Deep Learning
  • How to use CUDA and the GPU Version of Tensorflow for Deep Learning
  • Recurrent Neural Network (RNN) basics and the Long Short Term Memory (LSTM) cell
  • RNN w/ LSTM cell example in TensorFlow and Python
  • Convolutional Neural Network (CNN) basics
  • Convolutional Neural Network CNN with TensorFlow tutorial
  • TFLearn - High Level Abstraction Layer for TensorFlow Tutorial
  • Using a 3D Convolutional Neural Network on medical imaging data (CT Scans) for Kaggle
  • Classifying Cats vs Dogs with a Convolutional Neural Network on Kaggle
  • Using a neural network to solve OpenAI's CartPole balancing environment