Regression - Training and Testing

Welcome to part four of the Machine Learning with Python tutorial series. In the previous tutorials, we got our initial data, we transformed and manipulated it a bit to our liking, and then we began to define our features. Scikit-Learn does not fundamentally need to work with Pandas and dataframes, I just prefer to do my data-handling with it, as it is fast and efficient. Instead, Scikit-learn actually fundamentally requires numpy arrays. Pandas dataframes can be easily converted to NumPy arrays, so it just so happens to work out for us!

Our code up to this point:

import Quandl, math
import numpy as np
import pandas as pd
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression

df = Quandl.get("WIKI/GOOGL")


df = df[['Adj. Open',  'Adj. High',  'Adj. Low',  'Adj. Close', 'Adj. Volume']]

df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0

df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]

forecast_col = 'Adj. Close'
df.fillna(value=-99999, inplace=True)
forecast_out = int(math.ceil(0.01 * len(df)))

df['label'] = df[forecast_col].shift(-forecast_out)

We'll then drop any still NaN information from the dataframe:


It is a typical standard with machine learning in code to define X (capital x), as the features, and y (lowercase y) as the label that corresponds to the features. As such, we can define our features and labels like so:

X = np.array(df.drop(['label'], 1))
y = np.array(df['label'])

Above, what we've done, is defined X (features), as our entire dataframe EXCEPT for the label column, converted to a numpy array. We do this using the .drop method that can be applied to dataframes, which returns a new dataframe. Next, we define our y variable, which is our label, as simply the label column of the dataframe, converted to a numpy array.

We could leave it at this, and move on to training and testing, but we're going to do some pre-processing. Generally, you want your features in machine learning to be in a range of -1 to 1. This may do nothing, but it usually speeds up processing and can also help with accuracy. Because this range is so popularly used, it is included in the preprocessing module of Scikit-Learn. To utilize this, you can apply preprocessing.scale to your X variable:

X = preprocessing.scale(X)

Next, create the label, y:

y = np.array(df['label'])

Now comes the training and testing. The way this works is you take, for example, 75% of your data, and use this to train the machine learning classifier. Then you take the remaining 25% of your data, and test the classifier. Since this is your sample data, you should have the features and known labels. Thus, if you test on the last 25% of your data, you can get a sort of accuracy and reliability, often called the confidence score. There are many ways to do this, but, probably the best way is using the build in cross_validation provided, since this also shuffles your data for you. The code to do this:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)

The return here is the training set of features, testing set of features, training set of labels, and testing set of labels. Now, we're ready to define our classifier. There are many classifiers in general available through Scikit-Learn, and even a few specifically for regression. We'll show a couple in this example, but for now, let's use Support Vector Regression from Scikit-Learn's svm package:

clf = svm.SVR()

We're just going to use all of the defaults to keep things simple here, but you can learn much more about Support Vector Regression in the sklearn.svm.SVR documentation.

Once you have defined the classifer, you're ready to train it. With Scikit-Learn (sklearn), you train with .fit:, y_train)

Here, we're "fitting" our training features and training labels.

Our classifier is now trained. Wow that was easy. Now we can test it!

confidence = clf.score(X_test, y_test)

Boom tested, and then:


So here, we can see the accuracy is about 96%. Nothing to write home about. Let's try another classifier, this time using LinearRegression from sklearn:

clf = LinearRegression()

A bit better, but basically the same. So how might we know, as scientists, which algorithm to choose? After a while, you will get used to what works in most situations and what doesn't. You can also check out: choosing the right estimator from scikit-learn's website. This can help you walk through some basic choices. If you ask people who use machine learning often though, it's really trial and error. You will try a handful of algorithms and simply go with the one that works best. Another thing to note is that some of the algorithms must run linearly, others not. Do not confuse linear regression with the requirement to run linearly, by the way. So what does that all mean? Some of the machine learning algorithms here will process one step at a time, with no threading, others can thread and use all the CPU cores you have available. You could learn a lot about each algorithm to figure out which ones can thread, or you can visit the documentation, and look for the n_jobs parameter. If it has n_jobs, you have an algorithm that can be threaded for high performance. If not, tough luck! Thus, if you are processing massive amounts of data, or you need to process medium data but at a very high rate of speed, then you would want something threaded. Let's check for our two algorithms.

Heading to the docs for sklearn.svm.SVR, and looking through the parameters, do you see n_jobs? Not me. So no, no threading here. As you could see, on our small data, it makes very little difference, but, on say even as little as 20mb of data, it makes a massive difference. Next up, let's check out the LinearRegression algorithm. Do you see n_jobs here? Indeed! So here, you can specify exactly how many threads you'll want. If you put in -1 for the value, then the algorithm will use all available threads.

To do this:

clf = LinearRegression(n_jobs=-1)

That's all. While I have you doing such a rare thing (looking at documentation), let me draw your attention to the fact that, just because machine learning algorithms work with default parameters, it doesn't mean you can just ignore them. For example, let's revisit svm.SVR. SVR is support vector regression, which is a kind of...architecture... when doing machine learning. I highly encourage anyone interested in learning more to research the topic and learn from people who are far more educated than I am on fundamentals, by the way, I will do my best to explain things simply here, but I am not an expert. Back on topic, however. There is a parameter to svm.SVR for example which is kernel. What in the heck is that? Think of a kernel like a transformation against your data. It's a way to grossly, and I mean grossly, simplify your data. This makes processing go much faster. In the case of svm.SVR, the default is rbf, which is a type of kernel. You have a few other choices though. Check the documentation, you have 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or a callable. Again, just like the suggestion to try the various ML algorithms that can do what you want, try the kernels. Let's do a few:

for k in ['linear','poly','rbf','sigmoid']:
    clf = svm.SVR(kernel=k), y_train)
    confidence = clf.score(X_test, y_test)
linear 0.960075071072
poly 0.63712232551
rbf 0.802831714511
sigmoid -0.125347960903

As we can see, the linear kernel performed the best, closely by rbf, then poly, then sigmoid was clearly just goofing off and definitely needs to be kicked from the team.

So we have trained, and tested. Let's say we're happy with 71% at this point, what would we do next? We've trained and tested, but now we want to move forward and forecast out, which is what we'll be covering in the next tutorial.

The next tutorial:

  • Practical Machine Learning Tutorial with Python Introduction
  • Regression - Intro and Data
  • Regression - Features and Labels
  • Regression - Training and Testing
    You are currently here.
  • Regression - Forecasting and Predicting
  • Pickling and Scaling
  • Regression - Theory and how it works
  • Regression - How to program the Best Fit Slope
  • Regression - How to program the Best Fit Line
  • Regression - R Squared and Coefficient of Determination Theory
  • Regression - How to Program R Squared
  • Creating Sample Data for Testing
  • Classification Intro with K Nearest Neighbors
  • Applying K Nearest Neighbors to Data
  • Euclidean Distance theory
  • Creating a K Nearest Neighbors Classifer from scratch
  • Creating a K Nearest Neighbors Classifer from scratch part 2
  • Testing our K Nearest Neighbors classifier
  • Final thoughts on K Nearest Neighbors
  • Support Vector Machine introduction
  • Vector Basics
  • Support Vector Assertions
  • Support Vector Machine Fundamentals
  • Constraint Optimization with Support Vector Machine
  • Beginning SVM from Scratch in Python
  • Support Vector Machine Optimization in Python
  • Support Vector Machine Optimization in Python part 2
  • Visualization and Predicting with our Custom SVM
  • Kernels Introduction
  • Why Kernels
  • Soft Margin Support Vector Machine
  • Kernels, Soft Margin SVM, and Quadratic Programming with Python and CVXOPT
  • Support Vector Machine Parameters
  • Machine Learning - Clustering Introduction
  • Handling Non-Numerical Data for Machine Learning
  • K-Means with Titanic Dataset
  • K-Means from Scratch in Python
  • Finishing K-Means from Scratch in Python
  • Hierarchical Clustering with Mean Shift Introduction
  • Mean Shift applied to Titanic Dataset
  • Mean Shift algorithm from scratch in Python
  • Dynamically Weighted Bandwidth for Mean Shift
  • Introduction to Neural Networks
  • Installing TensorFlow for Deep Learning - OPTIONAL
  • Introduction to Deep Learning with TensorFlow
  • Deep Learning with TensorFlow - Creating the Neural Network Model
  • Deep Learning with TensorFlow - How the Network will run
  • Deep Learning with our own Data
  • Simple Preprocessing Language Data for Deep Learning
  • Training and Testing on our Data for Deep Learning
  • 10K samples compared to 1.6 million samples with Deep Learning
  • How to use CUDA and the GPU Version of Tensorflow for Deep Learning
  • Recurrent Neural Network (RNN) basics and the Long Short Term Memory (LSTM) cell
  • RNN w/ LSTM cell example in TensorFlow and Python
  • Convolutional Neural Network (CNN) basics
  • Convolutional Neural Network CNN with TensorFlow tutorial
  • TFLearn - High Level Abstraction Layer for TensorFlow Tutorial
  • Using a 3D Convolutional Neural Network on medical imaging data (CT Scans) for Kaggle
  • Classifying Cats vs Dogs with a Convolutional Neural Network on Kaggle
  • Using a neural network to solve OpenAI's CartPole balancing environment