Another popular topic, yet often confusing, is machine learning for algorithmic trading. While machine learning can be a very complex topic, it boils down to very simple techniques that you can employ with very little knowledge of how machine learning works in the background.
I often compare machine learning with a module like Scikit-Learn to driving a car. You don't need to know about all of the inner workings of the car in order to get utility from it, you just need to know how to operate the main parts like the wheel and pedals.
Machine learning divides into two major categories, supervised and unsupervised learning. We will be leaving unsupervised learning out of this. Supervised machine learning involves the user "teaching" the machine to come to results. This entails taking a sample that is labeled, and feeding the information, along with the labels to the machine, teaching it what is what.
For example, you might feed a supervised machine learning algorithm a bunch of pictures of a car, saying they are cars, and then another bunch of pictures of a motorcycle, saying those were motorcycles. The images themselves would be broken down into features, like pixels or the more likely polygons, and then stored into something like an array. Then, after this phase, referred to as training, we're ready to test. We test the machine learning algorithm by then feeding it new data that we know the labels to, but we don't tell the machine. The machine makes predictions, then we compare these to what we know to find out accuracy. If the accuracy is decent enough, we might choose to employ the algorithm.
If you happen to enjoy machine learning, you may be interested in the Scikit-Learn series that was aimed at using a supervised machine learning algorithm, an SVM, for finding long-term investments into companies in a separate a tutorial.
Here, we will not be diving anywhere near as deep. Instead, we'll just be showing a simple example of how to work with the Scikit-Learn module with stock price data. In order to do this, we have to have "features" and "labels" to train with. Features are whatever makes up the object that we classify. The classification is the label. In our case, we'll use pricing movements as feature sets, and their future outcomes as either being "up" or "down" as their labels.
To start, we'll need some imports and starting code that you've seen from previous tutorials:
from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC, LinearSVC, NuSVC from sklearn.ensemble import RandomForestClassifier from sklearn import preprocessing from collections import Counter import numpy as np def initialize(context): context.stocks = symbols('XLY', # XLY Consumer Discrectionary SPDR Fund 'XLF', # XLF Financial SPDR Fund 'XLK', # XLK Technology SPDR Fund 'XLE', # XLE Energy SPDR Fund 'XLV', # XLV Health Care SPRD Fund 'XLI', # XLI Industrial SPDR Fund 'XLP', # XLP Consumer Staples SPDR Fund 'XLB', # XLB Materials SPDR Fund 'XLU') # XLU Utilities SPRD Fund context.historical_bars = 100 context.feature_window = 10
First, we're importing a bunch of classifiers (SVC, LinearSVC, and NuSVC from the svms, then a random forest classifier as well.). Next, we bring in preprocessing, which is used to normalize data, a counter to count occurrences, and NumPy for some number crunching tasks.
Next, we write our initialize
method, which is used to establish starting principles for our strategy. Here, our stock universe, or companies we're willing to consider, is the 9 major sector ETFs from Spyder.
The context.historical_bars references how many bars of data we're wanting to consider from history, and then the feature_window corresponds to how many features will be included in each feature set.
If we're using daily data, this means that our samples will include the last 100 days of daily data, and then each feature set will be 10 days. Feel free to play with this numbers as you wish. We should probably have larger numbers, especially for the historical bars, but this is just a simple example.
Now that we have our initial settings chosen, we're ready to build the handle_data
method
def handle_data(context, data): prices = history(bar_count = context.historical_bars, frequency='1d', field='price') for stock in context.stocks: ma1 = data[stock].mavg(50) ma2 = data[stock].mavg(200) start_bar = context.feature_window price_list = prices[stock].tolist() X = [] y = []
Our first task with this handle_data
method is to create our feature sets. We begin that:
def handle_data(context, data): prices = history(bar_count = context.historical_bars, frequency='1d', field='price') for stock in context.stocks: ma1 = data[stock].mavg(50) ma2 = data[stock].mavg(200) start_bar = context.feature_window price_list = prices[stock].tolist() X = [] y = []
Generally, with supervised machine learning, the capital X is for the feature sets, and the lower case y denotes the labels. X will be a list of lists, or an array. Y will just be a list.
We will populate the X var with lists of features, and then Y will contain the labels that correspond, by index number, to the feature sets.