Hello and welcome to part 10 (and 11) of the Python for Finance tutorial series. In the previous tutorial, we began to build our labels for our attempt at using machine learning for investing with Python. In this tutorial, we're going to use what we did last tutorial to actually generate our labels when we're ready.
Full code up to this point:
import bs4 as bs import datetime as dt import matplotlib.pyplot as plt from matplotlib import style import numpy as np import os import pandas as pd import pandas_datareader.data as web import pickle import requests style.use('ggplot') def save_sp500_tickers(): resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies') soup = bs.BeautifulSoup(resp.text, 'lxml') table = soup.find('table', {'class': 'wikitable sortable'}) tickers = [] for row in table.findAll('tr')[1:]: ticker = row.findAll('td')[0].text tickers.append(ticker) with open("sp500tickers.pickle", "wb") as f: pickle.dump(tickers, f) return tickers # save_sp500_tickers() def get_data_from_yahoo(reload_sp500=False): if reload_sp500: tickers = save_sp500_tickers() else: with open("sp500tickers.pickle", "rb") as f: tickers = pickle.load(f) if not os.path.exists('stock_dfs'): os.makedirs('stock_dfs') start = dt.datetime(2010, 1, 1) end = dt.datetime.now() for ticker in tickers: # just in case your connection breaks, we'd like to save our progress! if not os.path.exists('stock_dfs/{}.csv'.format(ticker)): df = web.DataReader(ticker, 'morningstar', start, end) df.reset_index(inplace=True) df.set_index("Date", inplace=True) df = df.drop("Symbol", axis=1) df.to_csv('stock_dfs/{}.csv'.format(ticker)) else: print('Already have {}'.format(ticker)) def compile_data(): with open("sp500tickers.pickle", "rb") as f: tickers = pickle.load(f) main_df = pd.DataFrame() for count, ticker in enumerate(tickers): df = pd.read_csv('stock_dfs/{}.csv'.format(ticker)) df.set_index('Date', inplace=True) df.rename(columns={'Adj Close': ticker}, inplace=True) df.drop(['Open', 'High', 'Low', 'Close', 'Volume'], 1, inplace=True) if main_df.empty: main_df = df else: main_df = main_df.join(df, how='outer') if count % 10 == 0: print(count) print(main_df.head()) main_df.to_csv('sp500_joined_closes.csv') def visualize_data(): df = pd.read_csv('sp500_joined_closes.csv') df_corr = df.corr() print(df_corr.head()) df_corr.to_csv('sp500corr.csv') data1 = df_corr.values fig1 = plt.figure() ax1 = fig1.add_subplot(111) heatmap1 = ax1.pcolor(data1, cmap=plt.cm.RdYlGn) fig1.colorbar(heatmap1) ax1.set_xticks(np.arange(data1.shape[1]) + 0.5, minor=False) ax1.set_yticks(np.arange(data1.shape[0]) + 0.5, minor=False) ax1.invert_yaxis() ax1.xaxis.tick_top() column_labels = df_corr.columns row_labels = df_corr.index ax1.set_xticklabels(column_labels) ax1.set_yticklabels(row_labels) plt.xticks(rotation=90) heatmap1.set_clim(-1, 1) plt.tight_layout() plt.show() def process_data_for_labels(ticker): hm_days = 7 df = pd.read_csv('sp500_joined_closes.csv', index_col=0) tickers = df.columns.values.tolist() df.fillna(0, inplace=True) for i in range(1, hm_days+1): df['{}_{}d'.format(ticker, i)] = (df[ticker].shift(-i) - df[ticker]) / df[ticker] df.fillna(0, inplace=True) return tickers, df
Now we're going to create the function that creates our label. We have a lot of choices here. You might want to have something that dictates buy
, sell
, or hold
, or maybe just buy
or sell
. I am going to have us do the former. Basically, if the price rises more than 2% in the next 7 days, we're going to say that's a buy. If it drops more than 2% in the next 7 days, that's a sell. If it doesn't do either of those, then it's not moving enough, and we're going to just hold
whatever our position is. If we have shares in that company, we do nothing, we keep our position. If we don't have shares in that company, we do nothing, we just wait. Our function to do this:
def buy_sell_hold(*args): cols = [c for c in args] requirement = 0.02 for col in cols: if col > requirement: return 1 if col < -requirement: return -1 return 0
We're using args here so we can take any number of columns here that we want. The idea here is that we're going to map this function to a Pandas DataFrame column, and that column will be our "label." A -1 is a sell, 0 is hold, and 1 is a buy. The *args
will be those future price change columns, and we're interested if we see movement that exceeds 2% in either direction. Do note, this isn't a totally perfect function. For example, price might go up 2%, then fall 2%, and we might not be prepared for that, but it will do for now.
With that, let's actually make our features and labels! For this function, we're going to add the following import:
from collections import Counter
This will let us see the distributions of classes both in our dataset and in our algorithm's predictions. We dont want to feed highly imbalanced datasets to machine learning classifiers, and we also want to see if our classifier is predicting only one class. Our next function:
def extract_featuresets(ticker): tickers, df = process_data_for_labels(ticker) df['{}_target'.format(ticker)] = list(map( buy_sell_hold, df['{}_1d'.format(ticker)], df['{}_2d'.format(ticker)], df['{}_3d'.format(ticker)], df['{}_4d'.format(ticker)], df['{}_5d'.format(ticker)], df['{}_6d'.format(ticker)], df['{}_7d'.format(ticker)] ))
This function will take any ticker, create the needed dataset, and create our "target" column, which is our label
. The target column will have either a -1, 0, or 1 for each row, based on our function and the columns we feed through. Now, we can get the distribution:
vals = df['{}_target'.format(ticker)].values.tolist() str_vals = [str(i) for i in vals] print('Data spread:',Counter(str_vals))
Clean up our data:
df.fillna(0, inplace=True) df = df.replace([np.inf, -np.inf], np.nan) df.dropna(inplace=True)
We probably have some totally missing data, which we'll replace with 0. Next we probably have some infinite data, especially if we did a percent change from 0 to anything. We're going to convert infinite values to NaNs, then we're going to drop NaNs. We're *almost* ready to rumble, but right now our "features" are that day's prices for stocks. Just static numbers, really nothing telling at all. Instead, a better metric would be every company's percent change that day. The idea here being that some companies will change in price before others, and we can profit maybe on the laggards. We'll convert the stock prices to % changes:
df_vals = df[[ticker for ticker in tickers]].pct_change() df_vals = df_vals.replace([np.inf, -np.inf], 0) df_vals.fillna(0, inplace=True)
Again, being careful about infinite numbers, and then filling any other missing data, and, now, finally, we are ready to create our features and labels:
X = df_vals.values y = df['{}_target'.format(ticker)].values return X,y,df
The capital X
contains our featuresets (daily % changes for every company in the S&P 500). The lowercase y
is our "target" or our "label." Basically what we're trying to map our featuresets to.
Alright, we've got features and labels, we're ready to do some machine learning, which is what we'll cover in the next tutorial.