Hello and welcome to part 9 of the Python for Finance tutorial series. In the previous tutorials, we've covered how to pull in stock pricing data for a large number of companies, how to combine that data into one large dataset, and how to visually represent at least one relationship between all of the companies. Now, we're going to try to take this data and do some machine learning with it!
The idea is to see what might happen if we took data from all of the current companies, and fed this through some sort of machine learning classifier. We know that, over time, various companies have different relationships with eachother, so, if the machine can recognize and fit these relationships, it's possible we could predict from changes in prices today, what will happen tomorrow with a specific company. Let's try!
To begin, all machine learning does is take "featuresets" and attempts to map them to "labels." Whether we're doing K Nearest Neighbors or deep learning with neural networks, this remains the same. Thus, we need to convert our existing data to featuresets and labels.
Our features can be other company's prices, but we're going to instead say the features are the pricing changes that day for all companies. Our label will be whether or not we actually want to buy a specific company. Let's say we're considering Exxon (XOM). What we'll do for featuresets is take into account all company percent changes that day, and those will be our features. Our label will be whether or not Exxon (XOM) rose more than x%
within the next x days
, where we can pick whatever we want for x
. To start, let's say a company is a buy if, within the next 7 days, its price goes up more than 2% and it is a sell if the price goes down more than 2% within those 7 days.
This is something we could also relatively easily make a strategy for. If the algorithm says buy, we can buy, place a 2% drop stop-loss (basically something that tells the exchange is price falls below this number / or goes above if you're shorting the company, then exit my position). Otherwise, sell the company once it has risen 2%, or you could be conservative and sell at 1% rise...etc. Regardless, you could relatively easily build a strategy from this classifier. In order to begin, we need the prices into the future for our training data.
I am going to keep coding in our same script. If this is a problem to you, feel free to create a new file and import the functions we use.
Full code up to this point:
import bs4 as bs import datetime as dt import matplotlib.pyplot as plt from matplotlib import style import numpy as np import os import pandas as pd import pandas_datareader.data as web import pickle import requests style.use('ggplot') def save_sp500_tickers(): resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies') soup = bs.BeautifulSoup(resp.text, 'lxml') table = soup.find('table', {'class': 'wikitable sortable'}) tickers = [] for row in table.findAll('tr')[1:]: ticker = row.findAll('td')[0].text tickers.append(ticker) with open("sp500tickers.pickle", "wb") as f: pickle.dump(tickers, f) return tickers # save_sp500_tickers() def get_data_from_yahoo(reload_sp500=False): if reload_sp500: tickers = save_sp500_tickers() else: with open("sp500tickers.pickle", "rb") as f: tickers = pickle.load(f) if not os.path.exists('stock_dfs'): os.makedirs('stock_dfs') start = dt.datetime(2010, 1, 1) end = dt.datetime.now() for ticker in tickers: # just in case your connection breaks, we'd like to save our progress! if not os.path.exists('stock_dfs/{}.csv'.format(ticker)): df = web.DataReader(ticker, 'morningstar', start, end) df.reset_index(inplace=True) df.set_index("Date", inplace=True) df = df.drop("Symbol", axis=1) df.to_csv('stock_dfs/{}.csv'.format(ticker)) else: print('Already have {}'.format(ticker)) def compile_data(): with open("sp500tickers.pickle", "rb") as f: tickers = pickle.load(f) main_df = pd.DataFrame() for count, ticker in enumerate(tickers): df = pd.read_csv('stock_dfs/{}.csv'.format(ticker)) df.set_index('Date', inplace=True) df.rename(columns={'Adj Close': ticker}, inplace=True) df.drop(['Open', 'High', 'Low', 'Close', 'Volume'], 1, inplace=True) if main_df.empty: main_df = df else: main_df = main_df.join(df, how='outer') if count % 10 == 0: print(count) print(main_df.head()) main_df.to_csv('sp500_joined_closes.csv') def visualize_data(): df = pd.read_csv('sp500_joined_closes.csv') df_corr = df.corr() print(df_corr.head()) df_corr.to_csv('sp500corr.csv') data1 = df_corr.values fig1 = plt.figure() ax1 = fig1.add_subplot(111) heatmap1 = ax1.pcolor(data1, cmap=plt.cm.RdYlGn) fig1.colorbar(heatmap1) ax1.set_xticks(np.arange(data1.shape[1]) + 0.5, minor=False) ax1.set_yticks(np.arange(data1.shape[0]) + 0.5, minor=False) ax1.invert_yaxis() ax1.xaxis.tick_top() column_labels = df_corr.columns row_labels = df_corr.index ax1.set_xticklabels(column_labels) ax1.set_yticklabels(row_labels) plt.xticks(rotation=90) heatmap1.set_clim(-1, 1) plt.tight_layout() plt.show() visualize_data()
Continuing along, let's begin to process some data that will help us to create our labels:
def process_data_for_labels(ticker): hm_days = 7 df = pd.read_csv('sp500_joined_closes.csv', index_col=0) tickers = df.columns.values.tolist() df.fillna(0, inplace=True)
This function will take one parameter: the ticker
in question. Each model will be trained on a single company. Next, we want to know how many days into the future we need prices for. We're choosing 7 here. Now, we'll read in the data for the close prices for all companies that we've saved in the past, grab a list of the existing tickers, and we'll fill any missing with 0 for now. This might be something you want to change in the future, but we'll go with 0 for now. Now, we want to grab the % change values for the next 7 days:
for i in range(1,hm_days+1): df['{}_{}d'.format(ticker,i)] = (df[ticker].shift(-i) - df[ticker]) / df[ticker]
This creates new dataframe columns for our specific ticker
in question, using string formatting to create the custom names. The way we're getting future values is with .shift
, which basically will shift a column up or down. In this case, we shift a negative amount, which will take that column and, if you could see it visually, it would shift that column UP by i
rows. This gives us the future values i
days in advanced, which we can calculate percent change against.
Finally:
df.fillna(0, inplace=True) return tickers, df
We're all set here, we'll return the tickers and the dataframe, and we're well on our way to having some featuresets that our algorithms can use to try to fit and find relationships.
Our full processing function:
def process_data_for_labels(ticker): hm_days = 7 df = pd.read_csv('sp500_joined_closes.csv', index_col=0) tickers = df.columns.values.tolist() df.fillna(0, inplace=True) for i in range(1,hm_days+1): df['{}_{}d'.format(ticker,i)] = (df[ticker].shift(-i) - df[ticker]) / df[ticker] df.fillna(0, inplace=True) return tickers, df
In the next tutorial, we're going to cover how we'll go about creating our "labels."