Hello and welcome to part 7 of the Python for Finance tutorial series. In the previous tutorial, we grabbed the Yahoo Finance data for the entire S&P 500 of companies. In this tutorial, we're going to bring this data together into one DataFrame.
Code up to this point:
import bs4 as bs import datetime as dt import os import pandas_datareader.data as web import pickle import requests def save_sp500_tickers(): resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies') soup = bs.BeautifulSoup(resp.text, 'lxml') table = soup.find('table', {'class': 'wikitable sortable'}) tickers = [] for row in table.findAll('tr')[1:]: ticker = row.findAll('td')[0].text tickers.append(ticker) with open("sp500tickers.pickle", "wb") as f: pickle.dump(tickers, f) return tickers # save_sp500_tickers() def get_data_from_yahoo(reload_sp500=False): if reload_sp500: tickers = save_sp500_tickers() else: with open("sp500tickers.pickle", "rb") as f: tickers = pickle.load(f) if not os.path.exists('stock_dfs'): os.makedirs('stock_dfs') start = dt.datetime(2010, 1, 1) end = dt.datetime.now() for ticker in tickers: # just in case your connection breaks, we'd like to save our progress! if not os.path.exists('stock_dfs/{}.csv'.format(ticker)): df = web.DataReader(ticker, 'morningstar', start, end) df.reset_index(inplace=True) df.set_index("Date", inplace=True) df = df.drop("Symbol", axis=1) df.to_csv('stock_dfs/{}.csv'.format(ticker)) else: print('Already have {}'.format(ticker)) get_data_from_yahoo()
While we do have all of the data at our disposal, we may want to actually assess the data together. To do this, we're going to join all of the stock datasets together. Each of the stock files at the moment come with: Open, High, Low, Close, Volume, and Adj Close. At least to start, we're mostly just interested in the adjusted close for now.
def compile_data(): with open("sp500tickers.pickle","rb") as f: tickers = pickle.load(f) main_df = pd.DataFrame()
To begin, we pull our previously-made list of tickers, and begin with an empty DataFrame, called main_df
. Now, we're ready to read in each stock's dataframe:
for count,ticker in enumerate(tickers): df = pd.read_csv('stock_dfs/{}.csv'.format(ticker)) df.set_index('Date', inplace=True)
You do not need to use Python's enumerate here, I am just using it so we know where we are in the process of reading in all of the data. You could just iterate over the tickers
. From this point, we *could* generate extra columns with interesting data, like:
df['{}_HL_pct_diff'.format(ticker)] = (df['High'] - df['Low']) / df['Low'] df['{}_daily_pct_chng'.format(ticker)] = (df['Close'] - df['Open']) / df['Open']
For now, however, we're not going to be bothered with this. Just know this could be a path to pursue down the road. Instead, we're really just interested in that Adj Close
column:
df.rename(columns={'Adj Close':ticker}, inplace=True) df.drop(['Open','High','Low','Close','Volume'],1,inplace=True)
Now we've got just that column (or maybe extras, like above...but remember, in this example, we're not doing the HL_pct_diff or daily_pct_chng). Notice that we have renamed the Adj Close
column to whatever the ticker
name is. Let's begin building the shared dataframe:
if main_df.empty: main_df = df else: main_df = main_df.join(df, how='outer')
If there's nothing in the main_df
, then we'll start with the current df
, otherwise we're going to use Pandas' join.
Still within this for loop, we'll add two more lines:
if count % 10 == 0: print(count)
This will just output the count of the current ticker if it's evenly divisible by 10. What count % 10
gives us is the remainder if count was to be divided by 10. So if we ask if count % 10 == 0
, we're only going to see the if statement True if the current count, divided by 10, has a remainder of 0, or if it is perfectly divisible by 10.
When we're all done with the for-loop:
print(main_df.head()) main_df.to_csv('sp500_joined_closes.csv')
This function and calling it up to this point:
with open("sp500tickers.pickle","rb") as f: tickers = pickle.load(f) main_df = pd.DataFrame() for count,ticker in enumerate(tickers): df = pd.read_csv('stock_dfs/{}.csv'.format(ticker)) df.set_index('Date', inplace=True) df.rename(columns={'Adj Close':ticker}, inplace=True) df.drop(['Open','High','Low','Close','Volume'],1,inplace=True) if main_df.empty: main_df = df else: main_df = main_df.join(df, how='outer') if count % 10 == 0: print(count) print(main_df.head()) main_df.to_csv('sp500_joined_closes.csv') compile_data()
Full code up to this point:
import bs4 as bs import datetime as dt import os import pandas as pd import pandas_datareader.data as web import pickle import requests def save_sp500_tickers(): resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies') soup = bs.BeautifulSoup(resp.text, 'lxml') table = soup.find('table', {'class': 'wikitable sortable'}) tickers = [] for row in table.findAll('tr')[1:]: ticker = row.findAll('td')[0].text tickers.append(ticker) with open("sp500tickers.pickle", "wb") as f: pickle.dump(tickers, f) return tickers # save_sp500_tickers() def get_data_from_yahoo(reload_sp500=False): if reload_sp500: tickers = save_sp500_tickers() else: with open("sp500tickers.pickle", "rb") as f: tickers = pickle.load(f) if not os.path.exists('stock_dfs'): os.makedirs('stock_dfs') start = dt.datetime(2010, 1, 1) end = dt.datetime.now() for ticker in tickers: # just in case your connection breaks, we'd like to save our progress! if not os.path.exists('stock_dfs/{}.csv'.format(ticker)): df = web.DataReader(ticker, 'morningstar', start, end) df.reset_index(inplace=True) df.set_index("Date", inplace=True) df = df.drop("Symbol", axis=1) df.to_csv('stock_dfs/{}.csv'.format(ticker)) else: print('Already have {}'.format(ticker)) def compile_data(): with open("sp500tickers.pickle", "rb") as f: tickers = pickle.load(f) main_df = pd.DataFrame() for count, ticker in enumerate(tickers): df = pd.read_csv('stock_dfs/{}.csv'.format(ticker)) df.set_index('Date', inplace=True) df.rename(columns={'Adj Close': ticker}, inplace=True) df.drop(['Open', 'High', 'Low', 'Close', 'Volume'], 1, inplace=True) if main_df.empty: main_df = df else: main_df = main_df.join(df, how='outer') if count % 10 == 0: print(count) print(main_df.head()) main_df.to_csv('sp500_joined_closes.csv') compile_data()
In the next tutorial, we're going to attempt to see if we can quickly find any relationships in the data.