Hello and welcome to part 6 of the Python for Finance tutorial series. In the previous finance with Python tutorial, we covered how to acquire the list of companies that we're interested in (S&P 500 in our case), and now we're going to pull stock pricing data on all of them.
Code up to this point:
import bs4 as bs import pickle import requests def save_sp500_tickers(): resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies') soup = bs.BeautifulSoup(resp.text, 'lxml') table = soup.find('table', {'class': 'wikitable sortable'}) tickers = [] for row in table.findAll('tr')[1:]: ticker = row.findAll('td')[0].text tickers.append(ticker) with open("sp500tickers.pickle","wb") as f: pickle.dump(tickers,f) return tickers
We're going to add a few new imports:
import bs4 as bs import datetime as dt import os import pandas_datareader.data as web import pickle import requests
We'll use datetime
to specify dates for the Pandas datareader, os
is to check for, and create, directories. You already know what pandas
is for!
To start our new function:
# save_sp500_tickers() def get_data_from_yahoo(reload_sp500=False): if reload_sp500: tickers = save_sp500_tickers() else: with open("sp500tickers.pickle", "rb") as f: tickers = pickle.load(f)
Here's where I'll just show a quick example of one way you could handle for whether or not to reload the S&P 500 list. If we ask it to, the program will re-pull the S&P 500 list, otherwise it will just use our pickle. Now we want to prepare to grab data.
Now we ne need to decide what we're going to do with the data. What I tend to do is try to parse websites ONCE, and store the data locally. I don't try to know in advance all of the things I might do with the data, but I know if I am going to pull it more than once, I might as well just save it (unless it's a huge dataset, which this is not). Thus, we're going to pull everything we can from what Yahoo returns to us for every stock and just save it. To do this, we'll create a new directory, and, in there, store stock data per company. To begin, we need that initial directory:
if not os.path.exists('stock_dfs'): os.makedirs('stock_dfs')
You could just store these datasets in the same directory as your script, but this would get pretty messy in my opinion. Now we're ready to pull the data. You already know how to do this, we did it in the very first tutorial!
start = dt.datetime(2010, 1, 1) end = dt.datetime.now() for ticker in tickers: # just in case your connection breaks, we'd like to save our progress! if not os.path.exists('stock_dfs/{}.csv'.format(ticker)): df = web.DataReader(ticker, 'morningstar', start, end) df.reset_index(inplace=True) df.set_index("Date", inplace=True) df = df.drop("Symbol", axis=1) df.to_csv('stock_dfs/{}.csv'.format(ticker)) else: print('Already have {}'.format(ticker))
You will likely in time want to do some sort of force_data_update
parameter to this function, since, right now, it will not re-pull data it already sees hit has. Since we're pulling daily data, you'd want to have this re-pulling at least the latest data. That said, if that's the case, you might be better off with using a database instead with a table per company, and then just pulling the most recent values from the Yahoo database. We'll keep things simple for now though!
Full code up to this point:
import bs4 as bs import datetime as dt import os import pandas_datareader.data as web import pickle import requests def save_sp500_tickers(): resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies') soup = bs.BeautifulSoup(resp.text, 'lxml') table = soup.find('table', {'class': 'wikitable sortable'}) tickers = [] for row in table.findAll('tr')[1:]: ticker = row.findAll('td')[0].text tickers.append(ticker) with open("sp500tickers.pickle", "wb") as f: pickle.dump(tickers, f) return tickers # save_sp500_tickers() def get_data_from_yahoo(reload_sp500=False): if reload_sp500: tickers = save_sp500_tickers() else: with open("sp500tickers.pickle", "rb") as f: tickers = pickle.load(f) if not os.path.exists('stock_dfs'): os.makedirs('stock_dfs') start = dt.datetime(2010, 1, 1) end = dt.datetime.now() for ticker in tickers: # just in case your connection breaks, we'd like to save our progress! if not os.path.exists('stock_dfs/{}.csv'.format(ticker)): df = web.DataReader(ticker, 'morningstar', start, end) df.reset_index(inplace=True) df.set_index("Date", inplace=True) df = df.drop("Symbol", axis=1) df.to_csv('stock_dfs/{}.csv'.format(ticker)) else: print('Already have {}'.format(ticker)) get_data_from_yahoo()
Go ahead and run this. You might want to import time
and add a time.sleep(0.5)
or something if Yahoo throttles you. At the time of my writing this, Yahoo did not throttle me at all and I was able to run this all the way through without any issues. It might take you a while still, however, especially depending on your machine. The good news is, however, we wont need to do it again! In practice, again, since this is daily data, however, you might do this once a day.
Also, if you have a slow internet, you don't need to do all tickers, even just 10 would be enough, so you can just do for ticker in tickers[:10]:
, or something like that to speed things up.
In the next tutorial, once you have the data downloaded, we're going to compile the data we're interested in into one large Pandas DataFrame.