Hello and welcome to part 5 of the Python for Finance tutorial series. In this tutorial and the next few, we're going to be working on how we can go about grabbing pricing information en masse for a larger list of companies, and then how we can work with all of this data at once.
To begin, we need a list of companies. I could just hand you a list, but actually acquiring a list of stocks can be just one of the many challenges you might encounter. In our case, we want a Python list of the S&P 500 companies.
Whether you are looking for the Dow Jones companies, the S&P 500, or the Russell 3000, chances are, someone somewhere has posted a post of these companies. You will want to make sure it is up-to-date, but chances are it's not already in the perfect format for you. In our case, we're going to grab the list from Wikipedia: http://en.wikipedia.org/wiki/List_of_S%26P_500_companies.
The tickers/symbols in Wikipedia are organized on a table. To handle for this, we're going to use the HTML parsing library, Beautiful Soup. If you would like to learn more about Beautiful Soup, I have a quick 4-part tutorial on web scraping with Beautiful Soup.
First, let's begin with some imports:
import bs4 as bs import pickle import requests
bs4
is for Beautiful Soup, pickle
is so we can easily just save this list of companies, rather than hitting Wikipedia every time we run (though remember, in time, you will want to update this list!), and we'll be using requests
to grab the source code from Wikipedia's page.
To begin our function:
def save_sp500_tickers(): resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies') soup = bs.BeautifulSoup(resp.text, 'lxml') table = soup.find('table', {'class': 'wikitable sortable'})
First, we visit the Wikipedia page, and are given the response, which contains our source code. To treat the source code how we want, we want to access the .text
attribute, which we turn to soup
using BeautifulSoup. If you're not familiar with what BeautifulSoup does for you, it basically turns source code into a BeautifulSoup
object that suddenly can be treated much more like a typical Python object.
There was once a time when Wikipedia attempted to decline access to Python. Currently, at the time of my writing this, the code works without changing headers. If you're finding that the original source code (resp.text) doesn't seem to be returning the same page as you see on your home computer, add the following and change the resp
var code:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17'} resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies', headers=headers)
Once we have our soup, we can find the table of stock data by simply searching for the wikitable sortable
classes. The only reason I know to specify this table is because I viewed the sourcecode in a browser first. There may come a time where you want to parse a different website's list of stocks, maybe it's in a table, or maybe it's a list, or maybe something with div
tags. This is just one very specific solution. From here, we just iterate through the table:
tickers = [] for row in table.findAll('tr')[1:]: ticker = row.findAll('td')[0].text tickers.append(ticker)
For each row, after the header row (this is why we're going through with [1:]
), we're saying the ticker is the "table data" (td
), we grab the .text
of it, and we append this ticker to our list.
Now, it'd be nice if we could just save this list. We'll use the pickle module for this, which serializes Python objects for us.
with open("sp500tickers.pickle","wb") as f: pickle.dump(tickers,f) return tickers
We'd like to go ahead and save this so we don't have to request Wikipedia multiple times a day. At any time, we can update this list, or we could program it to check once a month...etc.
Full code up to this point:
import bs4 as bs import pickle import requests def save_sp500_tickers(): resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies') soup = bs.BeautifulSoup(resp.text, 'lxml') table = soup.find('table', {'class': 'wikitable sortable'}) tickers = [] for row in table.findAll('tr')[1:]: ticker = row.findAll('td')[0].text tickers.append(ticker) with open("sp500tickers.pickle","wb") as f: pickle.dump(tickers,f) return tickers save_sp500_tickers()
Now that we know the tickers, we're ready to pull information on them all, which is something we will do in the next tutorial.