Getting all company pricing data in the S&P 500 - Python Programming for Finance p.6




Hello and welcome to part 6 of the Python for Finance tutorial series. In the previous finance with Python tutorial, we covered how to acquire the list of companies that we're interested in (S&P 500 in our case), and now we're going to pull stock pricing data on all of them.

Code up to this point:

import bs4 as bs
import pickle
import requests

def save_sp500_tickers():
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        tickers.append(ticker)

    with open("sp500tickers.pickle","wb") as f:
        pickle.dump(tickers,f)

    return tickers

We're going to add a few new imports:

import bs4 as bs
import datetime as dt
import os
import pandas_datareader.data as web
import pickle
import requests

We'll use datetime to specify dates for the Pandas datareader, os is to check for, and create, directories. You already know what pandas is for!

To start our new function:

# save_sp500_tickers()
def get_data_from_yahoo(reload_sp500=False):
    if reload_sp500:
        tickers = save_sp500_tickers()
    else:
        with open("sp500tickers.pickle", "rb") as f:
            tickers = pickle.load(f)

Here's where I'll just show a quick example of one way you could handle for whether or not to reload the S&P 500 list. If we ask it to, the program will re-pull the S&P 500 list, otherwise it will just use our pickle. Now we want to prepare to grab data.

Now we ne need to decide what we're going to do with the data. What I tend to do is try to parse websites ONCE, and store the data locally. I don't try to know in advance all of the things I might do with the data, but I know if I am going to pull it more than once, I might as well just save it (unless it's a huge dataset, which this is not). Thus, we're going to pull everything we can from what Yahoo returns to us for every stock and just save it. To do this, we'll create a new directory, and, in there, store stock data per company. To begin, we need that initial directory:

    if not os.path.exists('stock_dfs'):
        os.makedirs('stock_dfs')

You could just store these datasets in the same directory as your script, but this would get pretty messy in my opinion. Now we're ready to pull the data. You already know how to do this, we did it in the very first tutorial!

    start = dt.datetime(2010, 1, 1)
    end = dt.datetime.now()
    for ticker in tickers:
        # just in case your connection breaks, we'd like to save our progress!
        if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
            df = web.DataReader(ticker, 'morningstar', start, end)
            df.reset_index(inplace=True)
            df.set_index("Date", inplace=True)
            df = df.drop("Symbol", axis=1)
            df.to_csv('stock_dfs/{}.csv'.format(ticker))
        else:
            print('Already have {}'.format(ticker))

You will likely in time want to do some sort of force_data_update parameter to this function, since, right now, it will not re-pull data it already sees hit has. Since we're pulling daily data, you'd want to have this re-pulling at least the latest data. That said, if that's the case, you might be better off with using a database instead with a table per company, and then just pulling the most recent values from the Yahoo database. We'll keep things simple for now though!

Full code up to this point:

import bs4 as bs
import datetime as dt
import os
import pandas_datareader.data as web
import pickle
import requests


def save_sp500_tickers():
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        tickers.append(ticker)
    with open("sp500tickers.pickle", "wb") as f:
        pickle.dump(tickers, f)
    return tickers


# save_sp500_tickers()
def get_data_from_yahoo(reload_sp500=False):
    if reload_sp500:
        tickers = save_sp500_tickers()
    else:
        with open("sp500tickers.pickle", "rb") as f:
            tickers = pickle.load(f)
    if not os.path.exists('stock_dfs'):
        os.makedirs('stock_dfs')

    start = dt.datetime(2010, 1, 1)
    end = dt.datetime.now()
    for ticker in tickers:
        # just in case your connection breaks, we'd like to save our progress!
        if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
            df = web.DataReader(ticker, 'morningstar', start, end)
            df.reset_index(inplace=True)
            df.set_index("Date", inplace=True)
            df = df.drop("Symbol", axis=1)
            df.to_csv('stock_dfs/{}.csv'.format(ticker))
        else:
            print('Already have {}'.format(ticker))


get_data_from_yahoo()

Go ahead and run this. You might want to import time and add a time.sleep(0.5) or something if Yahoo throttles you. At the time of my writing this, Yahoo did not throttle me at all and I was able to run this all the way through without any issues. It might take you a while still, however, especially depending on your machine. The good news is, however, we wont need to do it again! In practice, again, since this is daily data, however, you might do this once a day.

Also, if you have a slow internet, you don't need to do all tickers, even just 10 would be enough, so you can just do for ticker in tickers[:10]:, or something like that to speed things up.

In the next tutorial, once you have the data downloaded, we're going to compile the data we're interested in into one large Pandas DataFrame.

The next tutorial:





  • Intro and Getting Stock Price Data - Python Programming for Finance p.1
  • Handling Data and Graphing - Python Programming for Finance p.2
  • Basic stock data Manipulation - Python Programming for Finance p.3
  • More stock manipulations - Python Programming for Finance p.4
  • Automating getting the S&P 500 list - Python Programming for Finance p.5
  • Getting all company pricing data in the S&P 500 - Python Programming for Finance p.6
  • Combining all S&P 500 company prices into one DataFrame - Python Programming for Finance p.7
  • Creating massive S&P 500 company correlation table for Relationships - Python Programming for Finance p.8
  • Preprocessing data to prepare for Machine Learning with stock data - Python Programming for Finance p.9
  • Creating targets for machine learning labels - Python Programming for Finance p.10 and 11
  • Machine learning against S&P 500 company prices - Python Programming for Finance p.12
  • Testing trading strategies with Quantopian Introduction - Python Programming for Finance p.13
  • Placing a trade order with Quantopian - Python Programming for Finance p.14
  • Scheduling a function on Quantopian - Python Programming for Finance p.15
  • Quantopian Research Introduction - Python Programming for Finance p.16
  • Quantopian Pipeline - Python Programming for Finance p.17
  • Alphalens on Quantopian - Python Programming for Finance p.18
  • Back testing our Alpha Factor on Quantopian - Python Programming for Finance p.19
  • Analyzing Quantopian strategy back test results with Pyfolio - Python Programming for Finance p.20
  • Strategizing - Python Programming for Finance p.21
  • Finding more Alpha Factors - Python Programming for Finance p.22
  • Combining Alpha Factors - Python Programming for Finance p.23
  • Portfolio Optimization - Python Programming for Finance p.24
  • Zipline Local Installation for backtesting - Python Programming for Finance p.25
  • Zipline backtest visualization - Python Programming for Finance p.26
  • Custom Data with Zipline Local - Python Programming for Finance p.27
  • Custom Markets Trading Calendar with Zipline (Bitcoin/cryptocurrency example) - Python Programming for Finance p.28