Combining all S&P 500 company prices into one DataFrame - Python Programming for Finance p.7




Hello and welcome to part 7 of the Python for Finance tutorial series. In the previous tutorial, we grabbed the Yahoo Finance data for the entire S&P 500 of companies. In this tutorial, we're going to bring this data together into one DataFrame.

Code up to this point:

import bs4 as bs
import datetime as dt
import os
import pandas_datareader.data as web
import pickle
import requests


def save_sp500_tickers():
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        tickers.append(ticker)
    with open("sp500tickers.pickle", "wb") as f:
        pickle.dump(tickers, f)
    return tickers


# save_sp500_tickers()
def get_data_from_yahoo(reload_sp500=False):
    if reload_sp500:
        tickers = save_sp500_tickers()
    else:
        with open("sp500tickers.pickle", "rb") as f:
            tickers = pickle.load(f)
    if not os.path.exists('stock_dfs'):
        os.makedirs('stock_dfs')

    start = dt.datetime(2010, 1, 1)
    end = dt.datetime.now()
    for ticker in tickers:
        # just in case your connection breaks, we'd like to save our progress!
        if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
            df = web.DataReader(ticker, 'morningstar', start, end)
            df.reset_index(inplace=True)
            df.set_index("Date", inplace=True)
            df = df.drop("Symbol", axis=1)
            df.to_csv('stock_dfs/{}.csv'.format(ticker))
        else:
            print('Already have {}'.format(ticker))


get_data_from_yahoo()

While we do have all of the data at our disposal, we may want to actually assess the data together. To do this, we're going to join all of the stock datasets together. Each of the stock files at the moment come with: Open, High, Low, Close, Volume, and Adj Close. At least to start, we're mostly just interested in the adjusted close for now.

def compile_data():
    with open("sp500tickers.pickle","rb") as f:
        tickers = pickle.load(f)

    main_df = pd.DataFrame()

To begin, we pull our previously-made list of tickers, and begin with an empty DataFrame, called main_df. Now, we're ready to read in each stock's dataframe:

    for count,ticker in enumerate(tickers):
        df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
        df.set_index('Date', inplace=True)

You do not need to use Python's enumerate here, I am just using it so we know where we are in the process of reading in all of the data. You could just iterate over the tickers. From this point, we *could* generate extra columns with interesting data, like:

        df['{}_HL_pct_diff'.format(ticker)] = (df['High'] - df['Low']) / df['Low']
        df['{}_daily_pct_chng'.format(ticker)] = (df['Close'] - df['Open']) / df['Open']

For now, however, we're not going to be bothered with this. Just know this could be a path to pursue down the road. Instead, we're really just interested in that Adj Close column:

        df.rename(columns={'Adj Close':ticker}, inplace=True)
        df.drop(['Open','High','Low','Close','Volume'],1,inplace=True)

Now we've got just that column (or maybe extras, like above...but remember, in this example, we're not doing the HL_pct_diff or daily_pct_chng). Notice that we have renamed the Adj Close column to whatever the ticker name is. Let's begin building the shared dataframe:

        if main_df.empty:
            main_df = df
        else:
            main_df = main_df.join(df, how='outer')

If there's nothing in the main_df, then we'll start with the current df, otherwise we're going to use Pandas' join.

Still within this for loop, we'll add two more lines:

        if count % 10 == 0:
            print(count)

This will just output the count of the current ticker if it's evenly divisible by 10. What count % 10 gives us is the remainder if count was to be divided by 10. So if we ask if count % 10 == 0, we're only going to see the if statement True if the current count, divided by 10, has a remainder of 0, or if it is perfectly divisible by 10.

When we're all done with the for-loop:

    print(main_df.head())
    main_df.to_csv('sp500_joined_closes.csv')

This function and calling it up to this point:

    with open("sp500tickers.pickle","rb") as f:
        tickers = pickle.load(f)

    main_df = pd.DataFrame()

    for count,ticker in enumerate(tickers):
        df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
        df.set_index('Date', inplace=True)

        df.rename(columns={'Adj Close':ticker}, inplace=True)
        df.drop(['Open','High','Low','Close','Volume'],1,inplace=True)

        if main_df.empty:
            main_df = df
        else:
            main_df = main_df.join(df, how='outer')

        if count % 10 == 0:
            print(count)
    print(main_df.head())
    main_df.to_csv('sp500_joined_closes.csv')


compile_data()

Full code up to this point:

import bs4 as bs
import datetime as dt
import os
import pandas as pd
import pandas_datareader.data as web
import pickle
import requests


def save_sp500_tickers():
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        tickers.append(ticker)
    with open("sp500tickers.pickle", "wb") as f:
        pickle.dump(tickers, f)
    return tickers


# save_sp500_tickers()
def get_data_from_yahoo(reload_sp500=False):
    if reload_sp500:
        tickers = save_sp500_tickers()
    else:
        with open("sp500tickers.pickle", "rb") as f:
            tickers = pickle.load(f)
    if not os.path.exists('stock_dfs'):
        os.makedirs('stock_dfs')

    start = dt.datetime(2010, 1, 1)
    end = dt.datetime.now()
    for ticker in tickers:
        # just in case your connection breaks, we'd like to save our progress!
        if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
            df = web.DataReader(ticker, 'morningstar', start, end)
            df.reset_index(inplace=True)
            df.set_index("Date", inplace=True)
            df = df.drop("Symbol", axis=1)
            df.to_csv('stock_dfs/{}.csv'.format(ticker))
        else:
            print('Already have {}'.format(ticker))


def compile_data():
    with open("sp500tickers.pickle", "rb") as f:
        tickers = pickle.load(f)

    main_df = pd.DataFrame()

    for count, ticker in enumerate(tickers):
        df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
        df.set_index('Date', inplace=True)

        df.rename(columns={'Adj Close': ticker}, inplace=True)
        df.drop(['Open', 'High', 'Low', 'Close', 'Volume'], 1, inplace=True)

        if main_df.empty:
            main_df = df
        else:
            main_df = main_df.join(df, how='outer')

        if count % 10 == 0:
            print(count)
    print(main_df.head())
    main_df.to_csv('sp500_joined_closes.csv')


compile_data()

In the next tutorial, we're going to attempt to see if we can quickly find any relationships in the data.

The next tutorial:





  • Intro and Getting Stock Price Data - Python Programming for Finance p.1
  • Handling Data and Graphing - Python Programming for Finance p.2
  • Basic stock data Manipulation - Python Programming for Finance p.3
  • More stock manipulations - Python Programming for Finance p.4
  • Automating getting the S&P 500 list - Python Programming for Finance p.5
  • Getting all company pricing data in the S&P 500 - Python Programming for Finance p.6
  • Combining all S&P 500 company prices into one DataFrame - Python Programming for Finance p.7
  • Creating massive S&P 500 company correlation table for Relationships - Python Programming for Finance p.8
  • Preprocessing data to prepare for Machine Learning with stock data - Python Programming for Finance p.9
  • Creating targets for machine learning labels - Python Programming for Finance p.10 and 11
  • Machine learning against S&P 500 company prices - Python Programming for Finance p.12
  • Testing trading strategies with Quantopian Introduction - Python Programming for Finance p.13
  • Placing a trade order with Quantopian - Python Programming for Finance p.14
  • Scheduling a function on Quantopian - Python Programming for Finance p.15
  • Quantopian Research Introduction - Python Programming for Finance p.16
  • Quantopian Pipeline - Python Programming for Finance p.17
  • Alphalens on Quantopian - Python Programming for Finance p.18
  • Back testing our Alpha Factor on Quantopian - Python Programming for Finance p.19
  • Analyzing Quantopian strategy back test results with Pyfolio - Python Programming for Finance p.20
  • Strategizing - Python Programming for Finance p.21
  • Finding more Alpha Factors - Python Programming for Finance p.22
  • Combining Alpha Factors - Python Programming for Finance p.23
  • Portfolio Optimization - Python Programming for Finance p.24
  • Zipline Local Installation for backtesting - Python Programming for Finance p.25
  • Zipline backtest visualization - Python Programming for Finance p.26
  • Custom Data with Zipline Local - Python Programming for Finance p.27
  • Custom Markets Trading Calendar with Zipline (Bitcoin/cryptocurrency example) - Python Programming for Finance p.28