Getting more data and meshing data sets




Our end goal with "labeling" data is to categorize it. With investing, we really have just two categories. There's outperform or under-perform. In the future, we may add a third category, "match." For now, we're only interested in whether it would have been better or not to invest in the company, but it might be a good idea to further differentiate companies, like:

Significantly Outperform Outperform Match (say within 0.5% or something) Under-perform Significantly Under-perform

Then we could convert that to say,

2 1 0 -1 -2

That said, we're going to keep things simple for now, and just do:

Outperform (1) Under-perform (0)

With that in mind, how do we intend to do this? Well, there are a few necessary steps!

First, we need to compare it to something. In our case, we're going to compare to "market," or the S&P 500 index. Thus, we need to get the S&P 500 value. Yahoo Finance does track this for us, though they switch up their methods during our sample. We can compensate for this, though I think we might as well cover a more valuable learning experience: Meshing two different datasets! Fun!

So, our main data set is Yahoo Finance, but we want to combine this data set with another. We want the S&P 500 values, so we're going to need those. For most basic data sets, the website Quandl.com is my go-to source. They have data on all kinds of interesting topics. Their granularity is usually not ideal, but their vast coverage is exquisite. You can certainly use Quandl for testing and prototyping to see if buying a better data set might be worth it.

Head on over to quandl.com, search for S&P 500, choose index, Choose data from the year 2000 on-wards. After that, click on "download," choosing CSV. Easy as that, now we've got a CSV. Next we need to read it and load it into our DataFrame. Turns out Pandas does all of this, so let's do it!

Now the top of our program will read like:

import pandas as pd
import os
import time
from datetime import datetime

path = "X:/Backups/intraQuarter"

def Key_Stats(gather="Total Debt/Equity (mrq)"):
    statspath = path+'/_KeyStats'
    stock_list = [x[0] for x in os.walk(statspath)]
    df = pd.DataFrame(columns = ['Date','Unix','Ticker','DE Ratio','Price','SP500'])

    sp500_df = pd.DataFrame.from_csv("YAHOO-INDEX_GSPC.csv")

As you can see, all remains the same besides the definition of sp500_df, which is its own dataframe which we load by using from_csv, as well as initiating our data frame with a few new columns.

Next up, we begin the for loop:

    for each_dir in stock_list[1:25]:
        each_file = os.listdir(each_dir)
        ticker = each_dir.split("\\")[1]
        if len(each_file) > 0:
            for file in each_file:
                date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
                unix_time = time.mktime(date_stamp.timetuple())
                full_file_path = each_dir+'/'+file
                source = open(full_file_path,'r').read()
                try:
                    value = float(source.split(gather+':</td><td class="yfnc_tabledata1">')[1].split('</td>')[0])

                    try:
                        sp500_date = datetime.fromtimestamp(unix_time).strftime('%Y-%m-%d')
                        row = sp500_df[(sp500_df.index == sp500_date)]
                        sp500_value = float(row["Adjusted Close"])
                    except:
                        sp500_date = datetime.fromtimestamp(unix_time-259200).strftime('%Y-%m-%d')
                        row = sp500_df[(sp500_df.index == sp500_date)]
                        sp500_value = float(row["Adjusted Close"])

Most of this remains the same except for the end. You can see here that we have a new try-except statement, containing the definition of the sp500 date and value. What we're doing here is hunting for the value of the S&P 500 index at the same time as the date for our stock file. The reason for the Try and Except here is because some of our stock data may have been pulled on a weekend day. If we hunt for a weekend day's value of the S&P 500, that date just simply wont exist in the dataset, since there is only data for Monday-Friday.

Next in the script, we want to pull the stock price to compare to the S&P 500 value:

                    stock_price = float(source.split('</small><big><b>')[1].split('</b></big>')[0])

Here, again, we're using fairly clunky splitting procedures for hunting for the price. Eventually, we're going to upgrade this using Regular Expressions.

The final major change is adding all of this to our data frame:

                    df = df.append({'Date':date_stamp,
                                    'Unix':unix_time,
                                    'Ticker':ticker,
                                    'DE Ratio':value,
                                    'Price':stock_price,
                                    'SP500':sp500_value}, ignore_index = True)

Just in case you're lost somewhere up to this point, here's the full code:

import pandas as pd
import os
import time
from datetime import datetime

path = "X:/Backups/intraQuarter"

def Key_Stats(gather="Total Debt/Equity (mrq)"):
    statspath = path+'/_KeyStats'
    stock_list = [x[0] for x in os.walk(statspath)]
    df = pd.DataFrame(columns = ['Date','Unix','Ticker','DE Ratio','Price','SP500'])

    sp500_df = pd.DataFrame.from_csv("YAHOO-INDEX_GSPC.csv")

    for each_dir in stock_list[1:25]:
        each_file = os.listdir(each_dir)
        ticker = each_dir.split("\\")[1]
        if len(each_file) > 0:
            for file in each_file:
                date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
                unix_time = time.mktime(date_stamp.timetuple())
                full_file_path = each_dir+'/'+file
                source = open(full_file_path,'r').read()
                try:
                    value = float(source.split(gather+':</td><td class="yfnc_tabledata1">')[1].split('</td>')[0])

                    try:
                        sp500_date = datetime.fromtimestamp(unix_time).strftime('%Y-%m-%d')
                        row = sp500_df[(sp500_df.index == sp500_date)]
                        sp500_value = float(row["Adjusted Close"])
                    except:
                        sp500_date = datetime.fromtimestamp(unix_time-259200).strftime('%Y-%m-%d')
                        row = sp500_df[(sp500_df.index == sp500_date)]
                        sp500_value = float(row["Adjusted Close"])


                    stock_price = float(source.split('</small><big><b>')[1].split('</b></big>')[0])
                    #print("stock_price:",stock_price,"ticker:", ticker)
                        
                    

                    df = df.append({'Date':date_stamp,
                                    'Unix':unix_time,
                                    'Ticker':ticker,
                                    'DE Ratio':value,
                                    'Price':stock_price,
                                    'SP500':sp500_value}, ignore_index = True)
                except Exception as e:
                    pass

    save = gather.replace(' ','').replace(')','').replace('(','').replace('/','')+('.csv')
    print(save)
    df.to_csv(save)
    
        

Key_Stats()
    

The next tutorial:





  • Intro to Machine Learning with Scikit Learn and Python
  • Simple Support Vector Machine (SVM) example with character recognition
  • Our Method and where we will be getting our Data
  • Parsing data
  • More Parsing
  • Structuring data with Pandas
  • Getting more data and meshing data sets
  • Labeling of data part 1
  • Labeling data part 2
  • Finally finishing up the labeling
  • Linear SVC Machine learning SVM example with Python
  • Getting more features from our data
  • Linear SVC machine learning and testing our data
  • Scaling, Normalizing, and machine learning with many features
  • Shuffling our data to solve a learning issue
  • Using Quandl for more data
  • Improving our Analysis with a more accurate measure of performance in relation to fundamentals
  • Learning and Testing our Machine learning algorithm
  • More testing, this time including N/A data
  • Back-testing the strategy
  • Pulling current data from Yahoo
  • Building our New Data-set
  • Searching for investment suggestions
  • Raising investment requirement standards
  • Testing raised standards
  • Streamlining the changing of standards