Python Programming Tutorials

Finally finishing up the labeling

With supervised learning, you're going to need to label your data. The idea here is that you can "show" and "teach" the machine some examples of "what is." Then, you can eventually show the machine some data, without the label, expecting it to give you an answer. What we do is feed the data with labels to train the machine. Then we test the machine by feeding information and seeing what the answer is compared to the known answer. If the testing goes well, then we're ready to move forward.

In our case, we're trying to find a pretty vague answer. We cannot possibly expect the machine to be extremely accurate in its choice of successful companies, but we definitely want something over 50%, otherwise it's a bad sign. With investing in mind, not only is accuracy important, but so is the degree by which the companies perform. What if our wrong choices are significantly wrong, and our correct ones only just barely? What if our winning stock picks out perform by 1%, but the incorrect choices we made under-perform by 15%?

Regardless, our next step is labeling. There are many choices for labeling our data. Primarily, we want to label our data as a 0 or a 1, or as under-perform or out-perform.

The next question is according to what? We're going to use the S&P 500 index as comparison. The final question is: according to what time frame? The easiest thing we can do is measure performance according to the "now." This is not really the best choice, and will be changed shortly, but, for now, it's what we will use.

To start, we have similar code as usual:

import pandas as pd
import os
import time
from datetime import datetime
from time import mktime

import matplotlib
import matplotlib.pyplot as plt

from matplotlib import style
style.use('dark_background')

import re
import urllib

path = "X:/Backups/intraQuarter"

  


def Key_Stats(gather="Total Debt/Equity (mrq)"):

    statspath = path+'/_KeyStats'
    stock_list = [x[0] for x in os.walk(statspath)]

Next we need to add another column to our Data Frame, which we will call "status," which will house the stock's performance.

    df = pd.DataFrame(columns = ['Date','Unix','Ticker','DE Ratio','Price','stock_p_change','SP500','sp500_p_change','Difference','Status'])

Then we have no change for the following code:

    ticker_list = []

    sp500_df = pd.DataFrame.from_csv("YAHOO-INDEX_GSPC.csv")

    
    for each_dir in stock_list[1:]:
        ticker = each_dir.split("\\")[1]
        each_file = os.listdir(each_dir)
        ticker_list.append(ticker)
        
        starting_stock_value = False
        starting_sp500_value = False
        
        if len(each_file) > 0:
            
            for file in each_file:
                #print(file)

                date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
                unix_time = mktime(date_stamp.timetuple())
                full_file_path = each_dir+'/'+file
                source = open(full_file_path,'r').read()


                try:

                    try:
                        value = float(source.split(gather+':</td><td class="yfnc_tabledata1">')[1].split('</td>')[0])
                    except:
                        value = float(source.split(gather+':</td>\n<td class="yfnc_tabledata1">')[1].split('</td>')[0])
                        
                    try:
                        sp500_date = datetime.fromtimestamp(unix_time).strftime('%Y-%m-%d')
                        row = sp500_df[(sp500_df.index == sp500_date)]
                        sp500_value = float(row['Adjusted Close'])

                    except:
                        sp500_date = datetime.fromtimestamp(unix_time-259200).strftime('%Y-%m-%d')
                        row = sp500_df[(sp500_df.index == sp500_date)]
                        sp500_value = float(row['Adjusted Close'])

                    try:
                        stock_price = float(source.split('</small><big><b>')[1].split('</b></big>')[0])
                    except:
                
                        try:
                            stock_price = (source.split('</small><big><b>')[1].split('</b></big>')[0])
                            #print(stock_price)

                            stock_price = re.search(r'(\d{1,8}\.\d{1,8})', stock_price)
            
                            stock_price = float(stock_price.group(1))
                            #print(stock_price)


                            
                        except:

                            try:
                                stock_price = (source.split('<span class="time_rtq_ticker">')[1].split('</span>')[0])
                                #print(stock_price)

                                stock_price = re.search(r'(\d{1,8}\.\d{1,8})', stock_price)
                
                                stock_price = float(stock_price.group(1))
                                #print(stock_price)

                            except:

                                print('wtf stock price lol',ticker,file, value)
                                time.sleep(5)
                                
                    if not starting_stock_value:
                        starting_stock_value = stock_price

                    if not starting_sp500_value:
                        starting_sp500_value = sp500_value


                    stock_p_change = ((stock_price - starting_stock_value) / starting_stock_value) * 100
                    sp500_p_change = ((sp500_value - starting_sp500_value) / starting_sp500_value) * 100

                    location = len(df['Date'])

Now is where we account for the difference between the S&P 500 and the stock:

                    difference = stock_p_change-sp500_p_change
                    if difference > 0:
                        status = "outperform"
                    else:
                        status = "underperform"

Now we finish our script, not forgetting to add the new column to our appending row:

                    df = df.append({'Date':date_stamp,
                                    'Unix':unix_time,
                                    'Ticker':ticker,
                                    'DE Ratio':value,
                                    'Price':stock_price,
                                    'stock_p_change':stock_p_change,
                                    'SP500':sp500_value,
                                    'sp500_p_change':sp500_p_change,
                                    ############################
                                    'Difference':difference,
                                    'Status':status},
                                   ignore_index=True)


                    
                except Exception as e:
                    pass
                    #print(ticker,e,file, value)


                    
    #print(ticker_list)   
    #print(df)


    for each_ticker in ticker_list:
        try:
            plot_df = df[(df['Ticker'] == each_ticker)]

            plot_df = plot_df.set_index(['Date'])


            if plot_df['Status'][-1] == 'underperform':
                color = 'r'
            else:
                color = 'g'

            
            plot_df['Difference'].plot(label=each_ticker, color=color)
            plt.legend()
        except Exception as e:
            print(str(e))

    plt.show()

    save = gather.replace(' ','').replace(')','').replace('(','').replace('/','')+str('.csv')
    print(save)
    df.to_csv(save)
    
Key_Stats()

The next tutorial:

Intro to Machine Learning with Scikit Learn and Python
Simple Support Vector Machine (SVM) example with character recognition
Our Method and where we will be getting our Data
Parsing data
More Parsing
Structuring data with Pandas
Getting more data and meshing data sets
Labeling of data part 1
Labeling data part 2
Finally finishing up the labeling
Linear SVC Machine learning SVM example with Python
Getting more features from our data
Linear SVC machine learning and testing our data
Scaling, Normalizing, and machine learning with many features
Shuffling our data to solve a learning issue
Using Quandl for more data
Improving our Analysis with a more accurate measure of performance in relation to fundamentals
Learning and Testing our Machine learning algorithm
More testing, this time including N/A data
Back-testing the strategy
Pulling current data from Yahoo
Building our New Data-set
Searching for investment suggestions
Raising investment requirement standards
Testing raised standards
Streamlining the changing of standards