With supervised learning, you're going to need to label your data. The idea here is that you can "show" and "teach" the machine some examples of "what is." Then, you can eventually show the machine some data, without the label, expecting it to give you an answer. What we do is feed the data with labels to train the machine. Then we test the machine by feeding information and seeing what the answer is compared to the known answer. If the testing goes well, then we're ready to move forward.
In our case, we're trying to find a pretty vague answer. We cannot possibly expect the machine to be extremely accurate in its choice of successful companies, but we definitely want something over 50%, otherwise it's a bad sign. With investing in mind, not only is accuracy important, but so is the degree by which the companies perform. What if our wrong choices are significantly wrong, and our correct ones only just barely? What if our winning stock picks out perform by 1%, but the incorrect choices we made under-perform by 15%?
Regardless, our next step is labeling. There are many choices for labeling our data. Primarily, we want to label our data as a 0 or a 1, or as under-perform or out-perform.
The next question is according to what? We're going to use the S&P 500 index as comparison. The final question is: according to what time frame? The easiest thing we can do is measure performance according to the "now." This is not really the best choice, and will be changed shortly, but, for now, it's what we will use.
To start, we have similar code as usual:
import pandas as pd import os import time from datetime import datetime from time import mktime import matplotlib import matplotlib.pyplot as plt from matplotlib import style style.use('dark_background') import re import urllib path = "X:/Backups/intraQuarter" def Key_Stats(gather="Total Debt/Equity (mrq)"): statspath = path+'/_KeyStats' stock_list = [x[0] for x in os.walk(statspath)]
Next we need to add another column to our Data Frame, which we will call "status," which will house the stock's performance.
df = pd.DataFrame(columns = ['Date','Unix','Ticker','DE Ratio','Price','stock_p_change','SP500','sp500_p_change','Difference','Status'])
Then we have no change for the following code:
ticker_list = [] sp500_df = pd.DataFrame.from_csv("YAHOO-INDEX_GSPC.csv") for each_dir in stock_list[1:]: ticker = each_dir.split("\\")[1] each_file = os.listdir(each_dir) ticker_list.append(ticker) starting_stock_value = False starting_sp500_value = False if len(each_file) > 0: for file in each_file: #print(file) date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html') unix_time = mktime(date_stamp.timetuple()) full_file_path = each_dir+'/'+file source = open(full_file_path,'r').read() try: try: value = float(source.split(gather+':</td><td class="yfnc_tabledata1">')[1].split('</td>')[0]) except: value = float(source.split(gather+':</td>\n<td class="yfnc_tabledata1">')[1].split('</td>')[0]) try: sp500_date = datetime.fromtimestamp(unix_time).strftime('%Y-%m-%d') row = sp500_df[(sp500_df.index == sp500_date)] sp500_value = float(row['Adjusted Close']) except: sp500_date = datetime.fromtimestamp(unix_time-259200).strftime('%Y-%m-%d') row = sp500_df[(sp500_df.index == sp500_date)] sp500_value = float(row['Adjusted Close']) try: stock_price = float(source.split('</small><big><b>')[1].split('</b></big>')[0]) except: try: stock_price = (source.split('</small><big><b>')[1].split('</b></big>')[0]) #print(stock_price) stock_price = re.search(r'(\d{1,8}\.\d{1,8})', stock_price) stock_price = float(stock_price.group(1)) #print(stock_price) except: try: stock_price = (source.split('<span class="time_rtq_ticker">')[1].split('</span>')[0]) #print(stock_price) stock_price = re.search(r'(\d{1,8}\.\d{1,8})', stock_price) stock_price = float(stock_price.group(1)) #print(stock_price) except: print('wtf stock price lol',ticker,file, value) time.sleep(5) if not starting_stock_value: starting_stock_value = stock_price if not starting_sp500_value: starting_sp500_value = sp500_value stock_p_change = ((stock_price - starting_stock_value) / starting_stock_value) * 100 sp500_p_change = ((sp500_value - starting_sp500_value) / starting_sp500_value) * 100 location = len(df['Date'])
Now is where we account for the difference between the S&P 500 and the stock:
difference = stock_p_change-sp500_p_change if difference > 0: status = "outperform" else: status = "underperform"
Now we finish our script, not forgetting to add the new column to our appending row:
df = df.append({'Date':date_stamp, 'Unix':unix_time, 'Ticker':ticker, 'DE Ratio':value, 'Price':stock_price, 'stock_p_change':stock_p_change, 'SP500':sp500_value, 'sp500_p_change':sp500_p_change, ############################ 'Difference':difference, 'Status':status}, ignore_index=True) except Exception as e: pass #print(ticker,e,file, value) #print(ticker_list) #print(df) for each_ticker in ticker_list: try: plot_df = df[(df['Ticker'] == each_ticker)] plot_df = plot_df.set_index(['Date']) if plot_df['Status'][-1] == 'underperform': color = 'r' else: color = 'g' plot_df['Difference'].plot(label=each_ticker, color=color) plt.legend() except Exception as e: print(str(e)) plt.show() save = gather.replace(' ','').replace(')','').replace('(','').replace('/','')+str('.csv') print(save) df.to_csv(save) Key_Stats()