More testing, this time including N/A data




Now we're ready to test the efficacy of our algorithm. Now would be a good time to reiterate that, with our example, and with many examples of predictive algorithms, accuracy is only one of the measurements. The world of accuracy with say two options only leaves us with being correct or incorrect.

The world of investing, however, cares about performance. How accurate, or how inaccurate, were we? I would argue it's actually more of a spectrum. So we're hoping to have a number here that is higher than 50% accuracy, but we need to further test later to find out our "degree" of accuracy and "how right" or "how wrong" we are about the companies that we make positions with.

Also, in the world of investing, the companies that we didn't make investments in really do not matter to us. It would have been great if we had chosen companies that performed well, but our performance measurement can only come from the companies that we actually made positions with. Because of this, the accuracy percentage is actually only carries a fraction of overall performance.

With trading and investing, most of the time, there's going to be an actual "back test," or a simulation of trading being done under the strategy against historical events. We'll do that after this.

Now it is time to back-test our strategy. The only major consideration at this point is what to do with "N/A," or not available, data.

Most of the time, you're not going to get a pristine data set. It'd be great if you did, but there's almost always missing data. If the amount of missing data is low, you can choose to just ignore data that has missing points, but this, to a large degree, will affect your accuracy.

Depending on the algorithm you're using, what you do with N/A data may vary. If you have pre-propressed your data and it is mostly ranging from -1 to 1, some people may suggest that you value the N/A data at -999 or 999, since your machine learning algorithm may just ignore it as an outlier.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, preprocessing
import pandas as pd
from matplotlib import style
style.use("ggplot")


FEATURES =  ['DE Ratio',
             'Trailing P/E',
             'Price/Sales',
             'Price/Book',
             'Profit Margin',
             'Operating Margin',
             'Return on Assets',
             'Return on Equity',
             'Revenue Per Share',
             'Market Cap',
             'Enterprise Value',
             'Forward P/E',
             'PEG Ratio',
             'Enterprise Value/Revenue',
             'Enterprise Value/EBITDA',
             'Revenue',
             'Gross Profit',
             'EBITDA',
             'Net Income Avl to Common ',
             'Diluted EPS',
             'Earnings Growth',
             'Revenue Growth',
             'Total Cash',
             'Total Cash Per Share',
             'Total Debt',
             'Current Ratio',
             'Book Value Per Share',
             'Cash Flow',
             'Beta',
             'Held by Insiders',
             'Held by Institutions',
             'Shares Short (as of',
             'Short Ratio',
             'Short % of Float',
             'Shares Short (prior ']


def Build_Data_Set():
    data_df = pd.DataFrame.from_csv("key_stats_acc_perf_WITH_NA.csv")

    #data_df = data_df[:100]
    data_df = data_df.reindex(np.random.permutation(data_df.index))
    data_df = data_df.replace("NaN",0).replace("N/A",0)
    

    X = np.array(data_df[FEATURES].values)#.tolist())

    y = (data_df["Status"]
         .replace("underperform",0)
         .replace("outperform",1)
         .values.tolist())

    X = preprocessing.scale(X)


    return X,y


def Analysis():

    test_size = 1000
    X, y = Build_Data_Set()
    print(len(X))

    
    clf = svm.SVC(kernel="linear", C= 1.0)
    clf.fit(X[:-test_size],y[:-test_size])

    correct_count = 0

    for x in range(1, test_size+1):
        if clf.predict(X[-x])[0] == y[-x]:
            correct_count += 1

    print("Accuracy:", (correct_count/test_size) * 100.00)
    

Analysis()

		

The next tutorial:





  • Intro to Machine Learning with Scikit Learn and Python
  • Simple Support Vector Machine (SVM) example with character recognition
  • Our Method and where we will be getting our Data
  • Parsing data
  • More Parsing
  • Structuring data with Pandas
  • Getting more data and meshing data sets
  • Labeling of data part 1
  • Labeling data part 2
  • Finally finishing up the labeling
  • Linear SVC Machine learning SVM example with Python
  • Getting more features from our data
  • Linear SVC machine learning and testing our data
  • Scaling, Normalizing, and machine learning with many features
  • Shuffling our data to solve a learning issue
  • Using Quandl for more data
  • Improving our Analysis with a more accurate measure of performance in relation to fundamentals
  • Learning and Testing our Machine learning algorithm
  • More testing, this time including N/A data
  • Back-testing the strategy
  • Pulling current data from Yahoo
  • Building our New Data-set
  • Searching for investment suggestions
  • Raising investment requirement standards
  • Testing raised standards
  • Streamlining the changing of standards