Shuffling our data to solve a learning issue




In this machine learning tutorial, we're going to cover shuffling our data for learning. One of the problems we have right now is that we're training on, for example, 90% of the data. The problem is that our data frame is in alphabetical order of stocks. This means that, when we're testing, we're testing on stocks that begin with A-R, for example, and then leaving the S-Z stocks for testing.

This is not a very representative test, nor is it necessary to do it this way. The main constraint we have is that we just simply cannot train and test against the same data, but there's nothing wrong with training and testing on the same stocks, given our current algorithm. There could be some investing methods and/or algorithms that doing this would be a problem with, but this is not one of them.

There is another method of stepping through the data where you divide the data up, so for example:?

1st 25% - train

2nd 25% - train

3rd 25% - train

4th 25% - test

Then you do:

1st 25% - test

2nd 25% - train

3rd 25% - train

4th 25% - train

then:

1st 25% - train

2nd 25% - test

3rd 25% - train

4th 25% - train

Finally:

1st 25% - train

2nd 25% - train

3rd 25% - test

4th 25% - train

Now, you have actually trained and tested against all data, and you can take an average to see how you've really doen.

Instead, here, we're going to just shuffle the data to keep things simple.

To shuffle the rows of a data set, the following code can be used:

def Randomizing():
    df = pd.DataFrame({"D1":range(5), "D2":range(5)})
    print(df)
    df2 = df.reindex(np.random.permutation(df.index))
    print(df2)


Randomizing()

Now that we see how we can shuffle rows in the Pandas module, we're ready to apply this to our data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, preprocessing
import pandas as pd
from matplotlib import style
style.use("ggplot")


FEATURES =  ['DE Ratio',
             'Trailing P/E',
             'Price/Sales',
             'Price/Book',
             'Profit Margin',
             'Operating Margin',
             'Return on Assets',
             'Return on Equity',
             'Revenue Per Share',
             'Market Cap',
             'Enterprise Value',
             'Forward P/E',
             'PEG Ratio',
             'Enterprise Value/Revenue',
             'Enterprise Value/EBITDA',
             'Revenue',
             'Gross Profit',
             'EBITDA',
             'Net Income Avl to Common ',
             'Diluted EPS',
             'Earnings Growth',
             'Revenue Growth',
             'Total Cash',
             'Total Cash Per Share',
             'Total Debt',
             'Current Ratio',
             'Book Value Per Share',
             'Cash Flow',
             'Beta',
             'Held by Insiders',
             'Held by Institutions',
             'Shares Short (as of',
             'Short Ratio',
             'Short % of Float',
             'Shares Short (prior ']


def Build_Data_Set():
    data_df = pd.DataFrame.from_csv("key_stats.csv")

    data_df = data_df.reindex(np.random.permutation(data_df.index))
    
    X = np.array(data_df[FEATURES].values)#.tolist())

    y = (data_df["Status"]
         .replace("underperform",0)
         .replace("outperform",1)
         .values.tolist())

    X = preprocessing.scale(X)


    return X,y


def Analysis():

    test_size = 1000
    X, y = Build_Data_Set()
    print(len(X))

    
    clf = svm.SVC(kernel="linear", C= 1.0)
    clf.fit(X[:-test_size],y[:-test_size])

    correct_count = 0

    for x in range(1, test_size+1):
        if clf.predict(X[-x])[0] == y[-x]:
            correct_count += 1

    print("Accuracy:", (correct_count/test_size) * 100.00)
    

Analysis()

Notice the new line in the Build_Data_Set function: data_df = data_df.reindex(np.random.permutation(data_df.index))

The next tutorial:





  • Intro to Machine Learning with Scikit Learn and Python
  • Simple Support Vector Machine (SVM) example with character recognition
  • Our Method and where we will be getting our Data
  • Parsing data
  • More Parsing
  • Structuring data with Pandas
  • Getting more data and meshing data sets
  • Labeling of data part 1
  • Labeling data part 2
  • Finally finishing up the labeling
  • Linear SVC Machine learning SVM example with Python
  • Getting more features from our data
  • Linear SVC machine learning and testing our data
  • Scaling, Normalizing, and machine learning with many features
  • Shuffling our data to solve a learning issue
  • Using Quandl for more data
  • Improving our Analysis with a more accurate measure of performance in relation to fundamentals
  • Learning and Testing our Machine learning algorithm
  • More testing, this time including N/A data
  • Back-testing the strategy
  • Pulling current data from Yahoo
  • Building our New Data-set
  • Searching for investment suggestions
  • Raising investment requirement standards
  • Testing raised standards
  • Streamlining the changing of standards