Scaling, Normalizing, and machine learning with many features

In this machine learning with Scikit-learn (sklearn) tutorial, we cover scaling and normalizing data, as well as doing a full machine learning example on all of our features. In general, with machine learning, you ideally want your data normalized, which means all features are on a similar scale.

We can see here that we might have data that ranges from 0 to 12 typically, where other features go from 0 to 50 billion typically. While you can feed data features that vary this much through your machine learning algorithm, the results will likely be less accurate and less useful than if you were to scale it.

Applying changes to your data before running it through your machine learning algorithm is generally referred to as "pre-processing." This involves more than just scaling and normalizing your data. The very act of pre-processing can even include the use of unsupervised machine learning to reduce the amount of features in total to increase speed and efficiency.

Luckily for us, Scikit-Learn has a pre-built in functionality under sklearn.preprocessing. There are many options for pre-processing, we're just going to use one, which is sklearn.preprocessing.scale.

This will scale our data for us. It's a fairly complex topic, even just scaling. Consider how you might do it yourself. You, in general, want data to be between -1 and 1 as a target. This isn't a hard rule, you can do -5 to 7, or something like that, but the problem really arises when the ranges differ wildly between features.

With scaling, you want to have an accurate range, but you cannot just do a simple conversion using the minimum and maximum of the features and then convert based on the scaling differences, because any significant outliers will significantly cause problems. You really want the scales to still accurate reflect as best as possible.

With that, let's get started with our scaling and full test.

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, preprocessing
import pandas as pd
from matplotlib import style

FEATURES =  ['DE Ratio',
             'Trailing P/E',
             'Profit Margin',
             'Operating Margin',
             'Return on Assets',
             'Return on Equity',
             'Revenue Per Share',
             'Market Cap',
             'Enterprise Value',
             'Forward P/E',
             'PEG Ratio',
             'Enterprise Value/Revenue',
             'Enterprise Value/EBITDA',
             'Gross Profit',
             'Net Income Avl to Common ',
             'Diluted EPS',
             'Earnings Growth',
             'Revenue Growth',
             'Total Cash',
             'Total Cash Per Share',
             'Total Debt',
             'Current Ratio',
             'Book Value Per Share',
             'Cash Flow',
             'Held by Insiders',
             'Held by Institutions',
             'Shares Short (as of',
             'Short Ratio',
             'Short % of Float',
             'Shares Short (prior ']

Now we have our typical imports, and then a feature list that we intend to use.

Next, we use a similar function as before to build our data set, with the use of preprocessing.scale():

def Build_Data_Set():
    data_df = pd.DataFrame.from_csv("key_stats.csv")

    #data_df = data_df[:100]

    X = np.array(data_df[FEATURES].values)#.tolist())

    y = (data_df["Status"]

    X = preprocessing.scale(X)

    return X,y

Notice that we're also commenting out the line that slices our data set.

Now we're going to work on an analysis function:

def Analysis():

    test_size = 1000
    X, y = Build_Data_Set()

    clf = svm.SVC(kernel="linear", C= 1.0)[:-test_size],y[:-test_size])

    correct_count = 0

    for x in range(1, test_size+1):
        if clf.predict(X[-x])[0] == y[-x]:
            correct_count += 1

    print("Accuracy:", (correct_count/test_size) * 100.00)


test_size is being used as a variable to determine how much data that we'd like to reserve for testing.

Next, we unpack the returned values of the data-set building functions to X,y, printing out the length of the data after that.

We then specify the classifer we intend to use, and then fit the data to that classifier. Notice that we fit the data to the negative index of "test_size." We're doing this to make sure we do not train on the amount of data that we specify that we want to test against. You do not want to train and test against identical data. For now, we're keeping things basic, but there are many methods to training and testing against the full data-set to get more accurate results. Our methods here are very crude.

Next, we begin testing our data with a simple for-loop, which will check predictions of all of the testing data and then compare the predictions to reality.

There are built-in functions with sklearn for determining prediction accuracy, among other measurements. We're not using them here because some of the later operations we perform are not covered by these functions, so we're going to be looping through anyway.

Upon running this script, you should find accuracy to be something between ~53-63%.

It's important to note that the data we're looking at is not really black and white. We're converting a spectrum of data (performance, which could be anything from -15% to +15% on average) to black and white. Our predictions are right more often than not, but, since this is an investing example, we're actually more curious about the "spectrum" in the end.

The goal is that our "loser" stocks that we pick are not not extremely bad. Consider for example:

Machine Learning Algorithm is 85% accurate, and chooses 100 stocks to invest in:

15 stocks perform an average -88% compared to market.

85 stocks perform an average +2% compared to market.

This equates to an average of -11.5% in the end compared to market.

Next, instead, consider:

Machine Learning Algorthm is 52% accurate, and chooses 100 stocks to invest in:

48 stocks perform an average of -2% compared to market.

52 stocks perform an average of +2% compared to market.

This equates to an average +8% in the end compared to market.

Which of the two above is clearly the ideal in the end? This is an exaggerated example, of course, but it illustrates a major imperative within the field of investing or trading. Many people get hyper-focused on accuracy, rather than performance. When people focus on performance, however, they tend to hyper-focus there as well, ignoring the importance of risk.

There exists 2 quiz/question(s) for this tutorial. for access to these, video downloads, and no ads.

The next tutorial:

  • Intro to Machine Learning with Scikit Learn and Python
  • Simple Support Vector Machine (SVM) example with character recognition
  • Our Method and where we will be getting our Data
  • Parsing data
  • More Parsing
  • Structuring data with Pandas
  • Getting more data and meshing data sets
  • Labeling of data part 1
  • Labeling data part 2
  • Finally finishing up the labeling
  • Linear SVC Machine learning SVM example with Python
  • Getting more features from our data
  • Linear SVC machine learning and testing our data
  • Scaling, Normalizing, and machine learning with many features
  • Shuffling our data to solve a learning issue
  • Using Quandl for more data
  • Improving our Analysis with a more accurate measure of performance in relation to fundamentals
  • Learning and Testing our Machine learning algorithm
  • More testing, this time including N/A data
  • Back-testing the strategy
  • Pulling current data from Yahoo
  • Building our New Data-set
  • Searching for investment suggestions
  • Raising investment requirement standards
  • Testing raised standards
  • Streamlining the changing of standards