Our Method and where we will be getting our Data

Getting the data you need to do your testing can prove to be quite daunting. We're entering the age of big data, and, just like you want data, so does everyone else.

Data is a commodity, and has a market value. If you've never received a quote for something like Twitter data, you'll most likely be astonished to find out how much it costs. Do you want the firehose? Hope you have millions. Streaming 100K tweets a day? 10's of thousands a month.

The above is only true, of course, if someone has the data to even sell you. You'll likely find that a lot of the data you seek doesn't even exist. Even if it does, it may not exist in a form that you can use. Even if it is nicely organized, is it all in numerical form? Is it normalized?

It can be a massive pain. For us, our question is:

Can we use machine learning to analyze public company (stocks) fundamentals (things like price/book ratio, P/E ratio, Debt/Equity ... etc), and then classify the stocks as either out-performers compared to the market (labeled as 1's), or under-performers (labeled as 0's).

With this question, we need fundamental company data. We need this data over the years as well. You may find that the data you want is just plain not easily obtainable. Much of the data you may desire is just available online, and not in an easily downloaded and used format. We're going to simulate that, only without the need for you to actually parse from some web server.

The download for the data is: , which is over a decade's worth of S&P 500 company fundamentals

This data is straight HTML source code for the S&P 500 index of companies over a bit over a decade from Yahoo Finance.

This is the data we're going to use. Just for the record, probably the best source for public company data is the SEC (Securities and Exchange Commission) website.

To navigate the SEC.gov website, you should go to "company filings" near the top right, then use the "fast search" by typing the company's ticker symbol, like AAPL for Apple. An example of some forms you may be interested in here would be the 10K and 10Q forms. The 10K is the annual report, and the 10Q is a quarterly report.

All of that said, we're going to just use Yahoo Finance.

Yahoo Finance has a bunch of nicely organized data points all in a table. This isn't ideal for us, but we can work with it. It turns out there are some options for connecting to EDGAR via an API, so later we will cover using EDGAR specifically.

Once you download the data, extract the files. The structure is:



--stock files (organized by YYYYMMDDHHMMSS.html)


--stock files (organized by YYYYMMDDHHMMSS.html)


--stock files (organized by YYYYMMDDHHMMSS.html)

The next tutorial:

  • Intro to Machine Learning with Scikit Learn and Python
  • Simple Support Vector Machine (SVM) example with character recognition
  • Our Method and where we will be getting our Data
  • Parsing data
  • More Parsing
  • Structuring data with Pandas
  • Getting more data and meshing data sets
  • Labeling of data part 1
  • Labeling data part 2
  • Finally finishing up the labeling
  • Linear SVC Machine learning SVM example with Python
  • Getting more features from our data
  • Linear SVC machine learning and testing our data
  • Scaling, Normalizing, and machine learning with many features
  • Shuffling our data to solve a learning issue
  • Using Quandl for more data
  • Improving our Analysis with a more accurate measure of performance in relation to fundamentals
  • Learning and Testing our Machine learning algorithm
  • More testing, this time including N/A data
  • Back-testing the strategy
  • Pulling current data from Yahoo
  • Building our New Data-set
  • Searching for investment suggestions
  • Raising investment requirement standards
  • Testing raised standards
  • Streamlining the changing of standards