Python Programming Tutorials

Our Method and where we will be getting our Data

Getting the data you need to do your testing can prove to be quite daunting. We're entering the age of big data, and, just like you want data, so does everyone else.

Data is a commodity, and has a market value. If you've never received a quote for something like Twitter data, you'll most likely be astonished to find out how much it costs. Do you want the firehose? Hope you have millions. Streaming 100K tweets a day? 10's of thousands a month.

The above is only true, of course, if someone has the data to even sell you. You'll likely find that a lot of the data you seek doesn't even exist. Even if it does, it may not exist in a form that you can use. Even if it is nicely organized, is it all in numerical form? Is it normalized?

It can be a massive pain. For us, our question is:

Can we use machine learning to analyze public company (stocks) fundamentals (things like price/book ratio, P/E ratio, Debt/Equity ... etc), and then classify the stocks as either out-performers compared to the market (labeled as 1's), or under-performers (labeled as 0's).

With this question, we need fundamental company data. We need this data over the years as well. You may find that the data you want is just plain not easily obtainable. Much of the data you may desire is just available online, and not in an easily downloaded and used format. We're going to simulate that, only without the need for you to actually parse from some web server.

The download for the data is: , which is over a decade's worth of S&P 500 company fundamentals

This data is straight HTML source code for the S&P 500 index of companies over a bit over a decade from Yahoo Finance.

This is the data we're going to use. Just for the record, probably the best source for public company data is the SEC (Securities and Exchange Commission) website.

To navigate the SEC.gov website, you should go to "company filings" near the top right, then use the "fast search" by typing the company's ticker symbol, like AAPL for Apple. An example of some forms you may be interested in here would be the 10K and 10Q forms. The 10K is the annual report, and the 10Q is a quarterly report.

All of that said, we're going to just use Yahoo Finance.

Yahoo Finance has a bunch of nicely organized data points all in a table. This isn't ideal for us, but we can work with it. It turns out there are some options for connecting to EDGAR via an API, so later we will cover using EDGAR specifically.

Once you download the data, extract the files. The structure is:

intraQuarter

-_AnnualEarnings

--stock files (organized by YYYYMMDDHHMMSS.html)

-_KetStats

--stock files (organized by YYYYMMDDHHMMSS.html)

-_QuarterlyEarnings

--stock files (organized by YYYYMMDDHHMMSS.html)

The next tutorial:

Intro to Machine Learning with Scikit Learn and Python
Simple Support Vector Machine (SVM) example with character recognition
Our Method and where we will be getting our Data
Parsing data
More Parsing
Structuring data with Pandas
Getting more data and meshing data sets
Labeling of data part 1
Labeling data part 2
Finally finishing up the labeling
Linear SVC Machine learning SVM example with Python
Getting more features from our data
Linear SVC machine learning and testing our data
Scaling, Normalizing, and machine learning with many features
Shuffling our data to solve a learning issue
Using Quandl for more data
Improving our Analysis with a more accurate measure of performance in relation to fundamentals
Learning and Testing our Machine learning algorithm
More testing, this time including N/A data
Back-testing the strategy
Pulling current data from Yahoo
Building our New Data-set
Searching for investment suggestions
Raising investment requirement standards
Testing raised standards
Streamlining the changing of standards