Parsing data

Assuming you have the machine learning data file downloaded, then you're ready to learn how to parse the data.

The data set we have mimics exactly having visited the web pages at the time, only we're not actually needing to visit the page. We have the full HTML source code, so it is just like parsing the website, minus the bandwidth use.

First, we're going to want to know what the corresponding date is to our data, then we're going to pull the actual data.

To start:

import pandas as pd
import os
import time
from datetime import datetime

path = "X:/Backups/intraQuarter"

Above, we're importing pandas for the Pandas module, os so that we can interact with directories, time and datetime for managing time and date information.

Finally, we define "path," which is the path to the intraQuarter folder (you need to unzip the original zip file you downloaded from this website).

def Key_Stats(gather="Total Debt/Equity (mrq)"):
    statspath = path+'/_KeyStats'
    stock_list = [x[0] for x in os.walk(statspath)]

Here, we begin our function, with the specification that we're going to try to collect the Debt/Equity value.

Statspath is the path to the stats directory.

stock_list is a quick one-liner for loop that uses os.walk to list out all contents within a directory.


    for each_dir in stock_list[1:]:
        each_file = os.listdir(each_dir)
        if len(each_file) > 0:

Above, we're cycling through every directory (which is every stock ticker). Then, we list "each_file" which is each file within that stock's directory. If the length of each_file, which is a list of all of the files in the stock's directory is greater than 0, then we want to proceed. Some stocks have no files/data.

            for file in each_file:

                date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
                unix_time = time.mktime(date_stamp.timetuple())
                print(date_stamp, unix_time)


Finally, we run a for loop that pulls the date_stamp from each file. Our files are stored under their ticker, with a file name of the exact date and time from the information being pulled.

From there, we explain to date-time what the format for our date stamp is, then we convert to a unix time stamp.

The next tutorial:

  • Intro to Machine Learning with Scikit Learn and Python
  • Simple Support Vector Machine (SVM) example with character recognition
  • Our Method and where we will be getting our Data
  • Parsing data
  • More Parsing
  • Structuring data with Pandas
  • Getting more data and meshing data sets
  • Labeling of data part 1
  • Labeling data part 2
  • Finally finishing up the labeling
  • Linear SVC Machine learning SVM example with Python
  • Getting more features from our data
  • Linear SVC machine learning and testing our data
  • Scaling, Normalizing, and machine learning with many features
  • Shuffling our data to solve a learning issue
  • Using Quandl for more data
  • Improving our Analysis with a more accurate measure of performance in relation to fundamentals
  • Learning and Testing our Machine learning algorithm
  • More testing, this time including N/A data
  • Back-testing the strategy
  • Pulling current data from Yahoo
  • Building our New Data-set
  • Searching for investment suggestions
  • Raising investment requirement standards
  • Testing raised standards
  • Streamlining the changing of standards