Python Programming Tutorials

Parsing data

Assuming you have the machine learning data file downloaded, then you're ready to learn how to parse the data.

The data set we have mimics exactly having visited the web pages at the time, only we're not actually needing to visit the page. We have the full HTML source code, so it is just like parsing the website, minus the bandwidth use.

First, we're going to want to know what the corresponding date is to our data, then we're going to pull the actual data.

To start:

import pandas as pd
import os
import time
from datetime import datetime

path = "X:/Backups/intraQuarter"

Above, we're importing pandas for the Pandas module, os so that we can interact with directories, time and datetime for managing time and date information.

Finally, we define "path," which is the path to the intraQuarter folder (you need to unzip the original zip file you downloaded from this website).

def Key_Stats(gather="Total Debt/Equity (mrq)"):
    statspath = path+'/_KeyStats'
    stock_list = [x[0] for x in os.walk(statspath)]
    #print(stock_list)

Here, we begin our function, with the specification that we're going to try to collect the Debt/Equity value.

Statspath is the path to the stats directory.

stock_list is a quick one-liner for loop that uses os.walk to list out all contents within a directory.

    for each_dir in stock_list[1:]:
        each_file = os.listdir(each_dir)
        if len(each_file) > 0:

Above, we're cycling through every directory (which is every stock ticker). Then, we list "each_file" which is each file within that stock's directory. If the length of each_file, which is a list of all of the files in the stock's directory is greater than 0, then we want to proceed. Some stocks have no files/data.

            for file in each_file:

                date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
                unix_time = time.mktime(date_stamp.timetuple())
                print(date_stamp, unix_time)
                #time.sleep(15)

Key_Stats()

Finally, we run a for loop that pulls the date_stamp from each file. Our files are stored under their ticker, with a file name of the exact date and time from the information being pulled.

From there, we explain to date-time what the format for our date stamp is, then we convert to a unix time stamp.

The next tutorial:

Intro to Machine Learning with Scikit Learn and Python
Simple Support Vector Machine (SVM) example with character recognition
Our Method and where we will be getting our Data
Parsing data
More Parsing
Structuring data with Pandas
Getting more data and meshing data sets
Labeling of data part 1
Labeling data part 2
Finally finishing up the labeling
Linear SVC Machine learning SVM example with Python
Getting more features from our data
Linear SVC machine learning and testing our data
Scaling, Normalizing, and machine learning with many features
Shuffling our data to solve a learning issue
Using Quandl for more data
Improving our Analysis with a more accurate measure of performance in relation to fundamentals
Learning and Testing our Machine learning algorithm
More testing, this time including N/A data
Back-testing the strategy
Pulling current data from Yahoo
Building our New Data-set
Searching for investment suggestions
Raising investment requirement standards
Testing raised standards
Streamlining the changing of standards