More Parsing

While you may sometimes get lucky and there will be a nice, structured, download for the data you want, this is often not the case. For this reason, it is best that you have a good foundation for how to parse data from a website. There are many third-party parsing modules for Python, like Beautiful Soup, though parsing from websites is usually fairly trivial and does not require any other packages.

Modifying our script from before, we start with the same:

import pandas as pd
import os
import time
from datetime import datetime

path = "X:/Backups/intraQuarter"

def Key_Stats(gather="Total Debt/Equity (mrq)"):
    statspath = path+'/_KeyStats'
    stock_list = [x[0] for x in os.walk(statspath)]

    for each_dir in stock_list[1:]:
        each_file = os.listdir(each_dir)
        ticker = each_dir.split("\\")[1]
        if len(each_file) > 0:
            for file in each_file:
                date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
                unix_time = time.mktime(date_stamp.timetuple())
                #print(date_stamp, unix_time)

The above is identical to before, with an added "ticker" variable which stores the current ticker being assessed. Note that Windows users will need the double back slashes, while other users may use a forward slash. It depends on how your operating system defines paths. Windows uses back slashes, which are "Escape characters." This means we need two, one to escape and the other to be the escaped backslash.

Next, we do:

                full_file_path = each_dir+'/'+file
                source = open(full_file_path,'r').read()
                value = source.split(gather+':</td><td class="yfnc_tabledata1">')[1].split('</td>')[0]

With this code, we access the file, and save the full source code HTML contents to the "source" variable. From there, we do a quick search for the "gather" term, which is the name of the feature we're hunting for, then we split by the opening of the table data tag and the table data closing tag to find the value we're hunting for.

My methodology for parsing websites typically goes by viewing the page itself, finding the value I want, copying that value to clipboard, viewing the page source, searching for that value, and just seeing how the page is typically structured around those data points.

Our current splitting method is very crude, especially using very static splitting parameters. Later on, we're going to be finding our values using Regular Expressions, which do a better job hunting for data dynamically.

With this simple splitting method, however, we're able to pull the Debt/Equity ratios for all of the companies.

The next tutorial:

  • Intro to Machine Learning with Scikit Learn and Python
  • Simple Support Vector Machine (SVM) example with character recognition
  • Our Method and where we will be getting our Data
  • Parsing data
  • More Parsing
  • Structuring data with Pandas
  • Getting more data and meshing data sets
  • Labeling of data part 1
  • Labeling data part 2
  • Finally finishing up the labeling
  • Linear SVC Machine learning SVM example with Python
  • Getting more features from our data
  • Linear SVC machine learning and testing our data
  • Scaling, Normalizing, and machine learning with many features
  • Shuffling our data to solve a learning issue
  • Using Quandl for more data
  • Improving our Analysis with a more accurate measure of performance in relation to fundamentals
  • Learning and Testing our Machine learning algorithm
  • More testing, this time including N/A data
  • Back-testing the strategy
  • Pulling current data from Yahoo
  • Building our New Data-set
  • Searching for investment suggestions
  • Raising investment requirement standards
  • Testing raised standards
  • Streamlining the changing of standards