Python Programming Tutorials

More Parsing

While you may sometimes get lucky and there will be a nice, structured, download for the data you want, this is often not the case. For this reason, it is best that you have a good foundation for how to parse data from a website. There are many third-party parsing modules for Python, like Beautiful Soup, though parsing from websites is usually fairly trivial and does not require any other packages.

Modifying our script from before, we start with the same:

import pandas as pd
import os
import time
from datetime import datetime

path = "X:/Backups/intraQuarter"

def Key_Stats(gather="Total Debt/Equity (mrq)"):
    statspath = path+'/_KeyStats'
    stock_list = [x[0] for x in os.walk(statspath)]

    for each_dir in stock_list[1:]:
        each_file = os.listdir(each_dir)
        ticker = each_dir.split("\\")[1]
        if len(each_file) > 0:
            for file in each_file:
                date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
                unix_time = time.mktime(date_stamp.timetuple())
                #print(date_stamp, unix_time)

The above is identical to before, with an added "ticker" variable which stores the current ticker being assessed. Note that Windows users will need the double back slashes, while other users may use a forward slash. It depends on how your operating system defines paths. Windows uses back slashes, which are "Escape characters." This means we need two, one to escape and the other to be the escaped backslash.

Next, we do:

                full_file_path = each_dir+'/'+file
                print(full_file_path)
                source = open(full_file_path,'r').read()
                #print(source)
                value = source.split(gather+':</td><td class="yfnc_tabledata1">')[1].split('</td>')[0]
                print(ticker+":",value)
            
            time.sleep(15)
Key_Stats()

With this code, we access the file, and save the full source code HTML contents to the "source" variable. From there, we do a quick search for the "gather" term, which is the name of the feature we're hunting for, then we split by the opening of the table data tag and the table data closing tag to find the value we're hunting for.

My methodology for parsing websites typically goes by viewing the page itself, finding the value I want, copying that value to clipboard, viewing the page source, searching for that value, and just seeing how the page is typically structured around those data points.

Our current splitting method is very crude, especially using very static splitting parameters. Later on, we're going to be finding our values using Regular Expressions, which do a better job hunting for data dynamically.

With this simple splitting method, however, we're able to pull the Debt/Equity ratios for all of the companies.

The next tutorial:

Intro to Machine Learning with Scikit Learn and Python
Simple Support Vector Machine (SVM) example with character recognition
Our Method and where we will be getting our Data
Parsing data
More Parsing
Structuring data with Pandas
Getting more data and meshing data sets
Labeling of data part 1
Labeling data part 2
Finally finishing up the labeling
Linear SVC Machine learning SVM example with Python
Getting more features from our data
Linear SVC machine learning and testing our data
Scaling, Normalizing, and machine learning with many features
Shuffling our data to solve a learning issue
Using Quandl for more data
Improving our Analysis with a more accurate measure of performance in relation to fundamentals
Learning and Testing our Machine learning algorithm
More testing, this time including N/A data
Back-testing the strategy
Pulling current data from Yahoo
Building our New Data-set
Searching for investment suggestions
Raising investment requirement standards
Testing raised standards
Streamlining the changing of standards