Python Programming Tutorials

Structuring data with Pandas

Once we're able to acquire the data that we're interested in using for our features, we need some way of structuring and organizing our data. I find the Pandas module to be the most useful for this task, given it's quick, efficient and easy data manipulation abilities.

To start, we're going to need to import pandas, as well as set up our data frame. The following changes can be seen:

import pandas as pd
import os
import time
from datetime import datetime

path = "X:/Backups/intraQuarter"

def Key_Stats(gather="Total Debt/Equity (mrq)"):
    statspath = path+'/_KeyStats'
    stock_list = [x[0] for x in os.walk(statspath)]
    df = pd.DataFrame(columns = ['Date','Unix','Ticker','DE Ratio'])

Notice the import and the top, and the definition of a new variable, df. The import speaks for itself. The df variable is used to store the creation of a new "DataFrame" object from Pandas, where we specify the columns to be date, unix, ticker, and DE ratio. Think of a DataFrame like a spreadsheet or a relational database.

Next up:

    for each_dir in stock_list[1:]:
        each_file = os.listdir(each_dir)
        ticker = each_dir.split("\\")[1]
        if len(each_file) > 0:
            for file in each_file:
                date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
                unix_time = time.mktime(date_stamp.timetuple())
                full_file_path = each_dir+'/'+file
                source = open(full_file_path,'r').read()
                try:
                    value = float(source.split(gather+':</td><td class="yfnc_tabledata1">')[1].split('</td>')[0])
                    df = df.append({'Date':date_stamp,'Unix':unix_time,'Ticker':ticker,'DE Ratio':value,}, ignore_index = True)
                except Exception as e:
                    pass

The main change here is the try/except pairing here.

The Try here identifies the value as usual, then we're re-defining our DataFrame object as the previous DataFrame object with the new data appended to it. In terms of speed, this is a rather sluggish way to use Pandas, though, as far as I know, there is no better way to "append" rows to your Pandas DataFrame. Luckily, while it is considered "sluggish" relative to Pandas, it's still quite fast.

Finally, we end our Key_Stats function with:

    save = gather.replace(' ','').replace(')','').replace('(','').replace('/','')+('.csv')
    print(save)
    df.to_csv(save)

Above, we're specifying a custom name for the csv file, then using pandas to_csv capability to output the Data Frame to an actual CSV file.

Then finish off the script as usual with a:

Key_Stats()

Running this then saves the dataframe as a CSV spreadsheet for us. We want to save the data since we really just need to access and store the data once. When we begin doing analysis, we're going to want to frequently modify the data, re-structure it, and do all sorts of testing with it. If we were re-pulling the data every single time, it would add a significant amount of time to the overall process.

There exists 1 challenge(s) for this tutorial. for access to these, video downloads, and no ads.

The next tutorial:

Intro to Machine Learning with Scikit Learn and Python
Simple Support Vector Machine (SVM) example with character recognition
Our Method and where we will be getting our Data
Parsing data
More Parsing
Structuring data with Pandas
Getting more data and meshing data sets
Labeling of data part 1
Labeling data part 2
Finally finishing up the labeling
Linear SVC Machine learning SVM example with Python
Getting more features from our data
Linear SVC machine learning and testing our data
Scaling, Normalizing, and machine learning with many features
Shuffling our data to solve a learning issue
Using Quandl for more data
Improving our Analysis with a more accurate measure of performance in relation to fundamentals
Learning and Testing our Machine learning algorithm
More testing, this time including N/A data
Back-testing the strategy
Pulling current data from Yahoo
Building our New Data-set
Searching for investment suggestions
Raising investment requirement standards
Testing raised standards
Streamlining the changing of standards