Structuring data with Pandas




Once we're able to acquire the data that we're interested in using for our features, we need some way of structuring and organizing our data. I find the Pandas module to be the most useful for this task, given it's quick, efficient and easy data manipulation abilities.

To start, we're going to need to import pandas, as well as set up our data frame. The following changes can be seen:

import pandas as pd
import os
import time
from datetime import datetime

path = "X:/Backups/intraQuarter"

def Key_Stats(gather="Total Debt/Equity (mrq)"):
    statspath = path+'/_KeyStats'
    stock_list = [x[0] for x in os.walk(statspath)]
    df = pd.DataFrame(columns = ['Date','Unix','Ticker','DE Ratio'])

Notice the import and the top, and the definition of a new variable, df. The import speaks for itself. The df variable is used to store the creation of a new "DataFrame" object from Pandas, where we specify the columns to be date, unix, ticker, and DE ratio. Think of a DataFrame like a spreadsheet or a relational database.

Next up:

    for each_dir in stock_list[1:]:
        each_file = os.listdir(each_dir)
        ticker = each_dir.split("\\")[1]
        if len(each_file) > 0:
            for file in each_file:
                date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html')
                unix_time = time.mktime(date_stamp.timetuple())
                full_file_path = each_dir+'/'+file
                source = open(full_file_path,'r').read()
                try:
                    value = float(source.split(gather+':</td><td class="yfnc_tabledata1">')[1].split('</td>')[0])
                    df = df.append({'Date':date_stamp,'Unix':unix_time,'Ticker':ticker,'DE Ratio':value,}, ignore_index = True)
                except Exception as e:
                    pass

The main change here is the try/except pairing here.

The Try here identifies the value as usual, then we're re-defining our DataFrame object as the previous DataFrame object with the new data appended to it. In terms of speed, this is a rather sluggish way to use Pandas, though, as far as I know, there is no better way to "append" rows to your Pandas DataFrame. Luckily, while it is considered "sluggish" relative to Pandas, it's still quite fast.

Finally, we end our Key_Stats function with:

    save = gather.replace(' ','').replace(')','').replace('(','').replace('/','')+('.csv')
    print(save)
    df.to_csv(save)

Above, we're specifying a custom name for the csv file, then using pandas to_csv capability to output the Data Frame to an actual CSV file.

Then finish off the script as usual with a:

Key_Stats()

Running this then saves the dataframe as a CSV spreadsheet for us. We want to save the data since we really just need to access and store the data once. When we begin doing analysis, we're going to want to frequently modify the data, re-structure it, and do all sorts of testing with it. If we were re-pulling the data every single time, it would add a significant amount of time to the overall process.


There exists 1 challenge(s) for this tutorial. for access to these, video downloads, and no ads.

The next tutorial:





  • Intro to Machine Learning with Scikit Learn and Python
  • Simple Support Vector Machine (SVM) example with character recognition
  • Our Method and where we will be getting our Data
  • Parsing data
  • More Parsing
  • Structuring data with Pandas
  • Getting more data and meshing data sets
  • Labeling of data part 1
  • Labeling data part 2
  • Finally finishing up the labeling
  • Linear SVC Machine learning SVM example with Python
  • Getting more features from our data
  • Linear SVC machine learning and testing our data
  • Scaling, Normalizing, and machine learning with many features
  • Shuffling our data to solve a learning issue
  • Using Quandl for more data
  • Improving our Analysis with a more accurate measure of performance in relation to fundamentals
  • Learning and Testing our Machine learning algorithm
  • More testing, this time including N/A data
  • Back-testing the strategy
  • Pulling current data from Yahoo
  • Building our New Data-set
  • Searching for investment suggestions
  • Raising investment requirement standards
  • Testing raised standards
  • Streamlining the changing of standards