While you may sometimes get lucky and there will be a nice, structured, download for the data you want, this is often not the case. For this reason, it is best that you have a good foundation for how to parse data from a website. There are many third-party parsing modules for Python, like Beautiful Soup, though parsing from websites is usually fairly trivial and does not require any other packages.
Modifying our script from before, we start with the same:
import pandas as pd import os import time from datetime import datetime path = "X:/Backups/intraQuarter" def Key_Stats(gather="Total Debt/Equity (mrq)"): statspath = path+'/_KeyStats' stock_list = [x[0] for x in os.walk(statspath)] for each_dir in stock_list[1:]: each_file = os.listdir(each_dir) ticker = each_dir.split("\\")[1] if len(each_file) > 0: for file in each_file: date_stamp = datetime.strptime(file, '%Y%m%d%H%M%S.html') unix_time = time.mktime(date_stamp.timetuple()) #print(date_stamp, unix_time)
The above is identical to before, with an added "ticker" variable which stores the current ticker being assessed. Note that Windows users will need the double back slashes, while other users may use a forward slash. It depends on how your operating system defines paths. Windows uses back slashes, which are "Escape characters." This means we need two, one to escape and the other to be the escaped backslash.
Next, we do:
full_file_path = each_dir+'/'+file print(full_file_path) source = open(full_file_path,'r').read() #print(source) value = source.split(gather+':</td><td class="yfnc_tabledata1">')[1].split('</td>')[0] print(ticker+":",value) time.sleep(15) Key_Stats()
With this code, we access the file, and save the full source code HTML contents to the "source" variable. From there, we do a quick search for the "gather" term, which is the name of the feature we're hunting for, then we split by the opening of the table data tag and the table data closing tag to find the value we're hunting for.
My methodology for parsing websites typically goes by viewing the page itself, finding the value I want, copying that value to clipboard, viewing the page source, searching for that value, and just seeing how the page is typically structured around those data points.
Our current splitting method is very crude, especially using very static splitting parameters. Later on, we're going to be finding our values using Regular Expressions, which do a better job hunting for data dynamically.
With this simple splitting method, however, we're able to pull the Debt/Equity ratios for all of the companies.