Welcome to Part 7 of our Data Analysis with Python and Pandas tutorial series. In the last couple tutorials, we learned how to combine data sets. In this tutorial, we're going to resume under the premise that we're aspiring real estate moguls. We're looking to protect our wealth by having diversified wealth, and, one component to this is real-estate. In Part 4, we had the following code set up:
import Quandl import pandas as pd # Not necessary, I just do this so I do not show my API key. api_key = open('quandlapikey.txt','r').read() fiddy_states = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states') for abbv in fiddy_states[1:]: #print(abbv) print("FMAC/HPI_"+str(abbv))
This code was used to get the 50 states, iterate through them, and generate the appropriate Quandl query to return the House Price Index by state. Since we'll wind up with 50 dataframes here, we'd rather combine them all into one massive one. To do this, we can use .join, which we learned in the previous tutorial. We will use .join in this case because the data is returned back to us, using the Quandl module, with an actual index, Date. Normally, you probably wont get this, it'll just be dataframes with regular numbers for the index. In that case, you would use concatenate, with on = 'Date.'
Now, to run through and collect all of the data, we can do the following change:
import Quandl import pandas as pd # Not necessary, I just do this so I do not show my API key. api_key = open('quandlapikey.txt','r').read() fiddy_states = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states') main_df = pd.DataFrame() for abbv in fiddy_states[1:]: query = "FMAC/HPI_"+str(abbv) df = Quandl.get(query, authtoken=api_key) if main_df.empty: main_df = df else: main_df = main_df.join(df)
NOTE: Quandl has since changed the returns of datasets, to where if the return has one column (or so it seems to me), then the title of that column is just "value." Well, that's irritating, but we can work around it. In our for loop, rename the dataframe's column to what our
abbv value is. Without making this change, you will likely be seeing:
ValueError: columns overlap but no suffix specified: Index([u'Value'], dtype='object')
Great, but, you will find this process might take anywhere from 30 seconds to a few minutes, every single time you want to run it. That's pretty annoying. Right now, your short term goal is to just make this happen, but then what's next? We're going to keep building on this, and every time we run a test or whatever, we'll have to put up with this nonsense! Because of that, we want to save this data. Now, this is both a Data Analysis and a Pandas tutorial. With Pandas, we could just simply output the data to a CSV, or whatever data type we want, including what we're about to talk about. You may not always have the option to output the data to a simple file, however. In any case, we want to save this data to a file, so we just need to do this action once, then we can build on top of it.
When it comes to something like, machine learning, for example. You generally train a classifier, and then you can start immediately, and quickly, classifying with that classifier. The problem is, a classifer can't be saved to a .txt or .csv file. It's an object. Luckily, in programming, there are various terms for the process of saving binary data to a file that can be accessed later. In Python, this is called pickling. You may know it as serialization, or maybe even something else. Python has a module called Pickle, which will convert your object to a byte stream, or the reverse with unpickling. What this lets us do is save any Python object. That machine learning classifier? Yep. Dictionary? Yessir. Dataframe? Yep! Now, it just so happens that Pandas has pickles handled in its IO module, but you really should know how to do it with and without Pandas, so let's do that!
First, let's talk about a regular pickle. You can do this with any Python object that you want, it does not need to be a dataframe, but we'll do it with our dataframe.
First, import pickle at the top of your script
pickle_out = open('fiddy_states.pickle','wb') pickle.dump(main_df, pickle_out) pickle_out.close()
First we open a .pickle file with the intention to write some bytes. Then, we do pickle.dump to dump the data we want to pickle, and then where to dump it (the file we just opened). Finally, just any file, we close. Done, we've saved the pickle.
I would like to have us organize this code now, however. We don't want to run this code every time, but we do still want to reference the state list likely from time to time. Let's clean this up a bit:
import Quandl import pandas as pd import pickle # Not necessary, I just do this so I do not show my API key. api_key = open('quandlapikey.txt','r').read() def state_list(): fiddy_states = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states') return fiddy_states[1:] def grab_initial_state_data(): states = state_list() main_df = pd.DataFrame() for abbv in states: query = "FMAC/HPI_"+str(abbv) df = Quandl.get(query, authtoken=api_key) print(query) if main_df.empty: main_df = df else: main_df = main_df.join(df) pickle_out = open('fiddy_states.pickle','wb') pickle.dump(main_df, pickle_out) pickle_out.close() grab_initial_state_data()
Now, we can reference state_list any time we need that state list, and then we just call the grab_initial_state_data real quick for the HPI baseline, and we've saved that data to a pickle file.
Now, to get that data again, we just need to do:
pickle_in = open('fiddy_states.pickle','rb') HPI_data = pickle.load(pickle_in) print(HPI_data)
The output is more than I want to paste here, but you should get a dataframe that is ~462 rows x 50 columns. There you have it. Part of the object is that it is a dataframe, it's our way of just "saving" a variable. Very cool! You can do this with the pickle module anywhere in Python, but it turns out that Pandas has its own pickle as well, so we might as well illustrate that:
HPI_data.to_pickle('pickle.pickle') HPI_data2 = pd.read_pickle('pickle.pickle') print(HPI_data2)
Again, output is a bit too large to paste here, but you should get the same thing. If you're like me, you might wonder "why did Pandas make their own pickle option, if all of Python already has one that works just fine?" I really do not know. Apparently, the Pandas one can sometimes be quicker on massive datasets.
Now that we've got a pickle of our data, we're ready to continue along in the next tutorial with further research.