Welcome to Part 3 of Data Analysis with Pandas and Python. In this tutorial, we will begin discussing IO, or input/output, with Pandas, and begin with a realistic use-case. To get ample practice, a very useful website is Quandl. Quandl contains a plethora of free and paid data sources. What makes this location great is that the data is generally normalized, it's all in one place, and extracting the data is the same method. If you are using Python, and you access the Quandl data via their simple module, then the data is automatically returned to a dataframe. For the purposes of this tutorial, we're going to just manually download a CSV file instead, for learning purposes, since not every data source you find is going to have a nice and neat module for extracting the datasets.
Let's say we're interested in maybe purchasing or selling a home in Austin, Texas. The zipcode there is 77006. We could go to the local housing listings and see what the current prices are, but this doesn't really give us any real historical information, so let's just try to get some data on this. Let's query for "home value index 77006." Sure enough, we can see an index here. There's top, middle, lower tier, three bedroom, and so on. Let's say, sure, we got a a three bedroom house. Let's check that out. Turns out Quandl already provides graphs, but let's grab the dataset anyway, make our own graph, and maybe do some other analysis. Go to download, and choose CSV. Pandas is capable of IO with csv, excel data, hdf, sql, json, msgpack, html, gbq, stata, clipboard, and pickle data, and the list continues to grow. Check out the IO Tools documentation for the current list. Take that CSV and move it into the local directory (the directory that you are currently working in / where this .py script is).
Starting with this code, loading in a CSV to a dataframe can be as simple as:
import pandas as pd df = pd.read_csv('ZILL-Z77006_3B.csv') print(df.head())
Output:
Date Value 0 2015-06-30 502300 1 2015-05-31 501500 2 2015-04-30 500100 3 2015-03-31 495800 4 2015-02-28 492700
Notice that we have no decent index again. We can fix that like we did before doing:
df.set_index('Date', inplace = True)
Now, let's say we want to send this back to a CSV, we can do:
df.to_csv('newcsv2.csv')
We only have the one column right now, but if you had many columns, and just wanted to send one, you could do:
df['Value'].to_csv('newcsv2.csv')
Remember how we graphed multiple, but not all, columns? See if you can guess how to save multiple, but not all, columns.
Now, let's read that new CSV in:
df = pd.read_csv('newcsv2.csv') print(df.head())
Output:
Date Value 0 2015-06-30 502300 1 2015-05-31 501500 2 2015-04-30 500100 3 2015-03-31 495800 4 2015-02-28 492700
Darn, our index is gone again! This is because CSV has no "index" attribute like our dataframe does. What we can do, is set the index on import, rather than importing and then setting the index. Soemthing like:
df = pd.read_csv('newcsv2.csv', index_col=0) print(df.head())
Output:
Value Date 2015-06-30 502300 2015-05-31 501500 2015-04-30 500100 2015-03-31 495800 2015-02-28 492700
Now, I do not know about you, but the name "value" is fairly worthless. Can we change this? Sure, there are many ways to change the column names, one way is:
df.columns = ['House_Prices'] print(df.head())
Output:
House_Prices Date 2015-06-30 502300 2015-05-31 501500 2015-04-30 500100 2015-03-31 495800 2015-02-28 492700
Next, we can try to save to csv like so:
df.to_csv('newcsv3.csv')
If you look at the CSV there, you should see it has the headers. What if you don't want headers? No problem!
df.to_csv('newcsv4.csv', header=False)
What if the file doesn't have headers? No problem
df = pd.read_csv('newcsv4.csv', names = ['Date','House_Price'], index_col=0) print(df.head())
Output:
House_Price Date 2015-06-30 502300 2015-05-31 501500 2015-04-30 500100 2015-03-31 495800 2015-02-28 492700
These were the basics of IO an some options you have when it comes to input and output.
One interesting thing is the use of Pandas for conversion. So, maybe you are inputting data from a CSV, but you'd really like to display that data to HTML on your website. Since HTML is one of the datatypes, we can just export to HTML, like so:
df.to_html('example.html')
Now we have an HTML file. Open it up, and boom you have a table in HTML.
House_Prices | |
---|---|
Date | |
2015-06-30 | 502300 |
2015-05-31 | 501500 |
2015-04-30 | 500100 |
2015-03-31 | 495800 |
2015-02-28 | 492700 |
2015-01-31 | 493000 |
2014-12-31 | 494200 |
2014-11-30 | 490900 |
2014-10-31 | 486000 |
2014-09-30 | 479800 |
2014-08-31 | 473900 |
2014-07-31 | 467100 |
2014-06-30 | 461400 |
2014-05-31 | 455400 |
2014-04-30 | 450500 |
2014-03-31 | 450300 |
Note, this table is automatically assigned the class of "dataframe." This means you can have custom CSS to handle for dataframe-specific tables!
I particularly like to use Pandas when I have an SQL dump of data. I tend to pour the database data right into a Pandas dataframe, perform the operations that I want to perform, then I display the data in a graph maybe, or I otherwise serve the data in some way.
Finally, what if we want to actually rename just one of the columns? Earlier, you were shown how to name all columns, but maybe you just want to change one without having to type all the others out. Easy enough:
df = pd.read_csv('newcsv4.csv', names = ['Date','House_Price']) print(df.head()) df.rename(columns={'House_Price':'Prices'}, inplace=True) print(df.head())
Output:
Date House_Price 0 2015-06-30 502300 1 2015-05-31 501500 2 2015-04-30 500100 3 2015-03-31 495800 4 2015-02-28 492700 Date Prices 0 2015-06-30 502300 1 2015-05-31 501500 2 2015-04-30 500100 3 2015-03-31 495800 4 2015-02-28 492700
So here, we first imported the headless file, giving the column names of Date and House_Price. Then, we decided, nope we want to call House_Price just Price instead. So, we used df.rename, specifying that we wanted to rename columns, then, in dictionary form, the Key is the original name and the value is the new name. We finally use inplace=True so the original object is modified.