IO Basics - p.3 Data Analysis with Python and Pandas Tutorial



Welcome to Part 3 of Data Analysis with Pandas and Python. In this tutorial, we will begin discussing IO, or input/output, with Pandas, and begin with a realistic use-case. To get ample practice, a very useful website is Quandl. Quandl contains a plethora of free and paid data sources. What makes this location great is that the data is generally normalized, it's all in one place, and extracting the data is the same method. If you are using Python, and you access the Quandl data via their simple module, then the data is automatically returned to a dataframe. For the purposes of this tutorial, we're going to just manually download a CSV file instead, for learning purposes, since not every data source you find is going to have a nice and neat module for extracting the datasets.

Let's say we're interested in maybe purchasing or selling a home in Austin, Texas. The zipcode there is 77006. We could go to the local housing listings and see what the current prices are, but this doesn't really give us any real historical information, so let's just try to get some data on this. Let's query for "home value index 77006." Sure enough, we can see an index here. There's top, middle, lower tier, three bedroom, and so on. Let's say, sure, we got a a three bedroom house. Let's check that out. Turns out Quandl already provides graphs, but let's grab the dataset anyway, make our own graph, and maybe do some other analysis. Go to download, and choose CSV. Pandas is capable of IO with csv, excel data, hdf, sql, json, msgpack, html, gbq, stata, clipboard, and pickle data, and the list continues to grow. Check out the IO Tools documentation for the current list. Take that CSV and move it into the local directory (the directory that you are currently working in / where this .py script is).

Starting with this code, loading in a CSV to a dataframe can be as simple as:

import pandas as pd

df = pd.read_csv('ZILL-Z77006_3B.csv')
print(df.head())

Output:

         Date   Value
0  2015-06-30  502300
1  2015-05-31  501500
2  2015-04-30  500100
3  2015-03-31  495800
4  2015-02-28  492700

Notice that we have no decent index again. We can fix that like we did before doing:

df.set_index('Date', inplace = True)

Now, let's say we want to send this back to a CSV, we can do:

df.to_csv('newcsv2.csv')

We only have the one column right now, but if you had many columns, and just wanted to send one, you could do:

df['Value'].to_csv('newcsv2.csv')

Remember how we graphed multiple, but not all, columns? See if you can guess how to save multiple, but not all, columns.

Now, let's read that new CSV in:

df = pd.read_csv('newcsv2.csv')
print(df.head())

Output:

         Date   Value
0  2015-06-30  502300
1  2015-05-31  501500
2  2015-04-30  500100
3  2015-03-31  495800
4  2015-02-28  492700

Darn, our index is gone again! This is because CSV has no "index" attribute like our dataframe does. What we can do, is set the index on import, rather than importing and then setting the index. Soemthing like:

df = pd.read_csv('newcsv2.csv', index_col=0)
print(df.head())

Output:

             Value
Date              
2015-06-30  502300
2015-05-31  501500
2015-04-30  500100
2015-03-31  495800
2015-02-28  492700

Now, I do not know about you, but the name "value" is fairly worthless. Can we change this? Sure, there are many ways to change the column names, one way is:

df.columns = ['House_Prices']
print(df.head())

Output:

            House_Prices
Date                    
2015-06-30        502300
2015-05-31        501500
2015-04-30        500100
2015-03-31        495800
2015-02-28        492700

Next, we can try to save to csv like so:

df.to_csv('newcsv3.csv')

If you look at the CSV there, you should see it has the headers. What if you don't want headers? No problem!

df.to_csv('newcsv4.csv', header=False)

What if the file doesn't have headers? No problem

df = pd.read_csv('newcsv4.csv', names = ['Date','House_Price'], index_col=0)
print(df.head())

Output:

            House_Price
Date                   
2015-06-30       502300
2015-05-31       501500
2015-04-30       500100
2015-03-31       495800
2015-02-28       492700

These were the basics of IO an some options you have when it comes to input and output.

One interesting thing is the use of Pandas for conversion. So, maybe you are inputting data from a CSV, but you'd really like to display that data to HTML on your website. Since HTML is one of the datatypes, we can just export to HTML, like so:

df.to_html('example.html')

Now we have an HTML file. Open it up, and boom you have a table in HTML.

House_Prices
Date
2015-06-30 502300
2015-05-31 501500
2015-04-30 500100
2015-03-31 495800
2015-02-28 492700
2015-01-31 493000
2014-12-31 494200
2014-11-30 490900
2014-10-31 486000
2014-09-30 479800
2014-08-31 473900
2014-07-31 467100
2014-06-30 461400
2014-05-31 455400
2014-04-30 450500
2014-03-31 450300

Note, this table is automatically assigned the class of "dataframe." This means you can have custom CSS to handle for dataframe-specific tables!

I particularly like to use Pandas when I have an SQL dump of data. I tend to pour the database data right into a Pandas dataframe, perform the operations that I want to perform, then I display the data in a graph maybe, or I otherwise serve the data in some way.

Finally, what if we want to actually rename just one of the columns? Earlier, you were shown how to name all columns, but maybe you just want to change one without having to type all the others out. Easy enough:

df = pd.read_csv('newcsv4.csv', names = ['Date','House_Price'])
print(df.head())

df.rename(columns={'House_Price':'Prices'}, inplace=True)
print(df.head())

Output:

         Date  House_Price
0  2015-06-30       502300
1  2015-05-31       501500
2  2015-04-30       500100
3  2015-03-31       495800
4  2015-02-28       492700
         Date  Prices
0  2015-06-30  502300
1  2015-05-31  501500
2  2015-04-30  500100
3  2015-03-31  495800
4  2015-02-28  492700

So here, we first imported the headless file, giving the column names of Date and House_Price. Then, we decided, nope we want to call House_Price just Price instead. So, we used df.rename, specifying that we wanted to rename columns, then, in dictionary form, the Key is the original name and the value is the new name. We finally use inplace=True so the original object is modified.


There exists 2 challenge(s) for this tutorial. for access to these, video downloads, and no ads.


There exists 1 quiz/question(s) for this tutorial. for access to these, video downloads, and no ads.

The next tutorial:




  • Data Analysis with Python and Pandas Tutorial Introduction
  • Pandas Basics - p.2 Data Analysis with Python and Pandas Tutorial
  • IO Basics - p.3 Data Analysis with Python and Pandas Tutorial
  • Building dataset - p.4 Data Analysis with Python and Pandas Tutorial
  • Concatenating and Appending dataframes - p.5 Data Analysis with Python and Pandas Tutorial
  • Joining and Merging Dataframes - p.6 Data Analysis with Python and Pandas Tutorial
  • Pickling - p.7 Data Analysis with Python and Pandas Tutorial
  • Percent Change and Correlation Tables - p.8 Data Analysis with Python and Pandas Tutorial
  • Resampling - p.9 Data Analysis with Python and Pandas Tutorial
  • Handling Missing Data - p.10 Data Analysis with Python and Pandas Tutorial
  • Rolling statistics - p.11 Data Analysis with Python and Pandas Tutorial
  • Applying Comparison Operators to DataFrame - p.12 Data Analysis with Python and Pandas Tutorial
  • Joining 30 year mortgage rate - p.13 Data Analysis with Python and Pandas Tutorial
  • Adding other economic indicators - p.14 Data Analysis with Python and Pandas Tutorial
  • Rolling Apply and Mapping Functions - p.15 Data Analysis with Python and Pandas Tutorial
  • Scikit Learn Incorporation - p.16 Data Analysis with Python and Pandas Tutorial