Building dataset - p.4 Data Analysis with Python and Pandas Tutorial



In this part of Data Analysis with Python and Pandas tutorial series, we're going to expand things a bit. Let's consider that we're multi-billionaires, or multi-millionaires, but it's more fun to be billionaires, and we're trying to diversify our portfolio as much as possible. We want to have all types of asset classes, so we've got stocks, bonds, maybe a money market account, and now we're looking to get into real estate to be solid. You've all seen the commercials right? You buy a CD for $60, attend some $500 seminar, and you're set to start making your 6 figure at a time investments into property, right?

Okay, maybe not, but we definitely want to do some research and have some sort of strategy for buying real estate. So, what governs the prices of homes, and do we need to do the research to find this out? Generally, no, you don't really need to do that digging, we know the factors. The factors for home prices are governed by: The economy, interest rates, and demographics. These are the three major influences in general for real estate value. Now, of course, if you're buying land, various other things matter, how level is it, are we going to need to do some work to the land before we can actually lay foundation, how is drainage etc. If there is a house, then we have even more factors, like the roof, windows, heating/AC, floors, foundation, and so on. We can begin to consider these factors later, but first we'll start at the macro level. You will see how quickly our data sets inflate here as it is, it'll blow up fast.

So, our first step is to just collect the data. Quandl still represents a great place to start, but this time let's automate the data grabbing. We're going to pull housing data for the 50 states first, but then we stand to try to gather other data as well. We definitely dont want to be manually pulling this data. First, if you do not already have an account, you need to get one. This will give you an API key and unlimited API requests to the free data, which is awesome.

Once you create an account, go to your account / me, whatever they are calling it at the time, and then find the section marked API key. That's your key, which you will need. Next, we want to grab the Quandl module. We really don't need the module to make requests at all, but it's a very small module, and the size is worth the slight ease it gives us, so might as well. Open up your terminal/cmd.exe and do pip install quandl (again, remember to specify the full path to pip if pip is not recognized).

Next, we're ready to rumble, open up a new editor. To start:

import Quandl

# Not necessary, I just do this so I do not show my API key.
api_key = open('quandlapikey.txt','r').read()

df = Quandl.get("FMAC/HPI_TX", authtoken=api_key)

print(df.head())

You can just store a plain-text version of your key if you want, I have only hidden mine since it's a tutorial that I am publishing. This is all we need to do to grab the housing price index for Texas. The actual ticker that we grab can be found on any page, whenever you get there, just click on the library you are using on the side, in our case, Python, and up will pop the query you need to type.

As you carry on with your data science career, you will learn various constants that just happen to be because people are logical and reasonable. In our case, we need to grab the data for all of the states. How might we do this? Do we need to grab every ticker manually still? No. Looking at this ticker, we see FMAC/HPI_TX. We can easily decypher this to FMAC = Freddie Mac. HPI = House Price Index. TX means Texas, the common 2-letter abbreviation for the state. From here, we can safely assume all of the tickers are built this way, so now we just need a list of the state abbreviations. We Google it, make a choice like This list of the 50 states. Now what?

We can extract this data in a variety of ways. This is a Pandas tutorial, so if we can use Pandas, we shall. Let's check out the read_html from Pandas. It's not being called "experimental" anymore, but I would still label this as expirimental. The standard and quality of the other IO modules is very high and dependable. This read_html is not quite up to par, but I still say it's very impressive and useful code, and just plain cool. The way it works is you just simply feed in a URL, and Pandas will extract the dataframe worthy date from tables into a dataframe. This means, unlike the other typical methods you will usually use, read_html ends up reading into a list of dataframes. This isn't the only one that is different, but it is different. First, in order to use read_html, we need html5lib. Open up cmd.exe or your terminal and do: pip install html5lib. Now, we can make our first attempt by doing:

fiddy_states = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states')
print(fiddy_states)

That's more output than I am going to post here, but you get the idea. At least some of this data is what we want, and it looks like the first dataframe is off to a good start. So let's do:

print(fiddy_states[0])
               0               1               2                  3
0   Abbreviation      State name         Capital     Became a state
1             AL         Alabama      Montgomery  December 14, 1819
2             AK          Alaska          Juneau    January 3, 1959
3             AZ         Arizona         Phoenix  February 14, 1912
4             AR        Arkansas     Little Rock      June 15, 1836
5             CA      California      Sacramento  September 9, 1850
6             CO        Colorado          Denver     August 1, 1876
7             CT     Connecticut        Hartford    January 9, 1788
8             DE        Delaware           Dover   December 7, 1787
9             FL         Florida     Tallahassee      March 3, 1845
10            GA         Georgia         Atlanta    January 2, 1788
11            HI          Hawaii        Honolulu    August 21, 1959
12            ID           Idaho           Boise       July 3, 1890
13            IL        Illinois     Springfield   December 3, 1818
14            IN         Indiana    Indianapolis  December 11, 1816
15            IA            Iowa      Des Moines  December 28, 1846
16            KS          Kansas          Topeka   January 29, 1861
17            KY        Kentucky       Frankfort       June 1, 1792
18            LA       Louisiana     Baton Rouge     April 30, 1812
19            ME           Maine         Augusta     March 15, 1820
20            MD        Maryland       Annapolis     April 28, 1788
21            MA   Massachusetts          Boston   February 6, 1788
22            MI        Michigan         Lansing   January 26, 1837
23            MN       Minnesota      Saint Paul       May 11, 1858
24            MS     Mississippi         Jackson  December 10, 1817
25            MO        Missouri  Jefferson City    August 10, 1821
26            MT         Montana          Helena   November 8, 1889
27            NE        Nebraska         Lincoln      March 1, 1867
28            NV          Nevada     Carson City   October 31, 1864
29            NH   New Hampshire         Concord      June 21, 1788
30            NJ      New Jersey         Trenton  December 18, 1787
31            NM      New Mexico        Santa Fe    January 6, 1912
32            NY        New York          Albany      July 26, 1788
33            NC  North Carolina         Raleigh  November 21, 1789
34            ND    North Dakota        Bismarck   November 2, 1889
35            OH            Ohio        Columbus      March 1, 1803
36            OK        Oklahoma   Oklahoma City  November 16, 1907
37            OR          Oregon           Salem  February 14, 1859
38            PA    Pennsylvania      Harrisburg  December 12, 1787
39            RI    Rhode Island      Providence       May 19, 1790
40            SC  South Carolina        Columbia       May 23, 1788
41            SD    South Dakota          Pierre   November 2, 1889
42            TN       Tennessee       Nashville       June 1, 1796
43            TX           Texas          Austin  December 29, 1845
44            UT            Utah  Salt Lake City    January 4, 1896
45            VT         Vermont      Montpelier      March 4, 1791
46            VA        Virginia        Richmond      June 25, 1788
47            WA      Washington         Olympia  November 11, 1889
48            WV   West Virginia      Charleston      June 20, 1863
49            WI       Wisconsin         Madison       May 29, 1848
50            WY         Wyoming        Cheyenne      July 10, 1890

Yep, that's looking good, we want column 0. So, we want to iterate through column 0 of fiddy_states[0]. Remember, right now, fiddy_states is a list of dataframes, and fiddy_states[0] is the first dataframe. To reference column 0 then, we do fiddy_states[0][0]. One is a list index, which returns a dataframe. The other is a column within the dataframe. Next, we notice the first item in column 0 is the word "abbreviation," which we don't want. We can single that out when we iterate through all of the items in column 0 by doing column 0 [1:]. As such, our list of abbreviations is fiddy_states[0][0][1:], and we can iterate like so:

for abbv in fiddy_states[0][0][1:]:
    print(abbv)
AL
AK
AZ
AR
CA
CO
CT
DE
FL
GA
HI
ID
IL
IN
IA
KS
KY
LA
ME
MD
MA
MI
MN
MS
MO
MT
NE
NV
NH
NJ
NM
NY
NC
ND
OH
OK
OR
PA
RI
SC
SD
TN
TX
UT
VT
VA
WA
WV
WI
WY

Perfect! Now, we remember why we're doing this at all in the first place: We are trying to build the tickers with the state abbreviations to acquire the housing price index for each state. Okay, we can build the ticker:

for abbv in fiddy_states[0][0][1:]:
    #print(abbv)
    print("FMAC/HPI_"+str(abbv))
FMAC/HPI_AL
FMAC/HPI_AK
FMAC/HPI_AZ
FMAC/HPI_AR
FMAC/HPI_CA
FMAC/HPI_CO
FMAC/HPI_CT
FMAC/HPI_DE
FMAC/HPI_FL
FMAC/HPI_GA
FMAC/HPI_HI
FMAC/HPI_ID
FMAC/HPI_IL
FMAC/HPI_IN
FMAC/HPI_IA
FMAC/HPI_KS
FMAC/HPI_KY
FMAC/HPI_LA
FMAC/HPI_ME
FMAC/HPI_MD
FMAC/HPI_MA
FMAC/HPI_MI
FMAC/HPI_MN
FMAC/HPI_MS
FMAC/HPI_MO
FMAC/HPI_MT
FMAC/HPI_NE
FMAC/HPI_NV
FMAC/HPI_NH
FMAC/HPI_NJ
FMAC/HPI_NM
FMAC/HPI_NY
FMAC/HPI_NC
FMAC/HPI_ND
FMAC/HPI_OH
FMAC/HPI_OK
FMAC/HPI_OR
FMAC/HPI_PA
FMAC/HPI_RI
FMAC/HPI_SC
FMAC/HPI_SD
FMAC/HPI_TN
FMAC/HPI_TX
FMAC/HPI_UT
FMAC/HPI_VT
FMAC/HPI_VA
FMAC/HPI_WA
FMAC/HPI_WV
FMAC/HPI_WI
FMAC/HPI_WY

We've got the tickers, now we're ready to pull the dataframes. But then, what will we even do once we get them? Are we going to work with 50 separate dataframes? Sounds like a silly idea, we need some way to combine them. The good people behind Pandas saw that coming, and have provided us many ways to combine dataframes. We'll be talking about that in the next tutorial.


There exists 1 challenge(s) for this tutorial. for access to these, video downloads, and no ads.

The next tutorial:




  • Data Analysis with Python and Pandas Tutorial Introduction
  • Pandas Basics - p.2 Data Analysis with Python and Pandas Tutorial
  • IO Basics - p.3 Data Analysis with Python and Pandas Tutorial
  • Building dataset - p.4 Data Analysis with Python and Pandas Tutorial
  • Concatenating and Appending dataframes - p.5 Data Analysis with Python and Pandas Tutorial
  • Joining and Merging Dataframes - p.6 Data Analysis with Python and Pandas Tutorial
  • Pickling - p.7 Data Analysis with Python and Pandas Tutorial
  • Percent Change and Correlation Tables - p.8 Data Analysis with Python and Pandas Tutorial
  • Resampling - p.9 Data Analysis with Python and Pandas Tutorial
  • Handling Missing Data - p.10 Data Analysis with Python and Pandas Tutorial
  • Rolling statistics - p.11 Data Analysis with Python and Pandas Tutorial
  • Applying Comparison Operators to DataFrame - p.12 Data Analysis with Python and Pandas Tutorial
  • Joining 30 year mortgage rate - p.13 Data Analysis with Python and Pandas Tutorial
  • Adding other economic indicators - p.14 Data Analysis with Python and Pandas Tutorial
  • Rolling Apply and Mapping Functions - p.15 Data Analysis with Python and Pandas Tutorial
  • Scikit Learn Incorporation - p.16 Data Analysis with Python and Pandas Tutorial