In this part of Data Analysis with Python and Pandas tutorial series, we're going to expand things a bit. Let's consider that we're multi-billionaires, or multi-millionaires, but it's more fun to be billionaires, and we're trying to diversify our portfolio as much as possible. We want to have all types of asset classes, so we've got stocks, bonds, maybe a money market account, and now we're looking to get into real estate to be solid. You've all seen the commercials right? You buy a CD for $60, attend some $500 seminar, and you're set to start making your 6 figure at a time investments into property, right?
Okay, maybe not, but we definitely want to do some research and have some sort of strategy for buying real estate. So, what governs the prices of homes, and do we need to do the research to find this out? Generally, no, you don't really need to do that digging, we know the factors. The factors for home prices are governed by: The economy, interest rates, and demographics. These are the three major influences in general for real estate value. Now, of course, if you're buying land, various other things matter, how level is it, are we going to need to do some work to the land before we can actually lay foundation, how is drainage etc. If there is a house, then we have even more factors, like the roof, windows, heating/AC, floors, foundation, and so on. We can begin to consider these factors later, but first we'll start at the macro level. You will see how quickly our data sets inflate here as it is, it'll blow up fast.
So, our first step is to just collect the data. Quandl still represents a great place to start, but this time let's automate the data grabbing. We're going to pull housing data for the 50 states first, but then we stand to try to gather other data as well. We definitely dont want to be manually pulling this data. First, if you do not already have an account, you need to get one. This will give you an API key and unlimited API requests to the free data, which is awesome.
Once you create an account, go to your account / me, whatever they are calling it at the time, and then find the section marked API key. That's your key, which you will need. Next, we want to grab the Quandl module. We really don't need the module to make requests at all, but it's a very small module, and the size is worth the slight ease it gives us, so might as well. Open up your terminal/cmd.exe and do pip install quandl
(again, remember to specify the full path to pip if pip is not recognized).
Next, we're ready to rumble, open up a new editor. To start:
import Quandl # Not necessary, I just do this so I do not show my API key. api_key = open('quandlapikey.txt','r').read() df = Quandl.get("FMAC/HPI_TX", authtoken=api_key) print(df.head())
You can just store a plain-text version of your key if you want, I have only hidden mine since it's a tutorial that I am publishing. This is all we need to do to grab the housing price index for Texas. The actual ticker that we grab can be found on any page, whenever you get there, just click on the library you are using on the side, in our case, Python, and up will pop the query you need to type.
As you carry on with your data science career, you will learn various constants that just happen to be because people are logical and reasonable. In our case, we need to grab the data for all of the states. How might we do this? Do we need to grab every ticker manually still? No. Looking at this ticker, we see FMAC/HPI_TX. We can easily decypher this to FMAC = Freddie Mac. HPI = House Price Index. TX means Texas, the common 2-letter abbreviation for the state. From here, we can safely assume all of the tickers are built this way, so now we just need a list of the state abbreviations. We Google it, make a choice like This list of the 50 states. Now what?
We can extract this data in a variety of ways. This is a Pandas tutorial, so if we can use Pandas, we shall. Let's check out the read_html from Pandas. It's not being called "experimental" anymore, but I would still label this as expirimental. The standard and quality of the other IO modules is very high and dependable. This read_html is not quite up to par, but I still say it's very impressive and useful code, and just plain cool. The way it works is you just simply feed in a URL, and Pandas will extract the dataframe worthy date from tables into a dataframe. This means, unlike the other typical methods you will usually use, read_html ends up reading into a list of dataframes. This isn't the only one that is different, but it is different. First, in order to use read_html, we need html5lib. Open up cmd.exe or your terminal and do: pip install html5lib
. Now, we can make our first attempt by doing:
fiddy_states = pd.read_html('https://simple.wikipedia.org/wiki/List_of_U.S._states') print(fiddy_states)
That's more output than I am going to post here, but you get the idea. At least some of this data is what we want, and it looks like the first dataframe is off to a good start. So let's do:
print(fiddy_states[0])
0 1 2 3 0 Abbreviation State name Capital Became a state 1 AL Alabama Montgomery December 14, 1819 2 AK Alaska Juneau January 3, 1959 3 AZ Arizona Phoenix February 14, 1912 4 AR Arkansas Little Rock June 15, 1836 5 CA California Sacramento September 9, 1850 6 CO Colorado Denver August 1, 1876 7 CT Connecticut Hartford January 9, 1788 8 DE Delaware Dover December 7, 1787 9 FL Florida Tallahassee March 3, 1845 10 GA Georgia Atlanta January 2, 1788 11 HI Hawaii Honolulu August 21, 1959 12 ID Idaho Boise July 3, 1890 13 IL Illinois Springfield December 3, 1818 14 IN Indiana Indianapolis December 11, 1816 15 IA Iowa Des Moines December 28, 1846 16 KS Kansas Topeka January 29, 1861 17 KY Kentucky Frankfort June 1, 1792 18 LA Louisiana Baton Rouge April 30, 1812 19 ME Maine Augusta March 15, 1820 20 MD Maryland Annapolis April 28, 1788 21 MA Massachusetts Boston February 6, 1788 22 MI Michigan Lansing January 26, 1837 23 MN Minnesota Saint Paul May 11, 1858 24 MS Mississippi Jackson December 10, 1817 25 MO Missouri Jefferson City August 10, 1821 26 MT Montana Helena November 8, 1889 27 NE Nebraska Lincoln March 1, 1867 28 NV Nevada Carson City October 31, 1864 29 NH New Hampshire Concord June 21, 1788 30 NJ New Jersey Trenton December 18, 1787 31 NM New Mexico Santa Fe January 6, 1912 32 NY New York Albany July 26, 1788 33 NC North Carolina Raleigh November 21, 1789 34 ND North Dakota Bismarck November 2, 1889 35 OH Ohio Columbus March 1, 1803 36 OK Oklahoma Oklahoma City November 16, 1907 37 OR Oregon Salem February 14, 1859 38 PA Pennsylvania Harrisburg December 12, 1787 39 RI Rhode Island Providence May 19, 1790 40 SC South Carolina Columbia May 23, 1788 41 SD South Dakota Pierre November 2, 1889 42 TN Tennessee Nashville June 1, 1796 43 TX Texas Austin December 29, 1845 44 UT Utah Salt Lake City January 4, 1896 45 VT Vermont Montpelier March 4, 1791 46 VA Virginia Richmond June 25, 1788 47 WA Washington Olympia November 11, 1889 48 WV West Virginia Charleston June 20, 1863 49 WI Wisconsin Madison May 29, 1848 50 WY Wyoming Cheyenne July 10, 1890
Yep, that's looking good, we want column 0. So, we want to iterate through column 0 of fiddy_states[0]. Remember, right now, fiddy_states is a list of dataframes, and fiddy_states[0] is the first dataframe. To reference column 0 then, we do fiddy_states[0][0]. One is a list index, which returns a dataframe. The other is a column within the dataframe. Next, we notice the first item in column 0 is the word "abbreviation," which we don't want. We can single that out when we iterate through all of the items in column 0 by doing column 0 [1:]. As such, our list of abbreviations is fiddy_states[0][0][1:], and we can iterate like so:
for abbv in fiddy_states[0][0][1:]: print(abbv)
AL AK AZ AR CA CO CT DE FL GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND OH OK OR PA RI SC SD TN TX UT VT VA WA WV WI WY
Perfect! Now, we remember why we're doing this at all in the first place: We are trying to build the tickers with the state abbreviations to acquire the housing price index for each state. Okay, we can build the ticker:
for abbv in fiddy_states[0][0][1:]: #print(abbv) print("FMAC/HPI_"+str(abbv))
FMAC/HPI_AL FMAC/HPI_AK FMAC/HPI_AZ FMAC/HPI_AR FMAC/HPI_CA FMAC/HPI_CO FMAC/HPI_CT FMAC/HPI_DE FMAC/HPI_FL FMAC/HPI_GA FMAC/HPI_HI FMAC/HPI_ID FMAC/HPI_IL FMAC/HPI_IN FMAC/HPI_IA FMAC/HPI_KS FMAC/HPI_KY FMAC/HPI_LA FMAC/HPI_ME FMAC/HPI_MD FMAC/HPI_MA FMAC/HPI_MI FMAC/HPI_MN FMAC/HPI_MS FMAC/HPI_MO FMAC/HPI_MT FMAC/HPI_NE FMAC/HPI_NV FMAC/HPI_NH FMAC/HPI_NJ FMAC/HPI_NM FMAC/HPI_NY FMAC/HPI_NC FMAC/HPI_ND FMAC/HPI_OH FMAC/HPI_OK FMAC/HPI_OR FMAC/HPI_PA FMAC/HPI_RI FMAC/HPI_SC FMAC/HPI_SD FMAC/HPI_TN FMAC/HPI_TX FMAC/HPI_UT FMAC/HPI_VT FMAC/HPI_VA FMAC/HPI_WA FMAC/HPI_WV FMAC/HPI_WI FMAC/HPI_WY
We've got the tickers, now we're ready to pull the dataframes. But then, what will we even do once we get them? Are we going to work with 50 separate dataframes? Sounds like a silly idea, we need some way to combine them. The good people behind Pandas saw that coming, and have provided us many ways to combine dataframes. We'll be talking about that in the next tutorial.