Hello and welcome to part 4 of the data analysis with Python and Pandas series. We're going to be continuing our work with the minimum wage dataset and our correlation table. Where we left off:
import pandas as pd
import numpy as np
df = pd.read_csv("datasets/minwage.csv")
act_min_wage = pd.DataFrame()
for name, group in df.groupby("State"):
if act_min_wage.empty:
act_min_wage = group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name})
else:
act_min_wage = act_min_wage.join(group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name}))
act_min_wage.head()
min_wage_corr = act_min_wage.replace(0, np.NaN).dropna(axis=1).corr()
min_wage_corr.head()
Now, we can graph this with matplotlib
. If you do not have it, you need to do a pip install matplotlib
. Matplotlib has a nifty graphing function called matshow
that we can use:
import matplotlib.pyplot as plt
plt.matshow(min_wage_corr)
plt.show()
It wouldn't be Matplotlib, however, if we didnt need to do some customization.
Again, I will just do the customization. If you would like to learn more about Matplotlib, check out the data visualization series.
import matplotlib.pyplot as plt
labels = [c[:2] for c in min_wage_corr.columns] # get abbv state names.
fig = plt.figure(figsize=(12,12)) # figure so we can add axis
ax = fig.add_subplot(111) # define axis, so we can modify
ax.matshow(min_wage_corr, cmap=plt.cm.RdYlGn) # display the matrix
ax.set_xticks(np.arange(len(labels))) # show them all!
ax.set_yticks(np.arange(len(labels))) # show them all!
ax.set_xticklabels(labels) # set to be the abbv (vs useless #)
ax.set_yticklabels(labels) # set to be the abbv (vs useless #)
plt.show()
Our simple abbreviations aren't cutting it. We need something better. A quick google search found me https://www.infoplease.com/state-abbreviations-and-state-postal-codes
, which contains a table.
Guess what can read tables from the internet? Pandas can! You can use pd.read_html(URL)
and pandas will search for any tables to populate a list of dfs with. Just remember, pd.read_html
will return a list of dfs, not just one df.
# pip install lxml html5lib bs4
# on mac, run: /Applications/Python\ 3.7/Install\ Certificates.command
import requests
web = requests.get("https://www.infoplease.com/state-abbreviations-and-state-postal-codes")
dfs = pd.read_html(web.text)
import pandas as pd
# https://www.infoplease.com/state-abbreviations-and-state-postal-codes
dfs = pd.read_html("https://www.infoplease.com/state-abbreviations-and-state-postal-codes")
for df in dfs:
print(df.head()) # one is states, the other territory
state_abbv = dfs[0]
state_abbv.head()
Often sources decide to disable access, or disappear, so I may want to save this dataframe both for myself and to share with you all in case they stop allowing robot access! Saving a dataframe in pandas is easy:
state_abbv.to_csv("datasets/state_abbv.csv")
Bring back:
state_abbv = pd.read_csv("datasets/state_abbv.csv")
state_abbv.head()
So what happened? Well, we saved and loaded with the "index," which has created duplication. A CSV file has no idea about indexes, so pandas will by default just load in all of the data as columns, and then assign a new index. We can do things like saving with no index, we can opt to save specific columns only, and we can load in and specify an index on load. For example, this time, let's save just the specific columns we're after:
state_abbv[["State/District", "Postal Code"]].to_csv("datasets/state_abbv.csv", index=False) # index in this case is worthless
Then, we can do:
state_abbv = pd.read_csv("datasets/state_abbv.csv", index_col=0)
state_abbv.head()
Any time you're unsure about what you can do, you should always check out the Pandas Docs. They are great to just scroll through, just to learn more about what you can do, but also to learn about various parameters and methods that exist that you might otherwise not realize.
For example, while we're mainly working with CSVs here, we can work with many other formats: SQL, json, HDF5, BigQuery, and much much more!
Back to the task at hand, we are trying to use the Postal Codes for our abbreviations:
abbv_dict = state_abbv.to_dict()
abbv_dict
We can see here that it's a dict that maps to the dict we actually want, so we can reference the codes with:
abbv_dict = abbv_dict['Postal Code']
abbv_dict
Now we can re-do our labels with:
labels = [abbv_dict[c] for c in min_wage_corr.columns] # get abbv state names.
Okay. Fine, we have to hack this one in ourselves!
abbv_dict['Federal (FLSA)'] = "FLSA"
labels = [abbv_dict[c] for c in min_wage_corr.columns] # get abbv state names.
Hmm, we might have to revisit the territories, but:
abbv_dict['Guam'] = "GU"
abbv_dict['Puerto Rico'] = "PR"
labels = [abbv_dict[c] for c in min_wage_corr.columns] # get abbv state names.
Okay good enough! Back to our graph now!
fig = plt.figure(figsize=(12,12)) # figure so we can add axis
ax = fig.add_subplot(111) # define axis, so we can modify
ax.matshow(min_wage_corr, cmap=plt.cm.RdYlGn) # display the matrix
ax.set_xticks(np.arange(len(labels))) # show them all!
ax.set_yticks(np.arange(len(labels))) # show them all!
ax.set_xticklabels(labels) # set to be the abbv (vs useless #)
ax.set_yticklabels(labels) # set to be the abbv (vs useless #)
plt.show()
Cool! We've covered quite a bit again, but hopefully that was interesting, and we got to to begin to combine datasets, if only to inform our column names.
... but more cool things happen when we can combine datasets moreso for their data! While correlation is not causation, we can still gleam interesting things from it! Plus... we get to make cool graphs, so why not? In the next tutorial, we're going to explore the relationships of minimum wage to unemployment, and maybe even toss in political persuasions of those states while we're at it!