Visualizing Correlation Table - Data Analysis with Python 3 and Pandas




Practical Data Analysis 4

Hello and welcome to part 4 of the data analysis with Python and Pandas series. We're going to be continuing our work with the minimum wage dataset and our correlation table. Where we left off:

import pandas as pd
import numpy as np

df = pd.read_csv("datasets/minwage.csv")

act_min_wage = pd.DataFrame()

for name, group in df.groupby("State"):
    if act_min_wage.empty:
        act_min_wage = group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name})
    else:
        act_min_wage = act_min_wage.join(group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name}))

act_min_wage.head()

min_wage_corr = act_min_wage.replace(0, np.NaN).dropna(axis=1).corr()

min_wage_corr.head()
Alaska Arkansas California Colorado Connecticut Delaware District of Columbia Federal (FLSA) Guam Hawaii ... Pennsylvania Puerto Rico Rhode Island South Dakota Utah Vermont Washington West Virginia Wisconsin Wyoming
Alaska 1.000000 -0.377934 0.717653 -0.129332 0.680886 0.258997 0.321785 0.637679 0.787915 0.672620 ... 0.610814 -0.038118 0.652353 -0.326316 -0.020462 0.663558 0.411593 0.044814 0.702570 0.727932
Arkansas -0.377934 1.000000 -0.234367 0.135749 0.047580 0.016125 0.266889 0.117245 0.039593 0.204801 ... 0.159923 0.232186 0.003498 0.800116 0.194680 0.087429 -0.072343 0.420819 0.000470 -0.250592
California 0.717653 -0.234367 1.000000 0.483313 0.876215 0.479197 0.596865 0.371966 0.492052 0.519241 ... 0.429061 0.512712 0.780916 -0.036787 0.392898 0.877922 0.754085 0.371765 0.584067 0.722617
Colorado -0.129332 0.135749 0.483313 1.000000 0.402020 0.566304 0.673371 -0.232035 -0.192616 0.069800 ... -0.136195 0.657364 0.429852 0.399137 0.622330 0.448485 0.612637 0.533623 0.011501 0.130053
Connecticut 0.680886 0.047580 0.876215 0.402020 1.000000 0.552613 0.652488 0.487750 0.632073 0.621503 ... 0.531769 0.626712 0.802485 0.105707 0.302538 0.898469 0.715691 0.400099 0.585790 0.814971

5 rows A-- 39 columns

Now, we can graph this with matplotlib. If you do not have it, you need to do a pip install matplotlib. Matplotlib has a nifty graphing function called matshow that we can use:

import matplotlib.pyplot as plt

plt.matshow(min_wage_corr)
plt.show()

It wouldn't be Matplotlib, however, if we didnt need to do some customization.

Again, I will just do the customization. If you would like to learn more about Matplotlib, check out the data visualization series.

import matplotlib.pyplot as plt

labels = [c[:2] for c in min_wage_corr.columns]  # get abbv state names.

fig = plt.figure(figsize=(12,12))  # figure so we can add axis
ax = fig.add_subplot(111)  # define axis, so we can modify
ax.matshow(min_wage_corr, cmap=plt.cm.RdYlGn)  # display the matrix
ax.set_xticks(np.arange(len(labels)))  # show them all!
ax.set_yticks(np.arange(len(labels)))  # show them all!
ax.set_xticklabels(labels)  # set to be the abbv (vs useless #)
ax.set_yticklabels(labels)  # set to be the abbv (vs useless #)

plt.show()

Our simple abbreviations aren't cutting it. We need something better. A quick google search found me https://www.infoplease.com/state-abbreviations-and-state-postal-codes, which contains a table.

Guess what can read tables from the internet? Pandas can! You can use pd.read_html(URL) and pandas will search for any tables to populate a list of dfs with. Just remember, pd.read_html will return a list of dfs, not just one df.

# pip install lxml html5lib bs4
# on mac, run: /Applications/Python\ 3.7/Install\ Certificates.command


import requests

web = requests.get("https://www.infoplease.com/state-abbreviations-and-state-postal-codes")
dfs = pd.read_html(web.text)


import pandas as pd

# https://www.infoplease.com/state-abbreviations-and-state-postal-codes

dfs = pd.read_html("https://www.infoplease.com/state-abbreviations-and-state-postal-codes")
for df in dfs:
    print(df.head())  # one is states, the other territory
  State/District Abbreviation Postal Code
0        Alabama         Ala.          AL
1         Alaska       Alaska          AK
2        Arizona        Ariz.          AZ
3       Arkansas         Ark.          AR
4     California       Calif.          CA
  Territory/Associate Abbreviation Postal Code
0      American Samoa          NaN          AS
1                Guam         Guam          GU
2    Marshall Islands          NaN          MH
3          Micronesia          NaN          FM
4   Northern Marianas          NaN          MP
state_abbv = dfs[0]

state_abbv.head()
State/District Abbreviation Postal Code
0 Alabama Ala. AL
1 Alaska Alaska AK
2 Arizona Ariz. AZ
3 Arkansas Ark. AR
4 California Calif. CA

Often sources decide to disable access, or disappear, so I may want to save this dataframe both for myself and to share with you all in case they stop allowing robot access! Saving a dataframe in pandas is easy:

state_abbv.to_csv("datasets/state_abbv.csv")

Bring back:

state_abbv = pd.read_csv("datasets/state_abbv.csv")
state_abbv.head()
Unnamed: 0 State/District Abbreviation Postal Code
0 0 Alabama Ala. AL
1 1 Alaska Alaska AK
2 2 Arizona Ariz. AZ
3 3 Arkansas Ark. AR
4 4 California Calif. CA

So what happened? Well, we saved and loaded with the "index," which has created duplication. A CSV file has no idea about indexes, so pandas will by default just load in all of the data as columns, and then assign a new index. We can do things like saving with no index, we can opt to save specific columns only, and we can load in and specify an index on load. For example, this time, let's save just the specific columns we're after:

state_abbv[["State/District", "Postal Code"]].to_csv("datasets/state_abbv.csv", index=False)  # index in this case is worthless

Then, we can do:

state_abbv = pd.read_csv("datasets/state_abbv.csv", index_col=0)
state_abbv.head()
Postal Code
State/District
Alabama AL
Alaska AK
Arizona AZ
Arkansas AR
California CA

Any time you're unsure about what you can do, you should always check out the Pandas Docs. They are great to just scroll through, just to learn more about what you can do, but also to learn about various parameters and methods that exist that you might otherwise not realize.

For example, while we're mainly working with CSVs here, we can work with many other formats: SQL, json, HDF5, BigQuery, and much much more!

Back to the task at hand, we are trying to use the Postal Codes for our abbreviations:

abbv_dict = state_abbv.to_dict()
abbv_dict
{'Postal Code': {'Alabama': 'AL',
  'Alaska': 'AK',
  'Arizona': 'AZ',
  'Arkansas': 'AR',
  'California': 'CA',
  'Colorado': 'CO',
  'Connecticut': 'CT',
  'Delaware': 'DE',
  'District of Columbia': 'DC',
  'Florida': 'FL',
  'Georgia': 'GA',
  'Hawaii': 'HI',
  'Idaho': 'ID',
  'Illinois': 'IL',
  'Indiana': 'IN',
  'Iowa': 'IA',
  'Kansas': 'KS',
  'Kentucky': 'KY',
  'Louisiana': 'LA',
  'Maine': 'ME',
  'Maryland': 'MD',
  'Massachusetts': 'MA',
  'Michigan': 'MI',
  'Minnesota': 'MN',
  'Mississippi': 'MS',
  'Missouri': 'MO',
  'Montana': 'MT',
  'Nebraska': 'NE',
  'Nevada': 'NV',
  'New Hampshire': 'NH',
  'New Jersey': 'NJ',
  'New Mexico': 'NM',
  'New York': 'NY',
  'North Carolina': 'NC',
  'North Dakota': 'ND',
  'Ohio': 'OH',
  'Oklahoma': 'OK',
  'Oregon': 'OR',
  'Pennsylvania': 'PA',
  'Rhode Island': 'RI',
  'South Carolina': 'SC',
  'South Dakota': 'SD',
  'Tennessee': 'TN',
  'Texas': 'TX',
  'Utah': 'UT',
  'Vermont': 'VT',
  'Virginia': 'VA',
  'Washington': 'WA',
  'West Virginia': 'WV',
  'Wisconsin': 'WI',
  'Wyoming': 'WY'}}

We can see here that it's a dict that maps to the dict we actually want, so we can reference the codes with:

abbv_dict = abbv_dict['Postal Code']
abbv_dict
{'Alabama': 'AL',
 'Alaska': 'AK',
 'Arizona': 'AZ',
 'Arkansas': 'AR',
 'California': 'CA',
 'Colorado': 'CO',
 'Connecticut': 'CT',
 'Delaware': 'DE',
 'District of Columbia': 'DC',
 'Florida': 'FL',
 'Georgia': 'GA',
 'Hawaii': 'HI',
 'Idaho': 'ID',
 'Illinois': 'IL',
 'Indiana': 'IN',
 'Iowa': 'IA',
 'Kansas': 'KS',
 'Kentucky': 'KY',
 'Louisiana': 'LA',
 'Maine': 'ME',
 'Maryland': 'MD',
 'Massachusetts': 'MA',
 'Michigan': 'MI',
 'Minnesota': 'MN',
 'Mississippi': 'MS',
 'Missouri': 'MO',
 'Montana': 'MT',
 'Nebraska': 'NE',
 'Nevada': 'NV',
 'New Hampshire': 'NH',
 'New Jersey': 'NJ',
 'New Mexico': 'NM',
 'New York': 'NY',
 'North Carolina': 'NC',
 'North Dakota': 'ND',
 'Ohio': 'OH',
 'Oklahoma': 'OK',
 'Oregon': 'OR',
 'Pennsylvania': 'PA',
 'Rhode Island': 'RI',
 'South Carolina': 'SC',
 'South Dakota': 'SD',
 'Tennessee': 'TN',
 'Texas': 'TX',
 'Utah': 'UT',
 'Vermont': 'VT',
 'Virginia': 'VA',
 'Washington': 'WA',
 'West Virginia': 'WV',
 'Wisconsin': 'WI',
 'Wyoming': 'WY'}

Now we can re-do our labels with:

labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.
-----------------------------------------------------------------
KeyError                        Traceback (most recent call last)
<ipython-input-18-d53be929efcb> in <module>()
----> 1 labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

<ipython-input-18-d53be929efcb> in <listcomp>(.0)
----> 1 labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

KeyError: 'Federal (FLSA)'

Okay. Fine, we have to hack this one in ourselves!

abbv_dict['Federal (FLSA)'] = "FLSA"
labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.
-----------------------------------------------------------------
KeyError                        Traceback (most recent call last)
<ipython-input-20-d53be929efcb> in <module>()
----> 1 labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

<ipython-input-20-d53be929efcb> in <listcomp>(.0)
----> 1 labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

KeyError: 'Guam'

Hmm, we might have to revisit the territories, but:

abbv_dict['Guam'] = "GU"
abbv_dict['Puerto Rico'] = "PR"
labels = [abbv_dict[c] for c in min_wage_corr.columns]  # get abbv state names.

Okay good enough! Back to our graph now!

fig = plt.figure(figsize=(12,12))  # figure so we can add axis
ax = fig.add_subplot(111)  # define axis, so we can modify
ax.matshow(min_wage_corr, cmap=plt.cm.RdYlGn)  # display the matrix
ax.set_xticks(np.arange(len(labels)))  # show them all!
ax.set_yticks(np.arange(len(labels)))  # show them all!
ax.set_xticklabels(labels)  # set to be the abbv (vs useless #)
ax.set_yticklabels(labels)  # set to be the abbv (vs useless #)

plt.show()

Cool! We've covered quite a bit again, but hopefully that was interesting, and we got to to begin to combine datasets, if only to inform our column names.

... but more cool things happen when we can combine datasets moreso for their data! While correlation is not causation, we can still gleam interesting things from it! Plus... we get to make cool graphs, so why not? In the next tutorial, we're going to explore the relationships of minimum wage to unemployment, and maybe even toss in political persuasions of those states while we're at it!

The next tutorial:





  • Introduction - Data Analysis with Python 3 and Pandas
  • Graphing/visualization - Data Analysis with Python 3 and Pandas
  • Groupby - Data Analysis with Python 3 and Pandas
  • Visualizing Correlation Table - Data Analysis with Python 3 and Pandas
  • Combining multiple datasets - Data Analysis with Python 3 and Pandas
  • Machine Learning with Scikit-learn - Data Analysis with Python 3 and Pandas