Hello and welcome to part 5 of the data analysis with Python and Pandas series. In this tutorial, we're going to build off the lasdt and experiment with combining datasets to see if we can find more relationships.

We've been working with the Kaggle Minimum Wage by State dataset, and now I am curious to bring in the Unemployment by County dataset. Sometimes, it's argued that increasing minimum wage makes it so that employing people is more costly, and, as a result, employment may drop.

import pandas as pd

unemp_county = pd.read_csv("datasets/unemployment-by-county-us/output.csv")
unemp_county.head()

Now, we'd want to map the minimum wage by state to this. In many states, there are varying minimum wages by city. We'll just have to accept that this isn't going to be perfect. Let's now load in our minimum wage data:

df = pd.read_csv("datasets/minwage.csv")

act_min_wage = pd.DataFrame()

for name, group in df.groupby("State"):
    if act_min_wage.empty:
        act_min_wage = group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name})
    else:
        act_min_wage = act_min_wage.join(group.set_index("Year")[["Low.2018"]].rename(columns={"Low.2018":name}))

act_min_wage.head()

Oh right, missing data. Let's throw this out from our analysis:

import numpy as np

act_min_wage = act_min_wage.replace(0, np.NaN).dropna(axis=1)
act_min_wage.head()

So now...what do we do? This seems like it might be a bit more complex than the other things we've done.

Well, our very end goal is to see if there's any relationship between unemployment and the minimum wage. To do this, we'd like to be able to just call .corr or .cov on some dataframe. Seems to me like it'd be most convenient then to do this on our unemp_county dataframe. We'd really just like to add a new column, called minimum wage, then populate that column based on that ROW's state!

Logically, we're going to need to go, row by row, check the state, and set the minimum wage column to that state's value in the other dataframe. It's often going to be the case that you have some custom task like this. In this case, we're mapping some values from one dataframe to another, but maybe another time it wont even be a dataframe to another, it could be some sensor value, some sort of custom user input, or even something that will require even further calculations on. Who knows, but Pandas is extremely flexible and we can map functions to columns, based on row values. Let's see how!

First, let's just create a function that would handle this:

def get_min_wage(year, state):
    try:
        return act_min_wage.loc[year][state]
    except:
        return np.NaN

Then for example we could do:

get_min_wage(2012, "Colorado")

8.33

Now, we map!

%%time
# time will give us the total time to perform some cell's operation.

unemp_county['min_wage'] = list(map(get_min_wage, unemp_county['Year'], unemp_county['State']))

CPU times: user 1min 27s, sys: 126 ms, total: 1min 27s
Wall time: 1min 27s

We can use this method to map just about any function with as many parameters as we want to a column. This method will basically always work, but wont necessarily be the most efficient. Often, we can use .map or .apply insted to a column, or some other built-in methods, but the above is always an option.

unemp_county.head()

unemp_county.tail()

unemp_county[['Rate','min_wage']].corr()

unemp_county[['Rate','min_wage']].cov()

Interesting. It looks like there's a slightly positive relationship (correlation) between the unemployment rate and minimum wage, but also a pretty strong covariance, signaling to us that these two things do tend to vary together. It just looks like, while they definitely vary together, the actual impact of one on the other isn't very substantial. Plus, we'd have to ask next which comes first. The increased unemployment, or the minimum wage increases.

Finally, I'd like to look at election data by county and see if there's a relationship between voting, minimum wage, and unemployment. To do this, I will pull from 2016 US presidential vote by county.

pres16 = pd.read_csv("datasets/pres16results.csv")
pres16.head(15)

This data starts with the entire US aggregate data, but then breaks down by state and county, as well as candidate. Let's include the top 10 candidates. To grab their names:

top_candidates = pres16.head(10)['cand'].values
print(top_candidates)

['Donald Trump' 'Hillary Clinton' 'Gary Johnson' 'Jill Stein'
 'Evan McMullin' 'Darrell Castle' 'Gloria La Riva' 'Rocky De La Fuente'
 ' None of these candidates' 'Richard Duncan']

#county_2015 = unemp_county[ (unemp_county['Year']==2015 and unemp_county["Month"]=="February") ]



county_2015 = unemp_county[ (unemp_county['Year']==2015) & (unemp_county["Month"]=="February")]

county_2015.head()

Now, for county_2015, we'd like to convert the State to the all-caps abbreviation that our pres16 is using. We can do that using our abbreviations that we used before:

state_abbv = pd.read_csv("datasets/state_abbv.csv", index_col=0)
state_abbv.head()

state_abbv_dict = state_abbv.to_dict()['Postal Code']

county_2015['State'] = county_2015['State'].map(state_abbv_dict)

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

county_2015.tail()

Well that was magical, wasn't it?!

In the case of singe-parmeter functions, we can just use a .map. Or...as you just saw here, if you want to map a key to a value using a dict, you can do the same thing, and just say you want to map the dictionary. Cool, huh?

Now let's map the county's candidate percentages to this. To do this, we have quite a few columns, and really, everything just needs to match up by county and state.

print(len(county_2015))
print(len(pres16))

2802
18475

Since pres16 is longer, we'll map that to county_15, where there are matches. Instead of a map, however, we'll combine with a join. To do this, let's index both of these. They are indexed by state AND county. So, we'll name these both the same, and then index as such.

pres16.rename(columns={"county": "County", "st": "State"}, inplace=True)
pres16.head()

for df in [county_2015, pres16]:
    df.set_index(["County", "State"], inplace=True)

pres16 = pres16[pres16['cand']=="Donald Trump"]
pres16 = pres16[['pct']]
pres16.dropna(inplace=True)
pres16.head(2)

county_2015.head(2)

all_together = county_2015.merge(pres16, on=["County", "State"])
all_together.dropna(inplace=True)
all_together.drop("Year", axis=1, inplace=True)
all_together.head()

all_together.corr()

all_together.cov()

Curiously, min_wage appears to have a negative correlation with the pct vote for Trump, so as minimum wage rises, people are less likely to vote for Trump, for example. That makes sense since higher minimum wages are typically a Democratic agenda, though it is fairly curious that, while correlated, they do not attempt to vary together. Interesting.

Alright, I think that, at this point, we've covered quite a bit about Pandas. There will always be more and more to show, but, for that, I would suggest you just read through the docs and begin to just try doing things yourself.

The next thing that I would like to show is using Pandas as your pre-processing software for data for machine learning, which is what we're going to be doing in the next tutorial!

	Alabama	Alaska	Arizona	Arkansas	California	Colorado	Connecticut	Delaware	District of Columbia	Federal (FLSA)	...	Tennessee	Texas	U.S. Virgin Islands	Utah	Vermont	Virginia	Washington	West Virginia	Wisconsin	Wyoming
Year
1968	0.0	15.12	3.37	1.12	11.88	7.20	10.08	9.00	9.00	8.28	...	0.0	0.00	NaN	7.20	10.08	0.0	11.52	7.20	9.00	8.64
1969	0.0	14.33	3.19	1.07	11.26	6.83	9.56	8.53	8.53	7.85	...	0.0	0.00	NaN	6.83	9.56	0.0	10.92	6.83	8.53	8.19
1970	0.0	13.54	3.02	7.09	10.64	6.45	10.32	8.06	10.32	8.38	...	0.0	0.00	NaN	6.45	10.32	0.0	10.32	6.45	8.38	8.38
1971	0.0	12.99	2.89	6.80	10.20	6.18	9.89	7.73	9.89	8.04	...	0.0	0.00	NaN	6.18	9.89	0.0	9.89	6.18	8.04	8.04
1972	0.0	12.57	2.80	7.19	9.88	5.99	11.08	9.58	9.58	9.58	...	0.0	8.38	NaN	7.19	9.58	0.0	9.58	7.19	8.68	8.98

	Alaska	Arkansas	California	Colorado	Connecticut	Delaware	District of Columbia	Federal (FLSA)	Guam	Hawaii	...	Pennsylvania	Puerto Rico	Rhode Island	South Dakota	Utah	Vermont	Washington	West Virginia	Wisconsin	Wyoming
Year
1968	15.12	1.12	11.88	7.20	10.08	9.00	9.00	8.28	9.00	9.00	...	8.28	3.10	10.08	3.06	7.20	10.08	11.52	7.20	9.00	8.64
1969	14.33	1.07	11.26	6.83	9.56	8.53	8.53	7.85	8.53	8.53	...	7.85	2.94	9.56	2.90	6.83	9.56	10.92	6.83	8.53	8.19
1970	13.54	7.09	10.64	6.45	10.32	8.06	10.32	8.38	10.32	10.32	...	8.38	2.77	10.32	6.45	6.45	10.32	10.32	6.45	8.38	8.38
1971	12.99	6.80	10.20	6.18	9.89	7.73	9.89	8.04	9.89	9.89	...	8.04	2.66	9.89	6.18	6.18	9.89	9.89	6.18	8.04	8.04
1972	12.57	7.19	9.88	5.99	11.08	9.58	9.58	9.58	11.38	9.58	...	9.58	3.89	9.58	5.99	7.19	9.58	9.58	7.19	8.68	8.98

	Year	Month	State	County	Rate	min_wage
885543	2009	November	Maine	Somerset County	10.5	8.46
885544	2009	November	Maine	Oxford County	10.5	8.46
885545	2009	November	Maine	Knox County	7.5	8.46
885546	2009	November	Maine	Piscataquis County	11.3	8.46
885547	2009	November	Maine	Aroostook County	9.0	8.46

	Postal Code
State/District
Alabama	AL
Alaska	AK
Arizona	AZ
Arkansas	AR
California	CA

	Year	Month	State	County	Rate	min_wage
2797	2015	February	ME	Somerset County	8.4	7.92
2798	2015	February	ME	Oxford County	6.8	7.92
2799	2015	February	ME	Knox County	6.1	7.92
2800	2015	February	ME	Piscataquis County	7.0	7.92
2801	2015	February	ME	Aroostook County	7.2	7.92

	Year	Month	State	County	Rate
0	2015	February	Mississippi	Newton County	6.1
1	2015	February	Mississippi	Panola County	9.4
2	2015	February	Mississippi	Monroe County	7.9
3	2015	February	Mississippi	Hinds County	6.1
4	2015	February	Mississippi	Kemper County	10.6

	county	fips	cand	st	pct_report	votes	total_votes	pct	lead
0	NaN	US	Donald Trump	US	0.9951	60350241.0	127592176.0	0.472993	Donald Trump
1	NaN	US	Hillary Clinton	US	0.9951	60981118.0	127592176.0	0.477938	Donald Trump
2	NaN	US	Gary Johnson	US	0.9951	4164589.0	127592176.0	0.032640	Donald Trump
3	NaN	US	Jill Stein	US	0.9951	1255968.0	127592176.0	0.009844	Donald Trump
4	NaN	US	Evan McMullin	US	0.9951	451636.0	127592176.0	0.003540	Donald Trump
5	NaN	US	Darrell Castle	US	0.9951	180877.0	127592176.0	0.001418	Donald Trump
6	NaN	US	Gloria La Riva	US	0.9951	48308.0	127592176.0	0.000379	Donald Trump
7	NaN	US	Rocky De La Fuente	US	0.9951	32120.0	127592176.0	0.000252	Donald Trump
8	NaN	US	None of these candidates	US	0.9951	28824.0	127592176.0	0.000226	Donald Trump
9	NaN	US	Richard Duncan	US	0.9951	23501.0	127592176.0	0.000184	Donald Trump
10	NaN	US	Dan Vacek	US	0.9951	13546.0	127592176.0	0.000106	Donald Trump
11	NaN	US	Alyson Kennedy	US	0.9951	11456.0	127592176.0	0.000090	Donald Trump
12	NaN	US	Mike Smith	US	0.9951	8911.0	127592176.0	0.000070	Donald Trump
13	NaN	US	Chris Keniston	US	0.9951	6617.0	127592176.0	0.000052	Donald Trump
14	NaN	US	Lynn Kahn	US	0.9951	5565.0	127592176.0	0.000044	Donald Trump

		Month	Rate	min_wage	pct
County	State
Major County	OK	February	2.6	2.11	0.864960
Pottawatomie County	OK	February	4.5	2.11	0.701342
Johnston County	OK	February	6.5	2.11	0.770057
Jefferson County	OK	February	5.0	2.11	0.812367
Beaver County	OK	February	2.8	2.11	0.888243

	Rate	min_wage	pct
Rate	1.000000	0.186689	-0.085985
min_wage	0.186689	1.000000	-0.325036
pct	-0.085985	-0.325036	1.000000

	Rate	min_wage	pct
Rate	5.743199	0.683870	-0.031771
min_wage	0.683870	2.336451	-0.076602
pct	-0.031771	-0.076602	0.023772

	Rate	min_wage
Rate	9.687873	0.651586
min_wage	0.651586	1.874228

Combining multiple datasets - Data Analysis with Python 3 and Pandas