Graphing/visualization - Data Analysis with Python 3 and Pandas




Practical Data Analysis 2

Welcome to part 2 of the data analysis with Python and Pandas tutorials, where we're learning about the prices of Avocados at the moment. Soon, we'll find a new dataset, but let's learn a few more things with this one. Where we left off, we were graphing the price from Albany over time, but it was quite messy. Here's a recap:

import pandas as pd

df = pd.read_csv("datasets/avocado.csv")

albany_df = df[df['region']=="Albany"]
albany_df.set_index("Date", inplace=True)

albany_df["AveragePrice"].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x11fd925f8>

So dates are funky types of data, since they are strings, but also have order, at least to us. When it comes to dates, we have to help computers out a bit. Luckily for us, Pandas comes built in with ways to handle for dates. First, we need to convert the date column to datetime objects:

df = pd.read_csv("datasets/avocado.csv")

df['Date'] = pd.to_datetime(df['Date'])
albany_df = df[df['region']=="Albany"]
albany_df.set_index("Date", inplace=True)

albany_df["AveragePrice"].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x11fa86828>

Alright, the formatting looks better in terms of axis, but that graph is pretty wild! Could we settle it down a bit? We could smooth the data with a rolling average.

To do this, let's make a new column, and apply some smoothing:

albany_df["AveragePrice"].rolling(25).mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1223cc278>

Hmm, so what happened? Pandas understands that a date is a date, and to sort the X axis, but I am now wondering if the dataframe itself is sorted. If it's not, that would seriously screw up our moving average calculations. This data may be indexed by date, but is it sorted? Let's see.

albany_df.sort_index(inplace=True)
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

What's this warning above? Should we be worried? Basically, all it's telling us is that we're doing operations on a copy of a slice of a dataframe, and to watch out because we might not be modifying what we were hoping to modify (like the main df). In this case, we're not trying to work with the main dataframe, so I think this warning is just plain annoying, but whatever. It's just a warning, not an error.

albany_df["AveragePrice"].rolling(25).mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1223ccf98>

And there we have it! A more useful summary of avocado prices for Albany over the years.

Visualizations are cool, but what if we want to save our new, smoother, data like above? We can give it a new column in our dataframe:

albany_df["price25ma"] = albany_df["AveragePrice"].rolling(25).mean()
/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/ipykernel_launcher.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
albany_df.head()
Unnamed: 0 AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region price25ma
Date
2015-01-04 51 1.22 40873.28 2819.50 28287.42 49.90 9716.46 9186.93 529.53 0.0 conventional 2015 Albany NaN
2015-01-04 51 1.79 1373.95 57.42 153.88 0.00 1162.65 1162.65 0.00 0.0 organic 2015 Albany NaN
2015-01-11 50 1.24 41195.08 1002.85 31640.34 127.12 8424.77 8036.04 388.73 0.0 conventional 2015 Albany NaN
2015-01-11 50 1.77 1182.56 39.00 305.12 0.00 838.44 838.44 0.00 0.0 organic 2015 Albany NaN
2015-01-18 49 1.17 44511.28 914.14 31540.32 135.77 11921.05 11651.09 269.96 0.0 conventional 2015 Albany NaN

Perfect example of why tail is useful sometimes...

albany_df.tail()
Unnamed: 0 AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region price25ma
Date
2018-03-11 2 1.68 2570.52 131.67 229.56 0.00 2209.29 2209.29 0.00 0.0 organic 2018 Albany 1.4224
2018-03-18 1 1.66 3154.45 275.89 297.96 0.00 2580.60 2577.27 3.33 0.0 organic 2018 Albany 1.4316
2018-03-18 1 1.35 105304.65 13234.86 61037.58 55.00 30977.21 26755.90 3721.31 500.0 conventional 2018 Albany 1.4276
2018-03-25 0 1.57 149396.50 16361.69 109045.03 65.45 23924.33 19273.80 4270.53 380.0 conventional 2018 Albany 1.4272
2018-03-25 0 1.71 2321.82 42.95 272.41 0.00 2006.46 1996.46 10.00 0.0 organic 2018 Albany 1.4368

That warning sure is annoying though isn't it. What could we do? A common idea is to silence it all of the warnings. What if we instead...

albany_df = df.copy()[df['region']=="Albany"]
albany_df.set_index('Date', inplace=True)
albany_df["price25ma"] = albany_df["AveragePrice"].rolling(25).mean()

This way, we are explicit. Pandas is happy, we're happy. You can ignore pandas warnings, but I would strongly advise against silencing them. Now you know you can do it, but I am not going to show you how :)

Another subtle thing you might have glossed over is the requirement for us to sort things how we intend before we start performing operations and calcs. Many times, you wont be visualizing columns before you make them. You may actually never visualize them. Imagine if we wrote the above code before we sorted by date, basically just assuming things were ordered by date. We'd have produced bad data.

Then imagine maybe we're doing some machine learning or other statistical analysis on that data. Well, chances are, our MA column is not only fairly useless, it's also being informed often of future data!

It's very easy to make mistakes like this. Check your code early and check it often through printing it out and visualizing it where possible!

Alright, we want more cool stuff, what else can we do?

Let's graph prices in the different regions. We hard-coded the Albany region, but hmm, we don't know all of the regions. What do we do?! If we every just wanted to get a "list" from one of our columns, we could reference just that column, like:

df['region']
0                  Albany
1                  Albany
2                  Albany
3                  Albany
4                  Albany
5                  Albany
6                  Albany
7                  Albany
8                  Albany
9                  Albany
10                 Albany
11                 Albany
12                 Albany
13                 Albany
14                 Albany
15                 Albany
16                 Albany
17                 Albany
18                 Albany
19                 Albany
20                 Albany
21                 Albany
22                 Albany
23                 Albany
24                 Albany
25                 Albany
26                 Albany
27                 Albany
28                 Albany
29                 Albany
               ...
18219             TotalUS
18220             TotalUS
18221             TotalUS
18222             TotalUS
18223             TotalUS
18224             TotalUS
18225                West
18226                West
18227                West
18228                West
18229                West
18230                West
18231                West
18232                West
18233                West
18234                West
18235                West
18236                West
18237    WestTexNewMexico
18238    WestTexNewMexico
18239    WestTexNewMexico
18240    WestTexNewMexico
18241    WestTexNewMexico
18242    WestTexNewMexico
18243    WestTexNewMexico
18244    WestTexNewMexico
18245    WestTexNewMexico
18246    WestTexNewMexico
18247    WestTexNewMexico
18248    WestTexNewMexico
Name: region, Length: 18249, dtype: object

Then convert to array with:

df['region'].values
array(['Albany', 'Albany', 'Albany', ..., 'WestTexNewMexico',
       'WestTexNewMexico', 'WestTexNewMexico'], dtype=object)

Could go to list like:

df['region'].values.tolist()

# and then maybe get the uniques with:

print(set(df['region'].values.tolist()))
{'Orlando', 'RaleighGreensboro', 'Charlotte', 'NewOrleansMobile', 'Syracuse', 'Nashville', 'DallasFtWorth', 'Chicago', 'Columbus', 'SanFrancisco', 'Southeast', 'Tampa', 'Jacksonville', 'SanDiego', 'MiamiFtLauderdale', 'Seattle', 'Philadelphia', 'California', 'SouthCentral', 'Pittsburgh', 'GrandRapids', 'Atlanta', 'Indianapolis', 'CincinnatiDayton', 'RichmondNorfolk', 'Louisville', 'Roanoke', 'LasVegas', 'Northeast', 'NorthernNewEngland', 'Detroit', 'Portland', 'Plains', 'Spokane', 'LosAngeles', 'HarrisburgScranton', 'SouthCarolina', 'TotalUS', 'West', 'Albany', 'NewYork', 'WestTexNewMexico', 'BuffaloRochester', 'Sacramento', 'BaltimoreWashington', 'Boston', 'Boise', 'Denver', 'HartfordSpringfield', 'PhoenixTucson', 'Houston', 'GreatLakes', 'Midsouth', 'StLouis'}

So, very quickly you can take your pandas data and get it out into array/list form and use your own knowledge of python. Or, you could also just use some Pandas method. Just know, if you're trying to do it, it probably has a method!

df['region'].unique()
array(['Albany', 'Atlanta', 'BaltimoreWashington', 'Boise', 'Boston',
       'BuffaloRochester', 'California', 'Charlotte', 'Chicago',
       'CincinnatiDayton', 'Columbus', 'DallasFtWorth', 'Denver',
       'Detroit', 'GrandRapids', 'GreatLakes', 'HarrisburgScranton',
       'HartfordSpringfield', 'Houston', 'Indianapolis', 'Jacksonville',
       'LasVegas', 'LosAngeles', 'Louisville', 'MiamiFtLauderdale',
       'Midsouth', 'Nashville', 'NewOrleansMobile', 'NewYork',
       'Northeast', 'NorthernNewEngland', 'Orlando', 'Philadelphia',
       'PhoenixTucson', 'Pittsburgh', 'Plains', 'Portland',
       'RaleighGreensboro', 'RichmondNorfolk', 'Roanoke', 'Sacramento',
       'SanDiego', 'SanFrancisco', 'Seattle', 'SouthCarolina',
       'SouthCentral', 'Southeast', 'Spokane', 'StLouis', 'Syracuse',
       'Tampa', 'TotalUS', 'West', 'WestTexNewMexico'], dtype=object)

That was quick and painless!

graph_df = pd.DataFrame()

for region in df['region'].unique()[:16]:
    print(region)
    region_df = df.copy()[df['region']==region]
    region_df.set_index('Date', inplace=True)
    region_df.sort_index(inplace=True)
    region_df[f"{region}_price25ma"] = region_df["AveragePrice"].rolling(25).mean()

    if graph_df.empty:
        graph_df = region_df[[f"{region}_price25ma"]]  # note the double square brackets!
    else:
        graph_df = graph_df.join(region_df[f"{region}_price25ma"])
Albany
Atlanta
BaltimoreWashington
Boise
Boston
BuffaloRochester
California
Charlotte
Chicago
CincinnatiDayton
Columbus
DallasFtWorth
Denver
Detroit
GrandRapids
GreatLakes

I set the limit to 16 just to show we're getting bogged down. This one really tripped me up. I couldn't quite figure out what was going on. Things were taking exponentially longer and longer, then memory was getting exhausted. That's not what I know and love with Pandas, so what gives? Upon some digging, we find that hmm, dates are still getting duplicated. For example:

graph_df.tail()
Albany_price25ma Atlanta_price25ma BaltimoreWashington_price25ma Boise_price25ma Boston_price25ma BuffaloRochester_price25ma California_price25ma Charlotte_price25ma Chicago_price25ma CincinnatiDayton_price25ma Columbus_price25ma DallasFtWorth_price25ma Denver_price25ma Detroit_price25ma GrandRapids_price25ma GreatLakes_price25ma
Date
2018-03-25 1.4368 1.2884 1.3844 1.5016 1.588 1.2232 1.4232 1.4916 1.5708 1.2792 1.1704 1.118 1.2888 1.1492 1.3264 1.2788
2018-03-25 1.4368 1.2884 1.3844 1.5016 1.588 1.2232 1.4232 1.4916 1.5708 1.2792 1.1704 1.118 1.2888 1.1728 1.3164 1.2568
2018-03-25 1.4368 1.2884 1.3844 1.5016 1.588 1.2232 1.4232 1.4916 1.5708 1.2792 1.1704 1.118 1.2888 1.1728 1.3164 1.2788
2018-03-25 1.4368 1.2884 1.3844 1.5016 1.588 1.2232 1.4232 1.4916 1.5708 1.2792 1.1704 1.118 1.2888 1.1728 1.3264 1.2568
2018-03-25 1.4368 1.2884 1.3844 1.5016 1.588 1.2232 1.4232 1.4916 1.5708 1.2792 1.1704 1.118 1.2888 1.1728 1.3264 1.2788

Each row should be a separate date, but it's not. Through some debugging, we can discover what's happening, which actually informs us to why our previous data looked so ugly too.

Our avocados have multiple prices: Organic and Conventional! So, let's pick one. I'll go with organic. So we'll just start over pretty much.

import pandas as pd

df = pd.read_csv("datasets/avocado.csv")
df = df.copy()[df['type']=='organic']

df["Date"] = pd.to_datetime(df["Date"])

df.sort_values(by="Date", ascending=True, inplace=True)
df.head()
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 4770 Total Bags Small Bags Large Bags XLarge Bags type year region
9489 51 2015-01-04 1.24 142349.77 107490.73 25711.96 2.93 9144.15 9144.15 0.00 0.0 organic 2015 California
10269 51 2015-01-04 1.50 6329.83 3730.80 2141.91 0.00 457.12 426.67 30.45 0.0 organic 2015 LasVegas
10893 51 2015-01-04 1.12 17296.85 14569.66 1868.59 0.00 858.60 830.00 28.60 0.0 organic 2015 PhoenixTucson
9437 51 2015-01-04 1.73 379.82 0.00 59.82 0.00 320.00 320.00 0.00 0.0 organic 2015 BuffaloRochester
11621 51 2015-01-04 1.30 5782.70 723.29 4221.15 0.00 838.26 223.33 614.93 0.0 organic 2015 Spokane

Now, let's just copy and paste the code from above, minus the print:

graph_df = pd.DataFrame()

for region in df['region'].unique():
    region_df = df.copy()[df['region']==region]
    region_df.set_index('Date', inplace=True)
    region_df.sort_index(inplace=True)
    region_df[f"{region}_price25ma"] = region_df["AveragePrice"].rolling(25).mean()

    if graph_df.empty:
        graph_df = region_df[[f"{region}_price25ma"]]  # note the double square brackets! (so df rather than series)
    else:
        graph_df = graph_df.join(region_df[f"{region}_price25ma"])

graph_df.tail()
California_price25ma LasVegas_price25ma PhoenixTucson_price25ma BuffaloRochester_price25ma Spokane_price25ma LosAngeles_price25ma Philadelphia_price25ma Boston_price25ma StLouis_price25ma Louisville_price25ma ... Houston_price25ma Chicago_price25ma Plains_price25ma Indianapolis_price25ma SouthCentral_price25ma Columbus_price25ma Albany_price25ma Detroit_price25ma NewOrleansMobile_price25ma NewYork_price25ma
Date
2018-02-25 1.9128 1.9120 1.7468 1.2912 2.1544 1.9024 1.6324 1.7636 1.9240 1.7044 ... 1.6132 1.8160 1.8116 1.4928 1.5728 1.6064 1.5112 1.4980 1.5384 1.9308
2018-03-04 1.8876 1.8748 1.7404 1.2744 2.1040 1.8656 1.6260 1.7708 1.8868 1.6816 ... 1.5960 1.8024 1.7900 1.4744 1.5592 1.5804 1.4992 1.4692 1.5288 1.9156
2018-03-11 1.8636 1.8440 1.7324 1.2652 2.0552 1.8284 1.6300 1.7824 1.8468 1.6508 ... 1.5696 1.7836 1.7672 1.4540 1.5400 1.5496 1.5044 1.4444 1.5076 1.9092
2018-03-18 1.8516 1.8204 1.7216 1.2560 2.0012 1.8160 1.6304 1.7932 1.8192 1.6176 ... 1.5360 1.7732 1.7452 1.4320 1.5204 1.5088 1.5140 1.4092 1.4860 1.8948
2018-03-25 1.8364 1.7968 1.7104 1.2416 1.9496 1.8016 1.6256 1.7984 1.7976 1.5844 ... 1.5128 1.7672 1.7232 1.4160 1.5072 1.4848 1.5188 1.3964 1.4616 1.8876

5 rows A-- 54 columns

Now it's quick! Awesome!

Let's graph!

graph_df.plot(figsize=(8,5), legend=False)
<matplotlib.axes._subplots.AxesSubplot at 0x124046fd0>

Lots more we could poke around with here, but, in the next tutorial, we'll be visiting a new dataset with new challenges.

The next tutorial:





  • Introduction - Data Analysis with Python 3 and Pandas
  • Graphing/visualization - Data Analysis with Python 3 and Pandas
  • Groupby - Data Analysis with Python 3 and Pandas
  • Visualizing Correlation Table - Data Analysis with Python 3 and Pandas
  • Combining multiple datasets - Data Analysis with Python 3 and Pandas
  • Machine Learning with Scikit-learn - Data Analysis with Python 3 and Pandas