Welcome to a data analysis tutorial with Python and the Pandas data analysis library.
The field of data analytics is quite large and what you might be aiming to do with it is likely to never match up exactly to any tutorial. With that in mind, I think the best way for us to approach learning data analysis with Python is simply by example. My plan here is to find some datasets and do some of the common data analysis tasks, using the Pandas package, to hopefully get you familiar enough with the package to work with it on your own.
To begin, let's make sure we're all on the same page.
I will be using Python 3.7 and Pands 0.24.1
You can likely follow along with different versions of things, just know there may be minor differences that you will need to work out. With Pandas, I have personally found I can usually google my errors with a high degree of success.
So, after you've got Python and done a pip install pandas
, you're ready!
There will be quite a few packages and libraries that we install through the course of this series. If you'd rather focus on the code and not getting packages, you can check out a pre-compiled and optimized distribution of Python from Activestate, which will have everything you will need to follow along with this series. Get .
Let's jump in!
Oh, wait, we probably should have a dataset too.
The internet is stuffed full of datasets, so there are many to choose from. I am personally going to be using datasets from Kaggle.
If you are not familiar, Kaggle is a data analysis competitions website. I think that, if you're looking to practice real-world data analysis challenges, Kaggle is the single best place to do it, even if you're not looking to compete.
Many, if not most, of the competitions on Kaggle are actual company problems. Things just like I get often asked to do in my contract work or that you might be asked to do if you find employment as a data analyst. These are typically "unsolved" types of problems, rather than simpler, solved, issues that you will typically encounter in tutorials.
I don't think we're quite ready to jump into anything serious, so let's find a simpler dataset to start with. To find datasets, check out theKaggle Datasets. Tons of goodies here.
To begin, let's check out Avocado Prices. I absolutely adore avocados! Did you know avocados are a fruit? Most closely classified as ... a berry! Imagine getting some "mixed berries" flavored thing, and there's avocado in there. Hah!
Anyway, download that dataset. You will need to login/create an account to use Kaggle, but you should. If for whatever reason you don't want to, or the dataset is missing, I will also host it here: Avocado Prices.
Unzip the file using whatever you use to zip/unzip things, and you're left with a CSV file.
CSV files are highly common file types that you will find with data analysis. The structure of a CSV tends to be something that is meant to be organized by columns and rows, where the file itself has values, separted by commas (hey is that where the name CSV comes from!?!) and then the rows are separated by new lines in the document. So, let's read this csv in with Pandas.
For now, let's make sure our file is in the same working directory as our Python script or in a directory like "datasets." I will be doing the latter, but you can feel free to do as you wish. So, to begin, we have a file called avocado.csv
and we want to load that into pandas. It's a CSV file, so it's already in a sort of columns and rows format, we just want to load that into a pandas dataframe
.
To do this, we will use a method called read_csv
. Let's see how that works. I am going to be doing this in a Jupyter Notebook. You can use whatever editor that you like, but the Jupyer notebooks are pretty useful for data analysis and just general poking around with data. To use them, you can just do:
pip install jupyterlab
Then in a terminal/command prompt, you can do:
jupyter lab
Then you can go file > new > notebook, pick Python 3, and you're good to go! Let's start by loading in a file.
import pandas as pd # convention to import and use pandas like this
df = pd.read_csv("datasets/avocado.csv") # df stands for dataframe. Also a common convention to call this df
A dataframe
is a type of pandas object that is basically a "table" like object with columns and rows, which we can also perform various calcuations and statistical operations..etc on. We can print it out:
df
Okay, that's a bit messy to print that out everytime. Often, we just want to see a small snippet of our dataframe just to make sure everything is what we expect. Most people will use the .head()
method for this:
df.head()
You can pass a parameter to the head, which is how many rows you want. Like
df.head(3)
Often, you may apply rolling window types of operations, where the head will wind up containing NAN type data, and instead you want to see the end. You can do that too with .tail()
df.tail(6)
We can also reference specific columns, like:
df['AveragePrice'].head()
Also, you can use attribute-like dot notation like:
df.AveragePrice.head()
But most people use the dict-like methodology. I am not sure if I have ever seen the attribute-like method, so probably don't do it, just know that other people might! A common goal with data analysis is to visualize data. We all love pretty graphs, plus they help us generalize data usually pretty well. So, how might we graph this data. Looking at the data, it's clear that it's actually organized by date, but also region, so we could plot line graphs of individual regions over time.
To do this, we'll need matplotlib
, which is a popular data visualization library. To get it, let's do:
pip install matplotlib
Next, how might we get an individual region? We'd need to filter for that region column! Let's see how we might do that:
albany_df = df[df['region']=="Albany"]
Ok, so that might look a bit dense, but let's parse that out.
albany_df = df[ df['region'] == "Albany" ]
We're just saying that the albany_df
is the df
, where the df['region']
column is equal to Albany
. The result is a new dataframe where this is the case:
albany_df.head()
Okay, so one more thing you will often see is dataframes are "indexed" by something. Let's see what this dataframe is indexed by:
albany_df.index
In this case, the index is worthless to us. It's just incrementing row counts, which we have no use for here. Instead, we should ask ourselves, how is this Albany avocado data organized? How does each row relate to the other? Well, by date. That's the main way this data is organized. So really, we want Date to be our index! We can do this with set_index
.
albany_df.set_index("Date")
Wait, what? Why did it print out like that? Part of the benefit of the notebook is that this happened to us, but I would explain this either way. Some of the methods in pandas will modify your dataframe in place
, but MOST are going to simply do the thing and return a new dataframe. So if we just check real quick:
albany_df.head()
We can see that the albany_df
is not impacted. There are two ways we can handle for this. One is to re-define:
albany_df = albany_df.set_index("Date")
albany_df.head()
The other option we can use is the inplace
parameter. Something like:
albany_df.set_index("Date", inplace=True)
would also work. Okay, now that we've done that, let's plot!
albany_df['AveragePrice'].plot()
When we call .plot()
on a dataframe, it is just assumed that the x axis will be your index, and then Y will be all of your columns, which is why we specified one column in particular.
This graph is a bit messy, however, especially with the dates, which also look out of order and such. Let's see if we can't carry on with this in the next tutorial!