What is going on everyone, welcome to a Data Analysis with Python and Pandas tutorial series. Pandas is a Python module, and Python is the programming language that we're going to use. The Pandas module is a high performance, highly efficient, and high level data analysis library.
At its core, it is very much like operating a headless version of a spreadsheet, like Excel. Most of the datasets you work with will be what are called dataframes. You may be familiar with this term already, it is used across other languages, but, if not, a dataframe is most often just like a spreadsheet. Columns and rows, that's all there is to it! From here, we can utilize Pandas to perform operations on our data sets at lightning speeds.
Pandas is also compatible with many of the other data analysis libraries, like Scikit-Learn for machine learning, Matplotlib for Graphing, NumPy, since it uses NumPy, and more. It's incredibly powerful and valuable to know. If you're someone who finds themselves using Excel, or general spreadsheets, for various computational tasks, where they might take a minute, or an hour, to run, Pandas is going to change your life. I've even seen versions of Machine learning like K-Means clustering being done on Excel. That's really cool, but my Python is going to do that for you way faster, which will also allow you to be a bit more stringent on parameters, have larger datasets and just plain get more done.
Another bit of good news? You can easily load in, and output out in the xls or xlsx format quickly, so, even if your boss wants to view things the old way, they can. Pandas is also compatible with text files, csv, hdf files, xml, html, and more with its incredibly powerful IO.
If you're just now joining us with Python, you should be able to follow along without already having mastered Python, and this could even be your intro to Python in general. Most importantly, if you have questions, ask them! If you seek out answers for each of the areas of confusion, and do this for everything, eventually you will have a full picture. Most of your questions will be Google-able as well. Don't be afraid to Google your questions, it wont laugh at you, I promise. I still Google a lot of my goals to see if someone has some example code doing what I want to do, so don't feel like a noob just because you do it.
If I have not sold you yet on Pandas, the elevator pitch is: Lightning fast data analysis on spreadsheet-like data, with an extremely robust input/output mechanism for handling multiple data types and even converting to and from data types.
Alright, you are sold. Now let's get Pandas! First, I am going to assume some people do not even have Python yet. By far the easiest choice is to go with a pre-compiled distribution of Python, such as ActivePython, which is quick and simple way to get all of the packages and dependencies you need for data science in a bundle, without the headache of installing them one-by-one, especially on 64 bit Windows. I recommend getting the latest version of 64 bit Python. In this series alone, we're using Pandas, which requires Numpy. We'll also be using Matplotlib and Scikit-Learn, all of which come with ActivePython pre-compiled and optimized with MKL. You can download a fully setup Python distribution from ActiveState here.
If you want to manually install Python, head to Python.org, and download Python 3+, or later. Just don't get 2.X. Take note the bit-version that you download. Just because your operating system is 64 bit, it doesn't mean that's your Python version. The default is always 32bit. Choose what you want. 64 bit can be a bit of a headache, so I wouldn't recommend it if you're a newcomer, but 64 bit is ideal for data science so you're not locked into 2GB max of RAM use. If you want to do the 64 bit route, it might help to check out the pip install tutorial, which covers how to handle regular installs as well as the more tricky 64 bit packages. If you're going with 32bit, then don't worry about that tutorial for now.
So you've got Python installed. Next, go to your terminal or cmd.exe, and type:
pip install pandas. Did you get a "pip is not a recognized command" or something similar? No problem, this means pip is not on your PATH. Pip is a program, but your machine doesn't just simply know where it is unless it is on your PATH. You can look up how to add something to your path if you like, but you can always just explicitly give the path to the program you want to execute. On Windows, for example, Python's pip is located in
C:/Python34/Scripts/pip. Python34 means Python 3.4. If you have Python 3.6, then you would use Python36, and so on.
Thus, if regular pip install pandas didn't work, then you can do
C:/Python34/Scripts/pip install pandas
On that note, another major point of contention for people is the editor they choose. The editor really does not matter in the grand scheme of things. You should try multiple editors, and go with the one that suits you best. Whatever you feel comfortable with, and you are productive with. That's what matters most in the end. Some employers are also going to force you to use editor X, Y, or Z in the end as well, so you probably shouldn't become dependent on editor features. With that, I prefer the simple IDLE, so that's what I will code in. Again though, you can code in Wing, emacs, Nano, Vim, PyCharm, IPython, whatever you want. To open IDLE, just go to start, search for IDLE, and choose that. From there, File > New, and boom you have a text editor with highlighting and a few other little things. We'll cover some of these minor things as we go.
Now, with whatever editor you are using, open it up, and let's write some quick code to check out a dataframe.
Generally, a DataFrame is closest to the Dictionary Python data structure. If you are not familiar with Dictionaries, there's a tutorial for that. I'll annotate things like that in the video, as well as having links to them in the description and on the text-based versions of the tutorials on PythonProgramming.net
First, let's make some simple imports:
import pandas as pd import datetime import pandas.io.data as web
Here, we import pandas as pd. This is just a common standard used when importing the Pandas module. Next, we import datetime, which we'll use in a moment to tell Pandas some dates that we want to pull data between. Finally, we import pandas.io.data as web, because we're going to use this to pull data from the internet. Next up:
start = datetime.datetime(2010, 1, 1) end = datetime.datetime(2015, 8, 22)
Here, we create start and end variables that are datetime objects, pulling data from Jan 1st 2010 to Aug 22nd 2015. Now, we can create a dataframe like so:
df = web.DataReader("XOM", "yahoo", start, end)
This pulls data for Exxon from the Yahoo Finance API, storing the data to our df variable. Naming your dataframe df is not required, but again, is pretty popular standard for working with Pandas. It just helps people immediately identify the working dataframe without needing to trace the code back.
So this gives us a dataframe, how do we see it? Well, can can just print it, like:
So that's a lot of space. The middle of the dataset is ignored, but that's still a lot of output. Instead, most people will just do:
Open High Low Close Volume Adj Close Date 2010-01-04 68.720001 69.260002 68.190002 69.150002 27809100 59.215446 2010-01-05 69.190002 69.449997 68.800003 69.419998 30174700 59.446653 2010-01-06 69.449997 70.599998 69.339996 70.019997 35044700 59.960452 2010-01-07 69.900002 70.059998 69.419998 69.800003 27192100 59.772064 2010-01-08 69.690002 69.750000 69.220001 69.519997 24891800 59.532285
This prints the first 5 rows of the dataframe, and is useful for debugging and just generally seeing what your dataframe looks like. As you perform analysis and such, this will be useful to see if what you intended actually happened or not. We'll dive more into this later on, however.
We could stop here with the intro, but one more thing: Data Visualization. Like I said earlier, Pandas works great with other modules, Matplotlib being one of them. Let's see! Open your terminal or cmd.exe, and do
pip install matplotlib. You should already have got it I am prety sure with your pandas installation, but we want to make sure. Now, at the top of your script with the other imports, add:
import matplotlib.pyplot as plt from matplotlib import style style.use('fivethirtyeight')
Pyplot is the basic matplotlib graphing module. Style helps us quickly make our graphs look good, and style.use lets us choose a style. Interested in learning more about Matplotlib? Check out the in-depth Matplotlib tutorial series!
Next, below our print(df.head()), we can do something like:
df['High'].plot() plt.legend() plt.show()
Pretty cool! There's a quick introduction to Pandas, but nowhere near what is available. In this series, we're going to be covering more of the basics of pandas, then move on to navigating and working with dataframes. From there, we'll touch a bit more on visualization, input and output with many data formats, basic and intermediate data analysis and operations, merging and combining dataframes, resampling, and much more with a lot of realistic examples.
If you're lost, confused, or need some clarity, don't hesitate to ask questions on the respective videos.