Need help installing packages with pip? see the pip install tutorial
Hello and welcome to a Python for Finance tutorial series. In this series, we're going to run through the basics of importing financial (stock) data into Python using the Pandas framework. From here, we'll manipulate the data and attempt to come up with some sort of system for investing in companies, apply some machine learning, even some deep learning, and then learn how to back-test a strategy. I assume you know the fundamentals of Python. If you're not sure if that's you, click the fundamentals link, look at some of the topics in the series, and make a judgement call. If at any point you are stuck in this series or confused on a topic or concept, feel free to ask for help and I will do my best to help.
A common question that I am asked is whether or not I make a profit investing or trading with these techniques. I mostly play with finance data for fun and to practice my data analysis skills, but it actually does also influence my investment decisions to this day. I do not do active algorithmic trading with programming at the time of my writing this, but I have, and I have actually made a profit, but it's a lot more work than you might think to algorithmically trade. Finally, the knowledge about how to manipulate and analyze financial data, as well as how to backtest trading stategies, has *saved* me a ton of money.
None of the strategies presented here will make you an ultra wealthy person. If they would, I'd probably keep them to myself! The knowledge itself, however, can save you money, and even make you money.
Alright great, let's get started. To begin, I am using Python 3.5, but you should be able to get by with later versions. I will assume you already have Python installed. If you do not have 64 bit Python, but do have a 64bit operating system, get 64 bit Python, it'll help you a bit later. If you're on a 32 bit operating system, I am sorry for your situation, but you should be fine to follow most of this anyway.
Required Modules to start:
That'll do for now, we'll deal with other modules as they come up. To begin, let's cover how we might go about dealing with stock data using pandas, matplotlib and Python.
If you'd like to learn more on Matplotlib, check out the Data Visualization with Matplotlib tutorial series.
If you'd like to learn more on Pandas, check out the Data Analysis with Pandas tutorial series.
To begin, we're going to make the following imports:
import datetime as dt import matplotlib.pyplot as plt from matplotlib import style import pandas as pd import pandas_datareader.data as web
Datetime
will easily allow us to work with dates, matplotlib
to graph things, pandas
to manipulate data, and the pandas_datareader
is the newest pandas io library at the time of my writing this.
Now for some starting setup:
style.use('ggplot') start = dt.datetime(2015, 1, 1) end = dt.datetime.now()
We're setting a style, so our graphs don't look horrendous. In finance, it's of the utmost importance that your graphs are pretty, even if you're losing money. Next, we're setting a start and end datetime object, this will be the range of dates that we're going to grab stock pricing information for.
Now, we can make a dataframe from this data:
Note: This has changed since the video was filmed. Both Yahoo and Google have stopped their APIs, so we'll use morningstar this time:
df = web.DataReader("TSLA", 'morningstar', start, end)
If you're not currently familiar with what a DataFrame object is, you can check out the tutorial on Pandas, or just be content to think of it like a spreadsheet, or a database table that's in your memory/RAM. It's just a table of rows and columns, you have an index, and column names. In our case, our index will likely be date. The index should be something that relates to all of the columns.
The line web.DataReader('TSLA', "yahoo", start, end)
uses the pandas_datareader
package, looks for the stock ticker TSLA
(Tesla), gets the information from yahoo
, for the starting date of whatever start
is and ends at the end
variable that we chose. Just incase you don't know, a stock is a share of ownership of a company, and the ticker is the "symbol" used to reference the company in the stock exchange that it's on. Most tickers are 1-4 letters.
So now we've got a Pandas.DataFrame
object that contains stock pricing information for Tesla. Let's see what we have here:
print(df.head())
Close High Low Open Volume Symbol Date TSLA 2015-01-01 222.41 222.41 222.4100 222.41 0 2015-01-02 219.31 223.25 213.2600 222.63 4764443 2015-01-05 210.09 216.50 207.1626 214.50 5368477 2015-01-06 211.28 214.20 204.2100 210.06 6261936 2015-01-07 210.95 214.78 209.7800 213.40 2968390
Now, let's simplify this dataframe slightly:
df.reset_index(inplace=True) df.set_index("Date", inplace=True) df = df.drop("Symbol", axis=1) print(df.head())
Now, the full code is:
import datetime as dt import matplotlib.pyplot as plt from matplotlib import style import pandas as pd import pandas_datareader.data as web style.use('ggplot') start = dt.datetime(2015, 1, 1) end = dt.datetime.now() df = web.DataReader("TSLA", 'morningstar', start, end) df.reset_index(inplace=True) df.set_index("Date", inplace=True) df = df.drop("Symbol", axis=1) print(df.head())
Giving us:
Close High Low Open Volume Date 2015-01-01 222.41 222.41 222.4100 222.41 0 2015-01-02 219.31 223.25 213.2600 222.63 4764443 2015-01-05 210.09 216.50 207.1626 214.50 5368477 2015-01-06 211.28 214.20 204.2100 210.06 6261936 2015-01-07 210.95 214.78 209.7800 213.40 2968390
Now, this is a python object that is rows and columns, like a spreadsheet.
The .head()
is something you can do with Pandas DataFrames, and it will output the first n
rows, where n
is the optional parameter you pass. If you don't pass a parameter, 5 is the default value. We mosly will use .head()
to just get a quick glimpse of our data to make sure we're on the right track. Looks great to me!
In case you do not know: