Python Programming Tutorials

Zipline Local Installation for backtesting - Python Programming for Finance p.25

Hello and welcome to a tutorial covering how to use Zipline locally. Zipline is easily and by far the best finance back-testing and analysis package for Python. While you can use Zipline, along with a bunch of free data to back-test your strategies, on Quantopian for free, you cannot use your own asset data easily. Also, if you're wanting to live-trade on your own, you are now on your own, since you probably want the same system that back-tests your data for live-trading. Some people may also wish to protect their trading algorithm's IP. Finally, if your strategy requires heavy processing, such as using deep learning, a lot of data, or maybe you just want to do high frequency trading...etc, you're going to have to go at it locally, or on some hosting service, on your own.

If any of those things sound like your needs/wants, or you just want to learn more about Zipline, let's get started. First, installing Zipline can be a pain in the rear. Zipline is highly optimized by using many other packages, which is nice once you have everything working right, but it's quite the laundry list. Zipline is also only supported on Python 2.7 or 3.5, not 3.6, or 3.7 (as of my writing this anyway). It appears to me that the main reason for this is because Zipline also requires an older version of Pandas, which is not compatible with 3.6. I have personally installed Zipline on both Windows and Linux (Ubuntu) via stand-alone python. That said, you might also just look into using Conda. Otherwise:

I am personally using Zipline 1.2 on Python 3.5 on Windows OS.

Ubuntu Installation

Ubuntu Zipline setup is very simple. At the time of my writing this, Zipline only supports up to Python 3.5. If you've already setup Python on Ubuntu, then you just need:

$ pip3 install numpy
$ pip3 install cython
$ pip3 install -U setuptools
$ pip3 install zipline

If you're on a fresh server:

$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt-get install python3-dev

$ sudo apt-get install libatlas-base-dev gfortran pkg-config libfreetype6-dev
$ sudo apt-get install python3-pip

$ pip3 install numpy
$ pip3 install cython
$ pip3 install -U setuptools
$ pip3 install zipline

Windows Installation

On Windows, things get a bit more hacky. At the time of my writing this, Zipline only supports up to Python 3.5. First, one of the main dependencies of Zipline is Pandas, you need pandas 0.18 specifically, which is an older release. I expect this will one day be fixed, but this has been outdated for almost a year now, so I am guessing it's not high up on their priorities. To install to Python 3.5, here's the list of dependences, linking to the unofficial binaries page:

cython
numpy+mkl
sqlalchemy
bcolz
lru dict
wrapt
stats-models
bottleneck
Cyordereddict
empyrical
contextlib2

All of those can be downloaded from Unofficial Windows Binaries for Python site.

Now do a pip install zipline to get the list of other non C++ dependencies. This will eventually fail. That's, fine. It's all going according to plan! It's just our quick way of getting the non C dependencies, rather than manually installing them one-by-one, but the C ones will fail.

Then do a pip install --upgrade pandas==0.18.0, which seems to be where the Python 3.5 requirement originates from. You can also get a pre-built binary for pandas 0.18.0 here: Pandas 0.18.0

There are likely more dependencies than above, I probably just had them already. I'll try to update this list of people mention others.

Finally, get zipline. I downloaded from here

Still, however, zipline will attempt to download a different version of packages, like bcolz, which are outdated. Rather than a regular pip install that will install dependencies, we're going to just do:

pip install --no-deps zipline-1.2.0-cp35-cp35m-win_amd64.whl

Testing your install

Once you've got everything ... or so you think, run python and try to import zipline. You're probably missing other things. If you can successfully import Zipline, alright, let's carry on!

Using Zipline

Once you have Zipline, it's important we talk about some of the basics of using Zipline locally. First, you need data. Data is in the form of bundles. You can either make your own bundles, or use a pre-made one. Eventually, we will use our own dataset, but, for now, let's use a pre-made one to keep this start up process as easy as possible!

Let's go ahead and injest a data bundle via the command line interface (via terminal/command-line):

zipline ingest -b quantopian-quandl

The zipline.exe should be in your scripts dir for your Python installation. If you haven't set up your python path, you may need to specify the full path to zipline in this case, which would be something like C:/Python35/Scripts/zipline.exe

Aside from your data, your zipline program also, much like on Quantopian, will require an initialize and handle_data function. You will build your algorithms pretty much just like you do on Quantopian. Then, when you're ready, you have a few options for how you will run the back-test.

We used the zipline CLI above to grab data. Let's quickly do a zipline --help:

zipline --help
Usage: zipline [OPTIONS] COMMAND [ARGS]...

  Top level zipline entry point.

Options:
  -e, --extension TEXT            File or module path to a zipline extension
                                  to load.
  --strict-extensions / --non-strict-extensions
                                  If --strict-extensions is passed then
                                  zipline will not run if it cannot load all
                                  of the specified extensions. If this is not
                                  passed or --non-strict-extensions is passed
                                  then the failure will be logged but
                                  execution will continue.
  --default-extension / --no-default-extension
                                  Don't load the default zipline extension.py
                                  file in $ZIPLINE_HOME.
  --help                          Show this message and exit.

Commands:
  bundles  List all of the available data bundles.
  clean    Clean up data downloaded with the ingest...
  ingest   Ingest the data for the given bundle.
  run      Run a backtest for the given algorithm.

As you can see, we can list out our bundles, clean, injest new data, or run a backtest.

Let's also check out zipline run --help:

zipline run --help
Usage: zipline run [OPTIONS]

  Run a backtest for the given algorithm.

Options:
  -f, --algofile FILENAME         The file that contains the algorithm to run.
  -t, --algotext TEXT             The algorithm script to run.
  -D, --define TEXT               Define a name to be bound in the namespace
                                  before executing the algotext. For example
                                  '-Dname=value'. The value may be any python
                                  expression. These are evaluated in order so
                                  they may refer to previously defined names.
  --data-frequency [minute|daily]
                                  The data frequency of the simulation.
                                  [default: daily]
  --capital-base FLOAT            The starting capital for the simulation.
                                  [default: 10000000.0]
  -b, --bundle BUNDLE-NAME        The data bundle to use for the simulation.
                                  [default: quantopian-quandl]
  --bundle-timestamp TIMESTAMP    The date to lookup data on or before.
                                  [default: <current-time>]
  -s, --start DATE                The start date of the simulation.
  -e, --end DATE                  The end date of the simulation.
  -o, --output FILENAME           The location to write the perf data. If this
                                  is '-' the perf will be written to stdout.
                                  [default: -]
  --print-algo / --no-print-algo  Print the algorithm to stdout.
  --help                          Show this message and exit.

I think that playing with Zipline lends itself to using an IPython notebook. If you want to use some other editor, that's totally fine, the differences should be minimal, but, if you want to follow along exactly, get a jupyter notebook going. If you are using IPython notebook with me, let's start off by loading in the Zipline extension:

If you don't have jupyter notebooks, you can do a pip install jupyter. Then to open the notebooks, open a command prompt, type jupyter notebook, press enter, a browser should open, then you can go to "new" in the top right, choose python3, and boom, you're in a notebook!

Any time you want to use zipline in a notebook, you need some magic:

%load_ext zipline

Now, let's do the "buy Apple" strategy:

from zipline.api import order, record, symbol


def initialize(context):
    pass


def handle_data(context, data):
    order(symbol('AAPL'), 10)
    record(AAPL=data.current(symbol('AAPL'), 'price'))

Now, we'd like to back-test this. We should be able to either use:
zipline run --bundle quantopian-quandl -f apple_backtest.py --start 2000-1-1 --end 2018-1-1 --output buyapple_out.pickle
via the command line or terminal, or, in IPython notebooks, we can just do something like:

%zipline --bundle quantopian-quandl --start 2008-1-1 --end 2012-1-1 -o dma.pickle

As of my latest testing, this now works. Before, this was broken due to them using an API that was deprecated. The solution appears to be another API for the benchmark, so this could break at any time. If it does break, we can easily remedy it, no big deal. You do NOT need to do the following if things are working, just if you need to overcome errors:

Understanding the benchmarking process and how to overcome errors if they creep in:

So first of all, where are these benchmarks happening? From a quick poking around the error, I spot c:\python35\lib\site-packages\zipline\data\benchmarks.py. Alright, that's a start. Let's head there. Here's the code:

import numpy as np
        import pandas as pd

        import pandas_datareader.data as pd_reader


        def get_benchmark_returns(symbol, first_date, last_date):
            """
            Get a Series of benchmark returns from Google associated with `symbol`.
            Default is `SPY`.

            Parameters
            ----------
            symbol : str
                Benchmark symbol for which we're getting the returns.
            first_date : pd.Timestamp
                First date for which we want to get data.
            last_date : pd.Timestamp
                Last date for which we want to get data.

            The furthest date that Google goes back to is 1993-02-01. It has missing
            data for 2008-12-15, 2009-08-11, and 2012-02-02, so we add data for the
            dates for which Google is missing data.

            We're also limited to 4000 days worth of data per request. If we make a
            request for data that extends past 4000 trading days, we'll still only
            receive 4000 days of data.

            first_date is **not** included because we need the close from day N - 1 to
            compute the returns for day N.
            """
            data = pd_reader.DataReader(
                symbol,
                'google',
                first_date,
                last_date
            )

            data = data['Close']

            data[pd.Timestamp('2008-12-15')] = np.nan
            data[pd.Timestamp('2009-08-11')] = np.nan
            data[pd.Timestamp('2012-02-02')] = np.nan

            data = data.fillna(method='ffill')

            return data.sort_index().tz_localize('UTC').pct_change(1).iloc[1:]

Looks to me like *all* we need here is to get this to return any "close" pricing for some asset where date is the index and we fill missing values. So we could use anything here. Quandl is a decent source of stock/finance data. Let's try to use Quandl instead here. You can do a pip install for Quandl and grab various datasets. Fascinatingly, they do not have the S&P 500 ETF here for free. So I am just going to bebop on over to finance.yahoo.com, and manually download this dataset. I could write a script to do this, but, I plan to eventually use Bitcoin data myself. There are many ways for us to get stock pricing data. If I did some method here, it'd probably just break in a few months anyway. For that reason, I will also host the spy.csv file, because things always change. Now, put that file somewhere. Next, we're going to re-write benchmarks.py:

import pandas as pd


        def get_benchmark_returns(it, doesnt, matter):
            full_file_path = "C:\\Users\\H\\Desktop\\local-zipline\\SPY.csv"
            price_column = "Adj Close"

            df = pd.read_csv(full_file_path, parse_dates=True)
            df.set_index( pd.DatetimeIndex(df["Date"]) , inplace=True)
            df = df[price_column]
            df = df.fillna(method='ffill')
            return df.sort_index().tz_localize('UTC').pct_change(1).iloc[1:]


        if __name__ == "__main__":
            df = get_benchmark_returns()
            print(df.head())

Run and test it, you should see something like:

1993-02-01 00:00:00+00:00    0.007113
        1993-02-02 00:00:00+00:00    0.002117
        1993-02-03 00:00:00+00:00    0.010572
        1993-02-04 00:00:00+00:00    0.004184
        1993-02-05 00:00:00+00:00   -0.000696
        Name: Adj Close, dtype: float64

So this is how we can specify our own data for benchmarking, if necessary. For some reason, even if you set a custom benchmark, last I checked, this benchmark file will still run. Maybe this has been fixed, but, if it's ever a problem again, this should help!

Great, let's now try to run a back-test! In our notebook:

%zipline --bundle quantopian-quandl --start 2000-1-1 --end 2012-1-1 -o backtest.pickle

Should get some output:

AAPL    algo_volatility algorithm_period_return alpha   benchmark_period_return benchmark_volatility    beta    capital_used    ending_cash ending_exposure ... short_exposure  short_value shorts_count    sortino starting_cash   starting_exposure   starting_value  trading_days    transactions    treasury_period_return
2000-01-03 21:00:00+00:00   111.940 NaN 0.000000e+00    NaN -0.009787   NaN NaN 0.00    10000000.00 0.0 ... 0   0   0   NaN 10000000.00 0.0 0.0 1   []  0.0658
2000-01-04 21:00:00+00:00   102.500 0.000001    -1.000000e-07   0.000008    -0.048511   0.329103    0.000003    -1026.00    9998974.00  1025.0  ... 0   0   0   -11.224972  10000000.00 0.0 0.0 2   [{'order_id': '4b13a5b0a1884cccbb4960835cf9d4c...   0.0649
2000-01-05 21:00:00+00:00   104.000 0.000013    1.300000e-06    0.000229    -0.046809   0.334622    0.000030    -1041.00    9997933.00  2080.0  ... 0   0   0   119.146981  9998974.00  1025.0  1025.0  3   [{'order_id': '709643250a934fc6879b1db08c00668...   0.0662
2000-01-06 21:00:00+00:00   95.000  0.000148    -1.680000e-05   -0.000915   -0.062128   0.273233    0.000036    -951.00 9996982.00  2850.0  ... 0   0   0   -7.367062   9997933.00  2080.0  2080.0  4   [{'order_id': 'bda01c6ae0e6448d995edf8bd4e91df...   0.0657
2000-01-07 21:00:00+00:00   99.500  0.000179    -3.400000e-06   -0.000119   -0.007660   0.575339    0.000204    -996.00 9995986.00  3980.0  ... 0   0   0   -1.333453   9996982.00  2850.0  2850.0  5   [{'order_id': '6f83fb6921be4eaa9f97711d5432cb3...   0.0652

Again, any time we're using the magic IPython commands (the the %), you can just do the same via your command line, just without the % sign! Okay, so you can see above that we get returned a dataframe, which also is output to backtest.pickle. This contains a bunch of stats on our strategy. In the next tutorial, we're going to break those down a bit, showing you a few of your options for visualizing your outputs.

The next tutorial:

Intro and Getting Stock Price Data - Python Programming for Finance p.1
Handling Data and Graphing - Python Programming for Finance p.2
Basic stock data Manipulation - Python Programming for Finance p.3
More stock manipulations - Python Programming for Finance p.4
Automating getting the S&P 500 list - Python Programming for Finance p.5
Getting all company pricing data in the S&P 500 - Python Programming for Finance p.6
Combining all S&P 500 company prices into one DataFrame - Python Programming for Finance p.7
Creating massive S&P 500 company correlation table for Relationships - Python Programming for Finance p.8
Preprocessing data to prepare for Machine Learning with stock data - Python Programming for Finance p.9
Creating targets for machine learning labels - Python Programming for Finance p.10 and 11
Machine learning against S&P 500 company prices - Python Programming for Finance p.12
Testing trading strategies with Quantopian Introduction - Python Programming for Finance p.13
Placing a trade order with Quantopian - Python Programming for Finance p.14
Scheduling a function on Quantopian - Python Programming for Finance p.15
Quantopian Research Introduction - Python Programming for Finance p.16
Quantopian Pipeline - Python Programming for Finance p.17
Alphalens on Quantopian - Python Programming for Finance p.18
Back testing our Alpha Factor on Quantopian - Python Programming for Finance p.19
Analyzing Quantopian strategy back test results with Pyfolio - Python Programming for Finance p.20
Strategizing - Python Programming for Finance p.21
Finding more Alpha Factors - Python Programming for Finance p.22
Combining Alpha Factors - Python Programming for Finance p.23
Portfolio Optimization - Python Programming for Finance p.24
Zipline Local Installation for backtesting - Python Programming for Finance p.25
Zipline backtest visualization - Python Programming for Finance p.26
Custom Data with Zipline Local - Python Programming for Finance p.27
Custom Markets Trading Calendar with Zipline (Bitcoin/cryptocurrency example) - Python Programming for Finance p.28