Welcome to part 8 of the Deep Learning with Python, Keras, and Tensorflow series. In this tutorial, we're going to work on using a recurrent neural network to predict against a time-series dataset, which is going to be cryptocurrency prices.

Whenever I do anything finance-related, I get a lot of people saying they don't understand or don't like finance. If you want, feel free to adapt this tutorial to a dataset you like. Basically, it's just a sequence of features that we're interested in. Any example where you have a sequence of features will suffice.

The data we'll be using is Open, High, Low, Close, Volume data for Bitcoin, Ethereum, Litecoin and Bitcoin Cash.

For our purposes here, we're going to only be focused on the Close and Volume columns. What are these? The Close column measures the final price at the end of each interval. In this case, these are 1 minute intervals. So, at the end of each minute, what was the price of the asset.

The Volume column is how much of the asset was traded per each interval, in this case, per 1 minute.

In the simplest terms possible, Close is the price of the thing. Volume is how much of thing.

Okay, so not too complicated.

Now, we have a few of these "things." We're going to be tracking the Close and Volume every minute for Bitcoin, Litecoin, Ethereum, and Bitcoin Cash.

The theory being that these cryptocoins all have relationships with eachother. Could we possibly predict future movements of, say, Litecoin, by analyzing the last 60 minutes of prices and volumes for all 4 of these coins? I would guess that there exists some, at least better than random, relationship here that a recurrent neural network could discover.

Only 1 way to find out!

Alright, so how do we do this? Our data isn't already in some beautiful format where we have sequences mapped to targets. In fact, there are no targets at all. It's just some datapoints every 60 seconds. So, we've got some work to do.

First, we need to combine price and volume for each coin into a single featureset, then we want to take these featuresets and combine them into sequences of 60 of these featuresets. This will be our input.

Okay, what about our output? Our targets? Well, we're trying to predict if price will rise or fall. So, we need to take the "prices" of the item we're trying to predict. Let's stick with saying we're trying to predict the price of Litecoin. So we need to grab the future price of Litecoin, then determine if it's higher or lower to the current price. We need to do this at every step.

Great, besides that, we need to:

Balance the dataset between buys and sells. We can also use class weights, but balance is superior.
Scale/normalize the data in some way.
Create reasonable out of sample data that works with the problem.
???
Profit!

Let's get to it then. We need the data. Here's the data: Cryptocurrency pricing training dataset. Download that, then extract it in your project dir. You should have a directory called crypto_data and inside of it should be four csv files. To read these files in and manipulate them, we're going to use a library called pandas. Open a console/terminal and do pip install pandas.

Let's just look at one of these files:

import pandas as pd

df = pd.read_csv("crypto_data/LTC-USD.csv", names=['time', 'low', 'high', 'open', 'close', 'volume'])

print(df.head())

         time        low       high       open      close      volume
0  1528968660  96.580002  96.589996  96.589996  96.580002    9.647200
1  1528968720  96.449997  96.669998  96.589996  96.660004  314.387024
2  1528968780  96.470001  96.570000  96.570000  96.570000   77.129799
3  1528968840  96.449997  96.570000  96.570000  96.500000    7.216067
4  1528968900  96.279999  96.540001  96.500000  96.389999  524.539978

This is the data for LTC-USD, which is just the USD value for Litecoin. What we want to do is somehow take the close and volume from here, and combine it with the other 3 cryptocurrencies.

main_df = pd.DataFrame() # begin empty

ratios = ["BTC-USD", "LTC-USD", "BCH-USD", "ETH-USD"]  # the 4 ratios we want to consider
for ratio in ratios:  # begin iteration
    print(ratio)
    dataset = f'training_datas/{ratio}.csv'  # get the full path to the file.
    df = pd.read_csv(dataset, names=['time', 'low', 'high', 'open', 'close', 'volume'])  # read in specific file

    # rename volume and close to include the ticker so we can still which close/volume is which:
    df.rename(columns={"close": f"{ratio}_close", "volume": f"{ratio}_volume"}, inplace=True)

    df.set_index("time", inplace=True)  # set time as index so we can join them on this shared time
    df = df[[f"{ratio}_close", f"{ratio}_volume"]]  # ignore the other columns besides price and volume

    if len(main_df)==0:  # if the dataframe is empty
        main_df = df  # then it's just the current df
    else:  # otherwise, join this data to the main one
        main_df = main_df.join(df)

main_df.fillna(method="ffill", inplace=True)  # if there are gaps in data, use previously known values
main_df.dropna(inplace=True)
print(main_df.head())  # how did we do??

BTC-USD
LTC-USD
BCH-USD
ETH-USD
            BTC-USD_close  BTC-USD_volume  LTC-USD_close  LTC-USD_volume  \
time
1528968720    6487.379883        7.706374      96.660004      314.387024
1528968780    6479.410156        3.088252      96.570000       77.129799
1528968840    6479.410156        1.404100      96.500000        7.216067
1528968900    6479.979980        0.753000      96.389999      524.539978
1528968960    6480.000000        1.490900      96.519997       16.991997

            BCH-USD_close  BCH-USD_volume  ETH-USD_close  ETH-USD_volume
time
1528968720     870.859985       26.856577      486.01001       26.019083
1528968780     870.099976        1.124300      486.00000        8.449400
1528968840     870.789978        1.749862      485.75000       26.994646
1528968900     870.000000        1.680500      486.00000       77.355759
1528968960     869.989990        1.669014      486.00000        7.503300

Next, we need to create a target. To do this, we need to know which price we're trying to predict. We also need to know how far out we want to predict. We'll go with Litecoin for now. Knowing how far out we want to predict probably also depends how long our sequences are. If our sequence length is 3 (so...3 minutes), we probably can't easily predict out 10 minutes. If our sequence length is 300, 10 might not be as hard. I'd like to go with a sequence length of 60, and a future prediction out of 3. We could also make the prediction a regression question, using a linear activation with the output layer, but, instead, I am going to just go with a binary classification.

If price goes up in 3 minutes, then it's a buy. If it goes down in 3 minutes, not buy/sell. With all of that in mind, I am going to make the following constants:

SEQ_LEN = 60  # how long of a preceeding sequence to collect for RNN
FUTURE_PERIOD_PREDICT = 3  # how far into the future are we trying to predict?
RATIO_TO_PREDICT = "LTC-USD"

Next, I am going to make a simple classification function that we'll use to map in a moment:

def classify(current, future):
    if float(future) > float(current):
        return 1
    else:
        return 0

Pretty simple. This function will take values from 2 columns. If the "future" column is higher, great, it's a 1 (buy). Otherwise it's a 0 (sell). To do this, first, we need a future column!

main_df['future'] = main_df[f'{RATIO_TO_PREDICT}_close'].shift(-FUTURE_PERIOD_PREDICT)

A .shift will just shift the columns for us, a negative shift will shift them "up." So shifting up 3 will give us the price 3 minutes in the future, and we're just assigning this to a new column.

Now that we've got the future values, we can use them to make a target using the function we made above.

main_df['target'] = list(map(classify, main_df[f'{RATIO_TO_PREDICT}_close'], main_df['future']))

The above can be confusing. Start by ignoring the list() part, this is just at the very end, which I'll explain in a minute.

The map() is used to map a function. The first parameter here is the function we want to map (classify), then the next ones are the parameters to that function. In this case, the current close price, and then the future price.

The map part is what allows us to do this row-by-row for these columns, but also do it quite fast. The list part converts the end result to a list, which we can just set as a column.

Great, let's check out the data:

print(main_df.head())

            BTC-USD_close  BTC-USD_volume  LTC-USD_close  LTC-USD_volume  \
time
1528968720    6487.379883        7.706374      96.660004      314.387024
1528968780    6479.410156        3.088252      96.570000       77.129799
1528968840    6479.410156        1.404100      96.500000        7.216067
1528968900    6479.979980        0.753000      96.389999      524.539978
1528968960    6480.000000        1.490900      96.519997       16.991997

            BCH-USD_close  BCH-USD_volume  ETH-USD_close  ETH-USD_volume  \
time
1528968720     870.859985       26.856577      486.01001       26.019083
1528968780     870.099976        1.124300      486.00000        8.449400
1528968840     870.789978        1.749862      485.75000       26.994646
1528968900     870.000000        1.680500      486.00000       77.355759
1528968960     869.989990        1.669014      486.00000        7.503300

               future  target
time
1528968720  96.389999       0
1528968780  96.519997       0
1528968840  96.440002       0
1528968900  96.470001       1
1528968960  96.400002       0

Looking great! Let's make sequences and train!!!

Not so fast there Carl. We still need to make validation data, sequences, and normalize the data! We have a lot of work still. We will pick up on this in the next tutorial, see you there!

Creating a Cryptocurrency-predicting finance recurrent neural network - Deep Learning basics with Python, TensorFlow and Keras p.8