Leading up to this tutorial, we've learned about recurrent neural networks, deployed one on a simpler dataset, and now we are working on doing it with a more realistic dataset to try to predict cryptocurrency pricing movements.

So far, our code is:

import pandas as pd

SEQ_LEN = 60  # how long of a preceeding sequence to collect for RNN
FUTURE_PERIOD_PREDICT = 3  # how far into the future are we trying to predict?
RATIO_TO_PREDICT = "LTC-USD"


def classify(current, future):
    if float(future) > float(current):
        return 1
    else:
        return 0


main_df = pd.DataFrame() # begin empty

ratios = ["BTC-USD", "LTC-USD", "BCH-USD", "ETH-USD"]  # the 4 ratios we want to consider
for ratio in ratios:  # begin iteration
    print(ratio)
    dataset = f'training_datas/{ratio}.csv'  # get the full path to the file.
    df = pd.read_csv(dataset, names=['time', 'low', 'high', 'open', 'close', 'volume'])  # read in specific file

    # rename volume and close to include the ticker so we can still which close/volume is which:
    df.rename(columns={"close": f"{ratio}_close", "volume": f"{ratio}_volume"}, inplace=True)

    df.set_index("time", inplace=True)  # set time as index so we can join them on this shared time
    df = df[[f"{ratio}_close", f"{ratio}_volume"]]  # ignore the other columns besides price and volume

    if len(main_df)==0:  # if the dataframe is empty
        main_df = df  # then it's just the current df
    else:  # otherwise, join this data to the main one
        main_df = main_df.join(df)

main_df.fillna(method="ffill", inplace=True)  # if there are gaps in data, use previously known values
main_df.dropna(inplace=True)
print(main_df.head())  # how did we do??

main_df['future'] = main_df[f'{RATIO_TO_PREDICT}_close'].shift(-FUTURE_PERIOD_PREDICT)
main_df['target'] = list(map(classify, main_df[f'{RATIO_TO_PREDICT}_close'], main_df['future']))

print(main_df.head())

BTC-USD
LTC-USD
BCH-USD
ETH-USD
            BTC-USD_close  BTC-USD_volume  LTC-USD_close  LTC-USD_volume  \
time
1528968720    6487.379883        7.706374      96.660004      314.387024
1528968780    6479.410156        3.088252      96.570000       77.129799
1528968840    6479.410156        1.404100      96.500000        7.216067
1528968900    6479.979980        0.753000      96.389999      524.539978
1528968960    6480.000000        1.490900      96.519997       16.991997

            BCH-USD_close  BCH-USD_volume  ETH-USD_close  ETH-USD_volume
time
1528968720     870.859985       26.856577      486.01001       26.019083
1528968780     870.099976        1.124300      486.00000        8.449400
1528968840     870.789978        1.749862      485.75000       26.994646
1528968900     870.000000        1.680500      486.00000       77.355759
1528968960     869.989990        1.669014      486.00000        7.503300
            BTC-USD_close  BTC-USD_volume  LTC-USD_close  LTC-USD_volume  \
time
1528968720    6487.379883        7.706374      96.660004      314.387024
1528968780    6479.410156        3.088252      96.570000       77.129799
1528968840    6479.410156        1.404100      96.500000        7.216067
1528968900    6479.979980        0.753000      96.389999      524.539978
1528968960    6480.000000        1.490900      96.519997       16.991997

            BCH-USD_close  BCH-USD_volume  ETH-USD_close  ETH-USD_volume  \
time
1528968720     870.859985       26.856577      486.01001       26.019083
1528968780     870.099976        1.124300      486.00000        8.449400
1528968840     870.789978        1.749862      485.75000       26.994646
1528968900     870.000000        1.680500      486.00000       77.355759
1528968960     869.989990        1.669014      486.00000        7.503300

               future  target
time
1528968720  96.389999       0
1528968780  96.519997       0
1528968840  96.440002       0
1528968900  96.470001       1
1528968960  96.400002       0

The first thing I would like to do is separate out our validation/out of sample data. In the past, all we did was shuffle data, then slice it. Does that make sense here though?

The problem with that method is, the data is inherently sequential, so taking sequences that don't come in the future is likely a mistake. This is because sequences in our case, for example, 1 minute apart, will be almost identical. Chances are, the target is also going to be the same (buy or sell). Because of this, any overfitting is likely to actually pour over into the validation set. Instead, we want to slice our validation while it's still in order. I'd like to take the last 5% of the data. To do that, we'll do:

times = sorted(main_df.index.values)  # get the times
last_5pct = sorted(main_df.index.values)[-int(0.05*len(times))]  # get the last 5% of the times

validation_main_df = main_df[(main_df.index >= last_5pct)]  # make the validation data where the index is in the last 5%
main_df = main_df[(main_df.index < last_5pct)]  # now the main_df is all the data up to the last 5%

Next, we need to balance and normalize this data. By balance, we want to make sure the classes have equal amounts when training, so our model doesn't just always predict one class.

One way to counteract this is to use class weights, which allows you to weight loss higher for lesser-frequent classifications. That said, I've never personally seen this really be comparable to a real balanced dataset.

We also need to take our data and make sequences from it.

So...we've got some work to do! We'll start by making a function that will process the dataframes, so we can just do something like:

train_x, train_y = preprocess_df(main_df) validation_x, validation_y = preprocess_df(validation_main_df)

Let's start by removing the future column (the actual target is called literally target and only needed the future column temporarily to create it). Then, we need to scale our data:

from sklearn import preprocessing  # pip install sklearn ... if you don't have it!

def preprocess_df(df):
    df = df.drop("future", 1)  # don't need this anymore.

    for col in df.columns:  # go through all of the columns
        if col != "target":  # normalize all ... except for the target itself!
            df[col] = df[col].pct_change()  # pct change "normalizes" the different currencies (each crypto coin has vastly diff values, we're really more interested in the other coin's movements)
            df.dropna(inplace=True)  # remove the nas created by pct_change
            df[col] = preprocessing.scale(df[col].values)  # scale between 0 and 1.

    df.dropna(inplace=True)  # cleanup again... jic. Those nasty NaNs love to creep in.

Alright, we've normalized and scaled the data! Next up, we need to create our actual sequences. To do this:

    sequential_data = []  # this is a list that will CONTAIN the sequences
    prev_days = deque(maxlen=SEQ_LEN)  # These will be our actual sequences. They are made with deque, which keeps the maximum length by popping out older values as new ones come in

    for i in df.values:  # iterate over the values
        prev_days.append([n for n in i[:-1]])  # store all but the target
        if len(prev_days) == SEQ_LEN:  # make sure we have 60 sequences!
            sequential_data.append([np.array(prev_days), i[-1]])  # append those bad boys!

    random.shuffle(sequential_data)  # shuffle for good measure.

We've got our data, we've got sequences, we've got the data normalized, and we've got it scaled. Now we just need to balance it, and then we're ready to rumble! In the next tutorial, we'll balance the data.

Normalizing and creating sequences for our cryptocurrency predicting RNN - Deep Learning basics with Python, TensorFlow and Keras p.9