Welcome to part 8 of the Deep Learning with Python, Keras, and Tensorflow series. In this tutorial, we're going to work on using a recurrent neural network to predict against a time-series dataset, which is going to be cryptocurrency prices.
Whenever I do anything finance-related, I get a lot of people saying they don't understand or don't like finance. If you want, feel free to adapt this tutorial to a dataset you like. Basically, it's just a sequence of features that we're interested in. Any example where you have a sequence of features will suffice.
The data we'll be using is Open, High, Low, Close, Volume data for Bitcoin, Ethereum, Litecoin and Bitcoin Cash.
For our purposes here, we're going to only be focused on the Close
and Volume
columns. What are these? The Close
column measures the final price at the end of each interval. In this case, these are 1 minute intervals. So, at the end of each minute, what was the price of the asset.
The Volume
column is how much of the asset was traded per each interval, in this case, per 1 minute.
In the simplest terms possible, Close
is the price of the thing. Volume
is how much of thing.
Okay, so not too complicated.
Now, we have a few of these "things." We're going to be tracking the Close
and Volume
every minute for Bitcoin, Litecoin, Ethereum, and Bitcoin Cash.
The theory being that these cryptocoins all have relationships with eachother. Could we possibly predict future movements of, say, Litecoin, by analyzing the last 60 minutes of prices and volumes for all 4 of these coins? I would guess that there exists some, at least better than random, relationship here that a recurrent neural network could discover.
Only 1 way to find out!
Alright, so how do we do this? Our data isn't already in some beautiful format where we have sequences mapped to targets. In fact, there are no targets at all. It's just some datapoints every 60 seconds. So, we've got some work to do.
First, we need to combine price and volume for each coin into a single featureset, then we want to take these featuresets and combine them into sequences of 60 of these featuresets. This will be our input.
Okay, what about our output? Our targets? Well, we're trying to predict if price will rise or fall. So, we need to take the "prices" of the item we're trying to predict. Let's stick with saying we're trying to predict the price of Litecoin. So we need to grab the future price of Litecoin, then determine if it's higher or lower to the current price. We need to do this at every step.
Great, besides that, we need to:
Let's get to it then. We need the data. Here's the data: Cryptocurrency pricing training dataset. Download that, then extract it in your project dir. You should have a directory called crypto_data
and inside of it should be four csv files. To read these files in and manipulate them, we're going to use a library called pandas
. Open a console/terminal and do pip install pandas
.
Let's just look at one of these files:
import pandas as pd
df = pd.read_csv("crypto_data/LTC-USD.csv", names=['time', 'low', 'high', 'open', 'close', 'volume'])
print(df.head())
This is the data for LTC-USD
, which is just the USD value for Litecoin. What we want to do is somehow take the close
and volume
from here, and combine it with the other 3 cryptocurrencies.
main_df = pd.DataFrame() # begin empty
ratios = ["BTC-USD", "LTC-USD", "BCH-USD", "ETH-USD"] # the 4 ratios we want to consider
for ratio in ratios: # begin iteration
print(ratio)
dataset = f'training_datas/{ratio}.csv' # get the full path to the file.
df = pd.read_csv(dataset, names=['time', 'low', 'high', 'open', 'close', 'volume']) # read in specific file
# rename volume and close to include the ticker so we can still which close/volume is which:
df.rename(columns={"close": f"{ratio}_close", "volume": f"{ratio}_volume"}, inplace=True)
df.set_index("time", inplace=True) # set time as index so we can join them on this shared time
df = df[[f"{ratio}_close", f"{ratio}_volume"]] # ignore the other columns besides price and volume
if len(main_df)==0: # if the dataframe is empty
main_df = df # then it's just the current df
else: # otherwise, join this data to the main one
main_df = main_df.join(df)
main_df.fillna(method="ffill", inplace=True) # if there are gaps in data, use previously known values
main_df.dropna(inplace=True)
print(main_df.head()) # how did we do??
Next, we need to create a target. To do this, we need to know which price we're trying to predict. We also need to know how far out we want to predict. We'll go with Litecoin for now. Knowing how far out we want to predict probably also depends how long our sequences are. If our sequence length is 3 (so...3 minutes), we probably can't easily predict out 10 minutes. If our sequence length is 300, 10 might not be as hard. I'd like to go with a sequence length of 60, and a future prediction out of 3. We could also make the prediction a regression question, using a linear activation with the output layer, but, instead, I am going to just go with a binary classification.
If price goes up in 3 minutes, then it's a buy. If it goes down in 3 minutes, not buy/sell. With all of that in mind, I am going to make the following constants:
SEQ_LEN = 60 # how long of a preceeding sequence to collect for RNN
FUTURE_PERIOD_PREDICT = 3 # how far into the future are we trying to predict?
RATIO_TO_PREDICT = "LTC-USD"
Next, I am going to make a simple classification function that we'll use to map in a moment:
def classify(current, future):
if float(future) > float(current):
return 1
else:
return 0
Pretty simple. This function will take values from 2 columns. If the "future" column is higher, great, it's a 1 (buy). Otherwise it's a 0 (sell). To do this, first, we need a future column!
main_df['future'] = main_df[f'{RATIO_TO_PREDICT}_close'].shift(-FUTURE_PERIOD_PREDICT)
A .shift
will just shift the columns for us, a negative shift will shift them "up." So shifting up 3 will give us the price 3 minutes in the future, and we're just assigning this to a new column.
Now that we've got the future values, we can use them to make a target using the function we made above.
main_df['target'] = list(map(classify, main_df[f'{RATIO_TO_PREDICT}_close'], main_df['future']))
The above can be confusing. Start by ignoring the list()
part, this is just at the very end, which I'll explain in a minute.
The map()
is used to map a function. The first parameter here is the function we want to map (classify
), then the next ones are the parameters to that function. In this case, the current close price, and then the future price.
The map
part is what allows us to do this row-by-row for these columns, but also do it quite fast. The list part converts the end result to a list, which we can just set as a column.
Great, let's check out the data:
print(main_df.head())
Looking great! Let's make sequences and train!!!
Not so fast there Carl. We still need to make validation data, sequences, and normalize the data! We have a lot of work still. We will pick up on this in the next tutorial, see you there!