Creating a Convolutional Neural Network in Pytorch¶

Welcome to part 6 of the deep learning with Python and Pytorch tutorials. Leading up to this tutorial, we've covered how to make a basic neural network, and now we're going to cover how to make a slightly more complex neural network: The convolutional neural network, or Convnet/CNN.

Code up to this point:

import os
import cv2
import numpy as np
from tqdm import tqdm


REBUILD_DATA = False # set to true to one once, then back to false unless you want to change something in your training data.

class DogsVSCats():
    IMG_SIZE = 50
    CATS = "PetImages/Cat"
    DOGS = "PetImages/Dog"
    TESTING = "PetImages/Testing"
    LABELS = {CATS: 0, DOGS: 1}
    training_data = []

    catcount = 0
    dogcount = 0

    def make_training_data(self):
        for label in self.LABELS:
            print(label)
            for f in tqdm(os.listdir(label)):
                if "jpg" in f:
                    try:
                        path = os.path.join(label, f)
                        img = cv2.imread(path, cv2.IMREAD_GRAYSCALE)
                        img = cv2.resize(img, (self.IMG_SIZE, self.IMG_SIZE))
                        self.training_data.append([np.array(img), np.eye(2)[self.LABELS[label]]])  # do something like print(np.eye(2)[1]), just makes one_hot 
                        #print(np.eye(2)[self.LABELS[label]])

                        if label == self.CATS:
                            self.catcount += 1
                        elif label == self.DOGS:
                            self.dogcount += 1

                    except Exception as e:
                        pass
                        #print(label, f, str(e))

        np.random.shuffle(self.training_data)
        np.save("training_data.npy", self.training_data)
        print('Cats:',dogsvcats.catcount)
        print('Dogs:',dogsvcats.dogcount)

if REBUILD_DATA:
    dogsvcats = DogsVSCats()
    dogsvcats.make_training_data()


training_data = np.load("training_data.npy", allow_pickle=True)
print(len(training_data))

24946

Now, we're going to build the convnet. We'll begin with some basic imports:

import torch
import torch.nn as nn
import torch.nn.functional as F

Next, we'll make a Net class again, this time having the layers be convolutional:

class Net(nn.Module):
    def __init__(self):
        super().__init__() # just run the init of parent class (nn.Module)
        self.conv1 = nn.Conv2d(1, 32, 5) # input is 1 image, 32 output channels, 5x5 kernel / window
        self.conv2 = nn.Conv2d(32, 64, 5) # input is 32, bc the first layer output 32. Then we say the output will be 64 channels, 5x5 conv
        self.conv3 = nn.Conv2d(64, 128, 5)

The layers have 1 more parameter after the input and output size, which is the kernel window size. This is the size of the "window" that you take of pixels. A 5 means we're doing a sliding 5x5 window for colvolutions.

The same rules apply, where you see the first layer takes in 1 image, outputs 32 convolutions, then the next is going to take in 32 convolutions/features, and output 64 more...and so on.

Now comes a new concept. Convolutional features are just that, they're convolutions, maybe max-pooled convolutions, but they aren't flat. We need to flatten them, like we need to flatten an image before passing it through a regular layer.

...but how?

So this is an example that really annoyed me with both TensorFlow and now Pytorch documentation when I was trying to learn things. For example, here's some of the convolutional neural network sample code from Pytorch's examples directory on their github:

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5, 1)
        self.conv2 = nn.Conv2d(20, 50, 5, 1)
        self.fc1 = nn.Linear(4*4*50, 500)
        self.fc2 = nn.Linear(500, 10)

As I warned, you need to flatten the output from the last convolutional layer before you can pass it through a regular "dense" layer (or what pytorch calls a linear layer). So, looking at this code, you see the input to the first fully connected layer is: 4*4*50.

Where has this come from? What are the 4s? What's the 50?

These numbers are never explained in any of the docs that I found. This was also a huge headache with TensorFlow initially, but it has since been made super easy with a flatten method. I definitely wish this existed in Pytorch!

So instead, I will do my best to help explain how to do this yourself. I did a lot of googling and research to see if I could find a better solution than what I am about to show. There surely is something better, but I couldn't find it.

The way I am going to solve this is to simply determine the actual shape of the flattened output after the first convolutional layers.

How? Well, we can...just simply pass some fake data initially to just get the shape. I can then just use a flag basically to determine whether or not to do this partial pass of data to grab the value. We could keep the code itself cleaner by grabbing that value every time as well, but I'd rather have faster speeds and just do the calc one time:

class Net(nn.Module):
    def __init__(self):
        super().__init__() # just run the init of parent class (nn.Module)
        self.conv1 = nn.Conv2d(1, 32, 5) # input is 1 image, 32 output channels, 5x5 kernel / window
        self.conv2 = nn.Conv2d(32, 64, 5) # input is 32, bc the first layer output 32. Then we say the output will be 64 channels, 5x5 kernel / window
        self.conv3 = nn.Conv2d(64, 128, 5)

        x = torch.randn(50,50).view(-1,1,50,50)
        self._to_linear = None
        self.convs(x)

Whenever we initialize, we will create some random data, we'll just set self.__to_linear to none, then pass this random x data through self.convs, which doesn't yet exist.

What we're going to do is have self.convs be a part of our forward method. Separating it out just means we can call just this part as needed, without needing to do a full call.

class Net(nn.Module):
    def __init__(self):
        super().__init__() # just run the init of parent class (nn.Module)
        self.conv1 = nn.Conv2d(1, 32, 5) # input is 1 image, 32 output channels, 5x5 kernel / window
        self.conv2 = nn.Conv2d(32, 64, 5) # input is 32, bc the first layer output 32. Then we say the output will be 64 channels, 5x5 kernel / window
        self.conv3 = nn.Conv2d(64, 128, 5)

        x = torch.randn(50,50).view(-1,1,50,50)
        self._to_linear = None
        self.convs(x)

    def convs(self, x):
        # max pooling over 2x2
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv3(x)), (2, 2))

        if self._to_linear is None:
            self._to_linear = x[0].shape[0]*x[0].shape[1]*x[0].shape[2]
        return x

Slightly more complicated forward pass here, but not too bad. With:

x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))

First we have: F.relu(self.conv1(x)). This is the same as with our regular neural network. We're just running rectified linear on the convolutional layers. Then, we run that through a F.max_pool2d, with a 2x2 window.

Now, if we have not yet calculated what it takes to flatten (self._to_linear), we want to do that. All we need to do for that is to just grab the dimensions and multiply them. For example, if the shape of the tensor is (2,5,3), you just need to do 2x5x3 (30). If we need to calc that, we do, and store that, so we can continue to reference it. At the end of the convs method, we just need to return x, which can then continue to be passed through more layers.

We do want some more layers, so let's add those to the __init__ method:

self.fc1 = nn.Linear(self._to_linear, 512) #flattening.
self.fc2 = nn.Linear(512, 2) # 512 in, 2 out bc we're doing 2 classes (dog vs cat).

Making our full class now:

class Net(nn.Module):
    def __init__(self):
        super().__init__() # just run the init of parent class (nn.Module)
        self.conv1 = nn.Conv2d(1, 32, 5) # input is 1 image, 32 output channels, 5x5 kernel / window
        self.conv2 = nn.Conv2d(32, 64, 5) # input is 32, bc the first layer output 32. Then we say the output will be 64 channels, 5x5 kernel / window
        self.conv3 = nn.Conv2d(64, 128, 5)

        x = torch.randn(50,50).view(-1,1,50,50)
        self._to_linear = None
        self.convs(x)

        self.fc1 = nn.Linear(self._to_linear, 512) #flattening.
        self.fc2 = nn.Linear(512, 2) # 512 in, 2 out bc we're doing 2 classes (dog vs cat).

    def convs(self, x):
        # max pooling over 2x2
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv3(x)), (2, 2))

        if self._to_linear is None:
            self._to_linear = x[0].shape[0]*x[0].shape[1]*x[0].shape[2]
        return x

Finally, we can write our forward method, which will make use of our existing convs method:

    def forward(self, x):
        x = self.convs(x)
        x = x.view(-1, self._to_linear)  # .view is reshape ... this flattens X before 
        x = F.relu(self.fc1(x))
        x = self.fc2(x) # bc this is our output layer. No activation here.
        return F.softmax(x, dim=1)

The initial path of the input will just go through our convs method, which we separated out, again, so we could just run that part, but that's the same code as we'd need to start off this method, but then again we want to also do one regular fully-connected layer, and then another layer will be our output layer.

And that's our convolutional neural network. Full code up to this point:

import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super().__init__() # just run the init of parent class (nn.Module)
        self.conv1 = nn.Conv2d(1, 32, 5) # input is 1 image, 32 output channels, 5x5 kernel / window
        self.conv2 = nn.Conv2d(32, 64, 5) # input is 32, bc the first layer output 32. Then we say the output will be 64 channels, 5x5 kernel / window
        self.conv3 = nn.Conv2d(64, 128, 5)

        x = torch.randn(50,50).view(-1,1,50,50)
        self._to_linear = None
        self.convs(x)

        self.fc1 = nn.Linear(self._to_linear, 512) #flattening.
        self.fc2 = nn.Linear(512, 2) # 512 in, 2 out bc we're doing 2 classes (dog vs cat).

    def convs(self, x):
        # max pooling over 2x2
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2))
        x = F.max_pool2d(F.relu(self.conv3(x)), (2, 2))

        if self._to_linear is None:
            self._to_linear = x[0].shape[0]*x[0].shape[1]*x[0].shape[2]
        return x

    def forward(self, x):
        x = self.convs(x)
        x = x.view(-1, self._to_linear)  # .view is reshape ... this flattens X before 
        x = F.relu(self.fc1(x))
        x = self.fc2(x) # bc this is our output layer. No activation here.
        return F.softmax(x, dim=1)


net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 32, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1))
  (conv3): Conv2d(64, 128, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear(in_features=512, out_features=512, bias=True)
  (fc2): Linear(in_features=512, out_features=2, bias=True)
)

Next, we're ready to actually train the model, so we need to make a training loop. For this, we need a loss metric and optimizer. Again, we'll use the Adam optimizer. This time, since we have one_hot vectors, we're going to use mse as our loss metric. MSE stands for mean squared error.

import torch.optim as optim

optimizer = optim.Adam(net.parameters(), lr=0.001)
loss_function = nn.MSELoss()

Now we want to iterate over our data, but we need to also do this in batches. We also want to separate out our data into training and testing groups.

Along with separating out our data, we also need to shape this data (view it, according to Pytorch) in the way Pytorch expects us (-1, IMG_SIZE, IMG_SIZE)

To begin:

X = torch.Tensor([i[0] for i in training_data]).view(-1,50,50)
X = X/255.0
y = torch.Tensor([i[1] for i in training_data])

Above, we're separating out the featuresets (X) and labels (y) from the training data. Then, we're viewing the X data as (-1, 50, 50), where the 50 is coming from image size. Now, we want to separate out some of the data for validation/out of sample testing.

To do this, let's just say we want to use 10% of the data for testing. We can achieve this by doing:

VAL_PCT = 0.1  # lets reserve 10% of our data for validation
val_size = int(len(X)*VAL_PCT)
print(val_size)

2494

We're converting to an int because we're going to use this number to slice our data into groups, so it needs to be a valid index:

train_X = X[:-val_size]
train_y = y[:-val_size]

test_X = X[-val_size:]
test_y = y[-val_size:]

print(len(train_X), len(test_X))

22452 2494

Finally, we want to actually iterate over this data to fit and test. We need to decide on a batch size. If you get any memory errors, go ahead and lower the batch size. I am going to go with 100 for now:

BATCH_SIZE = 100
EPOCHS = 1

for epoch in range(EPOCHS):
    for i in tqdm(range(0, len(train_X), BATCH_SIZE)): # from 0, to the len of x, stepping BATCH_SIZE at a time. [:50] ..for now just to dev
        #print(f"{i}:{i+BATCH_SIZE}")
        batch_X = train_X[i:i+BATCH_SIZE].view(-1, 1, 50, 50)
        batch_y = train_y[i:i+BATCH_SIZE]

        net.zero_grad()

        outputs = net(batch_X)
        loss = loss_function(outputs, batch_y)
        loss.backward()
        optimizer.step()    # Does the update

    print(f"Epoch: {epoch}. Loss: {loss}")

100%|a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^| 225/225 [01:56<00:00,  2.24it/s]

Epoch: 0. Loss: 0.21407592296600342

We'll just do 1 epoch for now, since it's fairly slow.

The code should be fairly obvious, but basically we just iterate over the length of train_X, taking steps of the size of our BATCH_SIZE. From there, we can know our "batch slice" will be from whatever i currently is to i+BATCH_SIZE.

While we wait on that, let's code the validation:

correct = 0
total = 0
with torch.no_grad():
    for i in tqdm(range(len(test_X))):
        real_class = torch.argmax(test_y[i])
        net_out = net(test_X[i].view(-1, 1, 50, 50))[0]  # returns a list, 
        predicted_class = torch.argmax(net_out)

        if predicted_class == real_class:
            correct += 1
        total += 1
print("Accuracy: ", round(correct/total, 3))

100%|a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^a-^| 2494/2494 [00:11<00:00, 226.66it/s]

Accuracy:  0.651

As you can see, after just 1 epoch (pass through our data), we're at 65% accuracy, which is already quite good, considering our data is just about perfectly 50/50 balanced.

If you want, you can go ahead and run the code for something more like 3-10 epochs, and you should see we get pretty good accuracy. That said, as we continue to do testing and learn new things about how to measure performance of models, when to stop training...etc, it's going to be muuuuuuch more comfortable if we're using GPUs.

In the next tutorial, I am going to show how we can move things to the GPU, if you have access to one, either locally or in the Cloud via something like Linode. This will help us to trial/error things much more quickly. You can still continue with us if you do not have access to a high end GPU, it's just that things will take a bit longer to run. If you do not have the patience to train models for things like, an hour, then you're probably not cut out for deep learning, where even on extremely high end GPUs, models often take days to train, sometimes weeks, and sometimes even months!

Convolutional Neural Nework Model - Deep Learning and Neural Networks with Python and Pytorch p.6

Creating a Convolutional Neural Network in Pytorch¶