Over the past week and a bit I’ve been reading up on Deep Learning and Convolutional Neural Networks. I mean a lot of reading! I was dismayed to see that so many of the lectures and tutorials rely on machine learning libraries (like Caffe, Torch,the Python Docker book thing, ect) which is fine, but I felt that I was kinda missing out on the intuition behind them. So I decided to write the whole thing from scratch in C++, and I finally got it to work so I thought I’d make a blog post about it!
So what is a Convolutional Neural Network (CNN)? It is based around the convolution operation, which you perform by multiplying two matrices element wise. One matrix is your weights and the other is a same size patch of the image. You keep moving the patch over the image until you’ve covered the whole thing. Here’s an example of a 3*3 Sobel filter from an earlier project.
The filter reduces the width*height*depth (depth = 3 due to red, green,and blue channels) to just width*height. Technically the width and height are reduced too but the edges are (usually) just padded with zeros to keep the same dimensions. Convolutional layers just use filters like these, the really cool part is that they *learn* what filters to use!
Notice we treat the 2D input image as a three dimensioned volume using the RGB channels as the depth components. What happens when you use more than one filter?
You produce a new volume with the z dimension matching the number of filters used. Where each filter will move over the image and produce its own 2D output image and these are stacked together to form the resulting volume. The result of the filter is also thresholded by some function (I used ReLU), but that’s not really important right now. Now lets look at another layer that we can use.
Where convolutional layers (usually) conserve the width and height of the data while altering the depth, pooling layers conserve the depth while reducing the the width and height dimensions of the input volume. The method I used is called Max Pooling, where you just pick the largest value from a block of the image. Using 2*2 image segments the input would be altered like so:
If we passed the input from the above example through a max pooling layer we would get:
Now this is where the “network” part comes in, we stack these layers to produce one big network, the object scales are a little off but it will look something like the below.
At the end of the network we flatten the result and feed it into a ‘fully connected’ layer which is just a normal neural network with each neuron attaching to each input pixel. But now, not only has the size been reduced but the convolutional layers will have trained themselves to look for features in side the picture. What you should find is that the first layer of the network will train itself to look for ‘low level’ features in the image like lines, basically acting like an edge detection filter. The higher level convolution layers will then pick out higher level features from that result, I saw an lecture where they used a technique called guided backpropagation to actually pull out features which the higher level convolutional layers had trained themselves to look for like nosies and eyes.
It took me quite some time to get everything working, especially in calculating and flowing the derivatives through the network for the backpropagation. But it works now! I made a dummy set of 64*64 images of crosses and circles. Starting with a two filter convolutional layer I feed the network this image of a small circle (notice mostly whitespace).
On the first iteration (no training), the result from the first layer is:
Which makes sense, as all of the filter values are going to be a positive floating point value the intensities will be scattered across the picture. After training for 100 iterations we get this volume (think of it as the two slices of a 2 deep volume):
What you can see is the filters have started to cut out the background value (the data set includes all sort of colours for background) and just focus on the circle. This is then passed up the network to eventually reach a logistic regression classifier which gets 100% on this pretty basic example set.
It was pretty frustrating at times even just working with the tiny data set that I had, but writing the whole thing from scratch has been really rewarding in that now I have a pretty solid understanding of the intuitions behind the backpropagation/derivative calculations, CNNs, and machine learning in general. The network is super modular so I can just drop layers in and out and modify their hyperparameters. The setup looks like so:
// Width, height, and depth (3 due to RGB) of input image Network seqNet(width,height,depth); // Layer 1 - Convolution - ReLu - MaxPool seqNet.AddConvolution(5,3,1); // Filters - Filter size - stride seqNet.AddMaxPooling(2); // Size // Layer 2 - Convolution - ReLu - MaxPool seqNet.AddConvolution(3,3,1); seqNet.AddMaxPooling(2); // Layer 3 - Linear Classifier (LogRegression) seqNet.AddLinearClassifier();
I have an idea for a pretty cool project which would use CNNs but I’ll probably need to write an GPGPU implementation for it. I only just touched on the stuff I’ve learnt about CNNs but the field is super cool and can do things like not only classify an image but if the class is part of a larger image identify the bounding box of that class in the larger image (I.E where it is). There is so much to more to go into in terms of the implementation and how the backpropagation makes sense, but I think this is a fairly decent introduction to what CNNs do.
Shout out to Nando de Freitas’s superb deep learning course at the University of Oxford, the videos go pretty deep into the maths which was super useful, and Andrej Karpathy and Fei-Fei Li’s deep learning course at Stanford University which has probably the best videos and course notes on the topic.