Convolutional neural networks have recently shown impressive performance for many computer vision tasks including object recognition. In this exercise you will be introduced to this extremely powerful machine learning method and will gather hands-on experience by tackling a simple object recognition task using TensorFlow.

For this exercise we will use the popular TensorFlow framework on top of Python. If you do not have Python installed on you computer, our recommended choice is the Anaconda distribution.

Tensorflow can be installed to be GPU compatible, that is, you will be able to run your code on a Nvidia GPU. Running Tensorflow on a GPU will result in several times faster execution of the code and is therefore the recommended option. However, since not all of you may have a Nvidia GPU available, or are unwilling to go through the steps involved with installing the GPU compatible version of TensorFlow, we have tried to scale the exercise with a CPU in mind.

You find sample images for each of the 10 classes in CIFAR10 in the below figure:

In this exercise, you will learn to

- Set up a Deep Learning problem in TensorFlow
- Implement a Multi-layer Perceptron (MLP)
- Implement a Convolutional Neural Network (CNN)
- Apply both networks to an image classification task
- Optimize the network parameters with a cross-entropy loss
- Improve network performance by
- Properly initializing the network weights
- Applying data augmentation
- Stabilizing training with Batch Normalization
- Network regularization through Dropout

*cifar10_data.py* contains the code which downloads, extracts and reads the CIFAR10 dataset. The download will use up roughly 300MB of disk space. There is no need to modify any of the functions in this file, but you may experiment with the number of training samples used and the image size. The provided function creates a TensorFlow dataset (tf.data.FixedLengthRecordDataset) and already takes care of decoding the raw binary data and creating mini-batches of multiple images.

Here we will shortly review the machine learning algorithms which we’re going to use in this exercise.

Let \(x\in\mathbb{R}^{D}\) be the input to the classifier (e.g. an image flattened to a vector). A linear classifier then uses a weight matrix \(W\in\mathbb{R}^{C\times D}\) and a bias vector \(b\in\mathbb{R}\), where \(C\) is the number of classes. The raw output \(q\in\mathbb{R}^{C}\) of the linear classifier (often called logits) is then given by

$$q(x)=W\cdot x+b$$

and the classification decision is

$$\hat{c}=\arg\max_{c\in\{1,\dots,C\}}q_{c}(x).$$

Usually, the softmax function is used to convert the logits \(z\) to the predicted output \(\hat{y}\) which can be interpreted as posterior probabilites of the different classes. For class \(c\in\{1,\dots,C\}\), the softmax activation is given by

$$\hat{y}_{c}=\frac{\exp q_{c}}{\sum_{k\in\{1,\dots,C\}}\exp q_{k}}.$$

A multi-layer perceptron stacks one or multiple hidden layers on top of each other and each of them produce a hidden activation vector

$$q^{(i)}=W^{(i)}z_{i-1}+b^{(i)},$$

$$z^{(i)}=f(q^{(i)}),$$

where \(i\) is the layer index, and $f$ is a nonlinear activation function like tanh or ReLU. Note that \(z^{(0)}\) is just the input to the network, i.e. \(z^{(0)}:=x\). On top of the highest hidden layer, again a linear classifier is used for the final output of the network. The nonlinearities are needed to make the network more expressive and enable it to learn nonlinear dependencies. Without the activation functions, the whole network would be linear since a composition of multiple linear functions is linear again.

For optimizing neural networks, we first need to define a loss function which depends on the network output \(\hat{y}\in\mathbb{R}^{C}\) and the ground truth label \(y\). Here, we will consider the ground truth label \(y\) in one-hot format, i.e. \(y\in\mathbb{R}^{C}\), where \(y_{c}\in\{0,1\}\ \forall c\in\{1,\dots,C\}\) and \(\sum_{c=1}^{C}y_{c}=1\), which means that exactly one entry of the vector is \(1\) and all other entries are \(0\).

For the multi-class classification task, usually the cross-entropy loss is applied on top of the softmax activation function. The cross-entropy loss \(\mathcal{L}_{CE}\) is given by

$$\mathcal{L}_{CE}(y,\hat{y};W,b)=-\sum_{c=1}^{C}y_{c}\log\hat{y}_{c}.$$

In order to optimize the neural network, we will need to calculate gradients of the loss function with respect to the network weights. In practice, TensorFlow will take care of calculating derivatives for us, hence we will only briefly describe the main idea. To obtain the gradients, we will use the error backpropagation method which is basically the application of the chain rule of differentiation. Assume we are given the partial derivative \(\frac{\partial\mathcal{L}}{\partial z_{j}^{(i)}}\) of the loss with respect to the hidden activation \(z_{j}^{(i)}\) of hidden unit \(j\) of layer \(i\). We now want to compute the partial derivative \(\frac{\partial\mathcal{L}}{\partial z_{k}^{(i-1)}}\) of the loss with respect to the hidden unit \(k\) of the hidden activation \(z_{k}^{(i-1)}\) of the below layer \(i-1\). The chain rule of differentiation tells us

$$\frac{\partial\mathcal{L}}{\partial z_{k}^{(i-1)}}=\sum_{j}\frac{\partial\mathcal{L}}{\partial z_{j}^{(i)}}\frac{\partial z_{j}^{(i)}}{\partial q_{j}(i)}\frac{\partial q_{j}^{(i)}}{\partial z_{k}^{(i-1)}}=\sum_{j}\frac{\partial\mathcal{L}}{\partial z_{j}^{(i)}}f'(q_{j}^{(i)})w_{jk}^{(i)}.$$

We can put this equation into matrix/vector form and obtain

$$\frac{\partial\mathcal{L}}{\partial z^{(i-1)}}=W^{(i),T}\left(\frac{\partial\mathcal{L}}{\partial z^{(i)}}\odot f'(z^{(i)})\right),$$

where \(\odot\) denotes componentwise multiplication.

This equation can be used to efficiently backpropagate the gradients from layer \(i\) to layer \(i-1\). In order to get this started, initially the derivative \(\frac{\partial\mathcal{L}}{\partial\hat{y}_{i}}\) of the loss function \(\mathcal{L}\) with respect to the network output \(\hat{y}\) needs to be determined which can be done by simple analytical derivations which we will omit here. Similar to the way we backpropagated gradients from layer \(i\) to layer \(i-1\), we can use the gradients with respect to the hidden activations to calculate the gradients with respect to the weights.

Once we have computed gradients \(\frac{\partial\mathcal{L}}{\partial w_{i}}\) with respect to the network weights \(w_{i}\), we update the weights in the direction of the negative gradient using a learning rate \(\lambda\), i.e.

$$w^{new}=w^{old}-\lambda\frac{\partial L}{\partial w}_{\vert w^{old}}.$$

For efficiency, we calculate the gradient only on a mini-batch of a few images instead of using the whole dataset at once.

Multi-Layer Perceptrons are not well suited to deal with image data. When working on large images, the number of parameters quickly grows

too large and MLPs will not generalize. Convolutional Neural Networks (CNNs) alleviate this by using convolutional layers instead of fully-connected layers. Convolutional layers treat the input image as a 3-dimensional tensor (height, width, and RGB) and produce a new tensor while keeping the spatial structure but replacing the input channels (initially RGB) by the filter responses. To this end, a convolutional layer applies multiple filters with a limited kernel size (e.g. \(3\times3\)). The filters are slided over the image with weights which are shared over all spatial positions. This is efficiently implemented using a convolution operation. Another important component of CNNs are pooling layers. A max-pooling layer reduces the spatial dimension of it’s input by replacing each block of a certain size (e.g. \(2\times2\)) by the maximum value inside this block, where the maximum is applied separately for each dimension. By reducing the size of the tensor, max-pooling leads to higher computational efficiency and the maximum operation also leads to some robustness against translations.

When building an entire CNN, usually multiple convolutional layers interleaved with max-pooling will be stacked on each other and in the end the feature tensor can be flattened and input to a linear classifier.

For this exercise, we will train different models to classify images into the 10 classes given by CIFAR10. The provided code is runnable and will train a simple linear model.

As a first step, you will make yourself familiar with the given code. Afterwards, you will implement simple data augmentation strategies to improve the generalization of the learned models.

Afterwards, you will extend the simple linear model to a Multilayer perceptron (MLP) with a single hidden layer and adjust the weight initialization to work well with the ReLU nonlinearity. You will then investigate how well the fully-connected MLP works for the use with images and whether weight decay can help here.

As the next step, you will implement a convolutional neural network which is better suited for processing images. To this end, you will first implement generic functions for convolutional and pooling layers and afterwards stack these layers together to a complete network. Afterwards, you will use the batch normalization technique to speed up the training.

Finally, we will take a closer look at the pooling layers and consider replacing them by strided convolutions, we increase the amount of data used, and train a larger network with dropout as regularizer.