Recurrent Neural Networks (RNNs) have recently shown impressive results in natural language processing, e.g. for speech recognition and machine translation. Additionally, RNNs have also been successfully integrated into computer vision, for example for automatically generating text descriptions for images. In this exercise, you will be introduced to this extremely powerful machine learning method and will gather hands-on experience by tackling a Shakespeare language modeling and an image captioning task.

For this exercise, we will again use TensorFlow (for instructions on how to set it up, please have a look at Versuch 6). But before that, we will build a simple RNNs directly using numpy to better understand the backpropagation through time algorithm.

In this exercise, you will learn to

- Implement a Recurrent Neural Network (RNN)
- Implement a Long Short Term Memory (LSTM) network
- Apply both networks to a temporal prediction task
- Test out the memory capacity of both networks
- Improve network performance by
- Applying gradient clipping to counter the exploding gradients problem
- Using LSTMs to counter the vanishing gradient problem.

- Use RNNs for a simple language modeling task
- Apply RNNs to a image captioning task

For the first part of the exercise, where you will implement a basic RNN directly with numpy, we will work with randomly generated sequences which are input to the network and the network needs to remember this random input. The file \(\textit{q1_bptt/data_generator.py}\) already implements the data generation.

The second part of the exercise will be based on TensorFlow. Here, we will work with different datasets. The first is a memory task based on randomly generated data similar to before. Handling of this dataset is already implemented in the function \(\textit{generate_batch_memory_task}\) of \(\textit{q2_rnn/data_generator.py}\). Additionally, we will generate sine curves of random frequency and phase shift which is implemented in the same file in the function \(\textit{generate_sin_data}\). The next dataset will be based on texts of Shakespeare and the function \(\textit{generate_shakespeare_data}\) is provided to read in that data. The text file shakespeare.txt is directly supplied as download along with the code.

Recurrent Neural Networks (RNNs) are an extension of feed-forward networks (MLPs) which are designed to work on input sequences rather than just a single vector input. RNNs add recurrent connections through which the hidden state is used as an additional input for the next timestep. The weights of an RNN are shared over time and the RNN can be applied to input sequences of arbitrary length. The figure under the formula shows a simple RNN unfolded over time. Note that the recurrent weight matrix \(W_{hh}\) is shared over time. The hidden activation \(h_{t}\) at time \(t\) can be calculated as

$$h_{t}=f\left(W_{hh}h_{t-1}+W_{xh}x_{t}+b_{h}\right).$$

In order to calculate the activations for all time steps, we first initialize the initial hidden state \(h_{0}\), which can either be learned or as in the figure just set to \(0\). Afterwards, the above equation is used for every time step.

Usually on top of such a recurrent layer, either more recurrent layers or in the end a final linear output layer are stacked.

In order to calculate the gradient of the loss function with respect to the weights for RNNs, the backpropagation algorithm is extended to backpropagation through time (BPTT). Again, in practice TensorFlow will take care of all derivatives, hence we will only briefly illustrate the basic idea. Consider the hidden layer \(h\) shown in the figure in RNN and assume we have a loss function \(E\) which is composed of individual losses from each time step, i.e. \(E=\sum_{t=1}^{T}E_{t}.\) Starting with the output layer, we can then derive gradients \(\frac{\partial E_{t}}{\partial h_{t}}\) for each time step as shown in the figure. Now we are interested in the gradients with respect to the layer below. When evaluating an RNN, we use the equation \(h_{t}\) forward in time. During BPTT, we have to go backwards through time to propagate the gradients. Here we will consider the example in the figure with 3 time steps. Since \(h_{1}\) and \(h_{2}\) are earlier in time than \(h_{3}\), their gradients do not have an influence on the gradient for \(h_{3}\). Hence, we can start by

$$\frac{\partial E}{\partial h_{3}}=\frac{\partial E_{3}}{\partial h_{3}}.$$

For \(h_{2}\), we have a gradient both from the above layer and coming backwards in time from \(h_{3}\), i.e.

$$\frac{\partial E}{\partial h_{2}}=\frac{\partial E_{2}}{\partial h_{2}}+\frac{\partial E_{3}}{\partial h_{2}}=\frac{\partial E_{2}}{\partial h_{2}}+W_{hh}^{T}\left(\frac{\partial E_{3}}{\partial h_{3}}\odot f'(h_{3})\right).$$

Finally, for \(h_{1}\) ,we again have a gradient coming from the above layer and a gradient coming backwards in time, for which we reuse

the gradient with respect to \(h_{2}\), i.e.

$$\frac{\partial E}{\partial h_{1}}=\frac{\partial E_{1}}{\partial h_{1}}+\frac{\partial E_{2}}{\partial h_{2}}+\frac{\partial E_{3}}{\partial h_{3}}=\frac{\partial E_{1}}{\partial h_{1}}+\frac{\partial E}{\partial h_{2}}\frac{\partial h_{2}}{\partial h_{1}}=\frac{\partial E_{1}}{\partial h_{1}}+W_{hh}^{T}\left(\frac{\partial E}{\partial h_{2}}\odot f'(h_{2})\right).$$

As you’ve just seen, in every step of backpropagation through time, we multiply with the transpose of the recurrent weight matrix \(W_{hh}\). When we do this for many time steps, we effectively take this matrix to a high power. Based on an analysis of the eigenvalues of \(W_{hh}\), we can derive, that the gradients will either explode or vanish exponentially over time which is a problem in practice. One way to combat exploding gradients is gradient clipping which is described in the exercise sheet. Vanishing gradients can for example be addressed by a more complicated RNN architecture which will be described in the following.

The Long Short-Term Memory (LSTM) architecture extends a basic RNN by introducing a hidden cell state (in addition to the hidden state) and multiple multiplicative gates. The internal cell state is updated at each time step by using an input gate and a forget gate which decide how much influence the input and the previous hidden cell state should have on the current state. Furthermore, the output of the LSTM cell is gated by the output gate. The LSTM architecture enables the network to work much better with long sequences and is less strongly affected by vanishing or exploding gradients.

Language modeling is an important task in natural language processing and has many important applications like speech recognition or machine translation. A language modeling assigns a probability to each possible sequence of words. For a sequence \(x_{1},\dots,x_{N}\) of \(N\) words, usually the language model uses the following factorization

$$p(x_{1},\dots,x_{N})=p(x_{1})p(x_{2}|x_{1})\dots p(x_{N}|x_{1},\dots,x_{N-1})=\prod_{n=1}^{N}p(x_{n}|x_{1},\dots,x_{n-1}).$$

This means that language models usually just need to learn a one-word-ahead prediction.

In practice, language models need to deal with large vocabularies of possible words which makes it difficult to apply standard neural networks. Often, an embedding is learned which maps the one-hot encoded input word (i.e. a vector of length of the whole vocabulary, which has a \(1\) at the right word and a \(0\) for all other words) to a dense embedding vector which represents the word. Instead of multiplying a big weight matrix with the sparse one-hot input, we can just look up the corresponding column of the weight matrix without having to perform the multiplication.

As the first task, you will implement the backpropagation through time algorithm for a simple recurrent neural network directly with numpy, i.e. without TensorFlow. To this end, you will implement first the \(\textit{BasicRNNCell.fprop}\) function which computes the forward pass and afterwards the \(\textit{BasicRNNCell.bprop}\) function which computes the backward pass. To make things easier, the exercise sheet lists the precise equations you’ll need to implement. You will then test the correctness of your implementation with the \(\textit{run_check_grads}\) function which will compare your gradient computation to numeric differentiation.

Now we’ll create a visualization of how the gradients develop over time to get an understanding for exploding gradients. Afterwards, we’ll fix the exploding gradients by implementing gradient clipping.

To make things easier, we switch back to TensorFlow and you’ll now

implement the more complex LSTM architecture including the cell state

and it’s gates.

We’ll now consider two toy tasks, one where the network has to extrapolate a sine wave of random frequency and phase shift and one where the network has to remember the input sequence. For both tasks, we’ll compare the capabilities of a basic RNN and the LSTM.