THE LANGUAGE MODELING PATH: CHAPTER 3

Transformer: the fall of RNNs

A dive inside the model who gave birth to BERT and GPT3

16 min readDec 30, 2020

Hi and welcome to the third step of the “Language Modeling Path”, a series of articles from Machine Learning Reply aimed to cover the most important milestones of the huge language models able to imitate and (let’s say) understand the human language like BERT and GPT-3.

In order to fully appreciate the details of the model explanations you will need some knowledge of Attention mechanism and encoder-decoder model archetype. If you would like to get the basics of these topics before engaging this article we would suggest you to take the previous steps of the Language Modeling Path:

The current article will try to explain the basics of a model that has been one of the greatest booster of improvements in Neural Natural Language Processing (NLP) field. The name of this mode is “Transformer” and whenever you will see the acronym of one of the recent language modeling models, you can be quite sure that the “T” stands for “Transformer”. But why did this architecture become so popular? Is it really that effective? Before that, let’s see some examples where the Transformer is applied nowadays.

Transformer in the real world

As many encoder-decoder models the primary application of Transformer was related to automatic translation (or better “machine translation”). This is also the reason why our collection of articles, the Language Modeling Path, focused so much on this particular task. Anyway the capability developed by translation models to encode words and their relationships in an effective and accurate representation space can be useful in many different NLP applications. This representation is usually called “Embedding” and embedding models, based on Transformer architecture are almost ubiquitous. BERT (Bidirectional Encoder Representations from Transformers) has been used (among the others) by Google to improve the performance of Google Search, making it able to perform queries based not only on the keywords typed but also giving them different importance or role based on the meaning relationship given by the user input. Open AI developed GPT (Generative Pretrained Transformer, at the moment last release is GPT3) one of the most outstanding generative models, able to generate text almost indistinguishable from the human written style. The Multilingual Universal Encoder developed by Google is also (partially) based on the transformer architecture. This last model is able to process simultaneously up to different 16 languages and it is very popular in chat-bot and virtual assistants’ development due to its versatility.

Looking for efficiency

I hope we managed to let you catch the impact that this particular model had in the Neural NLP world. In this paragraph we will see the basics of the Transformer and what really makes it so different from previous encoder-decoder sequence-to-sequence models.

Transformer’s intuition was born from the need of increased capability of parallelization performed during the standard encoder-decoder architecture training. As we saw in the previous steps of the Language Modeling Path the most common encoder-decoder seq-2-seq models rely mainly on the application of Recurrent Neural Networks, most commonly on LSTM RNNs, and that was the state of the art before that the Transformer was published with a paper named “Attention is all you need” in late 2017.

One of the main characteristics of RNNs is that the input and output sequences are processed sequentially. To compute the hidden state at a particular step of the input or output sequence I need to wait until I know the previous value of the hidden state and so on until the very beginning of the sentence. From this it follows that I am not able to compute all the hidden state of an input sequence at once on a single matrix computation over the sentence dimension. The final conclusion is that due to RNNs I am not able to parallelize computation within the training examples.

As we will see the key concept of Transformer architecture is that the RNNs are completely removed and replaced by particular kind of attention mechanisms. The ability of Transformer to enormously increase the parallelization allowed to train bigger models at the same cost, and bigger models means more capability to identify patterns inside human language. At the moment the world record is held by the model GPT3 from Open AI with about 175 billion of parameters. Such models were unthinkable before the transformer.

General Model overview

The transformer architecture is quite complex but first I will try to explain the general idea and then we will see some details over the most important building blocks. As any model we have seen during the previous step the Transformer is compound by an encoder and a decoder. Both of them are made of the following blocks of sublayers each repeated 6 times in sequence. All Inputs and outputs of model’s sublayers and embedding layers have dimension d_model = 512.

Image modified from https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Encoder Layers

The encoder layer has two sublayers.

Multi-Head Self-Attention
A particular kind of attention. It is called Self-Attention because both the inputs of the alignment model (the model responsible for encoder hidden states weights) and even the components of the context vector are the same object. It is called Multi-Head because before applying attention the inputs are linearly projected in 8 different spaces in parallel. We will see this in deeper detail in the next section
Fully connected feed-forward network
Each of these sublayers is combined with a residual connection followed by a normalization layer. Translated into a pseudo-formula the last sentence is:

If you never heard of residual connection, I will simply point out that the effect of combining the input x and the sublayer output Sublayer(x), is that it is possible for the model to ignore the effect of the sublayer whenever this is not necessary, reducing the effect of fading gradients.

Decoder Layers

The decoder part of the transformer contains three sublayers:

Masked Multi-Head Self-Attention
This sublayer is very similar to the first sublayer of the encoder but with the difference that the input representation are masked in such way that to compute the output representation at a given position the sublayer can rely only on current and previous positions, while following position will have zero impact. Quoting from the paper [1]:

“This masking, combined with fact that the output’s embeddings [ed: here as ‘output’ is meant the target translated sentence’s embedding that are fed as decoder’s input] are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.”

Multi-Head Attention
This second layer is almost the same as the attention mechanism we saw in the previous step of Language Modeling Path, with the trick of linearly projecting input in multiple different spaces in parallel before application of attention. In fact, the values that are combined to create the attention’s context vector come from the encoder’s embedding, like in classical attention applications.
Fully connected feed-forward network

At the end of the Encoder-Decoder application we end up with an embedded representation for each output’s position. To this representation is finally applied a linear + softmax layer to obtain a dictionary-size probability distribution. This representation is used to compute loss function in training time and to select the predicted token during prediction time.

Layers Details

In the previous chapter we have listed the main components, how they are organized and linked together and mentioned some high level key concepts. And this could be enough for a first gaze over the Transformer aspect. If, instead, you would appreciate some additional details on the difference between classical attention as it is often described and the aforementioned Multi-Head Attention and Self-Attention in this chapter we will try to go deeper.

Scaled Dot-Product Attention

To follow the same explanation used in [1] it is quite useful to define three concepts related to attention mechanism. These three concepts are queries (Q), keys (K) and values (V). The links between standard attention definition and the following concepts could appear not trivial so I will try to state clearly the reference of these definitions to the standard case of sequence-to-sequence attention.

Queries
The idea of a query Q is that this array represents the information of what we are going to describe using the attention result. In a certain way it describes what the attention result will be used for. It is the first input of the compatibility function. In the case of standard seq-2-seq attention, the query is the hidden state of the previous step of the decoder (sₜ-₁ if we use the previous article notation).
Keys
The idea of the key K is that this array represents the characteristics of attention participants used to check if they answer to the query or not. This is the second input of the compatibility function. If, using the compatibility function, the query and the key will result into a good match, then the value associate to the key will have a strong impact on the context vector i.e. the attention output. In the case of standard seq-2-seq attention, the keys are all the encoder’s hidden states (hᵢ if we use the same notation as our previous article). In fact, the attention’s alignment scores in that case were computed as:

Values
The idea of the value V is that this array represents the information that will be actually used to compute the context vector. Each value is associated to a key and based on the result of the compatibility function of the key with the query, the value will have a higher or lower participation inside the attention output. In the case of standard seq-2-seq attention, the values are again all the encoder’s hidden states (hᵢ if we use the same notation as our previous article); in that case keys coincide with values. In fact, the context vector was computed as a weighted average of the encoder’s hidden states, and as we have already mentioned, those weights are based on the compatibility function computed between themselves and the decoder’s hidden state currently in analysis.

We can generalize the application of attention that we have seen in the previous chapter of the Language Modeling Path using the concepts that we have just mentioned. This generalization will be useful to understand the attention mechanism inside the transformer architecture. In [1] the way attention mechanism is applied is called “Scaled Dot-Product Attention” and it is defined by the following formula:

Where d_k is the size of key K vector. In the transformer, instead of using a feed forward neural network to compute compatibility between Q and K, the compatibility function is

Even this choice is made looking for the highest possible efficiency relying on matrix computation’s high parallelization. As said in [1]:

“… dot-product attention is much faster and more space-efficient in practice, since it can be implemented using highly optimized matrix multiplication code.”

The scaling factor

is used because for large sizes of K, the magnitude of the dot product QKᵗ will rise, pushing the softmax toward regions of extremely small gradients.

Image from https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Keep in mind that we have not yet stated what are the Queries, the Keys and the Values in the specific case of transformer’s attention application. This is because depending on the different sublayers the element used as Q, K and V can differ. We will be back to this later in the article.

Multi-Head attention

In the General Model Overview section, we have stated that the input and output representations of each sublayer is of size d_model = 512. However, instead of applying directly attention over these representation, the queries Q, the keys K and the Values V are linearly projected into h different spaces of size respectively d_k, d_k and d_v (d_k is repeated since queries and keys must have the same dimension). This is exactly the same as applying in parallel h different dense layers with linear activation to Q, K and V. The weights of these layers are independently trained between Q, V, K hence we end up with 3xh different weights’ matrices

In this way we obtain h different triplets

With the weights matrices having sizes

To each of these triplets is applied in parallel the Scaled Dot Product Attention we just discussed obtaining h different output values called heads. All the heads are then concatenated and once again projected back into a space of dimension d_model = 512. The complete formula representation of Multi-Head Attention is the following:

The reason behind the application of Multi-head instead of simple attention layer is that

“Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. With a single attention head, averaging inhibits this”

In other words: when we use a single attention head, we are applying a sort of weighted average over values V based on the alignment of Q and K. The intuition of multi-head attention is that even if using standard attention, each single query Q_i could not align very well with a key K_j, a linear combination of all {Q_i} could find a perfect alignment with a different linear combination of all {K_j}. To increase the capability of the model to find good projection spaces that increase the match between queries and keys, instead of using a single projection, we use many of them. In the model implemented in [1] researchers used h=8.

For each of these we use d_k = d_v = d_model/h = 64. Due to the reduced dimension of each head, the total computational cost is similar to that of single-head attention with full dimension.

Transformer’s Attention

In this paragraph we are going to build a bridge between the general definition of attention based on Q, V and K described in the previous sections and the General Model Overview tailored on the Transformer model application.

The first sublayer of the Transformer’s encoder is the “Multi-Head Self-Attention layer”. For this layer all Q, K and Vs comes from the same exact element, i.e. the output of the previous layer in the encoder. In the case of the very first encoder layer, this means the output of input sentence’s embedding layer.

The first sublayer of the decoder is the “Masked Multi-Head Self-Attention layer”. This layer is almost identical to the encoder’s first sublayer but in this case each output position of the attention mechanism can be computed attending only to the input positions up to and including that position. This is because during evaluation time, the model can attend only to previously generate output tokens to preserve the “auto-regressive property”. To implement this masking inside the scaled dot product we saw in the “Scaled Dot-Product Attention” paragraph all values in the input of the softmax which correspond to illegal connections are set to minus infinite. This lead to weights = 0 for all the values relative to illegal connection in the weighted average.

The second sublayer of the decoder is a “Multi-Head attention layer”. In that case Q is equal to the output of the previous layer of the decoder, while K and V are represented by encoder’s output. This is the same as the classical attention we saw in the previous chapter of the Language Modeling Path (except for the multi-head part). There is no masking here hence the decoder can attend to all positions in the input values sequence.

Other Layers

Position-wise Feed-Forward Networks
Both encoder and decoder contains a fully-connected feed-forward sublayer connected after the attention sublayers. This sublayer is compound of two dense layers of size respectively d_ff = 2048 and d_model = 512 with a RELU activation in between.
Embedding and Softmax
A learned embedding model is used to convert input tokens and output tokens (give as input to the decoder) to vectors of dimension d_model. To predict the next token probabilities, as usual, a learned linear transformation with softmax activation is applied to decoder output.
Positional Encoding
Since the transformer avoid recurrence or convolution, there is no way for the model to obtain some strong positional information over input tokens. So a “positional encoding” is added to the input embeddings. Examples of positional encodings used in [1] implementation are:

Where pos is the position of the token and i is the dimension. The positional encodings have the same dimension as the embedding so that the two can be summed. This means that

You can take a breath. We are done with the technical description. Now, we hope that looking back to the picture of the model’s architecture you could find it a little bit more familiar.

Why self-attention?

Now that we have some grasp on the details of the role of the attention mechanism inside the Transformer model we would like to point out some advantages that this choice has brought.

The key aspect is that self-attention mechanism, differently from RNN and CNN, connects all positions with a constant number of sequentially executed and parallelizable operations. This, as we said in the first paragraph, means a higher efficiency, but this is not the greatest advantage.

From [1] we can read that:

Learning long-range dependencies is a key challenge in many sequence transduction tasks. One key factor affecting the ability to learn such dependencies is the length of the paths forward and backward signals have to traverse in the network. The shorter these paths between any combination of positions in the input and output sequences, the easier it is to learn long-range dependencies

Since, using self-attention, the connection between distant tokens, has always the same length, the Transformer is very good to identify such patterns, even inside long and complex sentences, while at each step of recurrence in RNN or at each CNN convolutional layer part of their relationship is lost.

In RNN to connect token at position 1 with token at position N I must wait N recurrence steps. In convolution, it depends on the kernel size. If N is greater than the convolution’s kernel size, the first convolution application will not put in contact token 1 and token N, we must add a second convolutional layer to spot their relationship, and if N is high enough even a second layer could not be sufficient, and so on. At each step the pattern that linked those tokens fades a bit. Instead, using self-attention, it doesn’t matter how fare token 1 and token N are. They could find a strong alignment immediately at the first sublayer application.

Another side-advantage of attention, that we discussed also in the previous step of Language Modeling Path, is that attention provide a higher level of model interpretability. Inspecting attention’s alignments matrices, we can explicitly discover what kind of patterns and links between different token were spotted during model training.

Example of Multi-Head Self Attention alignment for word “making”. Different colors stands for different heads. Image taken from https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

Training and Results

The training set used in [1] to train the model we described so far is the standard WMT 2014 English-German and English-French dataset. We already talked about this dataset in the previous article. It’s respectively made of 4.5 million (English-German) and 36 million (English-French) sentence pairs of English sentences with relative translation.

The model was trained on one machine with 8 NVIDIA P100 GPUs. Depending on size of the model the training time took between 12 hours and 3.5 days.

The model performances were measured using the BLEU score, i.e. the bilingual evaluation understudy. We already mentioned it in the previous chapter of the Language Modeling Path. Here some additional details. On the English-German translation task the transformer model trained in [1] have been able to outperform all the previously reported models, including ensembles of different models by more than 2.0 BLEU reaching a state-of-the-art score of 28.4.

Even on the English-French task the Transformer was able to outperform previous state of the art establishing a new 41.8 BLEU record but the most amazing thing of these results is the fact that this model was able to surpass all the others using only a fraction of the computational cost that the other required.

Conclusions

Neural NLP models are able to do amazing things. Most of the state of the art models developed by Google, Open AI and all the big labs dealing with such tasks are based on the Transformer. Even though the huge importance of such architecture, it is often very difficult to go beyond the surprise of such awesome achievements and give an answer to the question “why is this result possible?”. The aim of this article was to provide some deeper explanation of what lies behind the “T” of many popular models like GPT-3 or BERT.

We have seen how the application of different forms of attention mechanism was able to accomplish multiple important results.

High efficiency
Vast space exploration thanks to multi-head projection
Quick connection between distant inputs

This features, together with the ones of the classical sequence-to-sequence encoder-decoder architecture, have brought to a model able not only to outperform all the previous state-of-the-art models, but also to do it in a fraction of time and computational cost.

Since Transformers has been around bigger and bigger models has come to life. Since the beginning of 2018 Transformer based models have grown from an order of magnitude of 100 million parameters to the 175 billion parameters of GPT-3 released in May 2020 and the curve does not seem to stop rising. This could give you a hint on the impact that this particular model has had in such field of research.

We really do hope we managed to give you a deeper understanding of what makes transformer-based models different from all the others and we look forward to meet you again for the next and (maybe) final step of the Language Modeling Path: “GPT2 and GPT3: so good to be bad”. It will be completely dedicated to what transformer-based models can and could do in the future.

Thank you for reading this article and if you missed the previous steps we suggest you to take a look also to

Bye!

References

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser and I. Polosukhin. 2017. “Attention is all you need” In Advances in Neural Information Processing Systems, pages 6000–6010 https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf