THE LANGUAGE MODELING PATH: CHAPTER 2

Neural Networks, at Attention!

The mechanism that makes language models focus on the important things

13 min readAug 27, 2020

Hi and welcome to the second step of the Language Modeling Path; a series of articles from Machine Learning Reply aimed to cover the most important milestones that brought to life the huge language models able to imitate and (let’s say) understand the human language like BERT and GPT-3.

In this article we are going to talk about attention layers. This kind of architectural trick was first applied to the computer vision field [1] but in this article we will focus only on the Neural Natural Language Processing application and in particular on the sequence to sequence applications for Neural Machine Translation (NMT). While this article is based on two papers ([2] and [3]), attention is a widespread technique that you can find explained everywhere if you need to go deeper.

To better understand this chapter, it is strongly suggested to have a good comprehension of encoder-decoder sequence-to-sequence models. If you need a good refresh of the key concept of those architectures, you can start from the first chapter of our Language Modeling Path.

Sequence-to-Sequence architectures

Attention in the real world

The introduction of attention mechanism in common seq2seq applications allows the elaboration of longer and more complex sentences. As we anticipated the basic insight of this trick was born inside the computer vision framework and then developed around natural language for Neural Machine Translation (NMT) applications. In this article we will focus on NMT simply because this is the natural habitat of such algorithm but nevertheless the family of attention-based models (models that rely on this particular architectural pattern) counts among its ranks many state of the art models in most of Natural Language Processing application fields.

This is because attention mechanism is the key that makes Google Assistant and Amazon Alexa understand our intentions even if we use more than a simple sentence to express it.

It can give a boost of accuracy in all the applications that require text embedding. Here it is a brief (incomplete) list of topics where we were able to experience its improvements with respect to not attention-based models.

Document retrieval
Text classification
Text clustering
Text similarity
Personalized search engines
Text generation

Moreover, the attention mechanism gives a more detailed insight on what part of the input had the highest impact on the decision made by our model, and this is a huge advantage in a production environment making the black box neural network a little less black.

A problem in standard Sequence-to-Sequence NMT

In the previous chapter we discussed one of the most effective architectures still used nowadays for the NMT problem, the sequence-to-sequence model. In those kind of networks an input sequence, made of the single words that compound the sentence in the starting language, is fed into a recurrent neural network, like for example an LSTM network. This network tries to collect all the information from each single input word and stores it inside a fixed length array. This first part of the model is called the encoder. The encoded array is then passed as initial hidden state of a second recurrent neural network that tries to generate the correct translation of one word at the time, starting at each time from the previous hidden state and the previously generated word; this second part is called decoder.

This kind of approach allows the input and the output sequence to have a great flexibility in length and opened up the way to deep learning inside the automatic translation field of research. However this approach still presents some intuitive downsides.

From [2] we can in fact read:

“A potential issue with this encoder–decoder approach is that a neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector. This may make it difficult for the neural network to cope with long sentences, especially those that are longer than the sentences in the training corpus. Cho et al.(2014b) showed that indeed the performance of a basic encoder–decoder deteriorates rapidly as the length of an input sentence increases.”

Even if LSTM recurrent neural networks are able to persist information taken from the beginning of the input sentence up to the end i.e. up to the encoded array, the available space where this information can be stored is limited by the fixed dimension of the encoded array. If the input sentence is too long with respect to this fixed dimension, then we cannot avoid the loss of some important information and the translation will be affected from it.

The attention-based model

The idea behind the attention mechanism is the following. During encoding phase, at each step of the RNN I store the hidden state related to each one of the input sentence’s words. Then during the decoding, at each step of the RNN, in addition to the previous decoder’s hidden state and generated word, I take also into account a weighted average of the previously stored encoder’s hidden states. The weights of this average depend directly from the previous decoder’s hidden state and are learned during training in order to give more attention to the encoder’s hidden states related to the words that are most “aligned” (semantically, syntactically, etc..) with the translated word I am trying to generate at the current step.

At each step of the decoder I highlight the information of a small subset of the input sentence strictly related specifically to the current translation step.

Let’s see in deeper detail how this idea has been implemented. Let’s start defining some initial symbols.

X = (x₁, …, xT) Input sentence
Y = (y₁, …, yS) Output sentence
hₜ = Encoder’s hidden state at step t with t in [1, …, T]
sₜ = Decoder’s hidden state at step t with t in [1, …, S]

As we saw in the chapter one of the Language Modeling Path, the target here it is as always finding an estimation of the conditional probability of the next generated word given all the previous ones and given also some kind of information taken from the input sentence X. This conditional probability is computed applying some function g (for example a dense neural network) to the current decoder step’s hidden state sₜ. In the case of attention based models, the current state sₜ is on its turn computed starting not only from the previously generated word yₜ-₁ and previous hidden state sₜ-₁ as it was in common seq2seq architectures but also from the so called context vector cₜ, that represents the weighted average we were talking about a few moments before.

The key fact of the Attention mechanism is indeed the context vector cₜ. Note that this context vector is specific for the sequence step t, in this way the generation of each different word building up the output translation will give more importance to different sections of the input sentence.

Generic schema of decoding phase using attention mechanism. The red intensity reresents the wieight given to each encoder step’s hidden state. For example h3 (hidden state of italian word “intelligenza”) has a higher weight when decoder tries to generate “intelligence” word, and a lower weight for other steps.

In most cases the context vector is just a weighted average of the encoder’s hidden states that we stored at each step of the encoding phase. The weights aₜᵢ are just a softmax rescaling (in order to have them between 0 and 1 and summing up to 1) of a certain value eₜᵢ called alignment score.

Those definitions are quite common among many of the different kind of attention implemented. On the other side the way to compute those scores eₜᵢ can differ from paper to paper. In the case of the original implementation described in [2] the scores are simply computed by applying a simple Feedforward Neural Network on the concatenation of sₜ-₁ and hᵢ, but remember that it’s not the only way to implement the attention mechanism. This model is referred to with the name of alignment model or compatibility function since it measures how much hᵢ is important in order to predict the next hidden state after sₜ-₁.

“The probability aₜᵢ, or its associated energy eₜᵢ reflects the importance of the annotation hᵢ with respect to the previous hidden state sₜ-₁ in deciding the next state sₜ and generating yₜ. […] The decoder decides which parts of the source sentence to pay attention to. By letting the decoder have an attention mechanism, we relieve the encoder from the burden of having to encode all information in the source sentence into a fixed-length vector” [2]

More datiled schema of attention as implemented in [2].

At the end of the translation, collecting all the aₜᵢ weights, it is possible to build up an alignment matrix that for each generate words, shows which input words where the most influent for that particular generation step.

Examples of alignement matrices taken from [2]. Each pixel shows the weight *aₜᵢ* of the annotation of the i-th source word for the t-th target word in grayscale (0: black,1: white). Notice the inverse alignement of “European Economic Area” with “zone économique européenne” and the influence of the tri-gram “a été signé” on the bi-gram “was signed” in the example (a).

Local Attention

This improvement of the architecture, unfortunately, brings with it a strong downside in terms of computational and memory cost. In fact, the longer the input sequence is the greater is the number of hidden states I must store and of the additional weights I must train to learn to combine them in an appropriate way. Moreover, this combination of numerous arrays must be repeated for each step of the generated sentence increasing the number of operations required. Since the development of attention mechanism was meant precisely to improve long sentences management this issue can’t be ignored. This is why, in [3], it has been developed a small variation on standard attention mechanism called local attention.

The intuition behind the local attention is simply the idea that instead of considering all the encoder’s hidden states to compute the context vector, at each decoder step we select only those ones inside a small moving window W, that at each decoder’s step slides over the encoder’s hidden states.

The size of the window D is usually empirically selected, while the center pₜ can be identified in two different ways depending on the implementation.

Monotonic → pₜ = t
This approach assumes that source and target sentence are already roughly aligned. In many combinations of source-target languages this is not necessarily a wrong assumption, especially if the dimension of the window D is large enough. This means simply that to generate the 5th word of the output translation I will pay attention to words in the area around the 5th input word.
Predictive → pₜ = T · sigmoid(vₚ · tanh(Wₚhₜ))
In this case the value of pₜ is computed by another feedforward neural network having learnable weights Wₚ and vₚ. The application of sigmoid function and multiplication by T (quick reminder: T is the length of the source sentence) ensures that value of pₜ will be between 0 and T itself hence representig the position of a word in the input sentence. By training these weights, the model is able to learn by its own, depending on the current hidden state hₜ, at which step of the input sentence the small window shall be centered and in this way what subset of the input hidden state must be considered to compute the context vector.

While the Monotonic approach is surely simpler and lighter, the Predictive one has proven to give more accurate translations. Anyway both the methods has surpassed the results obtained by the classical (global) attention mechanism as presented in [2].

Results

To compare improvements of a new model with respect to the previous state of the art the common procedure is to select some benchmark datasets where previous models have already been tested and see if the newly proposed one can do better. One of this benchmark dataset (very popular when [2] and [3] were published) is the WMT’14 dataset, a corpus made of pairs of sentences where to each sentence is associated a valid translation. In the corpus there are five pairs of languages that can be selected:

French-English
Hindi-English
German-English
Czech-English
Russian-English

The papers at the base of this article selected the French-English corpus [2] and German-English [3] corpus to evaluate the model performances.

In addition to a benchmarking dataset, it will be required also a common metric to compare the accuracy of each translation. One of the most popular metric is the BLEU metric that stands for bilingual evaluation understudy. Long story short, this score is a sort of precision computed between distribution generated by the model and the actual translation that

Takes into account how many times a certain generated token is present in the target sentences.
Instead of evaluating a word at the time, it is computed over n-grams (groups of n words).
Penalizes shorter translations.

Another article would be necessary to discuss the BLEU score and benchmarking of language models, but if you want to know some more on BLEU here it’s its paper.

This boring introduction was only to let you understand the following few statements over model results. For example, the first attention-based model [2] was able to outperform the non-attentional models on the French-English task of an amount of percentage points of BLEU score between 7 and 11, depending on the maximum length of the test sentences. Obviously the greatest improvements were observed over the longest sentences. Instead talking about [3], the model that introduced the local attention approach was able to surpass the results of basic attention model over the German-English dataset of an additional 1-3 percentage points.

To better understand what kind of improvement the attention mechanism has brought, let’s see also some qualitative result of those models taken from [2].

As an example, consider this source sentence from the test set:
“An admitting privilege is the right of a doctor to admit a patient to a hospital or a medical centre to carry out a diagnosis or a procedure, based on his status as a healthcare worker at a hospital. “
The RNNencdec-50 [non-attentional seq2seq] translated this sentence into:
“Un privilège d’admission est le droit d’un médecin de reconnaître un patient à l’hôpital ou un centre médical d’un diagnostic ou de prendre un diagnostic en fonction de son état de santé.”
Published as a conference paper at ICLR 2015The RNNencdec-50 [non-attentional seq2seq] correctly translated the source sentence until “a medical center”. However, from there on (underlined), it deviated from the original meaning of the source sentence. For instance, it replaced “based on his status as a health care worker at a hospital” in the source sentence with “en fonction de son état de santé” (“based on his state of health”). On the other hand, the RNNsearch-50 [attention-based model] generated the following correct translation, preserving the whole meaning of the input sentence without omitting any details:
“Un privilège d’admission est le droit d’un médecin d’admettre un patient à un hôpital ou un centre m édical pour effectuer un diagnostic ou une procédure, selon son statut de travailleur des soins de santé à l’hôpital”

This example is able to strongly highlight the true essence of attention improvement. Using the old seq2seq model the encoded array has been able to store enough information only to correctly generate the first half of the sentence. The second part of the generated sentence is simply a reshuffle of previous elements, but no new component is added. The words “status” or “healthcare worker” are completely ignored since there is no more room in the encoded hidden state to bring them from the encoder to the decoder. On the other hand, using the attention mechanism this problem is solved. To generate the first half of the sentence I can focus only on the first words, forgetting what appears at the end; then, as we approach the second half of the sentence, our attention is moved to another set of input words and such words like “status” or “healthcare worker” are allowed to give their impact over the generation of newly translated words.

Conclusions

The attention mechanism is a keystone to fully understand the current state of the art of Neural Language Model. One of the most beautiful (in my opinion) aspect of its development is that as most of deep learning evolutions (or even deep learning itself) the intuition behind it reflects the human thinking behaviour. When we approach a translation, we do not simply read the input sentence once, and then write the output. This could be done only for very small sentences. We can’t require a machine to work like this. What a human does, it is to read the sentence once and then read small pieces of sentence at the time focusing only on that piece’s translation, but keeping in mind where the whole sentence wants to go. This is exactly how sequence to sequence attention-based models work. The ability of memorizing longer and longer sentences will open up the way to models suitable for training on enormous corpus of text and this, as you can imagine, is the way that a neural language model learns how we, humans, are used to communicating.

I hope this article was of any help to understand the attention mechanism and can’t wait to see you in the next chapter of the Language Modeling Path. We will talk about the transformer, a model first published in the paper “Attention Is All You Need” (coincidence? I think not) and that is at the base of all the most popular deep learning language models like BERT, GPT-2, T-NLG and GPT-3.

Bye!

References

[1] Mnih et al. 2014 “Recurrent Models of Visual Attention”
https://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf

[2] Bahdanau et al. 2015 “Neural Machine Translation by Jointly Learning to Align and Translate” ICLR 2015 https://arxiv.org/pdf/1409.0473.pdf

[3] Minh-Thang Luong and Hieu Pham and Christopher D. Manning 2015 “Effective Approaches to Attention-based Neural Machine Translation” https://arxiv.org/pdf/1508.04025.pdf