Photo by Patrick Fore on Unsplash

The Language Modeling Path: Chapter 1

Sequence-to-Sequence architectures

A turning point in automatic translations

13 min readJul 24, 2020

--

This article is part of the “Language Modeling Path”. A series of article trying to collect main advances in Neural NLP that lead to current SOTA Generative Models. If you will enjoy this one go check the other ones too!

Prerequisite:
To be able to fully appreciate this article is strongly suggested a previous experience with artificial neural networks basics and Recurrent Neural Networks.

Introduction

Our story starts in December 2014 with the publication of a paper named “Sequence to Sequence Learning with Neural Networks” by Google [1]. The researchers of this papers were led by a simple but crucial statement.

“Despite their flexibility and power, DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality. It is a significant limitation, since many important problems are best expressed with sequences whose lengths are not known a-priori. For example, speech recognition and machine translation are sequential problems” — Sequence to Sequence Learning with Neural Networks [1]

The road to master neural network language models like GPT2 passes through the effort made to face a quite old problem of NLP, the automatic translation. As everyone knows the translation of a sentence will not usually bring to a new sentence of same length. How can we give a deep neural network the freedom of generating varying length output sentence, while keeping the whole information contained into the input sentence?

The answer given by the Google’s researchers is the sequence-to-sequence model, a combination of two pre-existent deep learning architectures: Recurrent Neural Networks and Encoder-Decoder model. Probably you have already heard about the basics of RNNs since they are one of the most common architectures for dealing with sequential data, otherwise I suggest you the preparatory material linked on top of this article. The Encoder — Decoder model on the other side is somehow less common so we will start from this. As you might guess from the name, it is made by two different blocks, the encoder model and the decoder model. The encoder part of the model processes the input and maps it into a fixed length latent representation. This encoded representation ideally contains all the input information needed to perform the desired translation. Then the decoder model, starting from this encoded representation generates word by word the output sentence.

symbolic scheme of an encoder-decoder architecture

The common denominator between encoder and decoder architectures in the sequence-to-sequence fashion are Recurrent Neural Networks and thanks to them we have the most complete freedom on input and output sizes.

Before diving deep into the architecture details, let’s see what kind of improvements this evolution of encoder-decoder has brought to the state of the art in 2014. Here it is a sample of translations made by a similar encoder decoder model published the previous year:

Kalchbrenner, N., & Blunsom, P. (2013, October). Recurrent Continuous Translation Models. In EMNLP (Vol. 3, №39, p. 413).

As you can see the model is able to translate only short sentences, and in some cases the candidate translation does not even have a meaning similar to the “gold translation”. Let’s see instead the 2014 sequence-to-sequence model what kind of translations can generate.

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104–3112).

You have surely noticed that even if the ground truth is not perfectly matched, the model is able to provide meaningful translations having a high similarity with respect to the true translation even for very long sentences. I am aware that a single sampled comparison is not statistically significant but I only would like to give you the idea of the kind of improvement brought with this particular architecture. If you wish to have the details of the increase of performance KPI on benchmarking tasks I suggest you to directly check the papers [1] and [2] listed into the reference chapter.

Now let’s see in deeper detail how those components interact with each other.

The encoder

As we anticipated, the encoder part of the model is made by Recurrent Neural Networks. At each step of the recurrence a new word of the input sentence is fed into the encoder and a hidden state is generated. The next step of recurrence will exploit this hidden state together with a new word of the sentence to generate a new hidden state and so on until all the words of the sentence have been processed and all the knowledge has been stored inside the final hidden state.

The decoder

The decoder section of the model is a recurrent neural network too but with some differences from the encoder.

1.The first step of the recurrence receives as input the final hidden state generated by the encoder, the one that ideally contains all the information stored inside the input sentence. (It receives also as input a fixed token identified as “<start>” that let the decoder understand that this is the beginning of the output).

first step of decoder sequence

2. Then at each subsequent recurrence step the RNN cell gets as input the previous hidden state and a word of the target translated sentence. The network aims to predict the word following the given one.

3.To predict such word, the architecture of the decoder is designed so that at each step the hidden state, that normally would be passed to the next step, is also used to compute a discrete probability distribution (an array of numbers between 0 and 1 that sum up to 1) over the output language dictionary. This distribution tells us, given the previous hidden state and word, what is the probability of the next word among all words inside the dictionary. Usually there are several layers between the input word and the output distribution, like additional RNN or dense layer or more complex architecture depending on the single application.

4.The final component of the training phase is the computation of a loss function. We can construct this loss function comparing the true translated sentence with the probability distributions over the dictionary that at each step our neural network has generated. Using the probabilities for each word, we are able to compute the overall probability of the whole output sentence, for any possible generated sentence. We then select the probability given by our decoder to the true translation, and use it to compute a loss such that the higher this probability is, the better our model is performing, the lower is the loss value. Here it is the formula of the loss function used in the previously mentioned paper.

Where O is the sentence in the original language, T is the true translation in the target language, p(T|O) is the probability that the model estimates for the true translation given the original sentence, and S is our training set.

The probability p(T|O) is computed as the product for each translated sentence step (t=1, …, m) of the probability of the translated word (yₜ), given all the preceding words (y₁, …, yₜ-₁) and the encoded input (v). This value p(yₜ|v, y₁, …, yₜ-₁) is exactly the value relative to the word yₜ that the decoder RNN tries to estimate at each recurrence step.

Using this loss, and applying backpropagation algorithm, the network will try to estimate higher probabilities for dictionary’s words that are the actual translation of the input sentence, since the higher is the given probability, the lower is the loss function.

Here it is a scheme representing the whole encoder-decoder model for sequence-to-sequence application:

This is the general idea of training phase on a sequence to sequence architecture, but there is some more important information to take into account. For example, how can we use this architecture to generate new translation after the training phase is completed?

The test phase

Let’s suppose we have a completely new sentence in Italian and we want to generate its English translation. The encoder part of the model works exactly the same as during training: apply the RNN at each word of the input sentence and give the final hidden state as first input to the decoder part. Instead, in the decoder section, there are two main differences with the training phase that could create some doubts on how the translation generation works:

  1. During training time, at each step of the recurrence, a word taken from the true translation is fed as input to the network, but during a model application the true translation is not available. What can we provide as input?
  2. At each step of the recurrence the output of our RNN is not a word but a probability distribution over the whole dictionary. How do we select the next word making up the translated sentence?

Question number one is actually not a big deal. Simply we consider the words generated previously by our network as if they were the true translation and use them as input to generate the next word at each recurrence step. But this lead us to the second problem, how can we select the correct word among all the possibilities of the dictionary using the probability distribution given by our RNN output?

An obvious approach would be the following: at each step we select the most likely word among the dictionary and fix it as the output at the current step. This approach is called greedy search. This kind of generation, however, does not take into account a maybe-not-so-intuitive fact.

Selecting the most probable word at each step could not lead to the most probable translated sentence.

This is due to the fact that to solve the doubt number one we use the previously selected word to retrieve the distribution of the following one. Let’s suppose that we have only two words in our vocabulary to be selected as first token, A and B with probabilities (following the p(T|O) formula of the previous section) :

These two possibilities are almost the same but, following the greedy approach, A is selected and B is rejected. If we select A as first word we are not considering anymore all the translations starting with B. But it is possible that, even if B had a little lower probability if taken by itself, sentences starting with B will have a higher probability globally. In other words it is possible that if we had A as first word, the next possibilities will be chosen among A and B obtaining overall probabilities (still following the p(T|O) formula of the previous section):

And our final selection would be the sentence “A B” with overall probability of 0.26. But it is possible that choosing B as first word this could led to

Leading to a sentence “B A” having a much higher probability than sentences generated fixing A as first word. If we only select at each step the word with highest probability, we might miss some combination that have an overall better maximum.

To be sure that the whole sentence probability is the global maximum, we would have to check all the possible combination of output sentences and compute the relative probability score for each of them.

Problem: there are hypothetically infinity possible translation that our network could generate. Not so much because of the big size that usually dictionaries have, but simply because output sentences, just due to the sequence-to-sequence model’ shape, can have arbitrary length.

The solution, as always, lays in between these two opposite sides and it is called beam search.

The Beam Search

The Beam search algorithm requires to set a parameter B that represents the “size of the beam”. Then it proceeds following these steps. For each step I will show you an example applied to the Italian sentence “posso resistere a tutto tranne che alle tentazioni” with a parameter B=3. As we already said, the encoder part is not different from training phase, so we are not taking it into account in the next examples.

1.The first token given as input to the decoder (in addition to the encoded hidden state) in the first recurrence step is always the “<start>” token. The decoder then will provide us the usual array of probabilities over the dictionary. We select among the whole dictionary the B most probable tokens.

2.Then we run the next recurrence step B times, using at each time one of the combination of tokens selected up to this moment as input. Doing this we will obtain B different arrays of distribution over the whole dictionary. Among all the combinations of tokens generated we again select the top B most probable (considering not only the last token, but the whole sentence generate up to the current step).

3.Repeat step 2 until for all the B generated sentences the token “<end>” has been selected and finally select the most probable among the B possibilities.

The beam search does not guaranty that the final result will be a global optimum but it is a heuristic. It will provide a better result than greedy search with a small increase in computational cost, exploring a much wider area of the solutions space.

Well, now we have seen what are the main components of a sequence-to-sequence model but unfortunately, using simply a basic recurrent neural network (RNN), nothing of this would work in the real world. In the next paragraph we will see why, and what kind of evolved RNN can be used to make it real.

The LSTM

Another big deal that Google researchers had to face was the following one: how can a classical RNN store and maintain the information of the first part of the sentence up to the end of input encoding in order to pass it to the decoder? I mean, at each step of the encoder the hidden state is updated with some elaborations, that means that the words closer to the end of the sentence will have a more direct influence on the final hidden state, while the previous words of the input sentence will have a lower impact on what is passed to the decoder architecture, simply being far away from the encoded array. This is a well-known problem in sequence elaboration and it goes under the name of long term dependencies problem. One of the most popular solution applied in deep learning is a particular kind of Recurrent Neural Network called Long Short-Term Memory network (LSTM).

The LSTM cell, in contrast to the basic RNN, is able to understand at each recurrence step what part of the previous hidden state shall be used (or ignored) to compute the new hidden state and which part of the previous hidden state must be updated (or left untouched) before passing it to the following step. This allows an LSTM to maintain the necessary information contained into the first elements of the sequence up to the final step, the one that, inside a sequence-to-sequence architecture, computes the encoded array to be passed as initial input to the decoder. Hence the use of LSTM, allows the decoder to know even what was the input at the beginning of the original sentence.

In the original implementation, Google researchers of the aforementioned paper, used two different LSTMs one for the encoder and one for the decoder, and moreover, stacked four layers of LSTM one on top of the other for both encoder and decoder architectures. This means that at each recurrence step, the hidden state is not only passed to the next step as “previous hidden state” but is also given as “current input” of the following LSTM layer.

I’m not going into details of LSTM but you can find a lot of good material on them. For example, here’s my suggestion if you would like to check all the details.

Conclusions

Since their birth a huge number of sequence-to-sequence architectures have been developed. Some of them realized great performances in term of accuracy and stability and in my company, Machine Learning Reply, in many occasions we relied on these models to develop our use cases.

  • In semantic search engines (for example in e-commerce) this models enabled a more flexible and accurate search, retrieving more relevant results.
  • In text embedding to improve the performances of any text based models starting from pre-trained semantic representation.
  • In documents retrieval, for example applied to automatically answer questions on a certain platform usage returning appropriate section of the documentation.
  • In natural language classification, for example applied in automatic ticket routing
  • In chat-bot development these models are able to achieve a higher level of comprehension and context awareness.

Anyway, there are a lot of additional tricks and evolutions worth to be examined but they differ a lot between paper and paper. My goal in this article was to introduce you to all the features that are common among a large part of sequence-to-sequence models and to allow you to learn the basics that are needed to join me in the amazing path leading to the huge generative models that in the last year has amazed the whole NLP community.

I sincerely thank you for reading up to far. I hope you enjoyed this article and I can’t wait to present you the next one. We will talk about one trick that led to the next level not only of NLP sequence-to-sequence models but also of other deep learning areas like computer vision: the attention mechanism.

References

[1] Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems (pp. 3104–3112).

[2] Kalchbrenner, N., & Blunsom, P. (2013, October). Recurrent Continuous Translation Models. In EMNLP (Vol. 3, №39, p. 413).

[3] S. Hochreiter and J. Schmidhuber. Long short-term memory.Neural Computation, 1997.

--

--

Mathematical Engineer, now I work as Data Scientist in Machine Learning Reply. NLP and deep learning are my main passions.