https://commons.wikimedia.org/wiki/File:Muse_2017.jpg

Multilingual Universal Sentence Encoder (MUSE)

Davide Salvaggio
8 min readSep 2, 2019

--

The NLP model that speaks 16 languages

Hi,

As many of you may have experienced, one of the cards that the real world often plays to put a spanner in the works of a data scientist struggling with Natural Language Processing is Languages. Have you just finished training your awesome chat-bot on thousands of English FAQ? Well, it would be a shame if business asked to deploy it on an Italian company branch…

Well, here it is something that could ease your multilingual problems.

Last July Google AI labs released the Multilingual Universal Sentence Encoder. This is a sentence encoding model simultaneously trained on multiple tasks and multiple languages able to create a single embedding space common to all 16 languages which it has been trained on.

Am I running too fast? OK let’s start from the beginning.

What is an Encoding Model?

Encoding models are models able to map Natural Language elements like sentences, words, n-grams (collections of n letters/words) into arrays of numbers. In this way each element can be represented as a single point in a vectorial space. The aim is to be able to obtain this nice (computationally speaking) representation without losing too much information. With the term “information” I mean not only the semantic of a certain element but also the style, the syntax, and especially similarities and relationships with other encoded elements.

Examples of information kept by encoding model: elements having the same relationship have same distance. (Image from Tensorflow Hub)

Learning good representation of a word or a sentence makes us able to perform lots of otherwise impossible tasks, without even training anything else on top of it. For example in the previously mentioned FAQ chat-bot task, one of the most common approach is simply to compare embedding of new user-performed question with all pre-existent FAQs embedding to find the most similar and return the relative answer.

Encoding models’ versatility made them one of the most powerful weapons that machine learning has ever provided to the Natural Language Processing field. Since the huge spread of famous Word2Vect model released in 2013, encoding models has proliferated. Some of the most popular: GloVe, Doc2Vect, FastText, Elmo, Bert, GPT and GPT2.

Universal sentence encoder family

Actually, when talking about USE (Universal Sentence Encoder) we are not referring to a single specific model but to a family of sentence encoding models. At the moment, as we can see from the tensorflow-hub page, there are 5 available sentence encoding models, “trained with different goals including size/performance multilingual, and fine-grained question answer retrieval” (taken from tensorflow-hub page description).

The first two models refer to paper “Universal Sentence Encoder for English” (2018) and the last three (object of this article) are from paper “Multilingual Universal Sentence Encoder for Semantic Retrieval” (2019).

The good of multilingual

What makes Multilingual Universal Sentence Encoder a must have in the NLP data scientist arsenal?

Most publicly available pre-trained sentence embedding models are language specific. That means that the same model architecture must be trained on English to process English input, on Spanish to process Spanish input, and so on, generating different models (same architecture but different training) for each desired language.

MUSE is instead trained simultaneously on multiple languages using a deep learning technique called “Multi-task Dual-Encoder Model”. This means that sentences from different languages are mapped in the vector space using exactly the same model.

The main consequence is that sentences from different languages are now directly comparable using their encoding since they live in the same vectorial space generated by the model. In other words, with MUSE, it is possible to tell the “meaning” distance between sentences written in two different languages without any intermediate step.

The languages that MUSE is able to encode are:

  1. Arabic (ar)
  2. Chinese (PRC) (zh)
  3. Chinese (Taiwan) (zh-tw)
  4. Dutch (nl)
  5. English (en)
  6. German (de)
  7. French (fr)
  8. Italian (it)
  9. Portuguese (pt)
  10. Spanish (es)
  11. Japanese (ja)
  12. Korean (ko)
  13. Russian (ru)
  14. Polish (pl)
  15. Thai (th)
  16. Turkish (tr)

How can I use it?

This article would have been of no use without explaining how to access this useful NLP tool. Thanks to the magnanimous spirit of Google AI researchers all those pre-trained models are available on tensorflow-hub and easily runnable on your laptop with Python 3 using tensorflow module. If you are not familiar with tensorflow, no worries; take it as a black box model mapping from sentences to encoded arrays.

Let’s show how to use it with an example. Let’s suppose you have to translate a text from Italian to English. Let’s also suppose that “Mamma mia” and “Pizza” are the only Italian words that you can recall but the text you are suppose to translate talks about a complex Italian recipe (that is not pizza ;D ). Last assumption (I promise): you are very fussy. I mean VERY fussy. It’s not enough to throw the text inside Google Translate and copy paste whatever it could spit out. You are looking for the very best online translation service, since nothing of this tasty Italian masterpiece shall be lost due to poor translations. But how can you compare quality of translation from a language that you do not know? Well you can do it if that language is within the sixteen previously listed, and luckily Italian is.

To solve this problem we are looking, within a bunch of the most used online translator, for those one where the starting sentences and the translated results are most similar using MUSE representation. This is because the more two MUSE representation are close to each other, the more the two encoded sentences are similar, despite the difference in language. Since MUSE representations of sentences are nothing than arrays of 512 dimensions normalized, a measure of similarity between two arrays can be computed using the scalar product. We reduced the original problem to the task of seeking for the translator whose encoded translated sentences has the highest scalar product with the encoded Italian sentences. A piece of cake, isn’t it?

This is a small sample of the Italian text:

Per preparare gli arancini di riso, iniziate lessando il riso in 1,2 l di acqua bollente salata, in modo da far sì che, a cottura avvenuta, l’acqua sia stata completamente assorbita (questo permetterà all’amido di rimanere tutto in pentola e otterrete un riso molto asciutto e compatto). Fate cuocere per circa 15 minuti, poi sciogliete lo zafferano in pochissima acqua calda e unitelo al riso ormai cotto. Unite anche il burro a pezzetti.

This is one of the many reasons why I love my country ❤ (https://en.wikipedia.org/wiki/Arancini)

And these are the translations respectively obtained from Google Translate, Reverso and Bing Translator.

Google Translate (gt)

To prepare the rice arancini, start by boiling the rice in 1.2 l of salted boiling water, so that, once cooked, the water has been completely absorbed (this will allow the starch to remain all in the pot and you will get a very dry and compact rice). Cook for about 15 minutes, then dissolve the saffron in very little hot water and add it to the now cooked rice. Also add the butter into small pieces.

Reverso (rv)

To prepare the rice oranges, start by boiling the rice in 1.2 l of boiling salted water, so that, once cooked, the water has been completely absorbed (this will allow the starch to stay in the pot and you will get a very dry and compact rice). Cook for about 15 minutes, then melt the saffron in very little hot water and add it to the rice now cooked. Add the butter pieces too.

Bing Translator (bng)

To prepare the rice arancini, start by boiling the rice in 1.2 l of boiling salted water, so that, when cooked, the water has been completely absorbed (this will allow the starch to stay all in the pot and you will get a very dry rice compact). Cook for about 15 minutes, then melt the saffron in very little hot water and add it to the cooked rice. Add the butter into small pieces.

They look quite similar but in many passages the different translators made different decisions, for example in the last sentence. Which one of this translations is the most faithful to the original Italian sentence?

OK Let’s start! First import necessary packages and set desired model url variable.

import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import tf_sentencepiece
model_url = "https://tfhub.dev/google/universal-sentence-encoder-multilingual/1"

Load Italian text and the three candidate translations, one from Google Translate, the second from Reverso and the third from Bing Translator.

# load italian text.
with open("Arancini_ITA", "r") as file:
ita_sentences = file.read().splitlines()
# load translation from google translate.
with open("Arancini_ENG_GT", "r") as file:
google_sentences = file.read().splitlines()
# load translation from reverso.
with open("Arancini_ENG_RV", "r") as file:
reverso_sentences = file.read().splitlines()
# load translation from bing translator.
with open("Arancini_ENG_BNG", "r") as file:
bing_sentences = file.read().splitlines()

Setup tensorflow graph model. If you are familiar with tensorflow you will have no problem deciphering the code, otherwise, as I said, take it as a black box model application.

# Graph set up.
g = tf.Graph()
with g.as_default():
text_input = tf.placeholder(dtype=tf.string, shape=[None])
embed = hub.Module(model_url)
embedded_text = embed(text_input)
init_op = tf.group([tf.global_variables_initializer(), tf.tables_initializer()])
g.finalize()

Compute the vectorial representation of all sentences of all four documents. Each sentence is mapped into a 512 element array, so each document is represented as a matrix M x 512 where M is the number of sentences in the original document.

# Initialize session.
session = tf.Session(graph=g)
session.run(init_op)
# Compute embeddings.
it_result = session.run(embedded_text, feed_dict={text_input: ita_sentences})
gt_result = session.run(embedded_text, feed_dict={text_input: google_sentences})
rv_result = session.run(embedded_text, feed_dict={text_input: reverso_sentences})
bng_result = session.run(embedded_text, feed_dict={text_input: bing_sentences})

Now the final step. To check similarities between Italian and English sentences we just have to compute scalar product between Italian sentences encoding and corresponding sentences encoding in all different translators. Then we average the similarities between all translated sentences.

# Compute similarity matrix. Higher score indicates greater similarity.
similarity_matrix_gt = np.inner(it_result, gt_result)
similarity_matrix_rv = np.inner(it_result, rv_result)
similarity_matrix_bng = np.inner(it_result, bng_result)
gt_avg_similarity = np.mean(np.diag(similarity_matrix_gt))
rv_avg_similarity = np.mean(np.diag(similarity_matrix_rv))
bng_avg_similarity = np.mean(np.diag(similarity_matrix_bng))
print("Google translate average similarity: {0}".format(round(float(gt_avg_similarity), 3)))
print("Reverso average similarity: {0}".format(round(float(rv_avg_similarity), 3)))
print("Bing average similarity: {0}".format(round(float(bng_avg_similarity), 3)))

The result is the following:

Google translate average similarity: 0.846Reverso average similarity: 0.834Bing Translator average similarity: 0.822

The difference in average similarities is very small (quite obvious since are all translations of the same text) but Google Translate resulted to generate sentences whose meaning seems slightly closer to original Italian ones. This approach can also be used to compute an approximated score of how much faithful a translation is to the original provided text.

Thank you for reading and please let me know if this article helped you in any way, or if you have comments, questions or any points of discussion related to NLP and Machine Learning. I’d love to know your opinion on these incredible topics!

Bye!

Davide Salvaggio

--

--

Davide Salvaggio

Mathematical Engineer, now I work as Data Scientist in Machine Learning Reply. NLP and deep learning are my main passions.