THE LANGUAGE MODELING PATH: CHAPTER 4

GPT: Magic or Illusion?

Considerations on the most famous large language models

11 min readOct 25, 2021

Hi and welcome to the fourth and final step of the Language Modeling Path a series of articles from Machine Learning Reply about the birth of the huge language models able to imitate and (let’s say) understand the human language, like BERT and GPT-3.

During the previous steps of the Language Modeling Path, we tried to provide you with the technical tools to understand the underlying processes that let Transformer’s based models to outperform other previous approaches. In this article, we will see how, after the Transformer’s release, researchers focused more on new ways of training this kind of architecture rather than looking for a different model. This article will focus on one of the brightest examples of Transformers based architectures, the GPT models. This content will be way less technical than the previous ones but if you like to go a little deeper into Neural Natural Language Models functioning theory, please consider starting from the beginning of the Language Modeling Path:

1. Chapter one: “Sequence-to-Sequence architectures”

2. Chapter two: “Neural Networks, at Attention!”

3. Chapter three: “Transformer: the fall of RNNs”

GPT Family

Among the vast list of transformer-based models, OpenAI’s GPT models emerged for popularity and resonance. GPT stands for Generative Pre-Training, because this family of transformer-based models is trained across two different phases, the first one (pre-training) simply consists of “predicting the next word” in a huge corpus of documents. This capability makes this kind of model particularly effective in generating text. The second phase is a supervised fine-tuning over different tasks, such as question answering or text classification. This kind of training leads to a very versatile model, while the huge amount of data that was used makes it able to generate text that is almost indistinguishable from the human-written one.

One of the most popular instances of this set of models is the GPT-2, a model with 1.5 Billion parameters. The following one is an extract from the blog of OpenAI published with the release of the GPT-2 model’s paper (Feb 2019). In this case, the model is used to generate text to complete a prompt provided as input:

Input (human-written)

A train carriage containing controlled nuclear materials was stolen in Cincinnati today. Its whereabouts are unknown.

Model Completion (first try)

The incident occurred on the downtown train line, which runs from Covington and Ashland stations.
In an email to Ohio news outlets, the U.S. Department of Energy said it is working with the Federal Railroad Administration to find the thief.
“The theft of this nuclear material will have significant negative consequences on public and environmental health, our workforce and the economy of our nation,” said Tom Hicks, the U.S. Energy Secretary, in a statement. “Our top priority is to secure the theft and ensure it doesn’t happen again.”
The stolen material was taken from the University of Cincinnati’s Research Triangle Park nuclear research site, according to a news release from Department officials.
The Nuclear Regulatory Commission did not immediately release any information.
According to the release, the U.S. Department of Energy’s Office of Nuclear Material Safety and Security is leading that team’s investigation.
“The safety of people, the environment and the nation’s nuclear stockpile is our highest priority,” Hicks said. “We will get to the bottom of this and make no excuses.

Personally, I think this is outstanding.

The incredible skill of the GPT-2 model arose some concerns in OpenAI regarding the malicious purposes that this tool could be used for. In the same article, they mentioned applications like

· Generating misleading news articles

· Impersonating others online

· Automating the production of abusive or fake posts on social media

· Automating the production of spam/phishing content

And, in general, the threat it poses is similar to the one that “deep fakes” do in the field of computer vision. In the same blog, they stated that

“The public at large will need to become more skeptical of text they find online”

Due to this possible implication, the GPT-2 parameters were not immediately released entirely but only after a few release steps, so that OpenAI could gradually evaluate the impact of this incredible generative model on society.

February 2019 First Announcement. 124M Parameters Model. Sampling Code
May 2019 355M Parameters
August 2019 774M Parameters
November 2019 1.5B Parameters. Training Code. Outputs Dataset.

GPT-2 has been open-sourced almost two years ago and society still hasn’t collapsed. If you are reading this article from a future where machines took over humanity you could agree with OpenAI reticence, but many other experts pointed out that similar concerns arose with the birth of Photoshop or other tools that make it easier to produce forged content. Even in this case, we are still holding on.

Since November 2019 few other players made their move presenting other transformer-based models to outperform the GPT-2, like the Turing Natural Language Generator (T-NLG) from Microsoft with its 17B parameters. Anyway, OpenAI hasn’t sat on its hands and in May 2020 they published a paper presenting the GPT-3. Apparently OpenAI took this “number of parameters” challenge quite seriously and decided that 175B parameters were a proper size for their GPT-3.

If OpenAI proceeded with extreme caution about releasing GPT-2, GPT-3 access is even more restricted. Currently, the official version can be used only via a specific API that OpenAI provides. In this way, any use of the model is constantly monitored to make it easier for OpenAI to spot malicious intents.

On the other hand, it is worth mentioning that after the first release of GPT-2 (or even before), OpenAI started a transition from a non-profit to a capped-profit company, and in September 2020 licensed GPT-3 exclusively to Microsoft. I am currently not sure that this is still the way of democratizing Artificial Intelligence. At the birth of OpenAI in 2015, the co-founder Elon Musk stated:

“If everyone has AI powers, then there’s not any one person or a small set of individuals who can have AI superpower”

I hope that this way of imagining AI will persist within this company.

GPT Open source cousins

The restrictions on usage of GPT-3, given by the costs of the OpenAI’s API and forbidden access to the code, have pushed other labs on working on their own version of the same model. One of the most notable examples of this urge is EleutherAI that in March 2021 released GPT-Neo with up to 2.7 Billion parameters and in June 2021 released GPT-J with 6 Billion parameters. Even if we are quite far from the 175 Billion of GPT-3 in this case researchers decided to completely open-source these models and a trained version of GPT-Neo and GPT-J are even available on Hugging Face Model Hub so If you want to play with it you can rely on the on-demand Inference API. Even if in some cases these inference APIs still imply a cost, it is significantly lower than OpenAI’s pricing policy.

It is also possible to run the inference code on your own setup but it requires a lot of resources. For example, GPT-J needs around 25GB of GPU VRAM that is not quite common for the average user.

What can it do?

Thousands of applications of GPT family has seen the light since the first model release, and many other are still emerging, even relying just on the official APIs of GPT-3. Let’s see a roundup of the most popular demos:

GPT Helps Writing:

This is the primary task of a GPT model like GPT-3 who was mainly trained on natural language. The peculiarity of this model is that it is highly context-aware and can infer the style of what we want to generate among a variety of writing fashions.

For example, in this article, Delian Asparouhov uses GPT-3 to generate an investment memo (a concise document that highlight opportunities, threat and strategies of a company to present it to potential investors) on a certain company. It is incredible how the model seems aware of some dominion knowledge entities like for example possible investors or partnerships with existing companies (e.g. ACSM) without any reference in the input prompted.

A notable example was the one presented by “The Guardian” in September 2020. The newspaper has published an article entitled “A robot wrote this entire article. Are you scared yet, human?”. In this article, the journalists showed an op-ed essay generated using GPT-3 prompted to convince humans that robots come in peace. This particular example was strongly criticized because it is clear that its main purpose was to create a lot of hype and turn the mixed feelings that people have about AI into a source of profit. The article is in fact the result of a cut and paste process that merged eight different GPT-3 outputs. I agree with those who point out that it would have been even more interesting to see the original results, instead of the arranged product, anyway, I suggest having a look at this article. The generated text, differently from older versions of the model, is quite good at staying on the topic even on the long-distance, and the single paragraphs, taken singularly, detain also a certain coherence. On the other side, I felt that it was missing a well-built structure smoothly connecting all the paragraphs that instead look a bit detached one with the other. Another thing that I found pretty interesting is this passage, “…. Microsoft tried to create a user-friendly AI, called Tay, who spoke like a teen girl … and was racist. …”. This TRUE information was not in the model’s input that the journalist provided, but it was in one of the hundreds of GB of text that the model processed during its training. The model simply decided that this was a good moment to make it pop out. Even just for this particular decision, I find such kind of technology pretty impressive.

It goes without saying that the automatic generation of text could potentially have a huge negative impact on the news world. Currently, even with the most advanced models, it is very difficult to avoid generated text containing elements invented from scratch or nonsensical. Keep also in mind that, even if most of the researchers working on such models always take care of providing the highest quality training data, it is possible that among the huge corpus of text used, also fake news and socially inappropriate content can end up to be processed, and the language models just imitate what it has seen. So, at the moment, it is still strongly not recommended to delegate news reporting to a machine.

GPT Helps Coding:

Imagine writing code, simply describing what you want to do in natural language. In June 2020 people were blowing their minds with demos like this one from Sharif Sameen where a small description of a simple webpage layout was fed into GPT-3 that used it to generate the appropriate JSX code. Well, it was just the beginning. Another GPT model from OpenAI, called “Codex” was released in July 2021 and it looks amazing. The purpose is always the same, translating natural language in working code, but the OpenAI’s product seems to be unrivaled in doing it. Thanks to a partnership with GitHub, from this model was born the “GitHub Copilot”, a Visual Studio Code extension that is proficient in more than a dozen of programming languages and that can perform multiple different tasks:

Convert comments that describe a piece of code into the code itself
Autofill for repetitive code
Auto-generate code tests
Navigate through different alternatives for automatic completion

As OpenAI’s developers observe in their blog,

“…, the act of writing code can be thought of as (1) breaking a problem down into simpler problems, and (2) mapping those simple problems to existing code (libraries, APIs, or functions) that already exist. The latter activity is probably the least fun part of programming (and the highest barrier to entry), and it’s where OpenAI Codex excels most.”

To give you a grasp on the level of effectiveness of such a model, I suggest you check this live coding session, where a small game is completely built from scratch simply describing what are the game components and how these components must behave based on player inputs.

Unluckily the access to this useful tool is limited (you don’t say?). To obtain the right to download the VS Code extension you need to join a waitlist and hope for the best. At the moment there is no paid version that you can access.

Magic or Illusion

GPT-like models are just one of the many kinds of models that originated from Transformer’s architecture and that vary on structure, training, and field of application. In the previous articles of the Language Modeling Path, we mentioned the BERT model, from which another wide family of models has been developed. Probably it would take an entire book to describe all of them and each day new developments take place and new articles are published. In addition to that, there is also someone that is working on transformer alternatives. For example, there are studies that claims that it is possible to achieve the same results of transformers with simpler architectures.

With this series of articles, we have only focused on a small region of Neural NLP, just the one that currently is in the spotlight. It is difficult to predict what will be the next disruptive paper to take its place. Will it be possible to keep creating bigger and bigger transformers, feeding an unimaginable amount of text to obtain models better and better to imitate human language? Will this be enough to achieve a kind of actual “understanding” of the logical relationship between concepts that current models are lacking (sometimes GPT-3 talks about lighting underwater fire)? Or, instead, will it be necessary to wait for a completely new approach to make the next step toward the so-called Artificial General Intelligence? Few people can answer these questions, if there are any.

I have been totally amazed by the transformer-based models’ capability. The first glance on a text generated by a GPT-3 model feels like a magic and it was an exciting journey to explore in-depth the logics that make it real. Despite this I truly hope for a change in the paradigm of Neural NLP research. Not only because it has been proved that such models are far from actually ‘understand’ what they are generating but mainly because models that need that cumbersome resources to be developed make the power of AI a privilege for a few giants of the sector. In the first stages of development of these tools, we observed a sharing intent, where the community could more or less play freely, experiment and evaluate the trained model provided by the big players, and even in that case, it was not that easy to perform a homemade training. Moreover, now we feel a drift towards a more profit-oriented approach, where models can be accessed only upon payment or by special selected users. The reasons behind this drift could also include some security concerns, with which I agree, but probably this will also have the result to make the role of intermediary agents, qualified to handle such resources and able to communicate both with big data labs’ scientists and with business end-users, more and more relevant with time.

Conclusion

Thanks for reading up to the last step of the Language Modeling Path. It was an absolute pleasure to walk through these beautiful tools and the gears that make them work. The road is still long before we arrive at Neural NLP models able to perfectly reproduce the human process of communicating, but, along that road, GPT and Transformer based models will be surely remembered as an important milestone. I hope that this series of articles helped you to understand the ‘why’ and the ‘how’ of it.

Can’t wait to see you in the next Machine Learning Reply article.

Sincerely, Davide.