Large language models evolved alongside deep-learning neural networks and are critical to generative AI. Here's a first look, including the top LLMs and what they're used for today.
Large language models (LLMs) such as GPT, Bard, and Llama 2 have caught the public’s imagination and garnered a wide variety of reactions. This article looks behind the hype to help you understand the origins of large language models, how they’re built and trained, and the range of tasks they are specialized for. We’ll also look at the most popular LLMs in use today.
What is a large language model?
Language models go back to the early twentieth century, but large language models (LLMs) emerged with a vengeance after neural networks were introduced. The Transformer deep neural network architecture, introduced in 2017, was particularly instrumental in the evolution from language models to LLMs.
Large language models are useful for a variety of tasks, including text generation from a descriptive prompt, code generation and code completion, text summarization, translating between languages, and text-to-speech and speech-to-text applications.
LLMs also have drawbacks, at least in their current developmental stage. Generated text is usually mediocre, and sometimes comically bad. LLMs are known to invent facts, called hallucinations, which might seem reasonable if you don’t know better. Language translations are rarely 100% accurate unless they’ve been vetted by a native speaker, which is usually only done for common phrases. Generated code often has bugs, and sometimes has no hope of running. While LLMs are usually fine-tuned to avoid making controversial statements or recommending illegal acts, it is possible to breach these guardrails using malicious prompts.
Training large language models requires at least one large corpus of text. Training examples include the 1B Word Benchmark, Wikipedia, the Toronto Books Corpus, the Common Crawl dataset, and public open source GitHub repositories. Two potential problems with large text datasets are copyright infringement and garbage. Copyright infringement is currently the subject of multiple lawsuits. Garbage, at least, can be cleaned up; an example of a cleaned dataset is the Colossal Clean Crawled Corpus (C4), an 800GB dataset based on the Common Crawl dataset.
Along with at least one large training corpus, LLMs require large numbers of parameters, also known as weights. The number of parameters grew over the years, until it didn’t. ELMo (2018) has 93.6 million parameters; BERT (2018) was released in 100-million and 340-million parameter sizes; GPT (2018) uses 117 million parameters; and T5 (2020) has 220 million parameters. GPT-2 (2019) has 1.6 billion parameters; GPT-3 (2020) uses 175 billion parameters; and PaLM (2022) has 540 billion parameters. GPT-4 (2023) has 1.76 trillion parameters.
More parameters make a model more accurate, but models with higher parameters also require more memory and run more slowly. In 2023, we’ve started to see some relatively smaller models released at multiple sizes: for example, Llama 2 comes in sizes of 7 billion, 13 billion, and 70 billion, while Claude 2 has 93-billion and 137-billion parameter sizes.
A history of AI models for text generation
Language models go back to Andrey Markov, who applied mathematics to poetry in 1913. Markov showed that in Pushkin’s Eugene Onegin, the probability of a character appearing depended on the previous character, and that, in general, consonants and vowels tended to alternate. Today, Markov chains are used to describe a sequence of events in which the probability of each event depends on the state of the previous one.
Markov’s work was extended by Claude Shannon in 1948 for communications theory, and again by Fred Jelinek and Robert Mercer of IBM in 1985 to produce a language model based on cross-validation (which they called deleted estimates), and applied to real-time large-vocabulary speech recognition. Essentially, a statistical language model assigns probabilities to sequences of words.
To quickly see a language model in action, just type a few words into Google Search, or a text message app on your phone, with auto-completion turned on.
In 2000, Yoshua Bengio and co-authors published a paper detailing a neural probabilistic language model in which neural networks replace the probabilities in a statistical language model, bypassing the curse of dimensionality and improving word predictions over a smoothed trigram model (then the state of the art) by 20% to 35%. The idea of feed-forward auto-regressive neural network models of language is still used today, although the models now have billions of parameters and are trained on extensive corpora; hence the term “large language model.”
Language models have continued to get bigger over time, with the goal of improving performance. But such growth has downsides. The 2021 paper, On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, questions whether we are going too far with the larger-is-better trend. The authors suggest weighing the environmental and financial costs first and investing resources into curating and documenting datasets rather than ingesting everything on the web.
Language models and LLMs explained
Current language models have a variety of tasks and goals and take various forms. For example, in addition to the task of predicting the next word in a document, language models can generate original text, classify text, answer questions, analyze sentiment, recognize named entities, recognize speech, recognize text in images, and recognize handwriting. Customizing language models for specific tasks, typically using small to medium-sized supplemental training sets, is called fine-tuning.
Some of the intermediate tasks that go into language models are as follows:
- Segmentation of the training corpus into sentences
- Word tokenization
- Stemming
- Lemmatizing (conversion to the root word)
- POS (part of speech) tagging
- Stopword identification and (possibly) removal
- Named-entity recognition (NER)
- Text classification
- Chunking (breaking sentences into meaningful phrases)
- Coreference resolution (finding all expressions that refer to the same entity in a text)
Several of these are also useful as tasks or applications in and of themselves, such as text classification.
Large language models are different from traditional language models in that they use a deep learning neural network and a large training corpus, and they require millions or more parameters or weights for the neural network. Training an LLM is a matter of optimizing the weights so that the model has the lowest possible error rate for its designated task. An example task would be predicting the next word at any point in the corpus, typically in a self-supervised fashion.
A look at the most popular LLMs
The recent explosion of large language models was triggered by the 2017 paper, Attention is All You Need, which introduced the Transformer as, “a new simple network architecture … based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.”
Here are some of the top large language models in use today.
ELMo
ELMo is a 2018 deep contextualized word representation LLM from AllenNLP that models both complex characteristics of word use and how that use varies across linguistic contexts. The original model has 93.6 million parameters and was trained on the 1B Word Benchmark.
BERT
BERT is a 2018 language model from Google AI based on the company’s Transformer neural network architecture. BERT was designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. The two model sizes initially used were 100 million and 340 million total parameters. The LLM uses masked language modeling (MLM), in which ~15% of tokens are “corrupted” for training. It was trained on English Wikipedia plus the Toronto Books Corpus.
T5
The 2020 Text-To-Text Transfer Transformer (T5) model from Google synthesizes a new model based on the best transfer learning techniques from GPT, ULMFiT, ELMo, BERT, and their successors. It uses the open source Colossal Clean Crawled Corpus (C4) as a pre-training dataset. The standard C4 for English is an 800GB dataset based on the original Common Crawl dataset. T5 reframes all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. The base T5 model has about 220 million total parameters.
GPT family
OpenAI, an AI research and deployment company, has a mission “to ensure that artificial general intelligence (AGI) benefits all of humanity.” Of course, it hasn’t achieved AGI yet—and some AI researchers, such as machine learning pioneer Yann LeCun of Meta-FAIR, think that OpenAI’s current approach to AGI is a dead end.
OpenAI is responsible for the GPT family of language models. Here’s a quick look at the entire GPT family and its evolution since 2018. (Note that the entire GPT family is based on Google’s Transformer neural network architecture, which is legitimate because Google open-sourced Transformer.)
GPT (Generative Pretrained Transformer) is a 2018 model from OpenAI that uses about 117 million parameters. GPT is a unidirectional transformer pre-trained on the Toronto Book Corpus, and was trained with a causal language modeling (CLM) objective, meaning that it was trained to predict the next token in a sequence.
GPT-2 is a 2019 direct scale-up of GPT with 1.5 billion parameters, trained on a dataset of 8 million web pages encompassing ~40GB of text data. OpenAI originally restricted access to GPT-2 because it was “too good” and would lead to “fake news.” The company eventually relented, although the potential social problems became even worse with the release of GPT-3.
GPT-3 is a 2020 autoregressive language model with 175 billion parameters, trained on a combination of a filtered version of Common Crawl, WebText2, Books1, Books2, and English Wikipedia. The neural net used in GPT-3 is similar to that of GPT-2, with a couple of additional blocks.
The biggest downside of GPT-3 is that it tends to “hallucinate,” meaning that it makes up facts with no discernable basis. GPT-3.5 and GPT-4 have the same problem, albeit to a lesser extent.
CODEX is a 2021 descendent of GPT-3 that was fine-tuned for code generation on 54 million open source GitHub repositories. It is the model used in GitHub Copilot.
GPT-3.5 is a set of 2022 updates to GPT-3 and CODEX. The gpt-3.5-turbo
model is optimized for chat but also works well for traditional completion tasks.
GPT-4 is a 2023 large multimodal model (accepting image and text inputs, emitting text outputs) that OpenAI claims exhibits human-level performance on some professional and academic benchmarks. GPT-4 outperformed GPT-3.5 in various simulated exams, including the Uniform Bar Exam, the LSAT, the GRE, and several AP subject exams.
Note that GPT-3.5 and GPT-4 performance has changed over time. A July 2023 Stanford paper identified several tasks, including prime number identification, where the behavior varied greatly between March 2023 and June 2023.
ChatGPT and BingGPT are chatbots that were originally based on gpt-3.5-turbo
and in March 2023 upgraded to use GPT-4. Currently, to access the version of ChatGPT based on GPT-4, you need to subscribe to ChatGPT Plus. The standard ChatGPT, based on GPT-3.5, was trained on data that cut off in September 2021.
BingGPT, aka “The New Bing,” which you can access in the Microsoft Edge browser, was also trained on data that cut off in 2021. When asked, the bot claims that it is constantly learning and updating its knowledge with new information from the web.
In early March 2023, Professor Pascale Fung of the Centre for Artificial Intelligence Research at the Hong Kong University of Science & Technology gave a talk on ChatGPT evaluation. It’s well worth the hour to watch it.
LaMDA
LaMDA (Language Model for Dialogue Applications), Google’s 2021 “breakthrough” conversation technology, is a Transformer-based language model trained on dialogue and fine-tuned to significantly improve the sensibleness and specificity of its responses. One of LaMDA’s strengths is that it can handle the topic drift that is common in human conversations.
PaLM
PaLM (Pathways Language Model) is a 2022 dense decoder-only Transformer model from Google Research with 540 billion parameters, trained with the Pathways system. PaLM was trained using a combination of English and multilingual datasets that include high-quality web documents, books, Wikipedia, conversations, and GitHub code. Google also created a “lossless” vocabulary that preserves all whitespace (especially important for code), splits out-of-vocabulary Unicode characters into bytes, and splits numbers into individual tokens, one for each digit.
PaLM-Coder is a version of PaLM 540B fine-tuned on a Python-only code dataset.
PaLM-E
PaLM-E is a 2023 embodied (for robotics) multimodal language model from Google. The researchers began with PaLM and “embodied” it (the E in PaLM-E), by complementing it with sensor data from the robotic agent. PaLM-E is also a generally-capable vision-and-language model; in addition to PaLM, it incorporates the ViT-22B vision model.
Bard
Bard is a 2023 Google conversational AI service based on LaMDA. Google Bard was released March 21, 2023 and made generally available on May 10, 2023.
Bard has been updated multiple times since its release. In April 2023 it gained the ability to generate code in 20 programming languages. In July 2023 it gained support for input in 40 human languages, incorporated Google Lens, and added text-to-speech capabilities in over 40 human languages.
LLaMA
LLaMA (Large Language Model Meta AI) is a 65-billion parameter “raw” large language model released by Meta AI (formerly known as Meta-FAIR) in February 2023. According to Meta:
Training smaller foundation models like LLaMA is desirable in the large language model space because it requires far less computing power and resources to test new approaches, validate others’ work, and explore new use cases. Foundation models train on a large set of unlabeled data, which makes them ideal for fine-tuning for a variety of tasks.
LLaMA was released at several sizes, along with a model card that details how it was built. Originally, you had to request the checkpoints and tokenizer, but they are in the wild now: a downloadable torrent was posted on 4chan by someone who properly obtained the models by filing a request, according to Yann LeCun of Meta AI.
Llama 2
Llama 2 is the next generation of Meta AI’s large language model, trained between January and July 2023 on 40% more data (2 trillion tokens from publicly available sources) than LLaMA 1 and having double the context length (4096). Llama 2 comes in a range of parameter sizes—7 billion, 13 billion, and 70 billion—as well as pretrained and fine-tuned variations. Meta AI calls Llama 2 open source, but there are some who disagree, given that it includes restrictions on acceptable use. A commercial license is available in addition to a community license.
Llama 2 is an auto-regressive language model that uses an optimized Transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. Llama 2 is currently English-only. The model card includes benchmark results and carbon footprint stats. The research paper, Llama 2: Open Foundation and Fine-Tuned Chat Models, offers additional detail.
Claude 2
Anthropic’s Claude 2, released in July 2023, accepts up to 100,000 tokens (about 70,000 words) in a single prompt, and can generate stories up to a few thousand tokens. Claude can edit, rewrite, summarize, classify, extract structured data, do Q&A based on the content, and more. It has the most training in English, but also performs well in a range of other common languages, and still has some ability to communicate in less common ones. Claude also has extensive knowledge of programming languages.
Claude was constitutionally trained to be Helpful, Honest, and Harmless (HHH), and extensively red-teamed to be more harmless and harder to prompt to produce offensive or dangerous output. It doesn’t train on your data or consult the internet for answers, although you can provide Claude with text from the internet and ask it to perform tasks with that content. Claude is available to users in the US and UK as a free beta, and has been adopted by commercial partners such as Jasper (a generative AI platform), Sourcegraph Cody (a code AI platform), and Amazon Bedrock.
Conclusion
As we’ve seen, large language models are under active development at several companies, with new versions shipping more or less monthly from OpenAI, Google AI, Meta AI, and Anthropic. While none of these LLMs achieve true artificial general intelligence (AGI), new models mostly tend to improve over older ones. Still, most LLMs are prone to hallucinations and other ways of going off the rails, and may in some instances produce inaccurate, biased, or other objectionable responses to user prompts. In other words, you should use them only if you can verify that their output is correct.