LLMs for everything, all aboard the hype train?

drawing — Deep.ai prompt ”AI robot trying to simplify a long book, smoke's coming out of it"

For the past two-ish years, basically every company that wants to be on trend finds some way to slip the term “AI” or “Generative AI” into an ad or slogan or marketing pitch. It gets the buzz because everyone knows about ChatGPT, but I think it’s worth taking a step back to remember where ChatGPT came from, and to take a moment to think about whether these newer large language models really make older AI models obsolete (spoiler alert, no).

The Transformer architecture for neural networks was originally published in 2017: since that time, three main variations of the architecture exist: Encoder-only (BERT, DistilBERT, HuBERT), Decoder-only (GPT, Claude, Llama, Mistral), and Encoder-Decoder (BART, LED, T5, PEGASUS, SpeechT5).

Generally, the encoder-only architecture can be leveraged for classification or embedding tasks (e.g. sentiment analysis, hidden unit speech representation, text embeddings for semantic search), while the decoder-only and encoder-decoder architectures are used for generation (audio, text). When “Generative AI” is mentioned, it’s typically referring to the decoder-only architecture. In terms of recent model releases (GPT, Mistral, Falcon, Claude, Gemini, Llama, Phi), they are all decoder-only transformers (many of which have hidden sources of the data they use for training the model, as well as hidden specific implementation decisions).

The decoder-only architecture is the far-and-away popular choice for new model releases that deal with text, but are they always better than the encoder-decoder architecture? The wonderful thing about a decoder-only architecture is that it is relatively easy to train and flexible to handle new tasks not explicitly a part of the training. This ability for generalization (sometimes referred to as “emergent capabilities”) is appealing in the pursuit of artificial general intelligence (AGI), where an AI system is as capable as a human across all tasks.

However, what if we have a specific task in mind for which we have data? For instance, what if we want to use a model that generates succinct tv episode synopses for a tv review website (like this dataset)? If we have the option of either fine-tuning a “small” model vs using something like GPT-4 that isn’t fine-tuned but prompted, maybe GPT isn’t by default better. Because of all the excitement around the new models that can generate a witty poem about your friend and invent a great pumpkin pie recipe, we can be tempted to assume that they’ll also be the best at other tasks (Cue the earlier post about automation bias). Luckily, there’s been a growing amount of research into just this question. Specifically, is the task of summarization, traditionally a job for encoder-decoder models, now better suited for an LLM?

A paper that is a must read for those interested in this topic is Section 3.2 of Google’s T5 Paper. The whole paper is excellent but that section in particular discusses some of the tradeoffs between the Encoder-Decoder and Decoder-Only design. Not only is the T5 model family one of the most popular Encoder-Decoder models available, but their thorough explanation of the experiment design is an inspiration to follow in my experiment documentation.

Now for some results! The Scrolls Benchmarkpaper (leaderboard here) and the ZeroScrolls Benchmark (Leaderboard Here) compile results on long summarization datasets into convenient places. ZeroScrolls documents the scores of popular LLMs on various long context summary datasets using zero-shot prompting, while Scrolls documents the scores of fine-tuned models using the same test datasets. These papers and leaderboards have some great comparisons to see how Encoder-Decoder and Decoder-Only models stack up. The ZeroScrolls paper shows how a Fine-tuned encoder-decoder model (ColT5 with 5.3B params) outperforms prompted models that are over 10 times bigger (e.g. Llama-70B that has 70B params)! Now there are plenty of discussion points here about the intricacies of how the experiment was run and how prompts were provided to the LLMs, but it’s important to see that a decoder-only LLM like GPT may not be the best or even cheapest option at scale to get the best performance on a task that isn’t naturally suited to its architecture. I think it’s also worth linking to this post by Yi Tay, a former senior research scientist at Google Brain. He gets deep into technical details comparing the different types of transformer architectures and explains how a decoder-only model is easier to train, but also concludes that it isn’t by default better at all tasks than the encoder-decoder.

Even as LLMs like GPT and Claude improve their performance on benchmarks like this, the question of cost can still play a factor at least in the near term. If a model that is a 10th the size of an LLM can reach similar performance criteria, it may make sense to stick with the smaller model until usage of the larger model reaches a more attractive price point. I could run a small encoder-decoder model locally on my M1 macbook, but I don’t have a prayer of fitting a 500B param model and then running inference on it with a 16k plus token context length.

For casual users, the free usage of ChatGPT/Claude makes them a great resource to use, but it may currently be pricey when put to commercial use. For now, enjoy the irony as I paste all of my above post into Claude-3.5 Sonnet and ask it to generate a summary/conclusion for me to use 😃 . Here it is:

“So, what's the takeaway from our deep dive into the world of AI summarization? Well, it turns out that sometimes, bigger isn't always better – even in the land of large language models! While GPT and its cousins are undoubtedly impressive (who doesn't love a good AI-generated pumpkin pie recipe?), our trusty encoder-decoder models might just be the unsung heroes of summarization. Think of it like this: you wouldn't use a sledgehammer to hang a picture frame, right? Sometimes, a well-tuned, task-specific model is just what the doctor ordered. It's like having a Swiss Army knife versus a well-stocked toolbox – both have their place, depending on the job at hand. So, the next time you're faced with a summarization task, don't automatically jump on the large language model hype train. Take a moment to consider whether a fine-tuned encoder-decoder model might be your ticket to efficiency and cost-effectiveness. After all, in the world of AI, it's not about the size of the model, but how you use it!”