A few shot learning future? Computation costs of different approaches to NLP

9 minute read


Like most things in machine learning, trends in NLP move fast. Transformers are not even three years old yet, and are already ubiquitous. With them came a paradigm shift in how models were trained. Instead of training models to do tasks from scratch, common practice is now to start from an expensive pretrained model on the internet. That model is trained on a self-supervised language modeling task to learn the language. This “Pretrain-finetune” pipeline allows for larger models to perform exceptionally on small datasets where they would normally overfit.

In May (or last century in pandemic-ML research time), GPT 3 drew headlines for its ability to generate text. Less talked was the paradigm shift advocated for in the paper: a move to few-shot learning. I want to examine the workload implications of this change.


In this article I am going to compare three different approaches to NLP. I will discuss the GLUE dataset, which is a set of common NLP tasks, and four different approaches to the dataset: word vectors, ELMo, BERT, and GPT-3. While these models are not the state of the art for this dataset, they do provide good benchmarks for the training procedures (the SOTA at the moment is a highly optimized version of BERT)

First, some reading to go through before: The Illustrated Bert will explain each of these models and the progression through them, save GPT 3. For that, I would read another post by the same person.

These readings are a good primer into how NLP models work, especially transformers. This is a more numerical analysis of what these shifts in model design mean for end users.

Few shot learning becoming mainstream for NLP would be disruptive, in both the good and bad sense of the term. Because GPT-3 size models are almost impossible for anyone but industry titans to train, end users will have to use it as a service instead of training there own models. This actual monopoly on the model itself is worthy of its own article, but I want to focus on the technical implications here.

The Fundamentals: Model Architecture and its applications

Model Memory footprint:

These model sizes reflect parameters. These memory costs do not scale with increasing batch size. Activation costs are very variable, and will be examined in detail later.

Word2Vec(GLoVe)400K words * 300d vectors120M
ELMo4096 dim 2 layer BiLSTM96M
BERT-Base512 Sequence 12 layer 768 dim model with 30k vocab110M
Bert-Large512 Sequence 24 layer 1024 dim model with 30k vocab340M
GPT32048 Sequence 96 Layer 12288 dim model175B

Note: check the GPT-3 Paper for more intermediate results

Activations per model while training. Note - these are rough calculations, are for a batch size of 1. Feel free to contact me with questions about how I calculated them. They might be off for GPT-3.

ModelBS 1 Activations during “pretrain”Change during finetune
Word2Vec#Tokens * 300 (The tokens)Depends on upstream model
ELMo#Tokens * 4 * 4096(LSTM representation)Depends on upstream model
BERT-Base7M per layer * 12 Layers (assuming operator fusion + optimized)Same
BERT-Large10M per layer * 24 Layers (assuming operator fusion + optimized)Same
GPT-3654M per layer * 96 Layers (assuming operator fusion + optimized)0(finetune is few-shot)

Note about tokens - I marked the per token costs of the first two models, because for an unrolled RNN the more tokens you unroll the more memory you need to keep. This also applies to Transformers, but usually the cost is somewhat “fixed”. Smaller examples are usually packed together. There is a key difference, however, between Autoregressive models like GPT-3 and Masked LM models like BERT. Namely, BERT goes through training examples 512 tokens at a time; each activation is the processing of up to 512 tokens, some masked. GPT-3 goes token by token. This analysis mainly reflects the rough ratios of how much you could scale batch size.

Note on using these numbers for practical purposes: While ideally these calculations would allow you to figure out the max batch size by figuring how memory would be allocated, life is far from that easy. I did a detailed analysis of what actually contributed to memory costs in tensorflow and pytorch in the past, and minute details like operator fusion and rounding allocations can quickly balloon memory costs to much larger than needed. Contact me if you are curious!

Cost Comparison: One time costs

Most of these numbers are taken from the ELECTRA paper or GPT-3 paper. I verified most of the numbers using a google sheets spreadsheet, but I used these numbers because my flops / params spreadsheet is less comprehensive.

Disclaimer about using flops to compare models

Flops are a pretty bad metric. The RNN models listed here are much slower than you would guess based on flops, because the computations in an RNN have a lower arithmetic intensity. The ratio of memory operations to computation operations is higher. As a result, these operations aren’t as nicely parallelizable as the operations in BERT.

This “comparing apples and oranges” problem becomes more severe for inference and fine tuning, because for those operations the variation in execution time would depend heavily on architecture. As I cannot do any benchmarking on GPT-3 due to not being Microsoft or OpenAI, we are forced to use the inferior measure of flops

ModelFlops to full train
ELMo3.3 EF

Cost Comparison: Once per tasks costs

Once that portion is done, these models diverge in how they asks those who use them to fine tune for their target dataset. In Word vectors ELMo, users train a model on top of ELMo. In BERT, users fine tune BERT. And in GPT-3, they run inference on a large sequence of up to 2048 tokens.

In BERT, the cost of fine tuning is usually on the order of 1/100th of the pre training time. These approximations are for SQuAD

ModelFlops to fine tune
BERT-Base166 PF
BERT-Large493 PF

GPT-3 has no pre training cost, and for ELMo it depends on the upstream model. Usually the model would be smaller than ELMo, though this is a far from universal statement.

Cost Comparison: Inference

Inference costs again depend on the structure of the model. Again, the cost for the RNN model depends heavily on system, architecture, etc. because the arithmetic intensity is lower and the model changes wildly on dataset.

The inference cost of ELMo and BERT-Base are around 20-30 Gigaflops, but they didn’t submit to the SuperGLUE leaderboard.

I also included SuperGLUE Scores, from The public leaderboard

ModelInference flopsSuperGLUE score
CBoW(Word vectors)<10GF44.5
BERT-Large80 GF69
Optimized BERT (RoBERTa)80 GF84.6
GPT-3 Few Shot1 TF per token71.8

For inference numbers on GPT-3 I divided the 314 zettaflops used for training (what a wonderful word) by the 300 billion tokens seen. This results in roughly 1 TF per token.

What do the numbers mean?

In my opinion, these numbers mean the following: If you can finetune a BERT like model, you probably should.

There are many parts of this statement.

GPT-3 doesn’t work great for even small supervised NLU tasks.

The first is that given sufficient data and a language understanding (instead of generating) task, the accuracy of finetuned models is much better the computational benefits of few shot learning disappear quickly. With only a million tokens the GPT model costs more to use than the much more accurate finetuned models. The accuracy of the finetuned model is higher even on tasks with less than a thousand sentences, such as the GLUE Diagnostics.

GPT-3 enables no code, few shot, NLP

If you write two examples of a task and pay a fee, you can get an answer without doing any training with GPT-3. An interesting question is if the price is better than hiring someone a human. The answer given the OpenAI pricing model is a definitive yes, given that $1 of generated text is 25 thousand characters with the $400 per month plan. That much text would take me around an hour to type.

Another application is rapid prototyping of NLP applications. A litmus test with GPT-3 could check if the model picks up on the pattern. The cost of annotating even 1000 word examples is much more time than two, and this time saving overshadows any disadvantages in compute cost for prototyping.

RNNs are semi obsolete in NLP, even in production

While I didn’t discuss it much in this article, it seems that transformers consistently outperform RNNs at similar compute budgets. Furthermore, this is true independent of While I know this holds true at the mobile scale (due to my colleagues work on SqueezeBERT), none of the models I have analyzed work great for super small IOT devices. If that is your goal, you probably weren’t really considering massive NLP models as a viable option anyway.