Like most things in machine learning, trends in NLP move fast. Transformers are not even three years old yet, and are already ubiquitous. With them came a paradigm shift in how models were trained. Instead of training models to do tasks from scratch, common practice is now to start from an expensive pretrained model on the internet. That model is trained on a self-supervised language modeling task to learn the language. This “Pretrain-finetune” pipeline allows for larger models to perform exceptionally on small datasets where they would normally overfit.
In May (or last century in pandemic-ML research time), GPT 3 drew headlines for its ability to generate text. Less talked was the paradigm shift advocated for in the paper: a move to few-shot learning. I want to examine the workload implications of this change.