Blog posts


Facebook’s new recommender on the edge

3 minute read


This is a writeup on the feasibility of using TBSM for edge applications.

TBSM has three portions: first, a DLRM generates an input vector for each action in a time series. Second, every output of each DLRM in time series is basically used as an “embedding” for a second DLRM like model. While there are differences between the top and bottom section (especially with regards to normalization), both exclusively use dot products between “embeddings” and MLPs for computation.

This model is highly usable for edge applications when applied to models with relatively few data points per example, such as the Taobao dataset.

A few shot learning future? Computation costs of different approaches to NLP

9 minute read


Like most things in machine learning, trends in NLP move fast. Transformers are not even three years old yet, and are already ubiquitous. With them came a paradigm shift in how models were trained. Instead of training models to do tasks from scratch, common practice is now to start from an expensive pretrained model on the internet. That model is trained on a self-supervised language modeling task to learn the language. This “Pretrain-finetune” pipeline allows for larger models to perform exceptionally on small datasets where they would normally overfit.

In May (or last century in pandemic-ML research time), GPT 3 drew headlines for its ability to generate text. Less talked was the paradigm shift advocated for in the paper: a move to few-shot learning. I want to examine the workload implications of this change.