Facebook’s new recommender on the edge

3 minute read


This is a writeup on the feasibility of using TBSM for edge applications.

TBSM has three portions: first, a DLRM generates an input vector for each action in a time series. Second, every output of each DLRM in time series is basically used as an “embedding” for a second DLRM like model. While there are differences between the top and bottom section (especially with regards to normalization), both exclusively use dot products between “embeddings” and MLPs for computation.

This model is highly usable for edge applications when applied to models with relatively few data points per example, such as the Taobao dataset.

Concern 1: Paramater space

TBSM’s model checkpoint for the taobao dataset is 317MB with full precision, but this is almost entirely embeddings.

TBSM has the following parameter usage:

ComponentNumber of parametersPercentage of total
User Embeddings16 Million19.1%
Product Embeddings66 Million80.7%
Category Embeddings151 Thousand.183%
Everything else3.6 Thousand.0044%

Clearly, for a given user we only need to store their user embedding as well as product embeddings for only a subset of the products. This dataset tracked interactions including just viewing prodcuts, and has on average 100 entries per user. In all likelihood, storing only 1000 of the 4 million embeddings will be sufficient on device. As a result, the actual amount of storage for an edge application is quite low.

Concern 2: Compute

The compute in TBSM can be structured as follows.

B represents the batch size.

T represents the number of time steps (Default 20).

K represents the number of heads, default 1.
Finally, the dimension size of embeddings is 16.

For the sake of readability, x will be used for dimensions. For the MLPs, I will write the progressing dimension sizes using arrows. Ex. Bx16 -> 32 -> 64 would mean the input of batch size B and dimension 16. It goes through a 16x32 FC layer, and then a 32x64 FC Layer.

ComponentOperation dimensionNumber of times executed
Dense Feature (“Bot”) MLPBx1 -> 16T
Per event SimilaritySix inner products of size 16 vectorsBT
Event Vector (“Top”) MLPBx22 -> 15 -> 15T
Inter event SimilarityT weighted dot productsBK
Per head vectorBxT -> 15 -> 15 -> TK
Per head contextBx15xT * TK
Per head probability estimateBx30 -> 60 -> 1K

In addition, if you have multiple heads, then an additional final MLP is added on top.

Dependency Graph

If you are curious where the numbers come from, here is some explanation:

  1. Bot MLP. Takes in 1 time and outputs size 16 vector.
  2. Similarity. The 3 embeddings + Bot MLP form 3 size 16 vectors. We have (4 choose 2)=6 inner products
  3. Top MLP: The 16 outputs of (1.) + 6 dot products from (2.)
  4. The T dot products come from taking the dot product of each (3.) vector of our “history” (previous 20) with the vector be predicted. It is a weighted dot product, so we multiply each of these vectors by a matrix A.
  5. We pass the T outputs from (4.) throuh an MLP. End size T
  6. We multiply our outputs from (3.) * the outputs of (5.)
  7. We concatenate the output for our target from 3. with the output of (6.)