This is a writeup on the feasibility of using TBSM for edge applications.
TBSM has three portions: first, a DLRM generates an input vector for each action in a time series. Second, every output of each DLRM in time series is basically used as an “embedding” for a second DLRM like model. While there are differences between the top and bottom section (especially with regards to normalization), both exclusively use dot products between “embeddings” and MLPs for computation.
This model is highly usable for edge applications when applied to models with relatively few data points per example, such as the Taobao dataset.
Concern 1: Paramater space
TBSM’s model checkpoint for the taobao dataset is 317MB with full precision, but this is almost entirely embeddings.
TBSM has the following parameter usage:
|Component||Number of parameters||Percentage of total|
|User Embeddings||16 Million||19.1%|
|Product Embeddings||66 Million||80.7%|
|Category Embeddings||151 Thousand||.183%|
|Everything else||3.6 Thousand||.0044%|
Clearly, for a given user we only need to store their user embedding as well as product embeddings for only a subset of the products. This dataset tracked interactions including just viewing prodcuts, and has on average 100 entries per user. In all likelihood, storing only 1000 of the 4 million embeddings will be sufficient on device. As a result, the actual amount of storage for an edge application is quite low.
Concern 2: Compute
The compute in TBSM can be structured as follows.
B represents the batch size.
T represents the number of time steps (Default 20).
K represents the number of heads, default 1.
Finally, the dimension size of embeddings is 16.
For the sake of readability, x will be used for dimensions. For the MLPs, I will write the progressing dimension sizes using arrows. Ex. Bx16 -> 32 -> 64 would mean the input of batch size B and dimension 16. It goes through a 16x32 FC layer, and then a 32x64 FC Layer.
|Component||Operation dimension||Number of times executed|
|Dense Feature (“Bot”) MLP||Bx1 -> 16||T|
|Per event Similarity||Six inner products of size 16 vectors||BT|
|Event Vector (“Top”) MLP||Bx22 -> 15 -> 15||T|
|Inter event Similarity||T weighted dot products||BK|
|Per head vector||BxT -> 15 -> 15 -> T||K|
|Per head context||Bx15xT * T||K|
|Per head probability estimate||Bx30 -> 60 -> 1||K|
In addition, if you have multiple heads, then an additional final MLP is added on top.
If you are curious where the numbers come from, here is some explanation:
- Bot MLP. Takes in 1 time and outputs size 16 vector.
- Similarity. The 3 embeddings + Bot MLP form 3 size 16 vectors. We have (4 choose 2)=6 inner products
- Top MLP: The 16 outputs of (1.) + 6 dot products from (2.)
- The T dot products come from taking the dot product of each (3.) vector of our “history” (previous 20) with the vector be predicted. It is a weighted dot product, so we multiply each of these vectors by a matrix A.
- We pass the T outputs from (4.) throuh an MLP. End size T
- We multiply our outputs from (3.) * the outputs of (5.)
- We concatenate the output for our target from 3. with the output of (6.)