(Probably bc I'm dumb) I'm very confused by this paper. The dimensions are all over the place: first they say M is a N x d x d matrix, then it becomes N x d. And then they are trying to scale M with g_out and add it to E_attn which is a T x d matrix??? Are the gates scalars or vectors or matrices? If they are matrices then the dimensions also don't line up to M
Missed opportunity to call them LMM and enjoy the onslaught of typos
Some Chinese researchers have worked on copper nanotubes and abbreviated them as CuNTs [0].
[0]: https://www.researchgate.net/publication/260800015_Structura...
Or Models with Large Memory. MLMs :)
This seems to be the most appropriate
I can't wait to enhance my STT-RAG-LLM-TTS with some LMMs
You forgot SOTA, that one is flying around a lot.
Also left out SoDoSoPa
Wouldn't it be STT-LLM-RAG-TTS
They're still language models
LMLM
Just as Graph RAG should have just been GAG.
That's why they named it LM2 ..
Seems like LM² would have made more sense in that case. Or L²M² Or LLMM
The question is whether L and M commute.
Because that makes it harder to type out
(LM)²
The largest model they tested is 1.7B.
Do they believe that it's likely to scale as high as "standard" LLMs? Or is this another state space machine moment where it turns out to be equivalent to them?
The concept of a memory module, separate from a prediction engine, makes sense to me, our brains might operate like that (short vs long term memory).
Many research groups have been trying to implement this idea, recent paper from Meta comes to mind: https://arxiv.org/html/2412.09764
If they believe it’s likely to scale, they will try to scale it, so we will either hear about it in a few months, or we won’t.
Table 2's results are interesting. If the paper is to be believed, just adding the memory model seems to improve reasoning tasks across the board.
That said, I do wonder if this a bit of mirage. At 1.7B parameters, they are 3 orders of magnitude down from 4o (well that isn't completely fair, I don't know what the average 'expert' size is in 4o, but I doubt the authors are doing mixture of experts at only 1.7B). A model can 'memorize' way more shit with that many parameters.
This immediately makes my mind bring up Hopfield networks https://arxiv.org/abs/2008.02217
when I worked with them circa 2012 they were practically toys. Maybe we are in a better place now?
RNN with extra steps?
There are many papers that use a recurrence across sub-sequences and attention within sub-sequences. Google did this with Infini-Attention and one of the variants from the Titans paper. However, I think the earliest example of this is Transformer-XL.
Isn't that all of modern AI?
Transformers are completely unlike RNNs.
There are some interesting connections between them. If you remove the softmax from the attention formula, you end up with linear attention, which has a recurrent form.
I haven't read it, but the Mamba 2 paper claims to establish a stronger connection.
Here is a paper explaining it: https://arxiv.org/abs/2006.16236
GitHub link in the paper is a 404 - private repo?
You're not dumb. I think it's just poorly written and full of errors.