Back

LM2: Large Memory Models

110 points27 daysarxiv.org
ghysznje26 days ago

(Probably bc I'm dumb) I'm very confused by this paper. The dimensions are all over the place: first they say M is a N x d x d matrix, then it becomes N x d. And then they are trying to scale M with g_out and add it to E_attn which is a T x d matrix??? Are the gates scalars or vectors or matrices? If they are matrices then the dimensions also don't line up to M

biofox26 days ago

You're not dumb. I think it's just poorly written and full of errors.

ziofill27 days ago

Missed opportunity to call them LMM and enjoy the onslaught of typos

bayindirh26 days ago

Some Chinese researchers have worked on copper nanotubes and abbreviated them as CuNTs [0].

[0]: https://www.researchgate.net/publication/260800015_Structura...

esafak26 days ago

Or Models with Large Memory. MLMs :)

x86hacker101026 days ago

This seems to be the most appropriate

stuartjohnson1227 days ago

I can't wait to enhance my STT-RAG-LLM-TTS with some LMMs

bloomingkales27 days ago

You forgot SOTA, that one is flying around a lot.

janderson21526 days ago

Also left out SoDoSoPa

pylotlight26 days ago

Wouldn't it be STT-LLM-RAG-TTS

nkozyra27 days ago

They're still language models

LMLM

DebtDeflation26 days ago

Just as Graph RAG should have just been GAG.

mgfist27 days ago

That's why they named it LM2 ..

pylotlight26 days ago

Seems like LM² would have made more sense in that case. Or L²M² Or LLMM

DiogenesKynikos26 days ago

The question is whether L and M commute.

mgfist26 days ago

Because that makes it harder to type out

gpderetta26 days ago

(LM)²

kadushka27 days ago

The largest model they tested is 1.7B.

free_bip27 days ago

Do they believe that it's likely to scale as high as "standard" LLMs? Or is this another state space machine moment where it turns out to be equivalent to them?

kadushka26 days ago

The concept of a memory module, separate from a prediction engine, makes sense to me, our brains might operate like that (short vs long term memory).

Many research groups have been trying to implement this idea, recent paper from Meta comes to mind: https://arxiv.org/html/2412.09764

If they believe it’s likely to scale, they will try to scale it, so we will either hear about it in a few months, or we won’t.

soganess26 days ago

Table 2's results are interesting. If the paper is to be believed, just adding the memory model seems to improve reasoning tasks across the board.

That said, I do wonder if this a bit of mirage. At 1.7B parameters, they are 3 orders of magnitude down from 4o (well that isn't completely fair, I don't know what the average 'expert' size is in 4o, but I doubt the authors are doing mixture of experts at only 1.7B). A model can 'memorize' way more shit with that many parameters.

igleria26 days ago

This immediately makes my mind bring up Hopfield networks https://arxiv.org/abs/2008.02217

when I worked with them circa 2012 they were practically toys. Maybe we are in a better place now?

ottaborra26 days ago

RNN with extra steps?

tripplyons26 days ago

There are many papers that use a recurrence across sub-sequences and attention within sub-sequences. Google did this with Infini-Attention and one of the variants from the Titans paper. However, I think the earliest example of this is Transformer-XL.

biofox26 days ago

Isn't that all of modern AI?

immibis26 days ago

Transformers are completely unlike RNNs.

tripplyons26 days ago

There are some interesting connections between them. If you remove the softmax from the attention formula, you end up with linear attention, which has a recurrent form.

I haven't read it, but the Mamba 2 paper claims to establish a stronger connection.

+1
kadushka26 days ago
anentropic26 days ago

GitHub link in the paper is a 404 - private repo?