LM2: Large Memory Models

ghysznje • 28 days ago

(Probably bc I'm dumb) I'm very confused by this paper. The dimensions are all over the place: first they say M is a N x d x d matrix, then it becomes N x d. And then they are trying to scale M with g_out and add it to E_attn which is a T x d matrix??? Are the gates scalars or vectors or matrices? If they are matrices then the dimensions also don't line up to M

biofox • 28 days ago

You're not dumb. I think it's just poorly written and full of errors.

ziofill • 28 days ago

Missed opportunity to call them LMM and enjoy the onslaught of typos

bayindirh • 28 days ago

Some Chinese researchers have worked on copper nanotubes and abbreviated them as CuNTs [0].

[0]: https://www.researchgate.net/publication/260800015_Structura...

esafak • 28 days ago

Or Models with Large Memory. MLMs :)

x86hacker1010 • 28 days ago

This seems to be the most appropriate

stuartjohnson12 • 28 days ago

I can't wait to enhance my STT-RAG-LLM-TTS with some LMMs

bloomingkales • 28 days ago

You forgot SOTA, that one is flying around a lot.

janderson215 • 28 days ago

Also left out SoDoSoPa

pylotlight • 28 days ago

Wouldn't it be STT-LLM-RAG-TTS

nkozyra • 28 days ago

They're still language models

LMLM

DebtDeflation • 28 days ago

Just as Graph RAG should have just been GAG.

mgfist • 28 days ago

That's why they named it LM2 ..

pylotlight • 28 days ago

Seems like LM² would have made more sense in that case. Or L²M² Or LLMM

DiogenesKynikos • 28 days ago

The question is whether L and M commute.

mgfist • 28 days ago

Because that makes it harder to type out

gpderetta • 28 days ago

(LM)²

kadushka • 28 days ago

The largest model they tested is 1.7B.

free_bip • 28 days ago

Do they believe that it's likely to scale as high as "standard" LLMs? Or is this another state space machine moment where it turns out to be equivalent to them?

kadushka • 28 days ago

The concept of a memory module, separate from a prediction engine, makes sense to me, our brains might operate like that (short vs long term memory).

Many research groups have been trying to implement this idea, recent paper from Meta comes to mind: https://arxiv.org/html/2412.09764

If they believe it’s likely to scale, they will try to scale it, so we will either hear about it in a few months, or we won’t.

soganess • 28 days ago

Table 2's results are interesting. If the paper is to be believed, just adding the memory model seems to improve reasoning tasks across the board.

That said, I do wonder if this a bit of mirage. At 1.7B parameters, they are 3 orders of magnitude down from 4o (well that isn't completely fair, I don't know what the average 'expert' size is in 4o, but I doubt the authors are doing mixture of experts at only 1.7B). A model can 'memorize' way more shit with that many parameters.

igleria • 28 days ago

This immediately makes my mind bring up Hopfield networks https://arxiv.org/abs/2008.02217

when I worked with them circa 2012 they were practically toys. Maybe we are in a better place now?

ottaborra • 28 days ago

RNN with extra steps?

tripplyons • 28 days ago

There are many papers that use a recurrence across sub-sequences and attention within sub-sequences. Google did this with Infini-Attention and one of the variants from the Titans paper. However, I think the earliest example of this is Transformer-XL.

biofox • 28 days ago

Isn't that all of modern AI?

immibis • 28 days ago

Transformers are completely unlike RNNs.

tripplyons • 28 days ago

There are some interesting connections between them. If you remove the softmax from the attention formula, you end up with linear attention, which has a recurrent form.

I haven't read it, but the Mamba 2 paper claims to establish a stronger connection.

kadushka • 28 days ago

anentropic • 28 days ago

GitHub link in the paper is a 404 - private repo?