ReALM: Reference Resolution as Language Modeling

dvt • 4 months ago

I'm very excited about work being done in this area. In fact, I'm working on a product that does exactly this (runs in the background, local LLM, has access to screen-space entities, can take certain actions). It feels pretty magical to use (here it's running on my 3090Ti; much slower but still serviceable on my M1 MBP): https://www.youtube.com/watch?v=JH1noETdQEA

Currently using Mistral-7B-Instruct-v0.2, but working on a fine-tuning dataset which should make it work better with local applications interfaces (console, browser, email client, Slack, Discord, Word, Excel, etc.).

smravec • 4 months ago

Can you explain how do you plan to finetune it(how will you make that dataset , what data you will use...)? Always wondered how it is done

dvt • 4 months ago

Still a lot of work to be done before that point, but I'll write at least 1000 prompts (some synthetic, some hand-crafted) based on the "arrow notation" I posted here[1]. The fine-tuning itself is actually quite easy[2] as long as the data is properly formatted. After, I'll have to quantize the model (which I've never done before, but doesn't seem too hard).

[1] https://www.reddit.com/r/LocalLLaMA/comments/1be5b4x/finetun...

[2] https://exnrt.com/blog/ai/mistral-7b-fine-tuning/

lewismenelaws • 4 months ago

Cool! How hard would it be for it to take actions do you think?

dvt • 4 months ago

I've been working on this for the past month, and in its current state it runs some actions, but it's very context-dependent.

For example, if I'm on a GitHub page and I say "clone this git repo," it works pretty flawlessly (git clones the repo in a scratch directory you set up in the settings), but if I'm reading a blog which references a git repo, it sometimes gets confused (may try to "clone" the blog URL for example), so I'm working through a few solutions, while trying to avoid multi-shot prompting.

A lot of this involves pretty par for the course data cleaning. For example, you'd turn the user query into an embedding which you then compare to a few "contexts" (git workflows, research, creative work) to see which one matches it best, and then prune the raw screen data per that context, removing (or de-emphasizing) non-context-relevant information. So, in the above case, if I'm trying to clone a git repo, I don't care about non-git URLs (we can mask them, remove them, or whatever). Then, we feed the sanitized context to the LLM. Et Voilà !

richardkmichael • 4 months ago

This reminds me of https://www.openinterpreter.com/.

djohnston • 4 months ago

Your video is private!

dvt • 4 months ago

Whoops, meant to just make it unlisted, fixed.

danielvaughn • 4 months ago

Can someone ELI5 what is meant by reference resolution? Sounds like it means identifying entities, given that it talks about “on screen and in the background”.

shermantanktop • 4 months ago

“This,” “that,” “the other thing.”

Or as the Beatles said, “Here, There, and Everywhere”

Linguistic references which are relative to the speakers context and viewpoint.

lionkor • 4 months ago

The only time when "oh, this and that" is a very specific and accurate answer

threeseed • 4 months ago

For context this is the basis for Apple's Siri replacement.

I wonder whether this is going to make such a difference given Apple's POI data is so poor.

CharlesW • 4 months ago

> I wonder whether this is going to make such a difference given Apple's POI data is so poor.

Is it still? I know that was a common criticism at launch.

https://en.wikipedia.org/wiki/Apple_Maps

The main provider of map data is TomTom, but data is also supplied by Automotive Navigation Data, Getchee, Hexagon AB, IGN, Increment P, Intermap Technologies, LeadDog, MDA Information Systems, OpenStreetMap, and Waze. Apple renewed their agreement with TomTom in 2015, though later decided to gradually switch to OpenStreetMap and remove all of TomTom-contributed map data except for live traffic information.

yunohn • 4 months ago

Yeah, it still sucks for me in the Netherlands - a country with a massive iOS population.

Since we can’t change default map apps, I always end up having to use it with calendar, reminders, siri, watch, etc. Always disappointed by how much info it lacks.

ultra_nick • 4 months ago

Did they just add 2D text position to the feature vectors?