ReALM: Reference Resolution as Language Modeling

140 points13
dvt13 days ago

I'm very excited about work being done in this area. In fact, I'm working on a product that does exactly this (runs in the background, local LLM, has access to screen-space entities, can take certain actions). It feels pretty magical to use (here it's running on my 3090Ti; much slower but still serviceable on my M1 MBP):

Currently using Mistral-7B-Instruct-v0.2, but working on a fine-tuning dataset which should make it work better with local applications interfaces (console, browser, email client, Slack, Discord, Word, Excel, etc.).

smravec13 days ago

Can you explain how do you plan to finetune it(how will you make that dataset , what data you will use...)? Always wondered how it is done

dvt13 days ago

Still a lot of work to be done before that point, but I'll write at least 1000 prompts (some synthetic, some hand-crafted) based on the "arrow notation" I posted here[1]. The fine-tuning itself is actually quite easy[2] as long as the data is properly formatted. After, I'll have to quantize the model (which I've never done before, but doesn't seem too hard).



lewismenelaws13 days ago

Cool! How hard would it be for it to take actions do you think?

dvt13 days ago

I've been working on this for the past month, and in its current state it runs some actions, but it's very context-dependent.

For example, if I'm on a GitHub page and I say "clone this git repo," it works pretty flawlessly (git clones the repo in a scratch directory you set up in the settings), but if I'm reading a blog which references a git repo, it sometimes gets confused (may try to "clone" the blog URL for example), so I'm working through a few solutions, while trying to avoid multi-shot prompting.

A lot of this involves pretty par for the course data cleaning. For example, you'd turn the user query into an embedding which you then compare to a few "contexts" (git workflows, research, creative work) to see which one matches it best, and then prune the raw screen data per that context, removing (or de-emphasizing) non-context-relevant information. So, in the above case, if I'm trying to clone a git repo, I don't care about non-git URLs (we can mask them, remove them, or whatever). Then, we feed the sanitized context to the LLM. Et Voilà !

richardkmichael13 days ago
djohnston13 days ago

Your video is private!

dvt13 days ago

Whoops, meant to just make it unlisted, fixed.

danielvaughn13 days ago

Can someone ELI5 what is meant by reference resolution? Sounds like it means identifying entities, given that it talks about “on screen and in the background”.

shermantanktop13 days ago

“This,” “that,” “the other thing.”

Or as the Beatles said, “Here, There, and Everywhere”

Linguistic references which are relative to the speakers context and viewpoint.

lionkor13 days ago

The only time when "oh, this and that" is a very specific and accurate answer

threeseed13 days ago

For context this is the basis for Apple's Siri replacement.

I wonder whether this is going to make such a difference given Apple's POI data is so poor.

CharlesW13 days ago

> I wonder whether this is going to make such a difference given Apple's POI data is so poor.

Is it still? I know that was a common criticism at launch.

The main provider of map data is TomTom, but data is also supplied by Automotive Navigation Data, Getchee, Hexagon AB, IGN, Increment P, Intermap Technologies, LeadDog, MDA Information Systems, OpenStreetMap, and Waze. Apple renewed their agreement with TomTom in 2015, though later decided to gradually switch to OpenStreetMap and remove all of TomTom-contributed map data except for live traffic information.

yunohn13 days ago

Yeah, it still sucks for me in the Netherlands - a country with a massive iOS population.

Since we can’t change default map apps, I always end up having to use it with calendar, reminders, siri, watch, etc. Always disappointed by how much info it lacks.

ultra_nick13 days ago

Did they just add 2D text position to the feature vectors?