Octopus v2: On-device language model for super agent

90 points10
vessenes10 days ago

Short summary of the paper:

Take Gemma-2B. Take your API. Use ChatGPT-3.5 to generate 1,000 "correct" API function call responses by dint of placing only your API calls in the pre-prompt, then prompting it. I imagine they use ChatGPT to create the request language as well. Then make 1,000 "incorrect" API call responses by filling the pre-prompt with functions not from your API.


Note that they use "functional tokens" in training - they convert a function to a particular, previously unused tokenization, and refer to it that way. They claim this speeds up inference (I'm sure it does). They don't make any claims as to whether or not it changes their accuracy (I bet that it does). It definitely makes the system more fragile / harder to train for large and very large APIs.

Outcome: highly capable single API function call LLM. They say you could do it with as little as 100 training inputs if you really wanted.

I think this is interesting, but not world-shattering. I could imagine building a nice little service company on it, basically just "send us a git repo and you'll get a helpful function call API for this version of your code which you can hook up to an API endpoint / chatbot".

Limitations are going to be largely around Gemma-2B's skills -- A 2B model isn't super sophisticated. And you can see they specify "<30 tokens" for the prompt. But, I imagine this could be trained quickly enough that it could be part of a release CI process. There are a number of libraries I use that I would like to have access to such a model.

I'd be interested in something that has general knowledge of a large set of packages for a language, and could pull in / finetune / MoE little models for specific repositories I'm coding on. Right now I would rely on either a very large model and hope its knowledge cutoff is right (Claude/GPT-4), or using a lot of a large context window. There might be some Goldilocks version in the middle here which would be helpful in a larger codebase but be faster and more accurate than the cloud monopoly providers.

AlexChen-NexaAI8 days ago

Thanks for your sharing. I am the author of this paper. You can have a try of our model, and another YouTuber created a tutorial video to use our model.

One note is that we haven't tested how many tokens the model can process. Query over 30 tokens is acceptable. We will follow up on this issue. The number 30 is summarized from the ordinary queries in an Android use case.

saltsaman10 days ago

I can see people training loras this way, which allows for multiple API function calls

vessenes10 days ago

Yeah, good idea! I'm not sure if you would be successful mixing LoRA + functional tokens. If you could, that would be great. Then You could ship very light LoRA packs with repositories.

Their LoRA training was I think against their finetuned model, not Gemma-2B directly. But, seems worth playing with -- could be super useful.

AlexChen-NexaAI8 days ago

Yes, you can do that! We have experimented about this idea. The lora training is reported in the paper already, and it works well. I think it would be very interesting.

gardnr10 days ago

> To mitigate such errors, we propose designating functions as unique functional tokens.

I just skimmed the paper but this seems to be the crux of it. They map functions to a single token and can then fine-tune models to use the token instead of the function name. This increases accuracy of smaller LLMs and reduces total number of tokens required for prompts and for generations, which is where they get their speed gains from.

The paper is worth a look just to see "Figure (2)"

alwa10 days ago

Figure 2 is incredible.

With only passing familiarity with the norms in this kind of work, the accuracy rates of all models on this benchmark suite seem suspiciously (and uniformly) high. Is choosing the right intention among “20 vehicle functions” or “20 Android APIs” consistent with an ordinary level of ambition in this kind of research these days?

jerpint10 days ago

That’s pretty clever, encoding atomic concepts as a token

wanderingmind10 days ago

I'm going to start commenting on ArXiV paper links with the same request.

1. Show me the data

2. Show me the code

3. Show me the model

If we can't play and modify it easily it doesn't belong in HN.

smcleod10 days ago

Yeah I've got to agree with this. Having links to a paper is useful - but not that interesting without a demo and the source code. It's not helped that ArXiV has a pretty horrible interface for anyone other than people writing papers.

rfoo10 days ago


iandanforth10 days ago

They might even get higher accuracies with a dedicated classification layer. By using the existing vocabulary they are spreading the probability mass across a much larger space. If they stuck to N options where N is the total number of functions available to the model I suspect they could get to 100% accuracy.

It's also not clear whether there is sufficient ambiguity in the test data for this to be a generalizable model. The difficulty with "intent recognition" (which they don't mention but is what this problem is called for agents like Siri) is that human generated inputs vary widely and are often badly formed. If they haven't done extensive evaluation with human users and/or they've constrained the functions to be quite distinct then they aren't yet tackling a hard problem, they've just got a complex setting.

turnsout10 days ago

This is the frontier—tiny, specialized models like this and ReALM [0], coupled to the application logic and able to run on-device.

Eventually devices will be powerful enough to run more general purpose models locally, but for high-frequency user tasks with a low tolerance for error, smaller specialized models may always win.


mikece10 days ago

"What is better than one recipe for Octopus?"

I can't be the only person who heard that line in their head instantly when reading that headline.

zhiyuan88 days ago

Hi all,

Thanks for discussing our work, feel free to contact us for follow-up demos and collaborations!

CGamesPlay10 days ago

So, I guess it's a LoRa for function calls. Makes sense that this would work well, and bodes well for creating really cheap request routers in more advanced cloud-based situations.

iandanforth10 days ago

It's not. They do train one version with LoRa but also train three variants without.