> The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals
It feels pretty intuitive to me that the ability for an LLM to break a complex problem down into smaller, more easily solvable pieces will unlock the next level of complexity.
This pattern feels like a technique often taught to junior engineers- how to break up a multi-week project into bitesized tasks. This model is obviously math focused, but I see no reason why this wouldn't be incredibly powerful for code based problem solving.
The best part about these is that I know the weights are static so I know I won't have to deal with a sassy unusable update for a week suddenly.
Or, like with Claude, it being effectively lobotomized during north american 'business' hours. 3am PST? Cracked. 8am PST? ... mentally challenged.
Super interesting that they chose 671B and 7B. no like 32B which feels like a "sweet spot"
Also notable is the earliest planning for a positive reception release of a new model might include both parameter-based and skill type market segmentation.
--> "In an increasingly crowded field of LLMs, how will our (costly to produce) model stand out?"
related: I imagine in the future we might several "expert" LLM's and a wrapper can delegate tasks as needed as if it were a "tool". That way we can have segregation of expertise - each individual model can excel at one single thing.
A prover model might be used as a tool in the coming future.
For a concrete example today, see https://openrouter.ai/openrouter/auto
thats nice but imagine first having models that are expert in specific domains. routing seems to be the easy part (just feed the available models as tools to your wrapper LLM)
Is that not what MoE models already do?
No. Each expert is not separately trained, and while they may store different concepts, they are not meant to be different experts in specific domains. However, there are certain technologies to route requests to different domain expert LLMs or even fine-tuning adapters, such as RouteLLM.
That might already happen behind what they call test time compute
The No Free Lunch Theorem implies that something like this is inevitable https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...
A system of n experts is no different to a single expert wrt the NFLT. The theorem is entirely indifferent to (ie "equally skeptical of") the idea.
> related: I imagine in the future we might several "expert" LLM's and a wrapper can delegate tasks as needed as if it were a "tool". That way we can have segregation of expertise - each individual model can excel at one single thing.
In the future? I'm pretty sure people do that already.
No I disagree. I would want ChatGPT to abstract away expert models - biochemistry model, coding model, physics model and maybe O3 would use these models as tools to come up with an answer.
The point being that a separate expert model would be better at its own field than a single model that tries to be good at everything. Intuitively it makes sense, in practice I have seen anecdotes where finetuning a small model on domain data makes the model lose coherence on other topics.
Mistrals model is a mixture-of-experts model
It's crudely done though.
That Putnam bench graph (middle one) is showing 49/658 solve rate.
> The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench.
Which is 0.07% (edit: 7%) for PutnamBench
49/658 is 7%
Sorry, forgot multiply by 100
I bet DeepSeek-Prover-V2 wouldn't have made that mistake
And it should be powerful for breaking down reasoning chains of thought too.
Imo current models can already break things up into bite sized pieces. The limiter I've seen is twofold
1) Maintaining context of the overall project and goals while working in the weeds on a subtask of a task on an epic (so to speak) both in terms of what has been accomplished already and what still needs to be accomplished
and 2) Getting an agentic coding tool which can actually handle the scale of doing 50 small projects back to back. With these agentic tools I find they start projects off really strong but by task #5 they're just making a mess with every change.
I've played with keeping basically a dev-progress.md file and implementation-plan.md file that I keep in context for every request and end each task by updating files. But me manually keeping all this context isn't solving all my problems.
And all the while, tools like Cline are gobbling up 2M tokens to make small changes.
> Maintaining context of the overall project and goals while working in the weeds on a subtask of a task on an epic (so to speak) both in terms of what has been accomplished already and what still needs to be accomplished
This is a struggle for every human I’ve ever worked with