Back

DeepSeek-Prover-V2

127 points3 hoursgithub.com
islewis2 hours ago

> The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals

It feels pretty intuitive to me that the ability for an LLM to break a complex problem down into smaller, more easily solvable pieces will unlock the next level of complexity.

This pattern feels like a technique often taught to junior engineers- how to break up a multi-week project into bitesized tasks. This model is obviously math focused, but I see no reason why this wouldn't be incredibly powerful for code based problem solving.

cadamsdotcom1 hour ago

And it should be powerful for breaking down reasoning chains of thought too.

criley21 hour ago

Imo current models can already break things up into bite sized pieces. The limiter I've seen is twofold

1) Maintaining context of the overall project and goals while working in the weeds on a subtask of a task on an epic (so to speak) both in terms of what has been accomplished already and what still needs to be accomplished

and 2) Getting an agentic coding tool which can actually handle the scale of doing 50 small projects back to back. With these agentic tools I find they start projects off really strong but by task #5 they're just making a mess with every change.

I've played with keeping basically a dev-progress.md file and implementation-plan.md file that I keep in context for every request and end each task by updating files. But me manually keeping all this context isn't solving all my problems.

And all the while, tools like Cline are gobbling up 2M tokens to make small changes.

jhrmnn52 minutes ago

> Maintaining context of the overall project and goals while working in the weeds on a subtask of a task on an epic (so to speak) both in terms of what has been accomplished already and what still needs to be accomplished

This is a struggle for every human I’ve ever worked with

qoez52 minutes ago

The best part about these is that I know the weights are static so I know I won't have to deal with a sassy unusable update for a week suddenly.

Implicated41 minutes ago

Or, like with Claude, it being effectively lobotomized during north american 'business' hours. 3am PST? Cracked. 8am PST? ... mentally challenged.

jasonjmcghee1 hour ago

Super interesting that they chose 671B and 7B. no like 32B which feels like a "sweet spot"

bredren1 hour ago

Also notable is the earliest planning for a positive reception release of a new model might include both parameter-based and skill type market segmentation.

--> "In an increasingly crowded field of LLMs, how will our (costly to produce) model stand out?"

simianwords2 hours ago

related: I imagine in the future we might several "expert" LLM's and a wrapper can delegate tasks as needed as if it were a "tool". That way we can have segregation of expertise - each individual model can excel at one single thing.

A prover model might be used as a tool in the coming future.

Arcuru2 hours ago

For a concrete example today, see https://openrouter.ai/openrouter/auto

simianwords1 hour ago

thats nice but imagine first having models that are expert in specific domains. routing seems to be the easy part (just feed the available models as tools to your wrapper LLM)

samvaran2 hours ago

Is that not what MoE models already do?

oofbaroomf2 hours ago

No. Each expert is not separately trained, and while they may store different concepts, they are not meant to be different experts in specific domains. However, there are certain technologies to route requests to different domain expert LLMs or even fine-tuning adapters, such as RouteLLM.

retinaros1 hour ago

That might already happen behind what they call test time compute

someguy1010102 hours ago

The No Free Lunch Theorem implies that something like this is inevitable https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...

repsilat1 hour ago

A system of n experts is no different to a single expert wrt the NFLT. The theorem is entirely indifferent to (ie "equally skeptical of") the idea.

koakuma-chan2 hours ago

> related: I imagine in the future we might several "expert" LLM's and a wrapper can delegate tasks as needed as if it were a "tool". That way we can have segregation of expertise - each individual model can excel at one single thing.

In the future? I'm pretty sure people do that already.

simianwords1 hour ago

No I disagree. I would want ChatGPT to abstract away expert models - biochemistry model, coding model, physics model and maybe O3 would use these models as tools to come up with an answer.

The point being that a separate expert model would be better at its own field than a single model that tries to be good at everything. Intuitively it makes sense, in practice I have seen anecdotes where finetuning a small model on domain data makes the model lose coherence on other topics.

kratom_sandwich2 hours ago

Mistrals model is a mixture-of-experts model

energy1232 hours ago

It's crudely done though.

smusamashah2 hours ago

That Putnam bench graph (middle one) is showing 49/658 solve rate.

> The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench.

Which is 0.07% (edit: 7%) for PutnamBench

darkmighty2 hours ago

49/658 is 7%

smusamashah2 hours ago

Sorry, forgot multiply by 100

booi1 hour ago

I bet DeepSeek-Prover-V2 wouldn't have made that mistake