OpenAI O3 breakthrough high score on ARC-AGI-PUB

1385 points21 hoursarcprize.org

bluecoconut • 20 hours ago

Efficiency is now key.

~=$3400 per single task to meet human performance on this benchmark is a lot. Also it shows the bullets as "ARC-AGI-TUNED", which makes me think they did some undisclosed amount of fine-tuning (eg. via the API they showed off last week), so even more compute went into this task.

We can compare this roughly to a human doing ARC-AGI puzzles, where a human will take (high variance in my subjective experience) between 5 second and 5 minutes to solve the task. (So i'd argue a human is at 0.03USD - 1.67USD per puzzle at 20USD/hr, and they include in their document an average mechancal turker at $2 USD task in their document)

Going the other direction: I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.

Super exciting that OpenAI pushed the compute out this far so we could see he O-series scaling continue and intersect humans on ARC, now we get to work towards making this economical!

bluecoconut • 20 hours ago

some other imporant quotes: "Average human off the street: 70-80%. STEM college grad: >95%. Panel of 10 random humans: 99-100%" -@fchollet on X

So, considering that the $3400/task system isn't able to compete with STEM college grad yet, we still have some room (but it is shrinking, i expect even more compute will be thrown and we'll see these barriers broken in coming years)

Also, some other back of envelope calculations:

The gap in cost is roughly 10^3 between O3 High and Avg. mechanical turkers (humans). Via Pure GPU cost improvement (~doubling every 2-2.5 years) puts us at 20~25 years.

The question is now, can we close this "to human" gap (10^3) quickly with algorithms, or are we stuck waiting for the 20-25 years for GPU improvements. (I think it feels obvious: this is new technology, things are moving fast, the chance for algorithmic innovation here is high!)

I also personally think that we need to adjust our efficiency priors, and start looking not at "humans" as the bar to beat, but theoretical computatble limits (show gaps much larger ~10^9-10^15 for modest problems). Though, it may simply be the case that tool/code use + AGI at near human cost covers a lot of that gap.

miki123211 • 14 hours ago

It's also worth keeping in mind that AIs are a lot less risky to deploy for businesses than humans.

You can scale them up and down at any time, they can work 24/7 (including holidays) with no overtime pay and no breaks, they need no corporate campuses, office space, HR personnel or travel budgets, you don't have to worry about key employees going on sick/maternity leave or taking time off the moment they're needed most, they won't assault a coworker, sue for discrimination or secretly turn out to be a pedophile and tarnish the reputation of your company, they won't leak internal documents to the press or rage quit because of new company policies, they won't even stop working when a pandemic stops most of the world from running.

fsndz • 3 hours ago

I get the excitement, but folks, this is a model that excels only in things like software engineering/math. They basically used reinforcement learning to train the model to better remember which pattern to use to solve specific problems. This in no way generalises to open ended tasks in a way that makes human in the loop unnecessary. This basically makes assistants better (as soon as they figure out how to make it cheaper), but I wouldn't blindly trust the output of o3. Sam Altman is still wrong: https://www.lycee.ai/blog/why-sam-altman-is-wrong

robwwilliams • 2 hours ago

In your blog you say:

> deep learning doesn't allow models to generalize properly to out-of-distribution data—and that is precisely what we need to build artificial general intelligence.

I think even (or especially) people like Altman accept this as a fact. I do. Hassabis has been saying this for years.

The foundational models are just a foundation. Now start building the AGI superstructure.

And this is also where most of the still human intellectual energy is now.

girvo • 3 hours ago

Quite. And if it was right, those businesses deploying it and replacing humans need humans with jobs and money to pay for their products and services…

TheOtherHobbes • 4 hours ago

It's all fun and games until the infra crashes and you can't work out why, because a machine has written all of the code, no one understands how it works or what it's doing.

Or - worse - there is no accessible code anywhere, and you have to prompt your way out of "I'm sorry Dave, I can't do that," while nothing works.

And a human-free economy does... what? For whom? When 99% of the population is unemployed, what are the 1% doing while the planet's ecosystems collapse around them?

sirsinsalot • 4 hours ago

rockskon • 10 hours ago

AI has a different risk profile than humans. They are a lot more risky for business operations where failure is wholly unacceptable under any circumstance.

They're risky in that they fail in ways that aren't readily deterministic.

And would you trust your life to a self-driving car in New York City traffic?

miki123211 • 5 hours ago

MaxPock • 3 hours ago

It depends with what the risk is .Would it be whole or in part ? In an organisation,failure by an HR might present an isolated departmental risk while an AI might not be the case.

zelphirkalt • 3 hours ago

Deterministic they may be, but unforeseeable for humans.

lxgr • 10 hours ago

wwweston • 10 hours ago

ijidak • 8 hours ago

It is amazing to me that we have reached an era where we are debating the trade-off of hiring thinking machines!

I mean, this is an incredible moment from that standpoint.

Regarding the topic at hand, I think that there will always be room for humans for the reasons you listed.

But even replacing 5% of humans with AI's will have mind boggling consequences.

I think you're right that there are jobs that humans will be preferred for for quite some time.

But, I'm already using AI with success where I would previously hire a human, and this is in this primitive stage.

With the leaps we are seeing, AI is coming for jobs.

Your concerns relate to exactly how many jobs.

And only time will tell.

But, I think some meaningful percentage of the population -- even if just 5% of humanity will be replaced by AI.

antihipocrat • 14 hours ago

AI brings similar risks - they can leak internal information, they can be tricked into performing prohibited tasks (with catastrophic effects if this is connected to core systems), they could be accused of actions that are discriminatory (biased training sets are very common).

Sure, if a business deploys it to perform tasks that are inherently low risk e.g. no client interface, no core system connection and low error impact, then the human performing these tasks is going to be replaced.

snozolli • 12 hours ago

lucubratory • 12 hours ago

>secretly turn out to be a pedophile and tarnish the reputation of your company

This is interesting because it's both Oddly Specific and also something I have seen happen and I still feel really sorry for the company involved. Now that I think about it, I've actually seen it happen twice.

bboygravity • 4 hours ago

humans definitely don't need office space, but your point stands

AustinW • 3 hours ago

LLM office space is pretty expensive. Chillers, backup generators, raised floors, communications gear, …. They even demand multiple offices for redundancy, not to mention the new ask of a nuclear power plant to keep the lights on.

danielovichdk • 4 hours ago

Name one technology that has come with computers that hasn't resulted in more humans being put to work ?

The rhetoric of not needing people doing work is cartoon'ish. I mean there is no sane explanation of how and why that would happen, without employing more people yet again, taking care of the advancements.

It's nok like technology has brought less work related stress. But it has definitely increased it. Humans were not made for using technology at such a pace as it's being rolled out.

The world is fucked. Totally fucked.

mortehu • 3 hours ago

Self check-out stations, ATMs, and online brokerages. Recently chat support. Namely cases where millions of people used to interact with a representative every week, and now they don't.

agumonkey • 3 hours ago

who in this field is anticipating impact of near AGI for society ? maybe i'm too anxious but not planning for potential workless life seems dangerous (but maybe i'm just not following the right groups)

zamadatix • 19 hours ago

I don't follow how 10 random humans can beat the average STEM college grad and average humans in that tweet. I suspect it's really "a panel of 10 randomly chosen experts in the space" or something?

I agree the most interesting thing to watch will be cost for a given score more than maximum possible score achieved (not that the latter won't be interesting by any means).

bcrosby95 • 18 hours ago

Two heads is better than 1. 10 is way better. Even if they aren't a field of experts. You're bound to get random people that remember random stuff from high school, college, work, and life in general, allowing them to piece together a solution.

inerte • 18 hours ago

generic92034 • 15 hours ago

If that works that way at all depends on the group dynamic. It is easily possible that a not so bright individual takes an (unofficial) leadership position in the group and overrides the input of smarter members. Think of any meetings with various hierarchy levels in a company.

herval • 15 hours ago

dlkf • 13 hours ago

If you take a vote of 10 random people, then as long as their errors are not perfectly correlated, you’ll do better than asking one person.

https://en.m.wikipedia.org/wiki/Ensemble_learning

olalonde • 6 hours ago

Even if you assume that non STEM grads are dumb, isn't there a good probability of having a STEM graduate among 10 random humans?

hmottestad • 18 hours ago

Might be that within a group of 10 people, randomly chosen, when each person attempts to solve the tasks at least 99% of the time 1 person out of the 10 people will get it right.

HDThoreaun • 14 hours ago

ARC-AGI is essentially an IQ test. There is no "expert in the space". Its just a question of if youre able to spot the pattern.

shkkmo • 15 hours ago

It is fairly well documented that groups of people can show cognitive abilities that exceed that of any individual member. The classic example of this is if you ask a group of people to estimate the number of jellybeans in a jar, you can get a more accurate result than if you test to find the person with the highest accuracy and use their guess.

This isn't to say groups always outperform their members on all tasks, just that it isn't unusual to see a result like that.

xbmcuser • 15 hours ago

You are missing that cost of electricity is also going to keep falling because of solar and batteries. This year in China my table cloth math says it is $0.05 pkwh and following the cost decline trajectory be under $0.01 in 10 years

barney54 • 13 hours ago

But the cost of electricity is not falling—it’s increasing. Wholesale prices have decreased, but retail rates are up. In the U.S. rates are up 27% over the past 4 years. In Europe prices are up too.

NoLinkToMe • 2 hours ago

That's a bit of a non-statement. Virtually all prices increase because of money supply, but we consider things to get cheaper if their prices grow less fast than inflation / income.

General inflation has outpaced the inflation of electricity prices by about 3x in the past 100 years. In other words, electricity has gotten cheaper over time in purchasing power terms.

And that's whilst our electricity usage has gone up by 10x in the last 100 years.

And this concerns retail prices, which includes distribution/transmission fees. These have gone up a lot as you get complications on the grid, some of which is built on a century old design. But wholesale prices (the cost of generating electricity without transmission/distribution) are getting dirt cheap, and for big AI datacentres I'm pretty sure they'll hook up to their own dedicated electricity generation at wholesale prices, off the grid, in the coming decades.

xbmcuser • 12 hours ago

lucubratory • 12 hours ago

I am not certain because I've been very focused on the o3 news, but at least yesterday neither the US nor Europe were part of China.

lxgr • 10 hours ago

But data centers pay wholesale prices or even less (given that especially AI training and, to a lesser extend, inference clusters can load shed like few other consumers of electricity).

fulafel • 9 hours ago

And this is great news as long as marginal production (the most expensive to produce, first to turn on/off according to demand) of electricity is fossils.

patrickhogan1 • 14 hours ago

Bingo! Solar energy moves us toward a future where a household's energy needs become nearly cost-free.

Energy Need: The average home uses 30 kWh/day, requiring 6 kW/hour over 5 peak sunlight hours.

Multijunction Panels: Lab efficiencies are already at 47% (2023), and with multiple years of progress, 60% efficiency is probable.

Efficiency Impact: At 60% efficiency, panels generate 600 W/m², requiring 10 m² (e.g., 2 m × 5 m) to meet energy needs.

This size can fit on most home roofs, be mounted on a pole with stacked layers, or even be hung through an apartment window.

arcticbull • 14 hours ago

sahmeepee • 4 hours ago

Average US home.

In Europe it is around 6-7 kWh/day. This might increase with electrification of heating and transport, but probably nothing like as much as the energy consumption they are replacing (due to greater efficiency of the devices consuming the energy and other factors like the quality of home insulation.)

In the rest of the world the average home uses significantly less.

necovek • 9 hours ago

If climate change ends up changing weather profiles and we start seeing many more cloudy days or dust/mist in the air, we'll need to push those solar panel above (all the way to space?) or have many more of them, figure out transmission to the ground and costs will very much balloon.

Not saying this will happen, but it's risky to rely on solar as the only long-term solution.

nateglims • 13 hours ago

Is it going to fall significantly for data centers? Industrial policy for consumer power is different from subsidizing it for data centers and if you own grid infrastructure why would you tank the price by putting up massive amounts of capital?

xbmcuser • 11 hours ago

iandanforth • 18 hours ago

Let's say that Google is already 1 generation ahead of nvidia in terms of efficient AI compute. ($1700)

Then let's say that OpenAI brute forced this without any meta-optimization of the hypothesized search component (they just set a compute budget). This is probably low hanging fruit and another 2x in compute reduction. ($850)

Then let's say that OpenAI was pushing really really hard for the numbers and was willing to burn cash and so didn't bother with serious thought around hardware aware distributed inference. This could be more than a 2x decrease in cost like we've seen deliver 10x reductions in cost via better attention mechanisms, but let's go with 2x for now. ($425).

So I think we've got about an 8x reduction in cost sitting there once Google steps up. This is probably 4-6 months of work flat out if they haven't already started down this path, but with what they've got with deep research, maybe it's sooner?

Then if "all" we get is hardware improvements we're down to what 10-14 years?

qingcharles • 11 hours ago

Until 2022 most AI research was aimed at improving the quality of the output, not the quantity.

Since then there has been a tsunami of optimizations in the way training and inference is done. I don't think we've even begun to find all the ways that inference can be further optimized at both hardware and software levels.

Look at the huge models that you can happily run on an M3 Mac. The cost reduction in inference is going to vastly outpace Moore's law, even as chip design continues on its own path.

promptdaddy • 15 hours ago

*deep mind research ?

iandanforth • 14 hours ago

Nope, Gemini Advanced with Deep Research. New mode of operation that does more "thinking" and web searches to answer your question.

acchow • 8 hours ago

> ~doubling every 2-2.5 years) puts us at 20~25 years.

The trend for power consumption of compute (Megaflops per watt) has generally tracked with Koomey’s law for a doubling every 1.57 years

Then you also have model performance improving with compression. For example, Llama 3.1’s 8B outperforming the original Llama 65B

cchance • 19 hours ago

I mean considering the big breaththrough this year for o1/o3 seems to have been "models having internal thoughts might help reasoning", seems to everyone outside of the AI field was sort of a "duh" moment.

I'd hope we see more internal optimizations and improvements to the models. The idea that the big breakthrough being "don't spit out the first thought that pops into your head" seems obvious to everyone outside of the field, but guess what turns out it was a big improvement when the devs decided to add it.

versteegen • 16 hours ago

> seems obvious to everyone outside of the field

It's obvious to people inside the field too.

Honestly, these things seem to be less obvious to people outside the field. I've heard so many uninformed takes about LLMs not representing real progress towards intelligence (even here on HN of all places; I don't know why I torture myself reading them), that they're just dumb memorizers. No, they are an incredible breakthrough, because extending them with things like internal thoughts will so obviously lead to results such as o3, and far beyond. Maybe a few more people will start to understand the trajectory we're on.

Agentus • 14 hours ago

dogma1138 • 16 hours ago

Reflection isn’t a new concept, but a) actually proving that it’s an effective tool for these types of models and b) finding an effective method for reflection that doesn’t just locks you into circular “thinking” were the hard parts and hence the “breakthrough”.

It’s very easy to say hey ofc it’s obvious but there is nothing obvious about it because you are anthropomorphizing these models and then using that bias after the fact as a proof of your conjecture.

This isn’t how real progress is achieved.

beardedwizard • 4 hours ago

Calling it reflection is, for me, further anthropomorphizing. However I am in violent agreement that a common feature of llm debate is centered around anthropomorphism leading to claims of "thinking longer" or "reflecting" when none of those things are happening.

The state of the art seems very focused on promoting that language that might encode reason is as good as actual reason, rather than asking what a reasoning model might look like.

bjornsing • 17 hours ago

> are we stuck waiting for the 20-25 years for GPU improvements

If this turns out to be hard to optimize / improve then there will be a huge economic incentive for efficient ASICs. No freaking way we’ll be running on GPUs for 20-25 years, or even 2.

coolspot • 16 hours ago

LLMs need efficient matrix multiiliers. GPUs are specialized ASICs for massive matrix multiplication.

vlovich123 • 15 hours ago

m3kw9 • 12 hours ago

Don’t forget humans which is real GI paired with increasing capable AI can create a feed back loop to accelerate new advances.

spencerchubb • 16 hours ago

> Super exciting that OpenAI pushed the compute out this far

it's even more exciting than that. the fact that you even can use more compute to get more intelligence is a breakthrough. if they spent even more on inference, would they get even better scores on arc agi?

lolinder • 3 hours ago

> the fact that you even can use more compute to get more intelligence is a breakthrough.

I'm not so sure—what they're doing by just throwing more tokens at it is similar to "solving" the traveling salesman problem by just throwing tons of compute into a breadth first search. Sure, you can get better and better answers the more compute you throw at it (with diminishing returns), but is that really that surprising to anyone who's been following tree of thought models?

All it really seems to tell us is that the type of model that OpenAI has available is capable of solving many of the types of problems that ARC-AGI-PUB has set up given enough compute time. It says nothing about "intelligence" as the concept exists in most people's heads—it just means that a certain very artificial (and intentionally easy for humans) class of problem that wasn't computable is now computable if you're willing to pay an enormous sum to do it. A breakthrough of sorts, sure, but not a surprising one given what we've seen already.

echelon • 15 hours ago

Maybe it's not linear spend.

empiko • 5 hours ago

I don't think this is only about efficiency. The model I have here is that this is similar to when we beat chess. Yes, it is impressive that we made progress on a class of problems, but is this class aligned with what the economy or the society needs?

Simple turn-based games such as chess turned out to be too far away from anything practical and chess-engine-like programs were never that useful. It is entirely possible that this will end up in a similar situation. ARC-like pattern matching problems or programming challenges are indeed a respectable challenge for AI, but do we need a program that is able to solve them? How often does something like that come up really? I can see some time-saving in using AI vs StackOverflow in solving some programming challenges, but is there more to this?

edanm • 3 hours ago

I mostly agree with your analysis, but just to drive home a point here - I don't think that algorithms to beat Chess were ever seriously considered as something that would be relevant outside of the context of Chess itself. And obviously, within the world of Chess, they are major breakthroughs.

In this case there is more reason to think these things are relevant outside of the direct context - these tests were specifically designed to see if AI can do general-thinking tasks. The benchmarks might be bad, but that's at least their purpose (unlike in Chess).

daxfohl • 9 hours ago

I wonder if we'll start seeing a shift in compute spend, moving away from training time, and toward inference time instead. As we get closer to AGI, we probably reach some limit in terms of how smart the thing can get just training on existing docs or data or whatever. At some point it knows everything it'll ever know, no matter how much training compute you throw at it.

To move beyond that, the thing has to start thinking for itself, some auto feedback loop, training itself on its own thoughts. Interestingly, this could plausibly be vastly more efficient than training on external data because it's a much tighter feedback loop and a smaller dataset. So it's possible that "nearly AGI" leads to ASI pretty quickly and efficiently.

Of course it's also possible that the feedback loop, while efficient as a computation process, isn't efficient as a learning / reasoning / learning-how-to-reason process, and the thing, while as intelligent as a human, still barely competes with a worm in true reasoning ability.

Interesting times.

freehorse • 14 hours ago

> I am interpreting this result as human level reasoning now costs (approximately) 41k/hr to 2.5M/hr with current compute.

On a very simple, toy task, which arc-agi basically is. Arc-agi tests are not hard per se, just LLM’s find them hard. We do not know how this scales for more complex, real world tasks.

SamPatt • 13 hours ago

Right. Arc is meant to test the ability of a model to generalize. It's neat to see it succeed, but it's not yet a guarantee that it can generalize when given other tasks.

The other benchmarks are a good indication though.

criddell • 10 hours ago

Does it mean anything for more general tasks like driving a car?

brookst • 9 hours ago

madduci • 9 hours ago

Let's see when this will be released to the free tier. Looks promising, although I hope they will also be able to publish more details on this, as part of the "open" in their name

riku_iki • 20 hours ago

> ~=$3400 per single task

report says it is $17 per task, and $6k for whole dataset of 400 tasks.

binarymax • 20 hours ago

"Note: OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration."

The low compute was $17 per task. Speculate 172*$17 for the high compute is $2,924 per task, so I am also confused on the $3400 number.

bluecoconut • 20 hours ago

3400 came from counting pixels on the plot.

Also its $20 on for the o3-low via the table for the semi-private, which x172 is 3440, also coming in close to the 3400 number

bluecoconut • 20 hours ago

That's the low-compute mode. In the plot at the top where they score 88%, O3 High (tuned) is ~3.4k

HDThoreaun • 14 hours ago

The low compute one did as well as the average person though

ionwake • 18 hours ago

sorry to be a noob, but can someone tell me doe sths mena o3 will be unaffordable for a typical user? Will only companies with thousands to spend per query be able to use this?

Sorry for being thick Im just confused how they can turn this into an addordable service?

JohnnyMarcone • 16 hours ago

xrendan • 20 hours ago

You're misreading it, there's two different runs, a low and a high compute run.

The number for the high-compute one is ~172x the first one according to the article so ~=$2900

jhrmnn • 20 hours ago

That’s for the low-compute configuration that doesn’t reach human-level performance (not far though)

riku_iki • 20 hours ago

I referred on high compute mode. They have table with breakdown here: https://arcprize.org/blog/oai-o3-pub-breakthrough

junipertea • 20 hours ago

The table row with 6k figure refers to high efficiency, not high compute mode. From the blog post:

Note: OpenAI has requested that we not publish the high-compute costs. The amount of compute was roughly 172x the low-compute configuration.

gbnwl • 20 hours ago

That's "efficiency" high, which actually means less compute. The 87.5% score using low efficiency (more compute) doesn't have cost listed.

bluecoconut • 20 hours ago

they use some poor language.

"High Efficiency" is O3 Low "Low Efficiency" is O3 High

They left the "Low efficiency" (O3 High) values as `-` but you can infer them from the plot at the top.

Note the $20 and $17 per task aligns with the X-axis of the O3-low

EVa5I7bHFq9mnYK • 20 hours ago

That's high EFFICIENCY. High efficiency = low compute.

croemer • 21 hours ago

The programming task they gave o3-mini high (creating Python server that allows chatting with OpenAI API and run some code in terminal) didn't seem very hard? Strange choice of example for something that's claimed to be a big step forwards.

YT timestamped link: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s (thanks for the fixed link @photonboom)

Updated: I gave the task to Claude 3.5 Sonnet and it worked first shot: https://claude.site/artifacts/36cecd49-0e0b-4a8c-befa-faa5aa...

bearjaws • 21 hours ago

It's good that it works since if you ask GPT-4o to use the openai sdk it will often produce invalid and out of date code.

HeatrayEnjoyer • 15 hours ago

Sonnet isn't a "mini" sized model. Try it with Haiku.

croemer • 15 hours ago

How mini is o3-mini compared to Sonnet and why does it matter whether it's mini or not? Isn't the point of the demo to show what's now possible that wasn't before?

4o is cheaper than o1 mini so mini doesn't mean much for costs.

phil917 • 20 hours ago

Yeah I agree that wasn't particularly mind blowing to me and seems fairly in line with what existing SOTA models can do. Especially since they did it in steps. Maybe I'm missing something.

MyFirstSass • 17 hours ago

What? Is this what this is? Either this is a complete joke or we're missing something.

I've been doing similar stuff in Claude for months and it's not that impressive when you see how limited they really are when going non boilerplate.

photonboom • 21 hours ago

heres the right timestamp: https://www.youtube.com/watch?v=SKBG1sqdyIU&t=768s

m3kw9 • 21 hours ago

I would say they didn’t need to demo anything, because if you are gonna use the output code live on a demo it may make compile errors and then look stupid trying to fix it live

croemer • 19 hours ago

If it was a safe bet problem, then they should have said that. To me it looks like they faked excitement for something not exciting which lowers credibility of the whole presentation.

sunaookami • 18 hours ago

They actually did that the last time when they showed the apps integration. First try in Xcode didn't work.

m3kw9 • 17 hours ago

Yeah I think that time it was ok because they were demoing the app function, but for this they are demoing the model smarts

csomar • 15 hours ago

Models are predictable at 0 temperatures. They might have tested the output beforehand.

fzzzy • 15 hours ago

Models in practice haven't been deterministic at 0 temperature, although nobody knows exactly why. Either hardware or software bugs.

Jensson • 15 hours ago

We know exactly why, it is because floating point operations aren't associative but the GPU scheduler assumes they are, and the scheduler isn't deterministic. Running the model strictly hurts performance so they don't do that.

obblekk • 21 hours ago

Human performance is 85% [1]. o3 high gets 87.5%.

This means we have an algorithm to get to human level performance on this task.

If you think this task is an eval of general reasoning ability, we have an algorithm for that now.

There's a lot of work ahead to generalize o3 performance to all domains. I think this explains why many researchers feel AGI is within reach, now that we have an algorithm that works.

Congrats to both Francois Chollet for developing this compelling eval, and to the researchers who saturated it!

[1] https://x.com/SmokeAwayyy/status/1870171624403808366, https://arxiv.org/html/2409.01374v1

phillipcarter • 21 hours ago

As excited as I am by this, I still feel like this is still just a small approximation of a small chunk of human reasoning ability at large. o3 (and whatever comes next) feels to me like it will head down the path of being a reasoning coprocessor for various tasks.

But, still, this is incredibly impressive.

qt31415926 • 20 hours ago

Which parts of reasoning do you think is missing? I do feel like it covers a lot of 'reasoning' ground despite its on the surface simplicity

john_minsk • 11 hours ago

My personal 5 cents is that reasoning will be there when LLM gives you some kind of outcome and then when questioned about it can explain every bit of result it produced.

For example, if we asked an LLM to produce an image of a "human woman photorealistic" it produces result. After that you should be able to ask it "tell me about its background" and it should be able to explain "Since user didn't specify background in the query I randomly decided to draw her standing in front of a fantasy background of Amsterdam iconic houses. Usually Amsterdam houses are 3 stories tall, attached to each other and 10 meters wide. Amsterdam houses usually have cranes on the top floor, which help to bring goods to the top floor since doors are too narrow for any object wider than 1m. The woman stands in front of the houses approximately 25 meters in front of them. She is 1,59m tall, which gives us correct perspective. It is 11:16am of August 22nd which I used to calculate correct position of the sun and align all shadows according to projected lighting conditions. The color of her skin is set at RGB:xxxxxx randomly" etc.

And it is not too much to ask LLMs for it. LLMs have access to all the information above as they read all the internet. So there is definitely a description of Amsterdam architecture, what a human body looks like or how to correctly estimate time of day based on shadows (and vise versa). The only thing missing is logic that connects all this information and which is applied correctly to generate final image.

I like to think about LLMs as a fancy genius compressing engines. They took all the information in the internet, compressed it and are able to cleverly query this information for end user. It is a tremendously valuable thing, but if intelligence emerges out of it - not sure. Digital information doesn't necessarily contain everything needed to understand how it was generated and why.

concordDance • 6 hours ago

Xmd5a • 9 hours ago

LLMs are still bound to a prompting session. They can't form long term memories, can't ponder on it and can't develop experience. They have no cognitive architecture.

'Agents' (i.e. workflows intermingling code and calls to LLMs) are still a thing (as shown by the fact there is a post by anthropic on this subject on the front page right now) and they are very hard to build.

Consequence of that for instance: it's not possible to have a LLM explore exhaustively a topic.

mjhagen • 2 hours ago

LLMs don’t, but who said AGI should come from LLMs alone. When I ask ChatGPT about something “we” worked on months ago, it “remembers” and can continue on the conversation with that history in mind.

I’d say, humans are also bound to promoting sessions in that way.

tim333 • 2 hours ago

Current AI is good at text but not very good at 3d physical stuff like fixing your plumbing.

phillipcarter • 18 hours ago

I think it's hard to enumerate the unknown, but I'd personally love to see how models like this perform on things like word problems where you introduce red herrings. Right now, LLMs at large tend to struggle mightily to understand when some of the given information is not only irrelevant, but may explicitly serve to distract from the real problem.

KaoruAoiShiho • 17 hours ago

o1 already fixed the red herrings...

zmgsabst • 11 hours ago

That’s not inability to reason though, that’s having a social context.

Humans also don’t tend to operate in a rigorously logical mode and understand that math word problems are an exception where the language may be adversarial: they’re trained for that special context in school. If you tell the LLM that social context, eg that language may be deceptive, their “mistakes” disappear.

What you’re actually measuring is the LLM defaults to assuming you misspoke trying to include relevant information rather than that you were trying to trick it — which is the social context you’d expect when trained on general chat interactions.

Establishing context in psychology is hard.

amelius • 2 hours ago

Does it include the use of tools to accomplish a task?

Does it include the invention of tools?

Agentus • 13 hours ago

kinda interesting, every single CS person (especially phds) when talking about reasoning are unable to concisely quantify, enumerate, qualify, or define reasoning.

people with (high) intelligence talking and building (artificial) intelligence but never able to convincingly explain aspects of intelligence. just often talk ambiguously and circularly around it.

what are we humans getting ourselves into inventing skynet :wink.

its been an ongoing pet project to tackle reasoning, but i cant answer your question with regards to llms.

YeGoblynQueenne • 12 hours ago

azeirah • 16 hours ago

I'd like to see this o3 thing play 5d chess with multiverse time travel or baba is you.

The only effect smarter models will have is that intelligent people will have to use less of their brain to do their work. As has always been the case, the medium is the message, and climate change is one of the most difficult and worst problems of our time.

If this gets software people to quit en-masse and start working in energy, biology, ecology and preservation? Then it has succeeded.

concordDance • 6 hours ago

> climate change is one of the most difficult and worst problems of our time.

Slightly surprised to see this view here.

I can think of half a dozen more serious problems off hand (e.g. population aging, institutional scar tissue, dysgenics, nuclear proliferation, pandemic risks, AI itself) along most axes I can think of (raw $ cost, QALYs, even X-risk).

cryptoegorophy • 21 hours ago

What’s interesting is it might be very close to human intelligence than some “alien” intelligence, because after all it is a LLM and trained on human made text, which kind of represents human intelligence.

hammock • 21 hours ago

In that vein, perhaps the delta between o3 @ 87.5% and Human @ 85% represents a deficit in the ability of text to communicate human reasoning.

In other words, it's possible humans can reason better than o3, but cannot articulate that reasoning as well through text - only in our heads, or through some alternative medium.

unsupp0rted • 18 hours ago

It's possible humans reason better through text than not through text, so these models, having been trained on text, should be able to out-reason any person who's not currently sitting down to write.

85392_school • 21 hours ago

I wonder how much of an effect amount of time to answer has on human performance.

yunwal • 20 hours ago

Yeah, this is sort of meaningless without some idea of cost or consequences of a wrong answer. One of the nice things about working with a competent human is being able to tell them "all of our jobs are on the line" and knowing with certainty that they'll come to a good answer.

hamburga • 13 hours ago

Agreed. I think what really makes them alien is everything else about them besides intelligence. Namely, no emotional/physiological grounding in empathy, shame, pride, and love (on the positive side) or hatred (negative side).

lastdong • 10 hours ago

Curious about how many tests were performed. Did it consistently manage to successfully solve many of these types of problems?

6gvONxR4sf7o • 20 hours ago

Human performance is much closer to 100% on this, depending on your human. It's easy to miss the dot in the corner of the headline graph in TFA that says "STEM grad."

tim333 • 2 hours ago

A fair comparison might be average human. The average human isn't a STEM grad. It seems STEM grad approximately equals an IQ of 130. https://www.accommodationforstudents.com/student-blog/the-su...

From a post elsewhere the scores on ARC-AGI-PUB are approx average human 64%, o3 87%. https://news.ycombinator.com/item?id=42474659

Though also elsewhere, o3 seems very expensive to operate. You could probably hire a PhD researcher for cheaper.

scotty79 • 21 hours ago

Still it's comparing average human level performance with best AI performance. Examples of things o3 failed at are insanely easy for humans.

cchance • 19 hours ago

You'd be surprised what the AVERAGE human fails to do that you think is easy, my mom can't fucking send an email without downloading a virus, i have a coworker that believes beyond a shadow of a doubt the world is flat.

The Average human is a lot dumber than people on hackernews and reddit seem to realize, shit the people on mturk are likely smarter than the AVERAGE person

mirkodrummer • 13 hours ago

Not being able to send an email or believing the world is flat it’s not a sign of intelligence, I’d rather say it’s more about culture or being more or less scholarized. Your mom or coworker still can do stuff instinctively that is outperforming every algorithm out there and still unexplained how we do it. We still have no idea what intelligence is

staticman2 • 18 hours ago

Yet the average human can drive a car a lot better than ChatGPT can, which shows that the way you frame "intelligence" dictates your conclusion about who is "intelligent".

p1esk • 18 hours ago

tracerbulletx • 18 hours ago

If you take an electrical sensory input signal sequence, and transform it to a electrical muscle output signal sequence you've got a brain. ChatGPT isn't going to drive a car because it's trained on verbal tokens, and it's not optimized for the type of latency you need for physical interaction.

And the brain doesn't use the same network to do verbal reasoning as real time coordination either.

But that work is moving along fine. All of these models and lessons are going to be combined into AGI. It is happening. There isn't really that much in the way.

FrustratedMonky • 20 hours ago

There are things Chimps do easily that humans fail at, and vice/versa of course.

There are blind spots, doesn't take away from 'general'.

Matumio • 2 hours ago

We can't agree whether Portia spiders are intelligent or just have very advanced instincts. How will we ever agree about what human intelligence is, or how to separate it from cultural knowledge? If that even makes sense.

FrustratedMonky • 1 hour ago

I guess my point is more, if we can't decide about Portia Spiders or Chimps, then how can we be so certain about AI. So offering up Portia and Chimps as counter examples.

noobermin • 3 hours ago

The downvotes should tell you, this is a decided "hype" result. Don't poo poo it, that's not allowed on AI slop posts on HN.

FrustratedMonky • 2 hours ago

Yeah, I didn't realize Chimp studies, or neuroscience were out of vogue. Even in tech, people form strong 'beliefs' around what they think is happening.

antirez • 20 hours ago

NNs are not algorithms.

benlivengood • 20 hours ago

Deterministic (ieee 754 floats), terminates on all inputs, correctness (produces loss < X on N training/test inputs)

At most you can argue that there isn't a useful bounded loss on every possible input, but it turns out that humans don't achieve useful bounded loss on identifying arbitrary sets of pixels as a cat or whatever, either. Most problems NNs are aimed at are qualitative or probabilistic where provable bounds are less useful than Nth-percentile performance on real-world data.

notfish • 20 hours ago

An algorithm is “a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer”

How does a giant pile of linear algebra not meet that definition?

antirez • 20 hours ago

It's not made of "steps", it's an almost continuous function to its inputs. And a function is not an algorithm: it is not an object made of conditions, jumps, terminations, ... Obviously it has computation capabilities and is Turing-complete, but is the opposite of an algorithm.

janalsncm • 17 hours ago

raegis • 18 hours ago

necovek • 8 hours ago

I would say you are right that function is not an algorithm, but it is an implementation of an algorithm.

Is that your point?

If so, I've long learned to accept imprecise language as long as the message can be reasonably extracted from it.

mvkel • 12 hours ago

necovek • 8 hours ago

NN is a very wide term applied in different contexts.

When a NN is trained, it produces a set of parameters that basically define an algorithm to do inference with: it's a very big one though.

We also call that a NN (the joy of natural language).

drdeca • 18 hours ago

How do you define "algorithm"? I suspect it is a definition I would find somewhat unusual. Not to say that I strictly disagree, but only because to my mind "neural net" suggests something a bit more concrete than "algorithm", so I might instead say that an artificial neural net is an implementation of an algorithm, rather than or something like that.

But, to my mind, something of the form "Train a neural network with an architecture generally like [blah], with a training method+data like [bleh], and save the result. Then, when inputs are received, run them through the NN in such-and-such way." would constitute an algorithm.

KeplerBoy • 19 hours ago

Running inference on a model certainly is a algorithm.

hypoxia • 19 hours ago

It actually beats the human average by a wide margin:

- 64.2% for humans vs. 82.8%+ for o3.

...

Private Eval:

- 85%: threshold for winning the prize [1]

Semi-Private Eval:

- 87.5%: o3 (unlimited compute) [2]

- 75.7%: o3 (limited compute) [2]

Public Eval:

- 91.5%: o3 (unlimited compute) [2]

- 82.8%: o3 (limited compute) [2]

- 64.2%: human average (Mechanical Turk) [1] [3]

Public Training:

- 76.2%: human average (Mechanical Turk) [1] [3]

...

References:

[1] https://arcprize.org/guide

[2] https://arcprize.org/blog/oai-o3-pub-breakthrough

[3] https://arxiv.org/abs/2409.01374

usaar333 • 19 hours ago

Super human isn't beating rando mech turk.

Their post has stem grad at nearly 100%

tripletao • 17 hours ago

This is correct. It's easy to get arbitrarily bad results on Mechanical Turk, since without any quality control people will just click as fast as they can to get paid (or bot it and get paid even faster).

So in practice, there's always some kind of quality control. Stricter quality control will improve your results, and the right amount of quality control is subjective. This makes any assessment of human quality meaningless without explanation of how those humans were selected and incentivized. Chollet is careful to provide that, but many posters here are not.

In any case, the ensemble of task-specific, low-compute Kaggle solutions is reportedly also super-Turk, at 81%. I don't think anyone would call that AGI, since it's not general; but if the "(tuned)" in the figure means o3 was tuned specifically for these tasks, that's not obviously general either.

ALittleLight • 21 hours ago

It's not saturated. 85% is average human performance, not "best human" performance. There is still room for the model to go up to 100% on this eval.

dmead • 5 hours ago

This is so strange. people think that an llm trained on programming questions and docs can do mundane tasks like this means intelligent? Come on.

It really calls into question two things.

1. You don't know what you're talking about about.

2. You have a perverse incentive to believe this such that you will preach it to others and elevate some job salary range or stock.

Either way, not a good look.

javaunsafe2019 • 2 hours ago

This

dyauspitr • 15 hours ago

I’ll believe it when the AI can earn money on its own. I obviously don’t mean someone paying a subscription to use the AI I mean, letting the AI lose on the Internet with only the goal of making money and putting it into a bank account.

hamburga • 13 hours ago

Do trading bots count?

1659447091 • 8 hours ago

No, the AI would have to start from zero and reason it's way to making itself money online, such as the humans who were first in their online field of interest (e-commerce, scams, ads etc from the 80's and 90's) when there was no guidance, only general human intelligence that could reason their way into money making opportunities and reason their way into making it work.

concordDance • 6 hours ago

I don't think humans ever do that. They research/read and ask other humans.

nopinsight • 20 hours ago

Let me go against some skeptics and explain why I think full o3 is pretty much AGI or at least embodies most essential aspects of AGI.

What has been lacking so far in frontier LLMs is the ability to reliably deal with the right level of abstraction for a given problem. Reasoning is useful but often comes out lacking if one cannot reason at the right level of abstraction. (Note that many humans can't either when they deal with unfamiliar domains, although that is not the case with these models.)

ARC has been challenging precisely because solving its problems often requires:

   1) using multiple different *kinds* of core knowledge [1], such as symmetry, counting, color, AND

   2) using the right level(s) of abstraction

Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforces, AIME, and Frontier Math suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it. Yes, this includes out-of-distribution problems that most humans can solve.

It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could. But not many humans can either.

[1] https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...

ADDED:

Thanks to the link to Chollet's posts by lswainemoore below. I've analyzed some easy problems that o3 failed at. They involve spatial intelligence, including connection and movement. This skill is very hard to learn from textual and still image data.

I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross. (OpenAI purchased a company behind a Minecraft clone a while ago. I've wondered if this is the purpose.)

phil917 • 20 hours ago

Quote from the creators of the AGI-ARC benchmark: "Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence."

qnleigh • 6 hours ago

I like the notion, implied in the article, that AGI won't be verified by any single benchmark, but by our collective inability to come up with benchmarks that defeat some eventual AI system. This matches the cat-and-mouse game we've been seeing for a while, where benchmarks have to constantly adapt to better models.

I guess you can say the same thing for the Turing Test. Simple chat bots beat it ages ago in specific settings, but the bar is much higher now that the average person is familiar with their limitations.

If/once we have an AGI, it will probably take weeks to months to really convince ourselves that it is one.

CooCooCaCha • 20 hours ago

Yeah the real goalpost is reliable intelligence. A supposed phd level AI failing simple problems is a red flag that we’re still missing something.

gremlinsinc • 19 hours ago

You've never met a Doctor who couldn't figure out how to work their email? Or use street smarts? You can have a PHD but be unable to reliably handle soft skills, or any number of things you might 'expect' someone to be able to do.

Just playing devils' advocate or nitpicking the language a bit...

manquer • 6 hours ago

Doctors[1] or say pilots are skilled professions and difficult to master and deserve respect yes , but they do not need high levels of intelligence to be good at. They require many other skills like taking decisions under pressure or good motor skills that are hard, but not necessarily intelligence.

Also not knowing something is hardly a criteria , skilled humans focus on their areas of interest above most other knowledge and can be unaware of other subjects.

Fields medal winners for example may not be aware of most pop culture things doesn’t make them not able to do so, just not interested

—-

[1] most doctors including surgeons and many respected specialists, some doctors however do need that skills but those are specialized few and generally do know how to use email

nuancebydefault • 18 hours ago

A coworker of mine has a phd in physics. Showing the difference to him between little and big endian in a hex editor, showing file sizes of raw image files and how to compute it... I explained 3 times and maybe he understood part of it now.

CooCooCaCha • 19 hours ago

An important distinction here is you’re comparing skill across very different tasks.

I’m not even going that far, I’m talking about performance on similar tasks. Something many people have noticed about modern AI is it can go from genius to baby-level performance seemingly at random.

Take self driving cars for example, a reasonably intelligent human of sound mind and body would never accidentally mistake a concrete pillar for a road. Yet that happens with self-driving cars, and seemingly here with ARC-AGI problems which all have a similar flavor.

nopinsight • 20 hours ago

I'd need to see what kinds of easy tasks those are and would be happy to revise my hypothesis if that's warranted.

Also, it depends a great deal on what we define as AGI and whether they need to be a strict superset of typical human intelligence. o3's intelligence is probably superhuman in some aspects but inferior in others. We can find many humans who exhibit such tendencies as well. We'd probably say they think differently but would still call them generally intelligent.

lswainemoore • 20 hours ago

They're in the original post. Also here: https://x.com/fchollet/status/1870172872641261979 / https://x.com/fchollet/status/1870173137234727219

Personally, I think it's fair to call them "very easy". If a person I otherwise thought was intelligent was unable to solve these, I'd be quite surprised.

nopinsight • 19 hours ago

PoignardAzur • 4 hours ago

> This skill is very hard to learn from textual and still image data.

I had the same take at first, but thinking about it again, I'm not quite sure?

Take the "blue dots make a cross" example (the second one). The inputs only has four blue dots, which makes it very easy to see a pattern even in text data: two of them have the same x coordinate, two of them have the same y (or the same first-tuple-element and second-tuple-element if you want to taboo any spatial concepts).

Then if you look into the output, you can notice that all the input coordinates are also in the output set, just not always with the same color. If you separate them into "input-and-output" and "output-only", you quickly notice that all of the output-only squares are blue and share a coordinate (tuple-element) with the blue inputs. If you split the "input-and-output" set into "same color" and "color changed", you can notice that the changes only go from red to blue, and that the coordinates that changed are clustered, and at least one element of the cluster shares a coordinate with a blue input.

Of course, it's easy to build this chain of reasoning in retrospect, but it doesn't seem like a complete stretch: each step only requires noticing patterns in the data, and it's how a reasonably puzzle-savvy person might solve this if you didn't let them draw the squares on papers. There are a lot of escape games with chains of reasoning much more complex and random office workers solve them all the time.

The visual aspect makes the patterns jump to us more, but the fact that o3 couldn't find them at all with thousands of dollars of compute budget still seems meaningful to me.

EDIT: Actually, looking at Twitter discussions[1], o3 did find those patterns, but was stumped by ambiguity in the test input that the examples didn't cover. Its failures on the "cascading rectangles" example[2] looks much more interesting.

[1]: https://x.com/bio_bootloader/status/1870339297594786064

[2]: https://x.com/_AI30_/status/1870407853871419806

lswainemoore • 19 hours ago

> I believe this sort of core knowledge is learnable through movement and interaction data in a simulated world and it will not present a very difficult barrier to cross.

Maybe! I suppose time will tell. That said, spatial intelligence (connection/movement included) is the whole game in this evaluation set. I think it's revealing that they can't handle these particular examples, and problematic for claims of AGI.

93po • 20 hours ago

they say it isn't AGI but i think the way o3 functions can be refined to AGI - it's learning to solve a new, novel problems. we just need to make it do that more consistently, which seems achievable

dimitri-vs • 15 hours ago

Have we really watered down the definition of AGI that much?

LLMs aren't really capable of "learning" anything outside their training data. Which I feel is a very basic and fundamental capability of humans.

Every new request thread is a blank slate utilizing whatever context you provide for the specific task and after the tread is done (or context limit runs out) it's like it never happened. Sure you can use databases, do web queries, etc. but these are inflexible bandaid solutions, far from what's needed for AGI.

theptip • 15 hours ago

> LLMs aren't really capable of "learning" anything outside their training data.

ChatGPT has had for some time the feature of storing memories about its conversations with users. And you can use function calling to make this more generic.

I think drawing the boundary at “model + scaffolding” is more interesting.

dimitri-vs • 11 hours ago

Calling the sentence or two it arbitrarily saves when you statd your preferences and profile info "memories" is a stretch.

True equivalent to human memories would require something like a multimodal trillion token context window.

RAG is just not going to cut it, and if anything will exacerbated problems with hallucinations.

bubblyworld • 10 hours ago

That's true for vanilla LLMs, but also keep in mind that there are no details about o3's architecture at the moment. Clearly they are doing something different given the huge performance jump on a lot of benchmarks, and it may well involve in-context learning.

catmanjan • 5 hours ago

Given every other iteration has basically just been the same thing but bigger, why should we think this?

timabdulla • 20 hours ago

What's your explanation for why it can only get ~70% on SWE-bench Verified?

I believe about 90% of the tasks were estimated by humans to take less than one hour to solve, so we aren't talking about very complex problems, and to boot, the contamination factor is huge: o3 (or any big model) will have in-depth knowledge of the internals of these projects, and often even know about the individual issues themselves (e.g. you can say what was Github issue #4145 in project foo, and there's a decent chance it can tell you exactly what the issue was about!)

slewis • 20 hours ago

I've spent tons of time evaluating o1-preview on SWEBench-Verified.

For one, I speculate OpenAI is using a very basic agent harness to get the results they've published on SWEBench. I believe there is a fair amount of headroom to improve results above what they published, using the same models.

For two, some of the instances, even in SWEBench-Verified, require a bit of "going above and beyond" to get right. One example is an instance where the user states that a TypeError isn't properly handled. The developer who fixed it handled the TypeError but also handled a ValueError, and the golden test checks for both. I don't know how many instances fall in this category, but I suspect its more than on a simpler benchmark like MATH.

nopinsight • 20 hours ago

One possibility is that it may not yet have sufficient experience and real-world feedback for resolving coding issues in professional repos, as this involves multiple steps and very diverse actions (or branching factor, in AI terms). They have committed to not training on API usage, which limits their ability to directly acquire training data from it. However, their upcoming agentic efforts may address this gap in training data.

timabdulla • 20 hours ago

Right, but the branching factor increases exponentially with the scope of the work.

I think it's obvious that they've cracked the formula for solving well-defined, small-in-scope problems at a superhuman level. That's an amazing thing.

To me, it's less obvious that this implies that they will in short order with just more training data be able to solve ambiguous, large-in-scope problems at even just a skilled human level.

There are far more paths to consider, much more context to use, and in an RL setting, the rewards are much more ambiguously defined.

nopinsight • 18 hours ago

Their reasoning models can learn from procedures and methods, which generalize far better than data. Software tasks are diverse but most tasks are still fairly limited in scope. Novel tasks might remain challenging for these models, as they do for humans.

That said, o3 might still lack some kind of interaction intelligence that’s hard to learn. We’ll see.

nyrikki • 20 hours ago

GPQA scores are mostly from pre-training, against content in the corpus. They have gone silent but look at the GPT4 technical report which calls this out.

We are nowhere close to what Sam Altman calls AGI and transformers are still limited to what uniform-TC0 can do.

As an example the Boolean Formula Value Problem is NC1-complete, thus beyond transformers but trivial to solve with a TM.

As it is now proven that the frame problem is equivalent to the halting problem, even if we can move past uniform-TC0 limits, novelty is still a problem.

I think the advancements are truly extraordinary, but unless you set the bar very low, we aren't close to AGI.

Heck we aren't close to P with commercial models.

sebzim4500 • 19 hours ago

Isn't any physically realizable computer (including our brains) limited to what uniform-TC0 can do?

nyrikki • 17 hours ago

Neither TC0 nor uniform-TC0 are physically realizable, they are tools not physical devices.

The default nonuniform circuits classes are allowed to have a different circuit per input size, the uniform types have unbounded fan-in

Similar to how a k-tape TM doesn't get 'charged' for the input size.

With Nick Class (NC) the number of components is similar to traditional compute time while depth relates to the ability to parallelize operations.

These are different than biological neurons, not better or worse but just different.

Human neurons can use dendritic compartmentalization, use spike timing, can retime spikes etc...

While the perceptron model we use in ML is useful, it is not able to do xor in one layer, while biological neurons do that without anything even reaching the soma, purely in the dendrites.

Statistical learning models still comes down to a choice function, no matter if you call that set shattering or...

With physical computers the time hierarchy does apply and if TIME(g(n)) is given more time than TIME(f(n)), g(n) can solve more problems.

So you can simulate a NTM with exhaustive search with a physical computer.

Physical computers also tend to have NAND and XOR gates, and can have different circuit depths.

When you are in TC0, you only have AND, OR and Threshold (or majority) gates.

Think of instruction level parallelism in a typical CPU, it can return early, vs Itanium EPIC, which had to wait for the longest operation. Predicated execution is also how GPUs work.

They can send a mask and save on load store ops as an example, but the cost of that parallelism is the consent depth.

It is the parallelism tradeoff that both makes transformers practical as well as limit what they can do.

The IID assumption and autograd requiring smooth manifolds plays a role too.

The frame problem, which causes hard problems to become unsolvable for computers and people alike does also.

But the fact that we have polynomial time solutions for the Boolean Formula Value Problem, as mentioned in my post above is probably a simpler way of realizing physical computers aren't limited to uniform-TC0.

drdeca • 17 hours ago

Do you just mean because any physically realizable computer is a finite state machine? Or...?

I wouldn't describe a computer's usual behavior as having constant depth.

It is fairly typical to talk about problems in P as being feasible (though when the constant factors are too big, this isn't strictly true of course).

Just because for unreasonably large inputs, my computer can't run a particular program and produce the correct answer for that input, due to my computer running out of memory, we don't generally say that my computer is fundamentally incapable of executing that algorithm.

ec109685 • 19 hours ago

The problem with ARC is that there are a finite number of heuristics that could be enumerated and trained for, which would give model a substantial leg up on this evaluation, but not be generalized to other domains.

For example, if they produce millions of examples of the type of problems o3 still struggles on, it would probably do better at similar questions.

Perhaps the private data set is different enough that this isn’t a problem, but the ideal situation would be unveiling a truly novel dataset, which it seems like arc aims to do.

Imnimo • 20 hours ago

>Achieving human-level performance in the ARC benchmark, as well as top human performance in GPQA, Codeforce, AIME, and Frontier Math strongly suggests the model can potentially solve any problem at the human level if it possesses essential knowledge about it.

The article notes, "o3 still fails on some very easy tasks". What explains these failures if o3 can solve "any problem" at the human level? Do these failed cases require some essential knowledge that has eluded the massive OpenAI training set?

nopinsight • 20 hours ago

Great point. I'd love to see what these easy tasks are and would be happy to revise my hypothesis accordingly. o3's intelligence is unlikely to be a strict superset of human intelligence. It is certainly superior to humans in some respects and probably inferior in others. Whether it's sufficiently generally intelligent would be both a matter of definition and empirical fact.

Imnimo • 19 hours ago

Chollet has a few examples here:

https://x.com/fchollet/status/1870172872641261979

https://x.com/fchollet/status/1870173137234727219

I would definitely consider them legitimately easy for humans.

nopinsight • 19 hours ago

Thanks! I added some comments on this at the bottom of the post above.

mirkodrummer • 13 hours ago

Please stop it calling AGI, we don’t even know or agree universally what that should actually mean. How far did we get with hype calling a lossy probabilistic compressor firing slowly at us words AGI? That’s a real bummer to me

razodactyl • 6 hours ago

Is this comment voted down because of sentiment / polarity?

Regardless the critical aspect is valid, AGI would be something like Cortana from Halo.

uncomplexity_ • 13 hours ago

on the spatial data i see it as a highly intelligent head of a machine that just needs better limbs and better senses.

i think that's where most hardware startups will specialize with in the coming decades, different industries with different needs.

puttycat • 16 hours ago

Great comment. See this as well for another potential reason for failure:

https://arxiv.org/abs/2402.10013

norir • 20 hours ago

Personally I find "human-level" to be a borderline meaningless and limiting term. Are we now super human as a species relative to ourselves just five years ago because of our advances in developing computer programs that better imitate what many (but far from all) of us were already capable of doing? Have we reached a limit to human potential that can only be surpassed by digital machines? Who decides what human level is and when we have surpassed it? I have seen some ridiculous claims about ai in art that don't stand up to even the slightest scrutiny by domain experts but that easily fool the masses.

razodactyl • 6 hours ago

No I think we're just tired and depressed as a species... Existing systems work to a degree but aren't living up to their potential of increasing happiness according to technological capabilities.

golol • 16 hours ago

In order to replace actual humans doing their job I think LLMs are lacking in judgement, sense of time and agenticism.

Kostchei • 15 hours ago

I mean fkcu me when they have those things, however, maybe they are just lazy and their judgement is fine, for a lazy intelligence. Inner-self thinks "why are these bastards asking me to do this? ". I doubt that is actually happening, but now, .. prove it isn't.

ryoshu • 13 hours ago

Ask o3 is P=NP?

amelius • 2 hours ago

It will just answer with the current consensus on the matter.

zwnow • 4 hours ago

This is not AGI lmao.

xvector • 20 hours ago

Agree. AGI is here. I feel such a sense of pride in our species.

PaulDavisThe1st • 20 hours ago

> It might not yet be able to generate highly novel theories, frameworks, or artifacts to the degree that Einstein, Grothendieck, or van Gogh could.

Every human does this dozens, hundreds or thousands of times ... during childhood.

zug_zug • 2 hours ago

This is a lot of noise around what's clearly not even an order of magnitude to the way to AGI.

Here's my AGI test - Can the model make a theory of AGI validation that no human has suggested before, test itself to see if it qualifies, iterate, read all the literature, and suggest modifications to its own network to improve its performance?

That's what a human-level performer would do.

w4 • 18 hours ago

The cost to run the highest performance o3 model is estimated to be somewhere between $2,000 and $3,400 per task.[1] Based on these estimates, o3 costs about 100x what it would cost to have a human perform the exact same task. Many people are therefore dismissing the near-term impact of these models because of these extremely expensive costs.

I think this is a mistake.

Even if very high costs make o3 uneconomic for businesses, it could be an epoch defining development for nation states, assuming that it is true that o3 can reason like an averagely intelligent person.

Consider the following questions that a state actor might ask itself: What is the cost to raise and educate an average person? Correspondingly, what is the cost to build and run a datacenter with a nuclear power plant attached to it? And finally, how many person-equivilant AIs could be run in parallel per datacenter?

There are many state actors, corporations, and even individual people who can afford to ask these questions. There are also many things that they'd like to do but can't because there just aren't enough people available to do them. o3 might change that despite its high cost.

So if it is true that we've now got something like human-equivilant intelligence on demand - and that's a really big if - then we may see its impacts much sooner than we would otherwise intuit, especially in areas where economics takes a back seat to other priorities like national security and state competitiveness.

[1] https://news.ycombinator.com/item?id=42473876

istjohn • 18 hours ago

Your economic analysis is deeply flawed. If there was anything that valuable and that required that much manpower, it would already have driven up the cost of labor accordingly. The one property that could conceivably justify a substantially higher cost is secrecy. After all, you can't (legally) kill a human after your project ends to ensure total secrecy. But that takes us into thriller novel territory.

w4 • 17 hours ago

I don't think that's right. Free societies don't tolerate total mobilization by their governments outside of war time, no matter how valuable the outcomes might be in the long term, in part because of the very economic impacts you describe. Human-level AI - even if it's very expensive - puts something that looks a lot like total mobilization within reach without the societal pushback. This is especially true when it comes to tasks that society as a whole may not sufficiently value, but that a state actor might value very much, and when paired with something like a co-located reactor and data center that does not impact the grid.

That said, this is all predicated on o3 or similar actually having achieved human level reasoning. That's yet to be fully proven. We'll see!

daemonologist • 12 hours ago

This is interesting to consider, but I think the flaw here is that you'd need a "total mobilization" level workforce in order to build this mega datacenter in the first place. You put one human-hour into making B200s and cooling systems and power plants, you get less than one human-hour-equivalent of thinking back out.

lurking_swe • 16 hours ago

i disagree because the job market is not a true free market. I mean it mostly is, but there’s a LOT of politics and shady stuff that employers do to purposely drive wages down. Even in the tech sector.

Your secrecy comment is really intriguing actually. And morbid lol.

earth2mars • 2 hours ago

Maybe spend more compute time to let it think about optimizing the compute time.

tymonPartyLate • 2 hours ago

Isn’t this like a brute force approach? Given it costs $ 3000 per task, thats like 600 GPU hours (h100 at Azure) In that amount of time the model can generate millions of chains of thoughts and then spend hours reviewing them or even testing them out one by one. Kind of like trying until something sticks and that happens to solve 80% of ARC. I feel like reasoning works differently in my brain. ;)

tikkun • 2 hours ago

They're only allowed 2-3 guesses per problem. So even though yes it generates many candidates, it can't validate them - it doesn't have tool use or a verifier, it submits the best 2-3 guesses. https://www.lesswrong.com/posts/Rdwui3wHxCeKb7feK/getting-50...

nmca • 2 hours ago

It is allowed exactly two guesses, per the ARC rules.

trescenzi • 2 hours ago

How many guesses is the human comparison based on? I’d hope two as well but haven’t seen this anywhere so now I’m curious.

nextworddev • 2 hours ago

The best interpretation of this result is probably that it showed tackling some arbitrary benchmark is something you can throw money at, aka it’s just something money can solve.

Its not agi obviously in the sense that you still need to some problem framing and initialization to kickstart the reasoning path simulations

strangescript • 2 hours ago

"We have created artificial super intelligence, it has solved physics!"

"Well, yeah, but its kind of expensive" -- this guy

tymonPartyLate • 2 hours ago

Haha. Hopefully you’re right and solving the ARC puzzle translates to solving all of physics. I just remain skeptical about the OpenAI hype. They have a track record of exaggerating the significance of their releases and their impact on humanity.

freehorse • 2 hours ago

The problem is not that it is expensive, but that, most likely, it is not superintelligence. Superintelligence is not exploring the problem space semi-blindly, if the thounsands $$$ per task are actually spent for that. There is a reason the actual ARC-AGI prize requires efficiency, because the point is not "passing the test" but solving the framing problem of intelligence.

modeless • 21 hours ago

Congratulations to Francois Chollet on making the most interesting and challenging LLM benchmark so far.

A lot of people have criticized ARC as not being relevant or indicative of true reasoning, but I think it was exactly the right thing. The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.

It's obvious to everyone that these models can't perform as well as humans on everyday tasks despite blowout scores on the hardest tests we give to humans. Yet nobody could quantify exactly the ways the models were deficient. ARC is the best effort in that direction so far.

We don't need more "hard" benchmarks. What we need right now are "easy" benchmarks that these models nevertheless fail. I hope Francois has something good cooked up for ARC 2!

adamgordonbell • 18 hours ago

There is a benchmark, NovelQA, that LLMs don't dominate when it feels like they should. The benchmark is to read a novel and answer questions about it.

LLMs are below human evaluation, as I last looked, but it doesn't get much attention.

Once it is passed, I'd like to see one that is solving the mystery in a mystery book right before it's revealed.

We'd need unpublished mystery novels to use for that benchmark, but I think it gets at what I think of as reasoning.

https://novelqa.github.io/

loxias • 15 hours ago

NovelQA is a great one! I also like GSM-Symbolic -- a benchmark based on making _symbolic templates_ of quite easy questions, and sampling them repeatedly, varying things like which proper nouns are used, what order relevant details appear, how many irrelevant details (GSM-NoOp) and where they are in the question, things like that.

LLMs are far, _far_ below human on elementary problems, once you allow any variation and stop spoonfeeding perfectly phrased word problems. :)

https://machinelearning.apple.com/research/gsm-symbolic

https://arxiv.org/pdf/2410.05229

Paper came out in October, I don't think many have fully absorbed the implications.

It's hard to take any of the claims of "LLMs can do reasoning!" seriously, once you understand that simply changing what names are used in a 8th grade math word problem can have dramatic impact on the accuracy.

meta_x_ai • 18 hours ago

Looks like it's not updated for nearly a year and I'm guessing Gemini 2.0 Flash with 2m context will simply crush it

adamgordonbell • 17 hours ago

That's true. They don't have Claude 3.5 on there either. So maybe it's not relevant anymore, but I'm not sure.

If so, let's move on to the murder mysteries or more complex literary analysis.

usaar333 • 8 hours ago

That's an old leaderboard -- has no one checked any SOTA LLM in the last 8 months?

latency-guy2 • 14 hours ago

> I'd like to see one that is solving the mystery in a mystery book right before it's revealed.

I would think this is a not so good bench. Author does not write logically, they write for entertainment.

adamgordonbell • 14 hours ago

So I'm thinking of something like Locked-room mystery where the idea is it's solvable, and the reader is given a chance to solve.

The reason it seems like an interesting bench, is it's a puzzle presented in a long context. Its like testing if an LLm is at Sherlock Holmes level of world and motivation modelling.

rowanG077 • 16 hours ago

Benchmark how? Is it good if the LLM can or can't solve it?

CamperBob2 • 18 hours ago

Does it work on short stories, but not novels? If so, then that's just a minor question of context length that should self-resolve over time.

adamgordonbell • 18 hours ago

The books fit in the current long context models, so it's not merely the context size constraint but the length is part of the issue, for sure.

internet_points • 7 hours ago

> The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning.

One might also interpret that as "the fact that models which are studying to the test are getting better at the test" (Goodhart's law), not that they're actually reasoning.

jug • 18 hours ago

I liked the SimpleQA benchmark that measures hallucinations. OpenAI models did surprisingly poorly, even o1. In fact, it looks like OpenAI often does well on benchmarks by taking the shortcut to be more risk prone than both Anthropic and Google.

danielmarkbruce • 18 hours ago

Highly challenging for LLMs because it has nothing to do with language. LLMs and their training processes have all kinds of optimizations for language and how it's presented.

This benchmark has done a wonderful job with marketing by picking a great name. It's largely irrelevant for LLMs despite the fact it's difficult.

Consider how much of the model is just noise for a task like this given the low amount of information in each token and the high embedding dimensions used in LLMs.

computerex • 15 hours ago

The benchmark is designed to test for AGI and intelligence, specifically the ability to solve novel problems.

If the hypothesis is that LLMs are the “computer” that drives the AGI then of course the benchmark is relevant in testing for AGI.

I don’t think you understand the benchmark and its motivation. ARC AGI benchmark problems are extremely easy and simple for humans. But LLMs fail spectacularly at them. Why they fail is irrelevant, the fact they fail though means that we don’t have AGI.

danielmarkbruce • 14 hours ago

> The benchmark is designed to test for AGI and intelligence, specifically the ability to solve novel problems.

It's a bunch of visual puzzles. They aren't a test for AGI because it's not general. If models (or any other system for that matter) could solve it, we'd be saying "this is a stupid puzzle, it has no practical significance". It's a test of some sort of specific intelligence. On top of that, the vast majority of blind people would fail - are they not generally intelligent?

The name is marketing hype.

The benchmark could be called "random puzzles LLMs are not good at because they haven't been optimized for it because it's not valuable benchmark". Sure, it wasn't designed for LLMs, but throwing LLMs at it and saying "see?" is dumb. We can throw in benchmarks for tennis playing, chess playing, video game playing, car driving and a bajillion other things while we are at it.

NateEag • 8 hours ago

And all that is kind of irrelevant, because if LLMs were human-level general intelligence, they would solve all these questions correctly without blinking.

But they don't. Not even the best ones.

zone411 • 18 hours ago

It's the least interesting benchmark for language models among all they've released, especially now that we already had a large jump in its best scores this year. It might be more useful as a multimodal reasoning task since it clearly involves visual elements, but with o3 already performing so well, this has proven unnecessary. ARC-AGI served a very specific purpose well: showcasing tasks where humans easily outperformed language models, so these simple puzzles had their uses. But tasks like proving math theorems or programming are far more impactful.

versteegen • 15 hours ago

ARC wasn't designed as a benchmark for LLMs, and it doesn't make much sense to compare them on it since it's the wrong modality. Even a MLM with image inputs can't be expected to do well, since they're nothing like 99.999% of the training data. The fact that even a text-only LLM can solve ARC problems with the proper framework is important, however.

skywhopper • 19 hours ago

"The fact that scaled reasoning models are finally showing progress on ARC proves that what it measures really is relevant and important for reasoning."

Not sure I understand how this follows. The fact that a certain type of model does well on a certain benchmark means that the benchmark is relevant for a real-world reasoning? That doesn't make sense.

munchler • 18 hours ago

It shows objectively that the models are getting better at some form of reasoning, which is at least worth noting. Whether that improved reasoning is relevant for the real world is a different question.

moffkalast • 17 hours ago

It shows objectively that one model got better at this specific kind of weird puzzle that doesn't translate to anything because it is just a pointless pattern matching puzzle that can be trained for, just like anything else. In fact they specifically trained for it, they say so upfront.

It's like the modern equivalent of saying "oh when AI solves chess it'll be as smart as a person, so it's a good benchmark" and we all know how that nonsense went.

munchler • 17 hours ago

bagels • 15 hours ago

It doesn't follow, faulty logic. The two are probably correlated though.

aimanbenbaha • 16 hours ago

Because LLMs are on an off-ramp path towards AGI. A generally intelligent system can brute force its way with just memory.

Once a model recognizes a weakness through reasoning with CoT when posed to a certain problem and gets the agency to adapt to solve that problem that's a precursor towards real AGI capability!

justanotherjoe • 10 hours ago

i am confused cause this dataset is visual-based, and yet being used to measure 'LLM'. I feel like the visual nature of it was really the biggest hurdle to solving it.

dtquad • 21 hours ago

Are there any single-step non-reasoner models that do well on this benchmark?

I wonder how well the latest Claude 3.5 Sonnet does on this benchmark and if it's near o1.

throwaway71271 • 21 hours ago

    | Name                                 | Semi-private eval | Public eval |
    |--------------------------------------|-------------------|-------------|
    | Jeremy Berman                        | 53.6%             | 58.5%       |
    | Akyürek et al.                       | 47.5%             | 62.8%       |
    | Ryan Greenblatt                      | 43%               | 42%         |
    | OpenAI o1-preview (pass@1)           | 18%               | 21%         |
    | Anthropic Claude 3.5 Sonnet (pass@1) | 14%               | 21%         |
    | OpenAI GPT-4o (pass@1)               | 5%                | 9%          |
    | Google Gemini 1.5 (pass@1)           | 4.5%              | 8%          |

https://arxiv.org/pdf/2412.04604

kandesbunzler • 20 hours ago

why is this missing the o1 release / o1 pro models? Would love to know how much better they are

Freebytes • 13 hours ago

This might be because they are referencing single step, and I do not think o1 is single step.

aimanbenbaha • 16 hours ago

Akyürek et al uses test-time compute.

YetAnotherNick • 21 hours ago

Here are the results for base models[1]:

  o3 (coming soon)  75.7% 82.8%
  o1-preview        18%   21%
  Claude 3.5 Sonnet 14%   21%
  GPT-4o            5%    9%
  Gemini 1.5        4.5%  8%

Score (semi-private eval) / Score (public eval)

[1]: https://arcprize.org/2024-results

Bjorkbat • 18 hours ago

It's easy to miss, but if you look closely at the first sentence of the announcement they mention that they used a version of o3 trained on a public dataset of ARC-AGI, so technically it doesn't belong on this list.

dot1x • 6 hours ago

It's all scam. ClosedAI trained on the data they were tested on, so no, nothing here is impressive.

simonw • 19 hours ago

I'd love to know how Claude 3.5 Sonnet does so well despite (presumably) not having the same tricks as the o-series models.

lossolo • 19 hours ago

> making the most interesting and challenging LLM benchmark so far.

This[1] is currently the most challenging benchmark. I would like to see how O3 handles it, as O1 solved only 1%.

1. https://epoch.ai/frontiermath/the-benchmark

pynappo • 19 hours ago

Apparently o3 scored about 25%

https://youtu.be/SKBG1sqdyIU?t=4m40s

FiberBundle • 18 hours ago

This is actually the result that I find way more impressive. Elite mathematicians think these problems are challenging and thought they were years away from being solvable by AI.

modeless • 18 hours ago

You're right, I was wrong to say "most challenging" as there have been harder ones coming out recently. I think the correct statement would be "most challenging long-standing benchmark" as I don't believe any other test designed in 2019 has resisted progress for so long. FrontierMath is only a month old. And of course the real key feature of ARC is that it is easy for humans. FrontierMath is (intentionally) not.

refulgentis • 21 hours ago

This emphasizes persons and a self-conceived victory narrative over the ground truth.

Models have regularly made progress on it, this is not new with the o-series.

Doing astoundingly well on it, and having a mutually shared PR interest with OpenAI in this instance, doesn't mean a pile of visual puzzles is actually AGI or some well thought out and designed benchmark of True Intelligence(tm). It's one type of visual puzzle.

I don't mean to be negative, but to inject a memento mori. Real story is some guys get together and ride off Chollet's name with some visual puzzles from ye olde IQ test, and the deal was Chollet then gets to show up and say it proves program synthesis is required for True Intelligence.

Getting this score is extremely impressive but I don't assign more signal to it than any other benchmark with some thought to it.

modeless • 21 hours ago

Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front. (Not sure I believe the speculation about o3's internals in the link.)

What I'm saying is the fact that as models are getting better at reasoning they are also scoring better on ARC proves that it is measuring something relating to reasoning. And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs. Even today, let alone five years ago when ARC was released. ARC was visionary.

HarHarVeryFunny • 15 hours ago

> o3 presumably isn't doing program synthesis

I'd guess it's doing natural language procedural synthesis, the same way a human might (i.e. figuring the sequence of steps to effect the transformation), but it may well be doing (sub-)solution verification by using the procedural description to generate code whose output can then be compared to the provided examples.

While OpenAI haven't said exactly what the architecture of o1/o3 are, the gist of it is pretty clear - basically adding "tree" search and iteration on top of the underlying LLM, driven by some RL-based post-training that imparts generic problem solving biases to the model. Maybe there is a separate model orchestrating the search and solution evaluation.

I think there are many tasks that are easy enough for humans but hard/impossible for these models - the ultimate one in terms of commercial value would be to take an "off the shelf model" and treat it as an intern/apprentice and teach it to become competent in a entire job it was never trained on. Have it participate in team meetings and communications, and become a drop-in replacement for a human performing that job (any job that an be performed remotely without a physical presence).

hdjjhhvvhga • 20 hours ago

Your argumentation seems convincing but I'd like to offer a competitive narrative: any benchmark that is public becomes completely useless because companies optimize for it - especially AI that depends on piles of money and they need some proof they are developing.

That's why I have some private benchmarks and I'm sorry to say that the transition from GTP4 to o1 wasn't unambiguously a step forward (in some tasks yes, in some not).

On the other hand, private benchmarks are even less useful to the general public than the public ones, so we have to deal with what we have - but many of us just treat it as noise and don't give it much significance. Ultimately, the models should defend themselves by performing the tasks individual users want them to do.

stonemetal12 • 20 hours ago

QuantumGood • 20 hours ago

Gaming the benchmarks usually needs to be considered first when evaluating new results.

bubblyworld • 20 hours ago

I think gaming the benchmarks is encouraged in the ARC AGI context. If you look at the public test cases you'll see they test a ton of pretty abstract concepts - space, colour, basic laws of physics like gravity/magnetism, movement, identity and lots of other stuff (highly recommend exploring them). Getting an AI to do well at all, regardless of whether it was gamed or not, is the whole challenge!

chaps • 20 hours ago

refulgentis • 20 hours ago

> Solving ARC doesn't mean we have AGI. Also o3 presumably isn't doing program synthesis, seemingly proving Francois wrong on that front.

Agreed.

> And nobody else has come up with a comparable benchmark that is so easy for humans and so hard for LLMs.

? There's plenty.

modeless • 20 hours ago

stego-tech • 18 hours ago

I won't be as brutal in my wording, but I agree with the sentiment. This was something drilled into me as someone with a hobby in PC Gaming and Photography: benchmarks, while handy measures of potential capabilities, are not guarantees of real world performance. Very few PC gamers completely reinstall the OS before benchmarking to remove all potential cruft or performance impacts, just as very few photographers exclusively take photos of test materials.

While I appreciate the benchmark and its goals (not to mention the puzzles - I quite enjoy figuring them out), successfully passing this benchmark does not demonstrate or guarantee real world capabilities or performance. This is why I increasingly side-eye this field and its obsession with constantly passing benchmarks and then moving the goal posts to a newer, harder benchmark that claims to be a better simulation of human capabilities than the last one: it reeks of squandered capital and a lack of a viable/profitable product, at least to my sniff test. Rather than simply capitalize on their actual accomplishments (which LLMs are - natural language interaction is huge!), they're trying to prove to Capital that with a few (hundred) billion more in investments, they can make AGI out of this and replace all those expensive humans.

They've built the most advanced prediction engines ever conceived, and insist they're best used to replace labor. I'm not sure how they reached that conclusion, but considering even their own models refute this use case for LLMs, I doubt their execution ability on that lofty promise.

danielmarkbruce • 18 hours ago

100%. The hype is misguided. I doubt half the people excited about the result have even looked at what the benchmark is.

miga89 • 8 hours ago

How do the organisers keep the private test set private? Does openAI hand them the model for testing?

If they use a model API, then surely OpenAI has access to the private test set questions and can include it in the next round of training?

(I am sure I am missing something.)

owenpalmer • 7 hours ago

I wouldn't be surprised if the term "benchmark fraud" will soon been coined.

PhilippGille • 5 hours ago

Benchmark fraud is not a novel concept. Outside of LLMs for example smartphone manufacturers detect benchmarks and disable or reduce CPU throttling: https://www.theregister.com/2019/09/30/samsung_benchmarking_...

hmottestad • 1 hour ago

CPU frequency ramp curve is also something that can be adjusted. You want the CPU to ramp up really quickly to make everything feel responsive, but at the same time you want to not have to use so much power from your battery.

If you detect that a benchmark is running then you can just ramp up to max frequency immediately. It’ll show how fast your CPU is, but won’t be representative of the actual performance that users will get from their device.

PoignardAzur • 4 hours ago

If we really want to imagine a cold-war-style solution, the two teams could meet in an empty warehouse, bring one computer with the model, one with the benchmarks, and connect them with a USB cable.

In practice I assume they just gave them the benchmarks and took it on the honor system they wouldn't cheat, yeah. They can always cook up a new test set for next time, it's only 10% of the benchmark content anyway and the results are pretty close.

andrepd • 3 hours ago

There's no honor system when there's billions of dollars at stake x) I'm highly highly skeptical of these benchmarks because of intentional cheating and accidental contamination.

7734128 • 8 hours ago

I suppose that's why they are calling it "semi-private".

freehorse • 7 hours ago

And why o3 or any OpenAI llm is not evaluated in the actual private dataset.

bjornsing • 3 hours ago

Isn’t that why they call it “ Semi-Private”?

There’s a fully private test set too as I understand it, that o3 hasn’t run on yet.

deneas • 6 hours ago

They have two sets, a fully private one where the models run isolated and the semi-private one where they run models accessed over the internet.

gritzko • 5 hours ago

That is the top question, actually. Given all the billions at stake.

phil917 • 20 hours ago

Direct quote from the ARC-AGI blog:

“SO IS IT AGI?

ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI – as we've repeated dozens of times this year. It's a research tool designed to focus attention on the most challenging unsolved problems in AI, a role it has fulfilled well over the past five years.

Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.

Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training). This demonstrates the continued possibility of creating challenging, unsaturated benchmarks without having to rely on expert domain knowledge. You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.”

The high compute variant sounds like it costed around *$350,000* which is kinda wild. Lol the blog post specifically mentioned how OpenAPI asked ARC-AGI to not disclose the exact cost for the high compute version.

Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned” (this was not displayed in the live demo graph). This suggest in those cases that the model was trained to better handle these types of questions, so I do wonder about data / answer contamination in those cases…

Bjorkbat • 19 hours ago

> Also, 1 odd thing I noticed is that the graph in their blog post shows the top 2 scores as “tuned”

Something I missed until I scrolled back to the top and reread the page was this

> OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set

So yeah, the results were specifically from a version of o3 trained on the public training set

Which on the one hand I think is a completely fair thing to do. It's reasonable that you should teach your AI the rules of the game, so to speak. There really aren't any spoken rules though, just pattern observation. Thus, if you want to teach the AI how to play the game, you must train it.

On the other hand though, I don't think the o1 models nor Claude were trained on the dataset, in which case it isn't a completely fair competition. If I had to guess, you could probably get 60% on o1 if you trained it on the public dataset as well.

skepticATX • 19 hours ago

Great catch. Super disappointing that AI companies continue to do things like this. It’s a great result either way but predictably the excitement is focused on the jump from o1, which is now in question.

Bjorkbat • 18 hours ago

To me it's very frustrating because such little caveats make benchmarks less reliable. Implicitly, benchmarks are no different from tests in that someone/something who scores high on a benchmark/test should be able to generalize that knowledge out into the real world.

While that is true with humans taking tests, it's not really true with AIs evaluating on benchmarks.

SWE-bench is a great example. Claude Sonnet can get something like a 50% on verified, whereas I think I might be able to score a 20-25%? So, Claude is a better programmer than me.

Except that isn't really true. Claude can still make a lot of clumsy mistakes. I wouldn't even say these are junior engineer mistakes. I've used it for creative programming tasks and have found one example where it tried to use a library written for d3js for a p5js programming example. The confusion is kind of understandable, but it's also a really dumb mistake.

Some very simple explanations, the models were probably overfitted to a degree on Python given its popularity in AI/ML work, and SWE-bench is all Python. Also, the underlying Github issues are quite old, so they probably contaminated the training data and the models have simply memorized the answers.

Or maybe benchmarks are just bad at measuring intelligence in general.

Regardless, every time a model beats a benchmark I'm annoyed by the fact that I have no clue whatsoever how much this actually translates into real world performance. Did OpenAI/Anthropic/Google actually create something that will automate wide swathes of the software engineering profession? Or did they create the world's most knowledgeable junior engineer?

throwaway0123_5 • 17 hours ago

phil917 • 18 hours ago

Lol I missed that even though it's literally the first sentence of the blog, good catch.

Yeah, that makes this result a lot less impressive for me.

hartator • 19 hours ago

> acid test

The css acid test? This can be gamed too.

sundarurfriend • 14 hours ago

https://en.wikipedia.org/wiki/Acid_test:

> An acid test is a qualitative chemical or metallurgical assay utilizing acid. Historically, it often involved the use of a robust acid to distinguish gold from base metals. Figuratively, the term represents any definitive test for attributes, such as gauging a person's character or evaluating a product's performance.

Specifically here, they're using the figurative sense of "definitive test".

airstrike • 10 hours ago

also a "litmus test" but I guess that's a different chemistry test...

ripped_britches • 14 hours ago

Sad to see everyone so focused on compute expense during this massive breakthrough. GPT-2 originally cost $50k to train, but now can be trained for ~$150.

The key part is that scaling test-time compute will likely be a key to achieving AGI/ASI. Costs will definitely come down as is evidenced by precedents, Moore’s law, o3-mini being cheaper than o1 with improved performance, etc.

stocknoob • 12 hours ago

It’s wild, are people purposefully overlooking that inference costs are dropping 10-100x each year?

https://a16z.com/llmflation-llm-inference-cost/

Look at the log scale slope, especially the orange MMLU > 83 data points.

menaerus • 2 hours ago

Those are the (subsidized) prices that end clients are paying for the service so that's not something that is representative of what the actual inference costs are. Somebody still needs to pay that (actual) price in the end. For inference, as well as for training, you need actual (NVidia) hardware and that hardware didn't become any cheaper. OTOH models are only becoming increasingly more complex and bigger and with more and more demand I don't see those costs exactly dropping down.

croes • 5 hours ago

A bit early for a every year claim not to mention what all these AI is used for.

In some parts of the internet it’s you hardly find real content only AI spam.

It will get worse the cheaper it gets.

Think of email spam.

yawnxyz • 14 hours ago

I think the question everyone has in their minds isn't "when will AGI get here" or even "how soon will it get here" — it's "how soon will AGI get so cheap that everyone will get their hands on it"

that's why everyone's thinking about compute expense. but I guess in terms of a "lifetime expense of a person" even someone who costs $10/hr isn't actually all that cheap, considering what it takes to grow a human into a fully functioning person that's able to just do stuff

croes • 5 hours ago

We are nowhere near AGI.

bdjsiqoocwk • 12 hours ago

[dead]

hamburga • 12 hours ago

I’m not sure if people realize what a weird test this is. They’re these simple visual puzzles that people can usually solve at a glance, but for the LLMs, they’re converted into a json format, and then the LLMs have to reconstruct the 2D visual scene from the json and pick up the patterns.

If humans were given the json as input rather than the images, they’d have a hard time, too.

Jensson • 9 hours ago

> If humans were given the json as input rather than the images, they’d have a hard time, too.

We shine light in text patterns at humans rather than inject the text directly into the brain as well, that is extremely unfair! Imagine how much better humans would be at text processing if we injected and extracted information from their brains using the neurons instead of eyes and hands.

torginus • 7 hours ago

Not sure how much that matters - I'm not an AI expert, but I did some intro courses where we had to train a classifier to recognize digits. How it worked basically was that we fed each pixel of the 2d grid of the image into an input of the network, essentially flattening it in a similar fashion. It worked just fine, and that was a tiny network.

causal • 9 hours ago

I think that's part of what feels odd about this- in some ways it feels like the wrong type of test for an LLM, but in many ways it makes this achievement that much more remarkable

deneas • 6 hours ago

The JSON files still contain images, just not in a regular image format. You have a 2D array of numbers where each number maps to a color. If you really want a regular picture format, you can easily convert the arrays.

ImaCake • 11 hours ago

Yeah, this entire thread seems utterly detached from my lived experience. LLMs are immensely useful for me at work but they certainly don't come close to the hype spouted by many commenters here. It would be great if it could handle more of our quite modest codebase but it's not able to yet

m_ke • 10 hours ago

ARC is a silly benchmark, the other results in math and coding are much more impressive.

o3 is just o1 scaled up, the main takeaway from this line of work that people should walk away with is that we now have a proven way to RL our way to super human performance on tasks where it’s cheap to sample and easy to verify the final output. Programming falls in that category, they focused on known benchmarks but the same process can be done for normal programs, using parsers, compilers, existing functions and unit tests as verifiers.

Pre o1 we only really had next token prediction, which required high quality human produced data, with o1 you optimize for success instead of MLE of next token. Explained in simpler terms, it means it can get reward for any implementation of a function that reproduces the expected result, instead of the exact implementation in the training set.

Put another way, it’s just like RLHF but instead of optimizing against learned human preferences, the model is trained to satisfy a verifier.

This should work just as well in VLA models for robotics, self driving and computer agents.

vicentwu • 9 hours ago

"Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data."

Really want to see the number of training pairs needed to achieve this socre. If it only takes a few pairs, say 100 pairs, I would say it is amazing!

nmca • 2 hours ago

75% of 400 is 300 :)

Imnimo • 21 hours ago

Whenever a benchmark that was thought to be extremely difficult is (nearly) solved, it's a mix of two causes. One is that progress on AI capabilities was faster than we expected, and the other is that there was an approach that made the task easier than we expected. I feel like the there's a lot of the former here, but the compute cost per task (thousands of dollars to solve one little color grid puzzle??) suggests to me that there's some amount of the latter. Chollet also mentions ARC-AGI-2 might be more resistant to this approach.

Of course, o3 looks strong on other benchmarks as well, and sometimes "spend a huge amount of compute for one problem" is a great feature to have available if it gets you the answer you needed. So even if there's some amount of "ARC-AGI wasn't quite as robust as we thought", o3 is clearly a very powerful model.

solidasparagus • 12 hours ago

Or the test wasn't testing anything meaningful, which IMO is what happened here. I think ARC was basically looking at the distribution of what AI is capable of, picked an area that it was bad at and no one had cared enough to go solve, and put together a benchmark. And then we got good at it because someone cared and we had a measurement. Which is essentially the goal of ARC.

But I don't much agree that it is any meaningful step towards AGI. Maybe it's a nice proofpoint that that AI can solve simple problems presented in intentionally opaque ways.

exe34 • 21 hours ago

> the other is that there was an approach that made the task easier than we expected.

from reading Dennett's philosophy, I'm convinced that that's how human intelligence works - for each task that "only a human could do that", there's a trick that makes it easier than it seems. We are bags of tricks.

Jensson • 16 hours ago

> We are bags of tricks.

We are trick generators, that is what it means to be a general intelligence. Adding another trick in the bag doesn't make you a general intelligence, being able to discover and add new tricks yourself makes you a general intelligence.

falcor84 • 15 hours ago

Not the parent, but remembering my reading of Dennett, he was referring to the tricks that we got through evolution, rather than ones we invented ourselves. As particular examples, we have neural functional areas for capabilities like facial recognition and spatial reasoning which seems to rely on dedicated "wetware" somewhat distinct from other parts of the brain.

Jensson • 15 hours ago

exe34 • 5 hours ago

generating tricks is itself a trick that relies on an enormous bag of tricks we inherited through evolution by the process of natural selection.

the new tricks don't just pop into our heads even though it seems that way. nobody ever woke up and devised a new trick in a completely new field without spending years learning about that field or something adjacent to it. even the new ideas tend to be an old idea from a different field applied to a new field. tricks stand on the shoulders of giants.

highfrequency • 19 hours ago

Very cool. I recommend scrolling down to look at the example problem that O3 still can’t solve. It’s clear what goes on in the human brain to solve this problem: we look at one example, hypothesize a simple rule that explains it, and then check that hypothesis against the other examples. It doesn’t quite work, so we zoom into an example that we got wrong and refine the hypothesis so that it solves that sample. We keep iterating in this fashion until we have the simplest hypothesis that satisfies all the examples. In other words, how humans do science - iteratively formulating, rejecting and refining hypotheses against collected data.

From this it makes sense why the original models did poorly and why iterative chain of thought is required - the challenge is designed to be inherently iterative such that a zero shot model, no matter how big, is extremely unlikely to get it right on the first try. Of course, it also requires a broad set of human-like priors about what hypotheses are “simple”, based on things like object permanence, directionality and cardinality. But as the author says, these basic world models were already encoded in the GPT 3/4 line by simply training a gigantic model on a gigantic dataset. What was missing was iterative hypothesis generation and testing against contradictory examples. My guess is that O3 does something like this:

1. Prompt the model to produce a simple rule to explain the nth example (randomly chosen)

2. Choose a different example, ask the model to check whether the hypothesis explains this case as well. If yes, keep going. If no, ask the model to revise the hypothesis in the simplest possible way that also explains this example.

3. Keep iterating over examples like this until the hypothesis explains all cases. Occasionally, new revisions will invalidate already solved examples. That’s fine, just keep iterating.

4. Induce randomness in the process (through next-word sampling noise, example ordering, etc) to run this process a large number of times, resulting in say 1,000 hypotheses which all explain all examples. Due to path dependency, anchoring and consistency effects, some of these paths will end in awful hypotheses - super convoluted and involving a large number of arbitrary rules. But some will be simple.

5. Ask the model to select among the valid hypotheses (meaning those that satisfy all examples) and choose the one that it views as the simplest for a human to discover.

hmottestad • 18 hours ago

I took a look at those examples that o3 can't solve. Looks similar to an IQ-test.

Took me less time to figure out the 3 examples that it took to read your post.

I was honestly a bit surprised to see how visual the tasks were. I had thought they were text based. So now I'm quite impressed that o3 can solve this type of task at all.

neom • 18 hours ago

I also took some time to look at the ones it couldn't solve. I stopped after this one: https://kts.github.io/arc-viewer/page6/#47996f11

hmottestad • 9 hours ago

That one's cool. All pink pixels need to be repaired so they match the symmetry in the picture.

highfrequency • 18 hours ago

You must be a stem grad! Or perhaps an ensemble of Kaggle submissions?

aithrowawaycomm • 20 hours ago

I would like to see this repeated with my highly innovative HARC-HAGI, which is ARC-AGI but it uses hexagons instead of squares. I suspect humans would only make slightly more brain farts on HARC-HAGI than ARC-AGI, but O3 would fail very badly since it almost certainly has been specifically trained on squares.

I am not really trying to downplay O3. But this would be a simple test as to whether O3 is truly "a system capable of adapting to tasks it has never encountered before" versus novel ARC-AGI tasks it hasn't encountered before.

falcor84 • 15 hours ago

Here's my take - even if the o3 as currently implemented is utterly useless on your HARC-HAGI, it is obvious that o3 coupled with its existing training pipeline trained briefly on the hexagons would excel on it, such that passing your benchmark doesn't require any new technology.

Taking this a level of abstraction higher, I expect that in the next couple of years we'll see systems like o3 given a runtime budget that they can use for training/fine-tuning smaller models in an ad-hoc manner.

zebomon • 21 hours ago

My initial impression: it's very impressive and very exciting.

My skeptical impression: it's complete hubris to conflate ARC or any benchmark with truly general intelligence.

I know my skepticism here is identical to moving goalposts. More and more I am shifting my personal understanding of general intelligence as a phenomenon we will only ever be able to identify with the benefit of substantial retrospect.

As it is with any sufficiently complex program, if you could discern the result beforehand, you wouldn't have had to execute the program in the first place.

I'm not trying to be a downer on the 12th day of Christmas. Perhaps because my first instinct is childlike excitement, I'm trying to temper it with a little reason.

hansonkd • 21 hours ago

It doesn't need to be general intelligence or perfectly map to human intelligence.

All it needs to be is useful. Reading constant comments about LLMs can't be general intelligence or lack reasoning etc, to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings (a large portion of the population held that point of view back then).

It doesn't need to be general intelligence for the rapid advancement of LLM capabilities to be the most societal shifting development in the past decades.

wruza • 20 hours ago

And look at the airplanes, they really can’t just land on a mountain slope or a tree without heavy maintenance afterwards. Those people weren’t all stupid, they questioned the promise of flying servicemen delivering mail or milk to their window and flying on a personal aircar to their workplace. Just like todays promises about whatever the CEOs telltales are. Imagining bullshit isn’t unique to this century.

Aerospace is still a highly regulated area that requires training and responsibility. If parallels can be drawn here, they don’t look so cool for a regular guy.

Workaccount2 • 20 hours ago

What people always leave out is that society will bend to the abilities of the new technology. Planes can't land in your backyard so we built airports. We didn't abandon planes.

wruza • 20 hours ago

PaulDavisThe1st • 20 hours ago

Sure, but that also vindicates the GP's point that the initial claims of the boosters for planes contained more than their fair share of bullshit and lies.

tivert • 18 hours ago

> What people always leave out is that society will bend to the abilities of the new technology.

Do they really? I don't think they do.

> Planes can't land in your backyard so we built airports. We didn't abandon planes.

But then what do you do with the all the fantasies and hype about the new technology (like planes that land in your backyard and you fly them to work)?

And it's quite possible and fairly common that the new technology actually ends up being mostly hype, and there's actually no "airports" use case in the wings. I mean, how much did society "bend to the abilities of" NFTs?

And then what if the mature "airports" use case is actually something most people do not want?

ForHackernews • 19 hours ago

This is already happening. A few days ago Microsoft turned down a documentation PR because the formatting was better for humans but worse for LLMs: https://github.com/MicrosoftDocs/WSL/pull/2021#issuecomment-...

They changed their mind after a public outcry including here on HN.

moffkalast • 17 hours ago

No, we built helicopters.

oblio • 19 hours ago

We are slowly discovering that many of our wonderful inventions from 60-80-100 years ago have serious side effects.

Plastics, cars, planes, etc.

One could say that a balanced situation, where vested interests are put back in the box (close to impossible since it would mean fighting trillions of dollars), would mean that for example all 3 in the list above are used a lot less than we use them now, for example. And only used where truly appropriate.

skydhash • 20 hours ago

This pretty much. Everyone knows that LLMs are great for text generation and processing. What people has been questioning is the end goals as promised by its builders, i.e. is it useful? And from most of what I saw, it's very much a toy.

MVissers • 12 hours ago

throwaway4aday • 19 hours ago

Your point is on the verge of nullification with the rapid improvement and adoption of autonomous drones don't you think?

wruza • 16 hours ago

Sort of, but doesn’t that sit on a far-fetch horizon? I doubt that drone companies are all the same who sold aircraft retrofuturism to people back then.

surgical_fire • 21 hours ago

> to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings

To me it is more like there is someone jumping on a pogo ball while flapping their arms and saying that they are flying whenever they hop off the ground.

Skeptics say that they are not really flying, while adherents say that "with current pogo ball advancements, they will be flying any day now"

PaulDavisThe1st • 20 hours ago

An old quote, quite famous: "... is like saying that an ape who climbs to the top of a tree for the first time is one step closer to landing on the moon".

intelVISA • 20 hours ago

Between skeptics and adherents who is more easily able to extract VC money for vaporware? If you limit yourself to 'the facts' you're leaving tons of $$ on the table...

surgical_fire • 20 hours ago

By all means, if this is the goal, AI is a success.

I understand that in this forum too many people are invested in putting lipstick on this particular pig.

DonHopkins • 19 hours ago

Is that what Elon Musk was trying to do on stage?

zebomon • 21 hours ago

I agree. If the LLMs we have today never got any smarter, the world would still be transformed over the next ten years.

handsclean • 21 hours ago

People aren’t responding to their own assumption that AGI is necessary, they’re responding to OpenAI and the chorus constantly and loudly singing hymns to AGI.

billyp-rva • 21 hours ago

> It doesn't need to be general intelligence or perfectly map to human intelligence.

> All it needs to be is useful.

Computers were already useful.

The only definition we have for "intelligence" is human (or, generally, animal) intelligence. If LLMs aren't that, let's call it something else.

throwup238 • 20 hours ago

What exactly is human (or animal) intelligence? How do you define that?

billyp-rva • 20 hours ago

skywhopper • 19 hours ago

On the contrary, the pushback is critical because many employers are buying the hype from AI companies that AGI is imminent, that LLMs can replace professional humans, and that computers are about to eliminate all work (except VCs and CEOs apparently).

Every person that believes that LLMs are near sentient or actually do a good job at reasoning is one more person handing over their responsibilities to a zero-accountability highly flawed robot. We've already seen LLMs generate bad legal documents, bad academic papers, and extremely bad code. Similar technology is making bad decisions about who to arrest, who to give loans to, who to hire, who to bomb, and who to refuse heart surgery for. Overconfident humans employing this tech for these purposes have been bamboozled by the lies from OpenAI, Microsoft, Google, et al. It's crucial to call out overstatement and overhype about this tech wherever it crops up.

alexalx666 • 20 hours ago

If I could put it into Tesla style robot and it could do dishes and help me figure out tech stuff, it would be more than enough.

AyyEye • 21 hours ago

> Reading constant comments about LLMs can't be general intelligence or lack reasoning etc, to me seems like people witnessing the airplane and complaining that it isn't "real flying" because it isn't a bird flapping its wings (a large portion of the population held that point of view back then).

That is a natural reaction to the incessant techbro, AIbro, marketing, and corporate lies that "AI" (or worse AGI) is a real thing, and can be directly compared to real humans.

There are people on this very thread saying it's better at reasoning than real humans (LOL) because it scored higher on some benchmark than humans... Yet this technology still can't reliably determine what number is circled, if two lines intersect, or count the letters in a word. (That said behaviour may have been somewhat finetuned out of newer models only reinforces the fact that the technology inherently not capable of understanding anything.)

IanCal • 20 hours ago

I encounter "spicy auto complete" style comments far more often than techbro AI-everything comments and its frankly getting boring.

I've been doing AI things for about 20+ years and llms are wild. We've gone from specialized things being pretty bad as those jobs to general purpose things better at that and everything else. The idea you could make and API call with "is this sarcasm?" and get a better than chance guess is incredible.

surgical_fire • 19 hours ago

AyyEye • 19 hours ago

jasondigitized • 18 hours ago

This a thousand times.

colordrops • 18 hours ago

I don't think many informed people doubt the utility of LLMs at this point. The potential of human-like AGI has profound implications far beyond utility models, which is why people are so eager to bring it up. A true human-like AGI basically means that most intellectual/white collar work will not be needed, and probably manual labor before too long as well. Huge huge implications for humanity, e.g. how does an economy and society even work without workers?

vouaobrasil • 17 hours ago

> Huge huge implications for humanity, e.g. how does an economy and society even work without workers?

I don't think those that create AI care about that. They just to come out on top before someone else does.

sigmoid10 • 21 hours ago

These comments are getting ridiculous. I remember when this test was first discussed here on HN and everyone agreed that it clearly proves current AI models are not "intelligent" (whatever that means). And people tried to talk me down when I theorised this test will get nuked soon - like all the ones before. It's time people woke up and realised that the old age of AI is over. This new kind is here to stay and it will take over the world. And you better guess it'll be sooner rather than later and start to prepare.

ignoramous • 20 hours ago

> These comments are getting ridiculous.

Not really. Francois (co-creator of the ARC Prize) has this to say:

  The v1 version of the benchmark is starting to saturate. There were already signs of this in the Kaggle competition this year: an ensemble of all submissions would score 81%

  Early indications are that ARC-AGI-v2 will represent a complete reset of the state-of-the-art, and it will remain extremely difficult for o3. Meanwhile, a smart human or a small panel of average humans would still be able to score >95% ... This shows that it's still feasible to create unsaturated, interesting benchmarks that are easy for humans, yet impossible for AI, without involving specialist knowledge. We will have AGI when creating such evals becomes outright impossible.

  For me, the main open question is where the scaling bottlenecks for the techniques behind o3 are going to be. If human-annotated CoT data is a major bottleneck, for instance, capabilities would start to plateau quickly like they did for LLMs (until the next architecture). If the only bottleneck is test-time search, we will see continued scaling in the future.

https://x.com/fchollet/status/1870169764762710376 / https://ghostarchive.org/archive/Sqjbf

ben_w • 20 hours ago

> It's time people woke up and realised that the old age of AI is over. This new kind is here to stay and it will take over the world. And you better guess it'll be sooner rather than later and start to prepare.

I was just thinking about how 3D game engines were perceived in the 90s. Every six months some new engine came out, blew people's minds, was declared photorealistic, and was forgotten a year later. The best of those engines kept improving and are still here, and kinda did change the world in their own way.

Software development seemed rapid and exciting until about Halo or Half Life 2, then it was shallow but shiny press releases for 15 years, and only became so again when OpenAI's InstructGPT was demonstrated.

While I'm really impressed with current AI, and value the best models greatly, and agree that they will change (and have already changed) the world… I can't help but think of the Next Generation front cover, February 1997 when considering how much further we may be from what we want: https://www.giantbomb.com/pc/3045-94/forums/unreal-yes-this-...

TeMPOraL • 17 hours ago

> Software development seemed rapid and exciting until about Halo or Half Life 2, then it was shallow but shiny press releases for 15 years

The transition seems to map well to the point where engines got sophisticated enough, that highly dedicated high-schoolers couldn't keep up. Until then, people would routinely make hobby game engines (for games they'd then never finish) that were MVPs of what the game industry had a year or three earlier. I.e. close enough to compete on visuals with top photorealistic games of a given year - but more importantly, this was a time where you could do cool nerdy shit to impress your friends and community.

Then Unreal and Unity came out, with a business model that killed the motivation to write your own engine from scratch (except for purely educational purposes), we got more games, more progress, but the excitement was gone.

Maybe it's just a spurious correlation, but it seems to track with:

> and only became so again when OpenAI's InstructGPT was demonstrated.

Which is again, if you exclude training SOTA models - which is still mostly out of reach for anyone but a few entities on the planet - the time where anyone can do something cool that doesn't have a better market alternative yet, and any dedicated high-schooler can make truly impressive and useful work, outpacing commercial and academic work based on pure motivation and focus alone (it's easier when you're not being distracted by bullshit incentives like user growth or making VCs happy or churning out publications, farming citations).

It's, once again, a time of dreams, where anyone with some technical interest and a bit of free time can make the future happen in front of their eyes.

hansonkd • 18 hours ago

> how much further we may be from what we wan

The timescale you are describing for 3D graphics is 4 years from the 1997 cover you posted to the release of Halo which you are saying plateaued excitement because it got advanced enough.

An almost infinitesimally small amount of time in terms of history human development and you are mocking the magazine being excited for the advancement because it was... 4 years yearly?

ben_w • 18 hours ago

torginus • 20 hours ago

The weird thing about the phenomenon you mention is only after the field of software engineering has plateaued 15 years ago, as you mentioned, that this insane demand for engineers did arise, with corresponding insane salaries.

It's a very strange thing I've never understood.

dwaltrip • 18 hours ago

My guess: It’s a very lengthy, complex, and error-prone process to “digitize” human civilization (government, commerce, leisure, military, etc). The tech existed, we just didn’t know how to use it.

We still barely know how to use computers effectively, and they have already transformed the world. For better or worse.

jcims • 20 hours ago

I agree, it's like watching a meadow ablaze and dismissing it because it's not a 'real forest fire' yet. No it's not 'real AGI' yet, but *this is how we get there* and the pace is relentless, incredible and wholly overwhelming.

I've been blessed with grandchildren recently, a little boy that's 2 1/2 and just this past Saturday a granddaughter. Major events notwithstanding, the world will largely resemble today when they are teenagers, but the future is going to look very very very different. I can't even imagine what the capability and pervasiveness of it all will be like in ten years, when they are still just kids. For me as someone that's invested in their future I'm interested in all of the educational opportunities (technical, philosphical and self-awareness) but obviously am concerned about the potential for pernicious side effects.

lawlessone • 21 hours ago

Failing the test may prove the AI is not intelligent. Passing the test doesn't necessarily prove it is.

NitpickLawyer • 20 hours ago

Your comment reminds me of this quote from a book published in the 80s:

> There is a related “Theorem” about progress in AI: once some mental function is programmed, people soon cease to consider it as an essential ingredient of “real thinking”. The ineluctable core of intelligence is always in that next thing which hasn’t yet been programmed. This “Theorem” was first proposed to me by Larry Tesler, so I call it Tesler’s Theorem: “AI is whatever hasn’t been done yet.”

6gvONxR4sf7o • 20 hours ago

I've always disliked this argument. A person can do something well without devising a general solution to the thing. Devising a general solution to the thing is a step we're talking all the time with all sorts of things, but it doesn't invalidate the cool fact about intelligence: whatever it is that lets us do the thing well without the general solution is hard to pin down and hard to reproduce.

All that's invalidated each time is the idea that a general solution to that task requires a general solution to all tasks, or that a general solution to that task requires our special sauce. It's the idea that something able to to that task will also be able to do XYZ.

And yet people keep coming up with a new task that people point to saying, 'this is the one! there's no way something could solve this one without also being able to do XYZ!'

fc417fc802 • 18 hours ago

[dead]

8note • 19 hours ago

id consider that it doing the test at all, without proper compensation is a sign that it isnt intelligent

philipkglass • 20 hours ago

If AI takes over white collar work that's still half of the world's labor needs untouched. There are some promising early demos of robotics plus AI. I also saw some promising demos of robotics 10 and 20 years that didn't reach mass adoption. I'd like to believe that by the time I reach old age the robots will be fully qualified replacements for plumbers and home health aides. Nothing I've seen so far makes me think that's especially likely.

I'd love more progress on tasks in the physical world, though. There are only a few paths for countries to deal with a growing ratio of old retired people to young workers:

1) Prioritize the young people at the expense of the old by e.g. cutting old age benefits (not especially likely since older voters have greater numbers and higher participation rates in elections)

2) Prioritize the old people at the expense of the young by raising the demands placed on young people (either directly as labor, e.g. nurses and aides, or indirectly through higher taxation)

3) Rapidly increase the population of young people through high fertility or immigration (the historically favored path, but eventually turns back into case 1 or 2 with an even larger numerical burden of older people)

4) Increase the health span of older people, so that they are more capable of independent self-care (a good idea, but difficult to achieve at scale, since most effective approaches require behavioral changes)

5) Decouple goods and services from labor, so that old people with diminished capabilities can get everything they need without forcing young people to labor for them

reducesuffering • 19 hours ago

> If AI takes over white collar work that's still half of the world's labor needs untouched.

I am continually baffled that people here throw this argument out and can't imagine the second-order effects. If white collar work is automated by AGI, all the RnD to solve robotics beyond imagination will happen in a flash. The top AI labs, the people smartest enough to make this technology, all are focusing on automating AGI Researchers and from there follows everything, obviously.

brotchie • 18 hours ago

QuantumGood • 20 hours ago

"it will take over the world"

Calibrating to the current hype cycle has been challenging with AI pronouncements.

Workaccount2 • 20 hours ago

You are telling a bunch of high earning individuals ($150k+) that they may be dramatically less valuable in the eat future. Of course the goal posts will keep being pushed back and the acknowledgements will never come.

foobarqux • 21 hours ago

You should look up the terms necessary and sufficient.

sigmoid10 • 20 hours ago

The real issue is people constantly making up new goalposts to keep their outdated world view somewhat aligned with what we are seeing. But these two things are drifting apart faster and faster. Even I got surprised by how quickly the ARC benchmark was blown out of the water, and I'm pretty bullish on AI.

foobarqux • 20 hours ago

The ARC maintainers have explicitly said that passing the test was necessary but not sufficient so I don't know where you come up with goal-post moving. (I personally don't like the test; it is more about "intuition" or in-built priors, not reasoning).

manmal • 19 hours ago

Are you like invested in LLM companies or something? You‘re pushing the agenda hard in this thread.

samvher • 21 hours ago

What kind of preparation are you suggesting?

johnny_canuck • 21 hours ago

Start learning a trade

whynotminot • 19 hours ago

I feel like that’s just kicking the can a little further down the road.

Our value proposition as humans in a capitalist society is an increasingly fragile thing.

jorblumesea • 20 hours ago

that's going to work when every white collar worker goes into the trades /s

who is going to pay for residential electrical work lol and how much will you make if some guy from MIT is going to compete with you

sigmoid10 • 21 hours ago

This is far too broad to summarise here. You can read up on Sutskever or Bostrom or hell even Steven Hawking's ideas (going in order from really deep to general topics). We need to discuss everything - from education over jobs and taxes all the way to the principles of politics, our economy and even the military. If we fail at this as a society, we will at the very least create a world where the people who own capital today massively benefit and become rich beyond imagination (despite having contributed nothing to it), while the majority of the population will be unemployable and forever left behind. And the worst case probably falls somewhere between the end of human civilisation and the end of our species.

astrange • 20 hours ago

ben_w • 2 hours ago

> Productivity improvements increase employment.

Sometimes: the productivity improvements from the combustion engine didn't increase employment of horses, it displaced them.

But even when productivity improvements do increase employment, it's not always to our advantage: the productivity improvements from Eli Whitney's cotton gin included huge economic growth and subsequent technological improvements… and also "led to increased demands for slave labor in the American South, reversing the economic decline that had occurred in the region during the late 18th century": https://en.wikipedia.org/wiki/Cotton_gin

A superhuman AI that's only superhuman in specific domains? We've been seeing plenty of those, "computer" used to be a profession, and society can re-train but it still hurts the specific individuals who have to be unemployed (or start again as juniors) for the duration of that training.

A superhuman AI that's superhuman in every domain, but close enough to us in resource requirements that comparative advantage is still important and we can still do stuff, relegates us to whatever the AI is least good at.

A superhuman AI that's superhuman in every domain… as soon as someone invents mining, processing, and factory equipment that works on the moon or asteroids, that AI can control that equipment to make more of that equipment, and demand is quickly — O(log(n)) — saturated. I'm moderately confident that in this situation, the comparative advantage argument no longer works.

BriggyDwiggs42 • 2 hours ago

No, Atlas shrugged explicitly believes that the wealthy beneficiaries are also the ones doing the innovation and the labor. Human/superhuman AI, if not self-directed but more like a tool, may massively benefit whoever happens to be lucky enough to be directing it when it arises. This does not imply that the lucky individual benefits on the basis of their competence.

The idea that productivity improvements increase unemployment is just fundamentally based on a different paradigm. There is absolutely no reason to think that when a machine exists that can do most things that a human can do as well if not better for less or equal cost, this will somehow increase human employment. In this scenario, using humans in any stage of the pipeline would be deeply inefficient and a stupid business decision.

kelseyfrog • 21 hours ago

What we're going to do is punt the questions and then convince ourselves the outcome was inevitable and if anything it's actually our fault.

bluerooibos • 17 hours ago

The goalposts have moved, again and again.

It's gone from "well the output is incoherent" to "well it's just spitting out stuff it's already seen online" to "WELL...uhh IT CAN'T CREATE NEW/NOVEL KNOWLEDGE" in the space of 3-4 years.

It's incredible.

We already have AGI.

levocardia • 20 hours ago

I'm a little torn. ARC is really hard, and Francois is extremely smart and thoughtful about what intelligence means (the original "On the Measure of Intelligence" heavily influenced my ideas on how to think about AI).

On the other hand, there is a long, long history of AI achieving X but not being what we would casually refer to as "generally intelligent," then people deciding X isn't really intelligence; only when AI achieves Y will it be intelligence. Then AI achieves Y and...

amarcheschi • 21 hours ago

I just googled arc agi questions, and it looks like it is similar to an iq test with raven matrix. Similar as in you have some examples of images before and after, then an image before and you have to guess the after.

Could anyone confirm if this is the only kind of questions in the benchmark? If yes, how come there is such a direct connection to "oh this performs better than humans" when llm can be quite better than us in understanding and forecasting patterns? I'm just curious, not trying to stir up controversies

zebomon • 21 hours ago

It's a test on which (apparently until now) the vast majority of humans have far outperformed all machine systems.

patrickhogan1 • 21 hours ago

But it’s not a test that directly shows general intelligence.

I am excited no less! This is huge improvement.

How does this do on SWE Bench?

og_kalu • 21 hours ago

Eridrus • 21 hours ago

ML is quite good at understanding and forecasting patterns when you train on the data you want to forecast. LLMs manage to do so much because we just decided to train on everything on the internet and hope that it included everything we ever wanted to know.

This tries to create patterns that are intentionally not in the data and see if a system can generalize to them, which o3 super impressively does!

yunwal • 20 hours ago

ARC is in the dataset though? I mean I'm aware that there are new puzzles every day, but there's still a very specific format and set of skills required to solve it. I'd bet a decent amount of money that humans get better at ARC with practice, so it seems strange to suggest that AI wouldn't.

ALittleLight • 21 hours ago

Yes, it's pretty similar to Raven's. The reason it is an interesting benchmark is because humans, even very young humans, "get" the test in the sense of understanding what it's asking and being able to do pretty well on it - but LLMs have really struggled with the benchmark in the past.

Chollett (one of the creators of the ARC benchmark) has been saying it proves LLMs can't reason. The test questions are supposed to be unique and not in the model's training set. The fact that LLMs struggled with the ARC challenge suggested (to Chollett and others) that models weren't "Truly reasoning" but rather just completing based on things they'd seen before - when the models were confronted with things they hadn't seen before, the novel visual patterns, they really struggled.

Bjorkbat • 17 hours ago

I think it's still an interesting way to measure general intellience, it's just that o3 has demonstrated that you can actually achieve human performance on it by training it on the public training set and giving it ridiculous amounts of compute, which I imagine equates to ludicrously long chains-of-thought, and if I understand correctly more than one chain-of-thought per task (they mention sample sizes in the blog post, with o3-low using 6 and o3-high using 1024. Not sure if these are chains-of-thought per task or what).

Once you look at it that way it the approach really doesn't look like intelligence that's able to generalize to novel domains. It doesn't pass the sniff test. It looks a lot more like brute-forcing.

Which is probably why, in order to actually qualify for the leaderboard, they stipulate that you can't use more than $10k more of compute. Otherwise, it just sounds like brute-forcing.

BriggyDwiggs42 • 1 hour ago

I disagree. It’s vastly inefficient, but it is managing to actually solve these problems with a vast search space. If we extrapolate this approach into the future and assume that the search becomes better as the underlying model improves, and assume that the architecture grows more efficient, and assume that the type of parallel computing used here grows cheaper, isn’t it possible that this is a lot more than brute-forcing in terms of what it will achieve? In other words, is it maybe just a really ugly way of doing something functionally equivalent to reasoning?

Agentus • 20 hours ago

how about a extra large dose of your skepticism. is true intelligence really a thing and not just a vague human construct that tries to point out the mysterious unquantifiable combination of human behaviors?

humans clearly dont know what intelligence is unambiguously. theres also no divinely ordained objective dictionary that one can point at to reference what true intelligence is. a deep reflection of trying to pattern associate different human cognitive abilities indicates human cognitive capabilities arent that spectacular really.

MVissers • 12 hours ago

My guess as an amateur neuroscientist is that what we call intelligence is just a 'measurement' of problem solving ability in different domains. Can be emotional, spatial, motor, reasoning, etc etc.

There is no special sauce in our brain. And we know how much compute there is in our brain– So we can roughly estimate when we'll hit that with these 'LLMs'.

Language is important in a human brain development as well. Kids who grow up deaf grow up vastly less intelligent unless they learn sign language. Language allow us to process complex concepts that our brain can learn to solve, without having to be in those complex environments.

So in hindsight, it's easy to see why it took a language model to be able to solve general tasks and other types deep learning networks couldn't.

I don't really see any limits on these models.

Agentus • 11 hours ago

interesting point about language. but i wonder if people misattribute the reason why language is pivotal to human development. your points are valid. i see human behavior with regard to learning as 90% mimicry and 10% autonomous learning. most of what humans believe in is taken on faith and passed on from the tribe to the individual. rarely is it verified even partially let alone fully. humans simple dont have the time or processing power to do that. learning a thing without outside aid is vastly slower and more energy or brain intensive process than copy learning or learning through social institutions by dissemination. the stunted development from lack of language might come more from the less ability to access the collective learning process that language enables and or greatly enhances. i think a lot of learning even when combined with reasoning, deduction, etc really is at the mercy of brute force exploration to find a solution, which individuals are bad at but a society that collects random experienced “ah hah!” occurrences and passes them along is actually okay at.

i wonder if llms and language dont as so much allow us to process these complex environments but instead preload our brains to get a head start in processing those complex environments once we arrive in them. i think llms store compressed relationships of the world which obviously has information loss from a neural mapping of the world that isnt just language based. but that compressed relationships ie knowledge doesnt exactly backwardly map onto the world without it having a reverse key. like artificially learning about real world stuff in school abstractly and then going into the real world, it takes time for that abstraction to snap fit upon the real world.

could you further elaborate on what you mean by limits, because im happy to play contrarian on what i think i interpret you to be saying there.

also to your main point: what intelligence is. yeah you sort of hit up my thoughts on intelligence. its a combination of problem solving abilities in different domains. its like an amalgam of cognitive processes that achieve an amalgam of capabilities. while we can label alllllll that with a singular word, doesnt mean its all a singular process. seems like its a composite. moreover i think a big chunk of intelligence (but not all) is just brute forcing finding associations and then encoding those by some reflexive search/retrieval. a different part of intelligence of course is adaptibility and pattern finding.

m3kw9 • 21 hours ago

From the statement where - this is a pretty tough test where AI scores low vs humans just last year, and AI can do it as good as humans may not be AGI which I agree, but it means something with all caps

manmal • 19 hours ago

Obviously, the multi billion dollar companies will try to satisfy the benchmarks they are not yet good in, as has always been the case.

m3kw9 • 12 hours ago

A valid conspiracy theory but I’ve heard that one everystep of the way to this point

kelseyfrog • 21 hours ago

> truly general intelligence

Indistinguishable from goalpost moving like you said, but also no true Scotsman.

I'm curious what would happen in your eyes if we misattributed general intelligence to an AI model? What are the consequences of a false positive and how would they affect your life?

It's really clear to me how intelligence fits into our reality as part of our social ontology. The attributes and their expression that each of us uses to ground our concept of the intelligent predicate differs wildly.

My personal theory is that we tend to have an exemplar-based dataset of intelligence, and each of us attempts to construct a parsimonious model of intelligence, but like all (mental) models, they can be useful but wrong. These models operate in a space where the trade off is completeness or consistency, and most folks, uncomfortable saying "I don't know" lean toward being complete in their specification rather than consistent. The unfortunate side-effect is that we're able to easily generate test data that highlights our model inconsistency - AI being a case in point.

PaulDavisThe1st • 20 hours ago

> I'm curious what would happen in your eyes if we misattributed general intelligence to an AI model? What are the consequences of a false positive and how would they affect your life?

Rich people will think they can use the AI model instead of paying other people to do certain tasks.

The consequences could range from brilliant to utterly catastrophic, depending on the context and precise way in which this is done. But I'd lean toward the catastrophic.

kelseyfrog • 19 hours ago

Any specifics? It's difficult to separate this from generalized concern.

PaulDavisThe1st • 19 hours ago

wslh • 21 hours ago

> My skeptical impression: it's complete hubris to conflate ARC or any benchmark with truly general intelligence.

But isn’t it interesting to have several benchmarks? Even if it’s not about passing the Turing test, benchmarks serve a purpose—similar to how we measure microprocessors or other devices. Intelligence may be more elusive, but even if we had an oracle delivering the ultimate intelligence benchmark, we'd still argue about its limitations. Perhaps we'd claim it doesn't measure creativity well, and we'd find ourselves revisiting the same debates about different kinds of intelligences.

zebomon • 21 hours ago

It's certainly interesting. I'm just not convinced it's a test of general intelligence, and I don't think we'll know whether or not it is until it's been able to operate in the real world to the same degree that our general intelligence does.

FrustratedMonky • 21 hours ago

" it's complete hubris to conflate ARC or any benchmark with truly general intelligence."

Maybe it would help to include some human results in the AI ranking.

I think we'd find that Humans score lower?

zamadatix • 20 hours ago

I'm not sure it'd help what they are talking about much.

E.g. go back in time and imagine you didn't know there are ways for computers to be really good at performing integration yet as nobody had tried to make them. If someone asked you how to tell if something is intelligent "the ability to easily reason integrations or calculate extremely large multiplications in mathematics" might seem like a great test to make.

Skip forward to the modern era and it's blatantly obvious CASes like Mathematica on a modern computer range between "ridiculously better than the average person" to "impossibly better than the best person" depending on the test. At the same time, it becomes painfully obvious a CAS is wholly unrelated to general intelligence and just because your test might have been solvable by an AGI doesn't mean solving it proves something must have been an AGI.

So you come up with a new test... but you have the same problem as originally, it seems like anything non-human completely bombs and an AGI would do well... but how do you know the thing that solves it will have been an AGI for sure and not just another system clearly unrelated?

Short of a more clever way what GP is saying is the goalposts must keep being moved until it's not so obvious the thing isn't AGI, not that the average human gets a certain score which is worse.

All that aside, to answer your original question, in the presentation it was said the average human gets 85% and this was the first model to beat that. It was also said a second version is being worked on. They have some papers on their site about clear examples of why the current test clearly has a lot of testing unrelated to whether something is really AGI (a brute force method was shown to get >50% in 2020) so their aim is to create a new goalpost test and see how things shake out this time.

og_kalu • 20 hours ago

Generality is not binary. It's a spectrum. And these models are already general in ways those things you've mentioned simply weren't.

What exactly is AGI to you ? If it's simply a generally intelligent machine then what are you waiting for ? What else is there to be sure of ? There's nothing narrow about these models.

Humans love to believe they're oh so special so much that there will always be debates on whether 'AGI' has arrived. If you are waiting for that then you'll be waiting a very long time, even if a machine arrives that takes us to the next frontier in science.

Jensson • 16 hours ago

FrustratedMonky • 20 hours ago

"Short of a more clever way what GP is saying is the goalposts must keep being moved until it's not so obvious the thing isn't AGI, not that the average human gets a certain score which is worse."

Best way of stating that I've heard.

The Goal Post must keep moving, until we understand enough what is happening.

I usually poo-poo the goal post moving, but this makes sense.

Engineering-MD • 13 hours ago

Can I just say what a dick move it was to do this as a 12 days of Christmas. I mean to be honest I agree with the arguments this isn’t as impressive as my initial impression, but they clearly intended it to be shocking/a show of possible AGI, which is rightly scary.

It feels so insensitive to that right before a major holiday when the likely outcome is a lot of people feeling less secure in their career/job/life.

Thanks again openAI for showing us you don’t give a shit about actual people.

XenophileJKO • 13 hours ago

Or maybe the target audience that watches 12 launch videos in the morning are genuninely excited about the new model. The intended it to be a preview of something to look forward to.

What a weird way to react to this.

achierius • 7 hours ago

It sounds like you aren't thinking about this that deeply then. Or at least not understanding that many smart (and financially disinterested) people who are, are coming to concerning conclusions.

https://www.transformernews.ai/p/richard-ngo-openai-resign-s...

>But while the “making AGI” part of the mission seems well on track, it feels like I (and others) have gradually realized how much harder it is to contribute in a robustly positive way to the “succeeding” part of the mission, especially when it comes to preventing existential risks to humanity.

Almost every single one of the people OpenAI had hired to work on AI safety have left the firm with similar messages. Perhaps you should at least consider the thinking of experts?

mirkodrummer • 13 hours ago

There is no AGI it’s just marketing, this stuff if over hyped, enjoy your holidays you won’t lose your job ;)

Engineering-MD • 7 hours ago

I agree, it’s just more about the intent than anything else, like boasting about your amazing new job when someone has recently been made redundant, just before Christmas.

achierius • 7 hours ago

I feel you. It's tough trying to think about what we can do to avert this; even to the extent that individuals are often powerless, in this regard it feels worse than almost anything that's come before.

keiferski • 6 hours ago

The vast majority of people who will lose jobs to AI aren’t following AGI benchmarks, or even know what AGI is short for.

Engineering-MD • 3 hours ago

That’s is true and a reasonable point. But looking in This thread you can see there has been this reaction from quite a few.

OldGreenYodaGPT • 12 hours ago

Blaming OpenAI for progress is like blaming a calendar for Christmas—it’s not the timing, it’s your unwillingness to adapt

r-zip • 2 hours ago

Unwillingness to adapt to the destruction of the middle class and knowledge work is pretty reasonable tbh.

lagrange77 • 2 hours ago

Wow, you just solved the ethics of technology in a one liner. Impressive.

t0lo • 9 hours ago

I hate the deliberate fear-mongering that these companies pedal on the population to get higher valuations

stevenhuang • 12 hours ago

This is a you problem. Yes there will be pain in short term, but it will be worth it in long term.

Many of us look forward to what a future with AGI can do to help humanity and hopefully change society for the better, mainly to achieve a post scarcity economy.

jakebasile • 11 hours ago

Surely the elites that control this fancy new technology will share the benefits with all of us _this_ time!

achierius • 7 hours ago

https://www.transformernews.ai/p/richard-ngo-openai-resign-s...

Almost every single one of the people OpenAI had hired to work on AI safety have left the firm with similar messages. Perhaps you should at least consider the thinking of experts? There is a real chance that this ends with significant good. There is also a real chance that this ends with the death of every single human being. That's never been a choice we've had to make before, and it seems like we as a species are unprepared to approach it.

randyrand • 10 hours ago

Post scarcity seems very unlikely. Humans might be worthless, but there will still be a finite number of AIs, compute, space, resources.

_cs2017_ • 10 hours ago

Wtf is wrong with you dude? It's just another tech, some jobs will get worse some jobs will get better. Happens every couple of decades. Stop freaking out.

achierius • 7 hours ago

This is not a very kind or humble comment. There are real experts talking about how this time is different -- as an analogy, think about how horses, for thousands of years, always had new things to do -- until one day they didn't. It's hubris to think that we're somehow so different from them.

Notably, the last key AI safety researcher just left OpenAI: https://www.transformernews.ai/p/richard-ngo-openai-resign-s...

Are you that upset that this guy chose to trust the people that OpenAI hired to talk about AI safety, on the topic of AI safety?

Balgair • 20 hours ago

Complete aside here: I used to do work with amputees and prosthetics. There is a standardized test (and I just cannot remember the name) that fits in a briefcase. It's used for measuring the level of damage to the upper limbs and for prosthetic grading.

Basically, it's got the dumbest and simplest things in it. Stuff like a lock and key, a glass of water and jug, common units of currency, a zipper, etc. It tests if you can do any of those common human tasks. Like pouring a glass of water, picking up coins from a flat surface (I chew off my nails so even an able person like me fails that), zip up a jacket, lock your own door, put on lipstick, etc.

We had hand prosthetics that could play Mozart at 5x speed on a baby grand, but could not pick up a silver dollar or zip a jacket even a little bit. To the patients, the hands were therefore about as useful as a metal hook (a common solution with amputees today, not just pirates!).

Again, a total aside here, but your comment just reminded me of that brown briefcase. Life, it turns out, is a lot more complex than we give it credit for. Even pouring the OJ can be, in rare cases, transcendent.

ubj • 20 hours ago

There's a lot of truth in this. I sometimes joke that robot benchmarks should focus on common household chores. Given a basket of mixed laundry, sort and fold everything into organized piles. Load a dishwasher given a sink and counters overflowing with dishes piled up haphazardly. Clean a bedroom that kids have trashed. We do these tasks almost without thinking, but the unstructured nature presents challenges for robots.

Balgair • 19 hours ago

I maintain that whoever invents a robust laundry folding robot will be a trillionaire. In that, I dump jumbled clean clothes straight from a dryer at it and out comes folded and sorted clothes (and those loner socks). I know we're getting close, but I also know we're not there yet.

smokel • 19 hours ago

We are certainly getting close! In 2010, watching PR2 fold some unseen towels is similar to watching paint dry [1], but we can now enjoy robots attain lazy student-level laundry folding in real-time, as demonstrated by π₀[2].

[1] https://www.youtube.com/watch?v=gy5g33S0Gzo

[2] https://www.physicalintelligence.company/blog/pi0

yongjik • 19 hours ago

I can live without folding laundry (I can just shove my undershirts in the closet, who cares if it's not folded), but whoever manufactures a reliable auto-loading dishwasher will have my dollars. Like, just put all your dishes in the sink and let the machine handle them.

Brybry • 18 hours ago

dweekly • 18 hours ago

I was a believer in Gal's FoldiMate but sadly it...folded.

https://en.m.wikipedia.org/wiki/FoldiMate

blargey • 18 hours ago

At this point I'm not sure we'll actually get a task-specific machine for laundry folding/sorting before humanoid robots gain the capability to do it well enough.

sss111 • 18 hours ago

Honestly, a robot that can hang jumbled clean clothes instead of folding them would be good enough, it's crazy how we don't even have those.

jessekv • 19 hours ago

I want it to lay out an outfit every day too. Hopefully without hallucination.

stefs • 19 hours ago

nradov • 19 hours ago

There is the Foldimate robot. I don't know how well it works. It doesn't seem to pair up socks. (Deleted the web link, it might not be legitimate.)

smokel • 18 hours ago

Beware, this website is probably a scam.

Foldimate has gone bankrupt in 2021 [1], and the domain referral from foldimate.com to a 404 page at miele.com, suggests that it was Miele who bought up the remains, not a sketchy company with a ".website" top-level domain.

[1] https://en.wikipedia.org/wiki/FoldiMate

oblio • 19 hours ago

Laundry folding and laundry ironing, I would say.

musicale • 18 hours ago

Hopefully will detect whether a small child is inside or not.

imafish • 19 hours ago

> I maintain that whoever invents a robust laundry folding robot will be a trillionaire

… so Elon Musk? :D

zamalek • 18 hours ago

Slightly tangential, we already have amazing laundry robots. They are called washing and drying machines. We don't give these marvels enough credit, mostly because they aren't shaped like humans.

Humanoid robots are mostly a waste of time. Task-shaped robots are much easier to design, build, and maintain... and are more reliable. Some of the things you mention might needs humanoid versatility (loading the dishwasher), others would be far better served by purpose-built robots (laundry sorting).

jkaptur • 18 hours ago

I'm embarrassed to say that I spent a few moments daydreaming about a robot that could wash my dishes. Then I thought about what to call it...

musicale • 18 hours ago

wsintra2022 • 18 hours ago

Was it a dishwasher? Just give it all your unclean dishes and tell it to go, come back an hour later and they all washed and mostly dried!

rytis • 18 hours ago

I agree. I don’t know where this obsession comes from. Obsession with resembling as close to humans as possible. We’re so far from being perfect. If you need proof just look at your teeth. Yes, we’re relatively universal, but a screwdriver is more efficient at driving in screws that our fingers. So please, stop wasting time building perfect universal robots, build more purpose-build ones.

golol • 17 hours ago

The shape doesn't matter! Non-humanoid shapes give minir advantages on specific tasks but for a general robot you'll have a hard time finding a shape much more optimal than humanoid. And if you go with humanoid you have so much data available! Videos contain the information of which movements a robot should execude. Teleoperation is easy. This is the bitter lesson! The shape doesn't matter, any shape will work with the right architecture, data and training!

Nevermark • 18 hours ago

Given we have shaped so many tasks to fit our bodies, it will be a long time before a bot able to do a variety/majority of human tasks the human way won’t be valuable.

1000 machines specialized for 1000 tasks are great, but don’t deliver the same value as a single bot that can interchange with people flexibly.

Costly today, but wont be forever.

rowanG077 • 17 hours ago

Purpose build robots are basically solved. Dishwashers, laundry machines, assembly robots, etc. the moat is a general purpose robot that can do what a human can do.

graemep • 17 hours ago

Great examples. They are simple, reliable, efficient and effective. Far better than blindly copying what a human being does. Maybe there are equally clever ways of doing things like folding clothes.

throwup238 • 18 hours ago

This is expressed in AI research as Moravec's paradox: https://en.wikipedia.org/wiki/Moravec%27s_paradox

Getting to LLMs that could talk to us turned out to be a lot easier than making something that could control even a robotic arm without precise programming, let alone a humanoid.

ecshafer • 20 hours ago

I had a pretty bad case of tendinitis once, that basically made my thumb useless since using it would cause extreme pain. That test seems really good. I could use a computer keyboard without any issue, but putting a belt on or pouring water was impossible.

vidarh • 17 hours ago

I had a swollen elbow a short while ago, and the amount of things I've never thought about that were affected by reduced elbow join mobility and an inability to put pressure on the elbow was disturbing.

alexose • 19 hours ago

It feels like there's a whole class of information that easily shorthanded, but really hard to explain to novices.

I think a lot about carpentry. From the outside, it's pretty easy: Just make the wood into the right shape and stick it together. But as one progresses, the intricacies become more apparent. Variations in the wood, the direction of the grain, the seasonal variations in thickness, joinery techniques that are durable but also time efficient.

The way this information connects is highly multisensory and multimodal. I now know which species of wood to use for which applications. This knowledge was hard won through many, many mistakes and trials that took place at my home, the hardware store, the lumberyard, on YouTube, from my neighbor Steve, and in books written by experts.

drdrey • 18 hours ago

I think assembling Legos would be a cool robot benchmark: you need to parse the instructions, locate the pieces you need, pick them up, orient them, snap them to your current assembly, visually check if you achieved the desired state, repeat

serpix • 9 hours ago

I agree. Watching my toddler daughter build with small legos makes me understand how incredible fine motor skills are as even with small fingers some of the blocks are just too hard to snap together.

Method-X • 19 hours ago

Was it the Southampton hand assessment procedure?

Balgair • 19 hours ago

Yes! Thank you!

https://www.shap.ecs.soton.ac.uk/

m463 • 20 hours ago

It would be interesting to see trick questions.

Like in your test

a hand grenade and a pin - don't pull the pin.

Or maybe a mousetrap? but maybe that would be defused?

in the ai test...

or Global Thermonuclear War, the only winning move is...

HPsquared • 19 hours ago

Gaming streams being in the training data, it might pull the pin because "that's what you do".

8note • 19 hours ago

or, because it has to give an output, and pulling the pin is the only option

TeMPOraL • 17 hours ago

There's also the option of not pulling the pin, and shooting your enemies as they instinctively run from what they think is a live grenade. Saw it on a TV show the other day.

sdenton4 • 19 hours ago

to move first!

m463 • 17 hours ago

oh crap. lol!

croemer • 19 hours ago

> We had hand prosthetics that could play Mozart at 5x speed on a baby grand, but could not pick up a silver dollar or zip a jacket even a little bit. "

I must be missing something, how can they be able to play Mozart at 5x speed with their prosthetics but not zip a jacket? They could press keys but not do tasks requiring feedback?

Or did you mean they used to play Mozart at 5x speed before they became amputees?

ben_w • 19 hours ago

Playing a piano involves pushing down on the right keys with the right force at the right time, but that could be pre-programmed well before computers. The self-playing piano in the saloon in Westworld wasn't a huge anachronism, such things slightly overlapped with the Wild West era: https://en.wikipedia.org/wiki/Player_piano

Picking up a 1mm thick metal disk from a flat surface requires the user gives the exact time, place, and force, and I'm not even sure what considerations it needs for surface materials (e.g. slightly squishy fake skin) and/or tip shapes (e.g. fake nails).

numpad0 • 19 hours ago

> Picking up a 1mm thick metal disk from a flat surface requires the user gives the exact time, place, and force

place sure but can't you cheat a bit for time and force with compliance("impedance control")?

ben_w • 18 hours ago

In theory, apparently not in practice.

rahimnathwani • 19 hours ago

Imagine a prosthetic 'hand' that has 5 regular fingers, rather than 4 fingers and a thumb. It would be able to play a piano just fine, but be unable to grasp anything small, like a zipper.

n144q • 17 hours ago

Well, you see, while the original comment says they could play at 5x speed, it does not say it plays at that speed well or play it beautifully. Any teacher or any student who learned piano for a while will tell you that this matters a lot, especially for classical music -- being able to accurately play at an even tempo with the correct dynamics and articulation is hard and is what differentiates a beginner/intermediate player from an advanced one. In fact, one mistake many students make is playing a piece too fast when they are not ready, and teachers really want students to practice very slowly.

My point is -- being able to zip a jacket is all about those subtle actions, and could actually be harder than "just" playing piano fast.

8note • 19 hours ago

zipping up a jacket is really hard to do, and requires very precise movements and coordination between hands.

playing mozart is much more forgiving in terms of the number of different motions you have to make in different directions, the amount of pressure to apply, and even the black keys are much bigger than large sized zipper tongues.

Balgair • 19 hours ago

Pretty much. The issue with zippers is that the fabric moves about in unpredictable ways. Piano playing was just movement programs. Zipping required (surprisingly) fast feedback. Also, gripping is somewhat tough compared to pressing.

oblio • 19 hours ago

I'm far from a piano player, but I can definitely push piano buttons quite quickly while zipping up my jacket when it's cold and/or wet outside is really difficult.

Even more so for picking up coins from a flat surface.

For robotics, it's kind of obvious, speed is rarely an issue, so the "5x" part is almost trivial. And you can program the sequence quite easily, so that's also doable. Piano keys are big and obvious and an ergonomically designed interface meant to be relatively easy to press, ergo easy even for a prosthetic. A small coin on a flat surface is far from ergonomic.

yongjik • 19 hours ago

I play piano as a hobby, and the funny thing is, if my hands are so cold that I can't zip up my jacket, there's no way I can play anything well. I know it's not quite zipping up jackets ;) but a human playing the piano does require a fast feedback loop.

croemer • 19 hours ago

But how do you deliberately control those fingers to actually play yourself what you have in mind rather than something preprogrammed? Surely the idea of a prosthetic does not just mean "a robot that is connected to your body", but something that the owner control with your mind.

vidarh • 17 hours ago

Nobody said anything about deliberately controlling those fingers to play yourself. Clearly it's not something you do for the sake of the enjoyment of playing, but more likely a demonstration of the dexterity of the prosthesis and ability to program it for complex tasks.

The idea of a prosthesis is to help you regain functionality. If the best way of doing that is through automation, then it'd make little sense not to.

numpad0 • 19 hours ago

Thumb not opposable?

dang • 16 hours ago

We detached this subthread from https://news.ycombinator.com/item?id=42473419

(nothing wrong with it! I'm just trying to prune the top subthread)

oblio • 19 hours ago

This was actually discovered quite early on in the history of AI:

> Rodney Brooks explains that, according to early AI research, intelligence was "best characterized as the things that highly educated male scientists found challenging", such as chess, symbolic integration, proving mathematical theorems and solving complicated word algebra problems. "The things that children of four or five years could do effortlessly, such as visually distinguishing between a coffee cup and a chair, or walking around on two legs, or finding their way from their bedroom to the living room were not thought of as activities requiring intelligence."

https://en.wikipedia.org/wiki/Moravec%27s_paradox

bawolff • 19 hours ago

I don't know why people always feel the need to gender these things. Highly educated female scientists generally find the same things challenging.

robocat • 18 hours ago

I don't know why anyone would blame people as though someone is making an explicit choice. I find your choice of words to be insulting to the OP.

We learn our language and stereotypes subconciously from our society, and it is no easy thing to fight against that.

Barrin92 • 17 hours ago

>I don't know why people always feel the need to gender these things

Because it's relevant to the point being made, i.e. that these tests reflect the biases and interests of the people who make them. This is true not just for AI tests, but intelligence test applied to humans. That Demis Hassabis, a chess player and video game designer, decided to test his machine on video games, Go and chess probably is not an accident.

The more interesting question is why people respond so apprehensively to pointing out a very obvious problem and bias in test design.

bawolff • 16 hours ago

xnx • 17 hours ago

Despite lake of fearsome teeth or claws, humans are way op due to brain, hand dexterity, and balance.

MarcelOlsz • 17 hours ago

>We had hand prosthetics that could play Mozart at 5x speed on a baby grand

I'd love to know more about this.

CooCooCaCha • 19 hours ago

That’s why the goal isn’t just benchmark scores, it’s reliable and robust intelligence.

In that sense, the goalposts haven’t moved in a long time despite claims from AI enthusiasts that people are constantly moving goalposts.

yawnxyz • 21 hours ago

O3 High (tuned) model scored an 88% at what looks like $6,000/task haha

I think soon we'll be pricing any kind of tasks by their compute costs. So basically, human = $50/task, AI = $6,000/task, use human. If AI beats human, use AI? Ofc that's considering both get 100% scores on the task

cchance • 21 hours ago

Isn't that generally what ... all jobs are? Automation Cost vs Longterm Human cost... its why amazon did the weird "our stores are AI driven" but in reality was cheaper to higher a bunch of guys in a sweat shop to look at the cameras and write things down lol.

The thing is given what we've seen from distillation and tech, even if its 6,000/task... that will come down drastically over time through optimization and just... faster more efficient processing hardware and software.

cryptoegorophy • 21 hours ago

I remember hearing Tesla trying to automate all of production but some things just couldn’t , like the wiring which humans still had to do.

Benjaminsen • 21 hours ago

Compute costs on AI with the same roughly the same capabilities have been halving every ~7 months.

That makes something like this competitive in ~3 years

seizethecheese • 16 hours ago

And human costs have been increasing a few percent per year for a few centuries!

jsheard • 21 hours ago

That's the elephant in the room with the reasoning/COT approach, it shifts what was previously a scaling of training costs into scaling of training and inference costs. The promise of doing expensive training once and then running the model cheaply forever falls apart once you're burning tens, hundreds or thousands of dollars worth of compute every time you run a query.

Workaccount2 • 18 hours ago

They're gonna figure it out. Something is being missed somewhere, as human brains can do all this computation on 20 watts. Maybe it will be a hardware shift or maybe just a software one, but I strongly suspect that modern transformers are grossly inefficient.

Legend2440 • 21 hours ago

Yeah, but next year they'll come out with a faster GPU, and the year after that another still faster one, and so on. Compute costs are a temporary problem.

freehorse • 20 hours ago

The issue is not just scaling compute, but scaling it in a rate that meets the increase in complexity of the problems that are not currently solved. If that is O(n) then what you say probably stands. If that is eg O(n^8) or exponential etc, then there is no hope to actually get good enough scaling by just increasing compute in a normal rate. Then AI technology will still be improving, but improving to a halt, practically stagnating.

o3 will be interesting if it offers indeed a novel technology to handle problem solving, something that is able to learn from few novel examples efficiently and adapt. That's what intelligence actually is. Maybe this is the case. If, on the other hand, it is a smart way to pair CoT within an evaluation loop (as the author hints as possibility) then it is probable that, while this _can_ handle a class of problems that current LLMs cannot, it is not really this kind of learning, meaning that it will not be able to scale to more complex, real world tasks with a problem space that is too large and thus less amenable to such a technique. It is still interesting, because having a good enough evaluator may be very important step, but it would mean that we are not yet there.

We will learn soon enough I suppose.

og_kalu • 20 hours ago

It's not 6000/task (i.e per question). 6000 is about the retail cost for evaluating the entire benchmark on high efficiency (about 400 questions)

Tiberium • 20 hours ago

From reading the blog post and Twitter, and cost of other models, I think it's evident that it IS actually cost per task, see this tweet: https://files.catbox.moe/z1n8dc.jpg

And o1 cost $15/$60 for 1M in/out, so the estimated costs on the graph would match for a single task, not the whole benchmark.

slibhb • 20 hours ago

The blog clarifies that it's $17-20 per task. Maybe it runs into thousands for tasks it can't solve?

Tiberium • 19 hours ago

That cost is for o3 low, o3 high goes into thousands per task.

freehorse • 21 hours ago

This makes me think and speculate if the solution comprises of a "solver" trying semi-random or more targeted things and a "checker" checking these? Usually checking a solution is cognitively (and computationally) easier than coming up with it. Else I cannot think what sort of compute would burn 6000$ per task, unless you are going through a lot of loops and you have somehow solved the part of the problem that can figure out if a solution is correct or not, while coming up with the actual correct solution is not as solved yet to the same degree. Or maybe I am just naive and these prices are just like breakfast for companies like that.

gbnwl • 20 hours ago

Well they got 75.7% at $17/task. Did you see that?

seydor • 18 hours ago

What if we use those humans to generate energy for the tasks?

redeux • 21 hours ago

Time and availability would also be factors.

dyauspitr • 21 hours ago

Compute can get optimized and cheap quickly.

karmasimida • 20 hours ago

Is it? The moore’s law is dead dead, I don’t think this is a given.

neuroelectron • 20 hours ago

OpenAI spent approximately $1,503,077 to smash the SOTA on ARC-AGI with their new o3 model

semi-private evals (100 tasks): 75.7% @ $2,012 total/100 tasks (~$20/task) with just 6 samples & 33M tokens processed in ~1.3 min/task and a cost of $2012

The “low-efficiency” setting with 1024 samples scored 87.5% but required 172x more compute.

If we assume compute spent and cost are proportional, then OpenAI might have just spent ~$346.064 for the low efficiency run on the semi-private eval.

On the public eval they might have spent ~$1.148.444 to achieve 91.5% with the low efficiency setting. (high-efficiency mode: $6677)

OpenAI just spent more money to run an eval on ARC than most people spend on a full training run.

bluecoconut • 19 hours ago

By my estimates, for this single benchmark, this is comparable cost to training a ~70B model from scratch today. Literally from 0 to a GPT-3 scale model for the compute they ran on 100 ARC tasks.

I double checked with some flop estimates (P100 for 12 hours = Kaggle limit, they claim ~100-1000x for O3-low, and x172 for O3-high) so roughly on the order of 10^22-10^23 flops.

In another way, using H100 market price $2/chip -> at $350k, that's ~175k hours. Or 10^24 FLOPs in total.

So, huge margin, but 10^22 - 10^24 flop is the band I think we can estimate.

These are the scale of numbers that show up in the chinchilla optimal paper, haha. Truly GPT-3 scale models.

rvnx • 19 hours ago

It sounds like they essentially brute-forced the solutions ? Ask LLM for answer, answer for LLM to verify the answer. Ask LLM for answer, answer for LLM to verify the answer. Add a bit of randomness. Ask LLM for answer, answer for LLM to verify the answer. Add a bit of randomness. Repeat 5B times (this is what the paper says).

rfoo • 19 hours ago

Pretty sure this "cost" is based on their retail price instead of actual inference cost.

neuroelectron • 18 hours ago

Yes that's correct and there's a bit of "pixel math" as well so take these numbers with a pinch of salt. Preliminary model sizes from the temporarily public HF repository puts the full model size at 8tb or roughly 80 H100s

az226 • 2 hours ago

I thought that was a fake.

ec109685 • 14 hours ago

Yeah and can run off peak, etc.

Does seem to show an absolutely massive market for inference compute…

tikkun • 2 hours ago

I wonder: when did o1 finish training, and when did o3 finish training?

There's a ~3 month delay between o1's launch (Sep 12) and o3's launch (Dec 20). But, it's unclear when o1 and o3 each finished training.

spaceman_2020 • 21 hours ago

Just as an aside, I've personally found o1 to be completely useless for coding.

Sonnet 3.5 remains the king of the hill by quite some margin

vessenes • 21 hours ago

To fill this out, I find o1-pro (and -preview when it was live) to be pretty good at filling in blindspots/spotting holistic bugs. I use Claude for day to day, and when Claude is spinning, o1 often can point out why. It's too slow for AI coding, and I agree that at default its responses aren't always satisfying.

That said, I think its code style is arguably better, more concise and has better patterns -- Claude needs a fair amount of prompting and oversight to not put out semi-shitty code in terms of structure and architecture.

In my mind: going from Slowest to Fastest, and Best Holistically to Worst, the list is:

1. o1-pro 2. Claude 3.5 3. Gemini 2 Flash

Flash is so fast, that it's tempting to use more, but it really needs to be kept to specific work on strong codebases without complex interactions.

spaceman_2020 • 2 hours ago

Claude has a habit of sometimes just getting “lost”

Like I’ll have it a project in Cursor and it will spin up ready to use components that use my site style, reference existing components, and follow all existing patterns

Then on some days, it will even forget what language the project is in and start giving me python code for a react project

og_kalu • 21 hours ago

To be fair, until the last checkpoint released 2 days ago, o1 didn't really beat sonnet (and if so, barely) in most non-competitive coding benchmarks

bitbuilder • 19 hours ago

I find myself hoping between o1 and Sonnet pretty frequently these days, and my personal observation is that the quality of output from o1 scales more directly to the quality of the prompting you're giving it.

In a way it almost feels like it's become too good at following instructions and simply just takes your direction more literally. It doesn't seem to take the initiative of going the extra mile of filling in the blanks from your lazy input (note: many would see this as a good thing). Claude on the other hand feels more intuitive in discerning intent from a lazy prompt, which I may be prone to offering it at times when I'm simply trying out ideas.

However, if I take the time to write up a well thought out prompt detailing my expectations, I find I much prefer the code o1 creates. It's smarter in its approach, offers clever ideas I wouldn't have thought of, and generally cleaner.

Or put another way, I can give Sonnet a lazy or detailed prompt and get a good result, while o1 will give me an excellent result with a well thought out prompt.

What this boils down to is I find myself using Sonnet while brainstorming ideas, or when I simply don't know how I want to approach a problem. I can pitch it a feature idea the same way a product owner might pitch an idea to an engineer, and then iterate through sensible and intuitive ways of looking at the problem. Once I get a handle on how I'd like to implement a solution, I type up a spec and hand it off to o1 to crank out the code I'd intend to implement.

spaceman_2020 • 2 hours ago

Have you found any tool or guide for writing better o1 prompts? This isn’t the first time I’ve heard this about o1 but no one seems to know how to prompt it

jules • 18 hours ago

Can you solve this by putting your lazy prompt through GPT-4o or Sonnet 3.6 and asking it to expand the prompt to a full prompt for o1?

InkCanon • 20 hours ago

I just asked o1 a simple yes or no question about x86 atomics and it did one of those A or B replies. The first answer was yes, the second answer was no.

bearjaws • 21 hours ago

o1 is pretty good at spotting OWASP defects, compared to most other models.

https://myswamp.substack.com/p/benchmarking-llms-against-com...

cchance • 21 hours ago

The new gemini's are pretty good too

lysecret • 21 hours ago

Actually prefer new geminis too. 2.0 experimental especially.

leumon • 18 hours ago

I've found gemini-1206 to be best. and we can use it free (for now), in google's aistudio. It's number 1 on lmarena.ai for coding, and generally, and number 1 on bigcodebench.

energy123 • 14 hours ago

Which o1? A new version was released a few days ago and beats Sonnet 3.5 on Livebench

karmasimida • 20 hours ago

Yeah I feel for chat use case, o1 is just too slow for me, and my queries aren’t that complicated.

For coding, o1 is marvelous at Leetcode question I think it is the best teacher I would ever afford to teach me leetcoding, but I don’t find myself have a lot of other use cases for o1 that is complex and requires really long reasoning chain

Mememaker197 • 18 hours ago

[dead]

m3kw9 • 20 hours ago

o1 is when all else fails, sometimes it does the same mistakes as weaker models if you give it simple tasks with very little context, but when a good precise context is given it usually outperforms other Models

nxobject • 20 hours ago

As an aside, I'm a little miffed that the benchmark calls out "AGI" in the name, but then heavily cautions that it's necessary but insufficient for AGI.

> ARC-AGI serves as a critical benchmark for detecting such breakthroughs, highlighting generalization power in a way that saturated or less demanding benchmarks cannot. However, it is important to note that ARC-AGI is not an acid test for AGI

mmcnl • 17 hours ago

I immediately thought so too. Why confuse everyone?

ec109685 • 14 hours ago

Because ARC somehow convinced people that solving it was an indicator of AGI.

Jensson • 14 hours ago

Its like the "Open" in OpenAI or the "Democratic" in North Koreas DPRK. Naming things helps fool a lot of people.

EthanHeilman • 10 hours ago

It is a necessary but not sufficient condition to AGI.

oezi • 6 hours ago

> o3 fixes the fundamental limitation of the LLM paradigm – the inability to recombine knowledge at test time

I don't understand this mindset. We have all experienced that LLMs can produce words never spoken before. Thus there is recombination of knowledge at play. We might not be satisfied with the depth/complexity of the combination, but there isn't any reason to believe something fundamental is missing. Given more compute and enough recursiveness we should be able to reach any kind of result from the LLM.

The linked article says that LLMs are like a collection of vector programs. It has always been my thinking that computations in vector space are easy to make turing complete if we just have an eigenvector representation figured out.

ndm000 • 17 hours ago

One thing I have not seen commented on is that ARC-AGI is a visual benchmark but LLMs are primarily text. For instance when I see one of the ARC-AGI puzzles, I have a visual representation in my brain and apply some sort of visual reasoning solve it. I can "see" in my mind's eye the solution to the puzzle. If I didn't have that capability, I don't think I could reason through words how to go about solving it - it would certainly be much more difficult.

I hypothesize that something similar is going on here. OpenAI has not published (or I have not seen) the number of reasoning tokens it took to solve these - we do know that each tasks was thoussands of dollars. If "a picture is worth a thousand words", could we make AI systems that can reason visually with much better performance?

krackers • 16 hours ago

Yeah this part is what makes the high performance even more surprising to me. The fact that LLMs are able to do so well on visual tasks (also seen with their ability to draw an image purely using textual output https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/) implies that not only do they actually have some "world model" but that this is in spite of the disadvantage given by having to fit a round peg in a square hole. It's like trying to map out the entire world using the orderly left-brain, without a more holistic spatial right-brain.

I wonder if anyone has experimented with having some sort of "visual" scratchpad instead of the "text-based" scratchpad that CoT uses.

skydhash • 13 hours ago

A file is a stream of symbols encoded by bits according to some format. It’s pretty much 1D. It would be susprising that LLM couldn’t extract information from a file or a data stream.

csomar • 13 hours ago

This is not new. When GPT-4 was released I was able to get it to generate SVGs albeit they were ugly they had the basics.

mortehu • 14 hours ago

The chart is super misleading, since the test was obscure until recently. A few months ago he announced he'd made the only good AGI test and offered a cash prize for solving it, only to find out in as much time that it's no different from other benchmarks.

t0lo • 16 hours ago

I'm 22 and have no clue what I'm meant to do in a world where this is a thing. I'm moving to a semi rural, outdoorsy area where they teach data science and marine science and I can enjoy my days hiking, and the march of technology is a little slower. I know this will disrupt so much of our way of life, so I'm chasing what fun innocent years are left before things change dramatically.

brysonreece • 16 hours ago

It's worth noting that LLMs have been part of the tech zeitgeist for over two years and have had a pretty limited impact on hireability for roles, despite what people like the Klarna CEO are saying. Personally, I'm betting on two things:

* The upward bound of compute/performance gains as we continue to iterate on LLMs. It simply isn't going to be feasible for a lot of engineers and businesses to run/train their own LLMs. This means an inherent reliance on cloud services to bridge the gap (something MS is clearly betting on), and engineers to build/maintain the integration from these services to whatever business logic their customers are buying.

* Skilled knowledge workers continuing to be in-demand, even factoring in automation and new-grad numbers. Collectively, we've built a better hammer; it still takes someone experienced enough to know where to drive the nail. These tools WILL empower the top N% of engineers to be more productive, which is why it will be more important than ever to know _how_ to build things that drive business value, rather than just how to churn through JIRA tickets or turn a pretty Figma design into React.

byyoung3 • 15 hours ago

o8 will probably be able to handle datacenter management

toomuchtodo • 15 hours ago

https://www.youtube.com/watch?v=Yvs7f4UaKLo

byyoung3 • 3 hours ago

exactly

VonTum • 12 hours ago

I agree completely. This is a fundamentally different change than the ones that came before. Calculators, assemblers, higher level languages, none of these actually removed the _reasoning_ the engineer has to do, they just provide abstractions that make this reasoning easier. What reason is there to believe LLMs will remain "assistants" instead of becoming outright replacements? If LLMs can do the reasoning all the way from high level description down to implementation, what prevents them from doing the high level describing too?

In general, with the technology advancing as rapidly as it is, and the trillions of dollars oriented towards replacing knowledge work, I don't see a future in this field. And that's despite me being on a very promising path myself! I'm 25, in the middle of a CS PhD in Germany, with an impressive CV behind me. My head may be the last on the chopping block, but I'd be surprised if it buys me more than a few years once programmer obsolescence truly kicks in.

Indeed, what I think are safe jobs are jobs with fundamental human interaction. Nurses, doctors, kindergarten teachers. I myself have been considering pivoting to becoming a skiing teacher.

Maybe one good thing that comes out of this is breaking my "wunderkind" illusion. I spent my teens writing C++ code instead of going out socializing and making friends. Of course, I still did these things, but I could've been far less of a hermit.

I mirror your sentiment of spending these next few years living life; Real life. My advice: Stop sacrificing the now for the future. See the world, go on hikes with friends, go skiing, attend that bouldering thing your friends have been telling you about. If programming is something you like doing, then by all means keep going and enjoy it. I will likely keep programming too, it's just no longer the only thing I focus on.

Edit: improve flow of last paragraph

darkgenesha • 12 hours ago

What was it that initially inspired you to learn to code? Was it robots, video games, design, etc... Whatever that was, creating the pinnacle of it is what your future will be.

VonTum • 4 hours ago

It was the challenge for me. Seeing some difficult-to-solve problem, attacking it, and actually solving it after much perseverance.

Kind of stemming from the mindspace "If they can build X, I can build X!"

I'd explicitly not look up tutorials, just so I'd have the opportunity to solve the mathemathics myself. Like building a 3D physics engine. (I did look up colission detection after struggling with it for a month or so, inventing GJK is on another level)

schappim • 16 hours ago

I completely understand how you feel -I'm in my 40s, and I often find myself questioning what direction to take in this rapidly changing world. On top of that, I'm unsure whether advising my kids to go to university is still the right path for their future.

Everything seems so uncertain, and the pace of technological advancement makes long-term planning feel almost impossible. Your plan to move to a slower-paced area and enjoy the outdoors sounds incredibly grounding - it's something I've been considering myself.

rtsil • 16 hours ago

I tell everyone who would listen to me (i.e. not many) that white collar jobs like mine are dead and skilled manual work is the way of the near future, that is until the rise of the robots.

dyauspitr • 12 hours ago

Robots are going to go hand in hand with AI. Pretty sure our problems right now are not with the physical hardware that can far outperform a human already, it’s in the control software.

t0lo • 12 hours ago

Robots can only proliferate at the speed of real world logistics and resource management and I think will always be a little difficult.

AI can be anywhere any time with cloud compute.

aryonoco • 16 hours ago

I advise my kids to stay curious, keep learning, keep wondering, keep discovering. Whether that's through university or some other path.

salter2 • 15 hours ago

I'm the same age as you; I feel lost, erring in being a little too pessimistic.

Feels like I hit the real world just a couple years too late to get situated in a solid position. Years of obsession in attempt to catch up to the wizards, chasing the tech dream. But this, feels like this is it. Just watching the timebomb tick. I'd love to work on what feels like the final technology, but I'm not a freakshow like what these labs are hiring. At least I get to spectate the creation of humanity's greatest invention.

This announcement is just another gut punch, but at this point I should expect its inevitable. A Jason Voorhees AGI, slowly but surely to devour all the talents and skills information workers have to offer.

Apologies for the rambly and depressing post, but this is reality for anyone recently out or still in school.

neom • 12 hours ago

Put another way, you have deep conviction in a change that vast majority of people have not even seen yet, never mind grokked, and you're young enough to spend some decent amount of time on education for "venn'ing" yourself into a useful tool in the future. If you have a baseline education, there are any number of orthogonal skills you could add, be it philosophy, fine art, medicine, whatever. You know how to skate and you know where the puck is going, most most people, don't even see the rink.

t0lo • 9 hours ago

At least you're disillusioned with the idea of a long term career before a lot of other people. It's disturbing seeing how ready people are to go into a lifelong career and expecting stability and happiness in the world we're heading into.

We are living in a world run by and for the soon to be dead, many of which have dementia, so empathic policy and foresight is out of the question, and we're going to be picking up the incredibly broken scraps of our golden age.

And not to get too political but the mass restructuring of public consciousness and intellectual society due to mass immigration for an inexplicable gdp squeeze and social media is happening at exactly the wrong time to handle these very serious challenges. The speed at which we've undone civil society is breakneck, and it will go even further, and it will get even worse. We've easily gone back 200 years in terms of emotional intelligence in the past 15.

karmasimida • 16 hours ago

While I understand why you feel this way, the meaning or standing of being a programmer is different now. It feels like the purpose is lost or it longer belongs to human.

But below is reality talk. With Claude 3.5, I already think it is a better programmer than I at micro level tasks, and a better Leetcode programmer than I could ever be.

I think it is like modern car manufacturering, the robots build most of the components, but I can’t see how human could be dismissed from the process to oversee output.

O3 has been very impressive in achieving 70+ in swebench for example, but this also means when it is trained on the codebase multiple times so visibility isn't an issue yet it still has 30% chance that it can’t pass the unit tests.

A fully autonomous system can’t be trusted, the economy of software won’t collapse, but it will be transformed beyond our imagination now.

I will for sure miss the days when writing code, or coder is still a real business.

How time flies

Kostchei • 15 hours ago

Developer. Prompt Engineer. Philosopher-Builder. (mostly) not programmer.

The code part will get smaller and smaller for most folks. Some frameworks or bare-metal people or intense heavy-lifters will still do manual code or pair-programming where half the pair is an agentic AI with super-human knowledge of your org's code base.

But this will be a layer of abstraction for most people who build software. And as someone who hates rote learning, I'm here for it. IMO.

Unfortunately (?) I think the 10-20-50? years of development experience you might bring to bear on the problems can be superseded by an LLM finetuned on stackoverflow, github etc once judgement and haystack are truly nailed. Because it can have all that knowledge you have accumulated, and soaked into a semi-conscious instinct that you use so well you aren't even aware of it except that it works. It can have that a million times over. Actually. Which is both amazing and terrifying. Currently this isn't obvious because it's accuracy /judgement to learn all those life-of-a-dev lessons is almost non-existent. Currently. But it will happen. That is copilot's future. It's raison d'être.

I would argue what it will never have however, simply by function of the size of training runs is unique functional drive and vision. If you wanted a "Steve Jobs" AI you would have to build it. And if you gave it instructions to make a prompt/framework to build a "Jobs" it would just be an imitation, rather than a new unique in-context version. That is the value a person has- their particular filter, their passion and personal framework. Someone who doesn't have any of those things, they had better be hoping for UBI and charity. Or go live a simple life, outside the rat race.

bows

t0lo • 14 hours ago

I'm hoping it's similar to the abacus for maths, the elimination of human "calculators" like on the apollo missions, and we just ended up moving onto different, harder, more abstract problems, and forget that we ever had to climb such small hills. AI's evolution and integration is more multifaceted though and much more unpredictable.

But unlike the abacus/calculators i don't feel like we're at a point in history where society is getting wiser and more empathetic, and these new abilities are going towards something good.

But supervisors of tasks will remain because we're social, untrusting, and employers will always want someone else to blame for their shortcomings. And humans will stay in the chain at least for marketing and promotion/reputation because we like our japanese craftsman and our amg motors made by one person.

rich_sasha • 16 hours ago

I feel your anxiety. I often wonder how I arrange the remaining many decades of my life to maintain a stream of income.

Perhaps what I need is actually a steady stream of food - i.e. buy some land and oxen and solar panels while I can.

Havoc • 15 hours ago

>I'm 22 and have no clue what I'm meant to do in a world where this is a thing.

For what it's worth that's probably an advantage versus the legions of people who are staring down the barrel of years invested into skills that may lose relevance very rapidly.

ec109685 • 14 hours ago

If information technology workers become twice as productive, you’ll want more of them for your business, not less.

There are way more data analysts now than when it required paper and pencil.

mrcwinn • 16 hours ago

On the contrary I think you already have an excellent plan.

t0lo • 16 hours ago

I'm happy enough with it, but I'm also a little sad that it's essentially been chosen for me because of weak willed and valued people who don't want to use policy to make things better for us as a society. Plus we are in a bad world/scenario for AI advancements to come into with pretty heavy institutional decay and loss of political checks and balances.

It's like my life is forfeit to fixing other peoples mistakes because they're so glaring and I feel an obligation. Maybe that's the way the world's always been, but it's a concerning future right now

aryonoco • 16 hours ago

Our way of life changed when electricity came around. It changed when cars took over the cities, it again changed when mobile phones became omnipresent.

Will LLMs or without LLMs, the world will keep turning. Humans will still be writing amazing works of literature, creating beautiful art, carrying out scientific experiments and discovering new species.

mukunda_johnson • 15 hours ago

Deciphering patterns in natural language is more complex than these puzzles. If you train your AI to solve these puzzles, we end up in the same spot. The difficulty of solving would be with creating training data for a foreign medium. The "tokens" are the grids and squares instead of words (for words, we have the internet of words, solving that).

If we're inferring the answers of the block patterns from minimal or no additional training, it's very impressive, but how much time have they had to work on O3 after sharing puzzle data with O1? Seems there's some room for questionable antics!

flakiness • 21 hours ago

The cost axis is interesting. O3 Low is $10+ per task and 03 High is over $1000 (it's logarithmic graph so it's like $50 and $5000 respectively?)

onemetwo • 21 hours ago

In (1) the author use a technique to improve the performance of an LLM, he trained sonnet 3.5 to obtain 53,6% in the arc-agi-pub benchmark moreover he said that more computer power would give better results. So the results of o3 could be produced in this way using the same method with more computer power, so if this is the case the result of o3 is not very interesting.

(1) https://params.com/@jeremy-berman/arc-agi

energy123 • 14 hours ago

At about 12-14 minutes in OpenAI's YouTube vid they show that o3-mini beats o1 on Codeforces despite using much less compute.

attentionmech • 21 hours ago

Isn't this at the level now where it can sort of self improve. My guess is that they will just use it to improve the model and the cost they are showing per evaluation will go down drastically.

So, next step in reasoning is open world reasoning now?

dyauspitr • 7 hours ago

I don’t believe so. If it’s at the point where you could just plug it into a bunch of camera feeds around the world and it could only filter out a useful training set for itself out of that data then we truly would have AGI. I don’t think it’s there yet.

joshdavham • 10 hours ago

A lot of the comments seem very dismissive and a little overly-skeptical in my opinion. Why is this?

Bjorkbat • 21 hours ago

I was impressed until I read the caveat about the high-compute version using 172x more compute.

Assuming for a moment that the cost per task has a linear relationship with compute, then it costs a little more than $1 million to get that score on the public eval.

The results are cool, but man, this sounds like such a busted approach.

futureshock • 20 hours ago

So what? I’m serious. Our current level of progress would have been sci-fi fantasy with the computers we had in 2000. The cost may be astronomical today, but we have proven a method to achieve human performance on tests of reasoning over novel problems. WOW. Who cares what it costs. In 25 years it will run on your phone.

radioactivist • 18 hours ago

So your claim for optimism here is that something today that took ~10^22 floating point operations (based on an estimate earlier in the thread) to execute will be running on phones in 25 years? Phones which are currently running at O(10^12) flops. That means ten orders of magnitudes of improvement for that to run in a reasonable amount of time? It's a similar scale up in going from ENIAC (500 flops) to a modern desktop (5-10 teraflops).

futureshock • 17 hours ago

That sounds reasonable to me because the compute cost for this level of reasoning performance won’t stay at 10^22 and phones won’t stay at 10^12. This reasoning breakthrough is about 3 months old.

radioactivist • 17 hours ago

I think expecting five orders of magnitude improvement from either side of this (inference cost or phone performance) is insane.

Bjorkbat • 20 hours ago

It's not so much the cost as much the fact that they got a slightly better result by throwing 172x more compute per/task. The fact that it may have cost somewhere north of $1 million simply helps to give a better idea of how absurd the approach is.

It feels a lot less like the breakthrough when the solution looks so much like simply brute-forcing.

But you might be right, who cares? Does it really matter how crude the solution is if we can achieve true AGI and bring the cost down by increasing the efficiency of compute?

futureshock • 17 hours ago

“Simply brute-forcing”

That’s the thing that’s interesting to me though and I had the same first reaction. It’s a very different problem than brute-forcing chess. It has one chance to come to the correct answer. Running through thousands or millions of options means nothing if the model can’t determine which is correct. And each of these visual problems involve combinations of different interacting concepts. To solve them requires understanding, not mimicry. So no matter how inefficient and “stupid” these models are, they can be said to understand these novel problems. That’s a direct counter to everyone who ever called these a stochastic parrot and said they were a dead-end to AGI that was only searching an in distribution training set.

The compute costs are currently disappointing, but so was the cost of sequencing the first whole human genome. That went from 3 billion to a few hundred bucks from your local doctor.

RivieraKid • 20 hours ago

It sucks that I would love to be excited about this... but I mostly feel anxiety and sadness.

Jcampuzano2 • 18 hours ago

Same, it's sad but I honestly hoped they never achieved these results and it came out that it wasn't possible or would take an insurmountable amount of resources but here we are ok the verge of making most humans useless when it comes to productivity.

While there are those that are excited, the world is not prepared for the level of distress this could put on the average person without critical changes at a monumental level.

JacksCracked • 17 hours ago

If you don't feel like the world needed grand scale changes at a societal level with all the global problems we're unable to solve, you haven't been paying attention. Income inequality, corporate greed, political apathy, global warming.

phito • 10 hours ago

AI will fix none of that

sensanaty • 17 hours ago

And you think the bullshit generators backed by the largest corporate entities in humanity who are, as we speak, causing all the issues you mention are somehow gonna solve any of this?

CamperBob2 • 13 hours ago

crakhamster01 • 16 hours ago

Well said! There's no way big tech and institutional investors are pouring billions of dollars into AI because of corporate greed. It's definitely so that they can redistribute wealth equally once AGI is achieved.

gom_jabbar • 19 hours ago

Anxiety and sadness are actually mild emotional responses to the dissolution of human culture. Nick Land in 1992:

"It is ceasing to be a matter of how we think about technics, if only because technics is increasingly thinking about itself. It might still be a few decades before artificial intelligences surpass the horizon of biological ones, but it is utterly superstitious to imagine that the human dominion of terrestrial culture is still marked out in centuries, let alone in some metaphysical perpetuity. The high road to thinking no longer passes through a deepening of human cognition, but rather through a becoming inhuman of cognition, a migration of cognition out into the emerging planetary technosentience reservoir, into 'dehumanized landscapes ... emptied spaces' where human culture will be dissolved. Just as the capitalist urbanization of labour abstracted it in a parallel escalation with technical machines, so will intelligence be transplanted into the purring data zones of new software worlds in order to be abstracted from an increasingly obsolescent anthropoid particularity, and thus to venture beyond modernity. Human brains are to thinking what mediaeval villages were to engineering: antechambers to experimentation, cramped and parochial places to be.

[...]

Life is being phased-out into something new, and if we think this can be stopped we are even more stupid than we seem." [0]

Land is being ostracized for some of his provocations, but it seems pretty clear by now that we are in the Landian Accelerationism timeline. Engaging with his thought is crucial to understanding what is happening with AI, and what is still largely unseen, such as the autonomization of capital.

[0] https://retrochronic.com/#circuitries

achierius • 7 hours ago

It's obvious that there are lines of flight (to take a Deleuzian tack, a la Land) away from the current political-economic assemblage. For example, a strategic nuclear exchange starting tomorrow (which can always happen -- technical errors, a rogue submarine, etc.) would almost certainly set back technological development enough that we'd have no shot at AI for the next few decades. I don't know whether you agree with him, but I think the fact that he ignores this fact is quite unserious, especially given the likely destabilizing effects sub-AGI AI will have on international politics.

larve • 8 hours ago

I have been diving deep into LLM coding over the last 3 years and regular encountered that feeling along the way. I still at times have a "wtf" moment where I need to take a break. However, I have been able to quell most of my anxieties around my job / the software profession in general (I've been at this professionally for 25+ years and software has been my dream job since I was 6).

For one, I found AI coding to work best in a small team, where there is an understanding of what to build and how to build it, usually in close feedback loop with the designers / users. Throw the usual managerial company corporate nonsense on top and it doesn't really matter if you can instacreate a piece of software, if nobody cares for that piece of software and it's just there to put a checkmark on the Q3 OKR reports.

Furthermore, there is a lot of software to be built out there, for people who can't afford it yet. A custom POS system for the local baker so that they don't have to interact with a computer. A game where squids eat algae for my nephews at christmas. A custom photo layout software for my dad who despairs at indesign. A plant watering system for my friend. A local government information website for older citizens. Not only can these be built at a fraction of the cost they were before, but they can be built in a manner where the people using the software are directly involved in creating it. Maybe they can get a 80% hacked version together if they are technically enclined. I can add the proper database backend and deployment infrastructure. Or I can sit with them and iterate on the app as we are talking. It is also almost free to create great documentation, in fact, LLM development is most productive when you turn up software engineering best practices up to 11.

Furthermore, I found these tools incredible for actively furthering my own fundamental understanding of computer science and programming. I can now skip the stuff I don't care to learn (is it foobarBla(func, id) or foobar_bla(id, func)) and put the effort where I actually get a long-lived return. I have become really ambitious with the things I can tackle now, learning about all kinds of algorithms and operating system patterns and chemistry and physics etc... I can also create documents to help me with my learning.

Local models are now entering the phase where they are getting to be really useful, definitely > gpt3.5 which I was able to use very productively already at the time.

Writing (creating? manifesting? I don't really have a good word for what I do these days) software that makes me and real humans around me happy is extremely fulfilling, and has allevitated most of my angst around the technology.

pupppet • 20 hours ago

We’re enabling a huge swath of humanity being put out of work so a handful of billionaires can become trillionaires.

abiraja • 19 hours ago

And also the solving of hundreds of diseases that ail us.

lewhoo • 19 hours ago

One of the biggest factors in risk of death right now is poverty. Also what is being chased right now is "human level on most economically viable tasks" because the automated research for solving physics etc. even now seems far-fetched.

asdf6969 • 18 hours ago

Why do you think you’ll be able to afford healthcare? The new medicine is for the AI owners

thrance • 19 hours ago

You need to solve diseases and make the cure available. Millions die of curable diseases every year, simply because they are not deemed useful enough. What happens when your labor becomes worthless?

hartator • 19 hours ago

It doesn’t matter. Statists rather be poor, sick, and dead than risking trillionaires.

thrance • 19 hours ago

You should read about workers right in the gilded age, and see how good laissez-faire capitalism was. What do you think will happen when the only thing you can trade with the trillionaires, your labor, becomes worthless?

xvector • 20 hours ago

Humanity is about to enter an even steeper hockey stick growth curve. Progressing along the Kardashev scale feels all but inevitable. We will live to see Longevity Escape Velocity. I'm fucking pumped and feel thrilled and excited and proud of our species.

Sure, there will be growing pains, friction, etc. Who cares? There always is with world-changing tech. Always.

lewhoo • 18 hours ago

> Sure, there will be growing pains, friction, etc. Who cares?

That's right. Who cares about pains of others and why they even should are absolutely words to live by.

xvector • 17 hours ago

Yeah, with this mentality, we wouldn't have electricity today. You will never make transition to new technology painless, no matter what you do. (See: https://pessimistsarchive.org)

What you are likely doing, though, is making many more future humans pay a cost in suffering. Every day we delay longevity escape velocity is another 150k people dead.

lewhoo • 17 hours ago

There was a time when in the name of progress people were killed for whatever resources they possessed, others were enslaved etc. and I was under the impression that the measure of our civilization is that we actually DID care and just how much. It seems to me that you are very eager to put up altars of sacrifice without even thinking that the problems you probably have in mind are perfectly solvable without them.

smokedetector1 • 17 hours ago

achierius • 7 hours ago

https://www.transformernews.ai/p/richard-ngo-openai-resign-s...

Almost every single one of the people OpenAI had hired to work on AI safety have left the firm with similar messages. Perhaps you should at least consider the thinking of experts?

You and I will likely not live to see much of anything past AGI.

croemer • 19 hours ago

Longevity Escape Velocity? Even if you had orders of magnitude more people working on medical research, it's not a given that prolonging life indefinitely is even possible.

soheil • 17 hours ago

Of course it's a given unless you want to invoke supernatural causes the human brain is a collection of cells with electro-chemical connections that if fully reconstructed either physically or virtually would necessarily need to represent the original person's brain. Therefore with sufficient intelligence it would be possible to engineer technology that would be able to do that reconstruction without even having to go to the atomic level, which we also have a near full understanding of already.

tokioyoyo • 19 hours ago

My job should be secure for a while, but why would an average person give a damn about humanity when they might lose their jobs and comfort levels? If I had kids, I would absolutely hate this uncertainty as well.

“Oh well, I guess I can’t give the opportunities to my kid that I wanted, but at least humanity is growing rapidly!”

xvector • 19 hours ago

> when they might lose their jobs and comfort levels?

Everyone has always worried about this for every major technology throughout history

IMO AGI will dramatically increase comfort levels, lower your chance of dying, death, disease, etc.

tokioyoyo • 19 hours ago

MVissers • 12 hours ago

We've almost wiped ourselves out in a nuclear war in the 70ies. If that would have happened, would it have been worth it? Probably not.

Beyond immediate increase in inequality, which I agree could be worth it in the long run if this was the only problem, we're playing a dangerous game.

The smartest and most capable species on the planet that dominates it for exactly this reason, is creating something even smarter and more capable than itself in the hope it'd help make its life easier.

Hmm.

asdf6969 • 18 hours ago

I would rather follow in the steps of uncle Ted than let AI turn me in a homeless person. It’s no consolation that my tent will have a nice view of a lunar colony

drcode • 20 hours ago

longevity for the AIs

soheil • 17 hours ago

I agree, save invoking supernatural causes, the human brain is a collection of cells with electro-chemical connections that if fully reconstructed either physically or virtually would necessarily need to represent the original person's brain. Therefore with sufficient intelligence it would be possible to engineer technology that would be able to do that reconstruction without even having to go to the atomic level, which we also have a near full understanding of already.

objektif • 18 hours ago

You sound like a rich person.

the5avage • 3 hours ago

The examples unsolved by high compute o3 look a lot like the raven progressive matrix tests used in IQ tests.

SerCe • 13 hours ago

> You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.

You'll know AGI is here when traditional captchas stop being a thing due to their lack of usefulness.

thallium205 • 13 hours ago

Captchas are already completely useless.

CamperBob2 • 13 hours ago

(Shrug) AI has been better than humans at solving CAPTCHAs for a LONG time. As the sibling points out, they're just a waste of time and electricity at this point.

darkgenesha • 12 hours ago

Ironically, they are used as free labor to label image sets for ai to be trained on.

Seattle3503 • 20 hours ago

How can there be "private" taks when you have use the OpenAI API to run queries? OpenAI sees everything.

nmca • 16 hours ago

We worked with ARC to run inference on the semi-private tasks last week, after o3 was trained, using an inference only API that was sent the prompts but not the answers & did no durable logging.

idontknowmuch • 11 hours ago

What's your opinion on the veracity of this benchmark - given o3 was fine-tuned and others were not? Can you give more details on how much data was used to fine-tune o3? It's hard to put this into perspective given this confounder.

nmca • 2 hours ago

I can’t provide more information than is currently public, but from the ARC post you’ll note that we trained on about 75% of the train set (which contains 400 examples total); which is within the ARC rules, and evaluated on the semiprivate set.

thisisthenewme • 18 hours ago

I feel like AI is already changing how we work and live - I've been using it myself for a lot of my development work. Though, what I'm really concerned about is what happens when it gets smart enough to do pretty much everything better (or even close) than humans can. We're talking about a huge shift where first knowledge workers get automated, then physical work too. The thing is, our whole society is built around people working to earn money, so what happens when AI can do most jobs? It's not just about losing jobs - it's about how people will pay for basic stuff like food and housing, and what they'll do with their lives when work isn't really a thing anymore. Or do people feel like there will be jobs safe from AI? (hopefully also fulfilling)

Some folks say we could fix this with universal basic income, where everyone gets enough money to live on, but I'm not optimistic that it'll be an easy transition. Plus, there's this possibility that whoever controls these 'AGI' systems basically controls everything. We definitely need to figure this stuff out before it hits us, because once these changes start happening, they're probably going to happen really fast. It's kind of like we're building this awesome but potentially dangerous new technology without really thinking through how it's going to affect regular people's lives. I feel like we need a parachute before we attempt a skydive. Some people feel pretty safe about their jobs and think they can't be replaced. I don't think that will be the case. Even if AI doesn't take your job, you now have a lot more unemployed people competing for the same job that is safe from AI.

neom • 18 hours ago

I spend quite a lot of time noodling on this. The thing that became really clear from this o3 announcement is that the "throw a lot of compute at it and it can do insane things" line of thinking continues to hold very true. If that is true, is the right thing to do productize it (use the compute more generally) or apply it (use the compute for very specific incredibly hard and ground breaking problems)? I don't know if any of this thinking is logical or not, but if it's a matter of where to apply the compute, I feel like I'd be more inclined to say: don't give me AI, instead use AI to very fundamentally shift things.

para_parolu • 17 hours ago

From IT bubble it’s very easy to have impression that AI will replace most people. Most of people on my street do not work in IT. Teacher, nurse, hobby shop owner, construction workers, etc. Surely programming and other virtual work may become less paid job but it’s not end of the world.

dyauspitr • 7 hours ago

Honestly with o3 levels of reasoning generating control software for robots on the fly, none of the above seem safe. For a decade or two at the most if that.

lacedeconstruct • 18 hours ago

I am pretty sure we will have a deep cultural repulsion from it and people will pay serious money to have an AI free experience, If AI becomes actually useful there is alot of areas that we dont even know how to tackle like medicine and biology, I dont think anything would change otherwise, AI will take jobs but it will open alot more jobs at much higher abstraction, 50 years ago the idea that a software engineer would become a get rich quick job would have been insane imo

cerved • 18 hours ago

> Though, what I'm really concerned about is what happens when it gets smart enough to do pretty much everything better (or even close)

I'll get concerned when it stops sucking so hard. It's like talking to a dumb robot. Which it unsurprisingly is.

vouaobrasil • 17 hours ago

A possibility is a coalition: of people who refuse to use AI and who refuse to do business with those who use AI. If the coalition grows large enough, AI can be stopped by economic attrition.

sumedh • 11 hours ago

> of people who refuse to use AI and who refuse to do business with those who use AI.

Do people refuse to buy from stores which gets goods manufactured by slave labour?

Most people dont care, if AI business are offering goods/services at a lower costs , people will vote with their wallets not principle.

vouaobrasil • 5 hours ago

AI could be different. At least, I'm willing to try to form a coalition.

Besides, AI researchers failed to make anything like a real Chatbot until recently, yet they've been trying since the Eliza days. I'm willing to put in at least as much effort as them.

globular-toast • 16 hours ago

I get LLMs to make k8s manifests for me. It gets it wrong, sometimes hilariously so, but still saves me time. That's because the manifests are in yaml, a language. The leap between that and inventing Kubernetes is one I can't see yet.

spyckie2 • 20 hours ago

The more Hacker News worthy discussion is the part where the author talks about search through the possible mini-program space of LLMs.

It makes sense because tree search can be endlessly optimized. In a sense, LLMs turn the unstructured, open system of general problems into a structured, closed system of possible moves. Which is really cool, IMO.

glup • 18 hours ago

Yes! This seems to be a really neat combination of 2010's Bayesian cleverness / Tenenbaumian program search approaches with the LLMs as merely sources of high-dim conditional distributions. I knew people were experimenting in this space (like https://escholarship.org/uc/item/7018f2ss) but didn't know it did so well wrt these new benchmarks.

roboboffin • 19 hours ago

Interesting that in the video, there is an admission that they have been targeting this benchmark. A comment that was quickly shut down by Sam.

A bit puzzling to me. Why does it matter ?

HarHarVeryFunny • 16 hours ago

It matters to extent that they want to market this as general intelligence, not as a collection of narrow intelligences (math, competitive programming, ARC puzzles, etc).

In reality it seems to be a bit of both - there is some general intelligence based on having been "trained on the internet", but it seems these super-human math/etc skills are very much from them having focused on training on those.

roboboffin • 7 hours ago

However, the way it is progressing is that the SOTA is saturating the current benchmarks; then a new one is conceived as people understand the nature of what it means to be intelligent. It seems only natural to concentrate on one benchmark at a time.

Francois Chollet mentioned that the test tries to avoid curve fitting (which he states is the main ability of LLMs). However, they specifically restricted the number of examples to do this. It is not beyond the realms of possibility that many examples could have been generated by hand though, and that the curve fitting has been achieved, rather than discrete programming.

Anyway, it’s all supposition. It’s difficult to know how genuine the results is, without knowledge of how it was actually achieved.

mukunda_johnson • 15 hours ago

I always smell foul play from Sam. I'd bet they are doing something silly to inflate the benchmark score. Not saying they are, but Sam is the type of guy to put a literal dumb human in the API loop and score "just as high as a human would."

digitcatphd • 7 hours ago

o3 fixes the fundamental limitation of the LLM paradigm – the inability to recombine knowledge at test time – and it does so via a form of LLM-guided natural language program search

> This is significant, but I am doubtful it will be as meaningful as people expect aside from potentially greater coding tasks. Without a 'world model' that has a contextual understanding of what it is doing, things will remain fundamentally throttled.

mattfrommars • 13 hours ago

Guys, its already happening. I recently got laid off due to AI taking over my jobs.

dimgl • 12 hours ago

What did you do? Can you elaborate?

Woodi • 6 hours ago

So article seriously and scientifically states:

"Our programs compilation (AI) gave 90% of correct answers in test 1. We expect that in test 2 quality of answers will degenerate to below random monkey pushing buttons levels. Now more money is needed to prove we hit blind alley."

Hurray ! Put limited version of that on everybody phones !

smy20011 • 21 hours ago

It seems O3 following trend of Chess engine that you can cut your search depth depends on state.

It's good for games with clear signal of success (Win/Lose for Chess, tests for programming). One of the blocker for AGI is we don't have clear evaluation for most of our tasks and we cannot verify them fast enough.

hackpert • 11 hours ago

If anyone else is curious about which ARC-AGI public eval puzzles o3 got right vs wrong (and its attempts at the ones it did get right), here's a quick visualization: https://arcagi-o3-viz.netlify.app

whoistraitor • 21 hours ago

The general message here seems to be that inference-time brute-forcing works as long as you have a good search and evaluation strategy. We’ve seemingly hit a ceiling on the base LLM forward-pass capability so any further wins are going to be in how we juggle multiple inferences to solve the problem space. It feels like a scripting problem now. Which is cool! A fun space for hacker-engineers. Also:

> My mental model for LLMs is that they work as a repository of vector programs. When prompted, they will fetch the program that your prompt maps to and "execute" it on the input at hand. LLMs are a way to store and operationalize millions of useful mini-programs via passive exposure to human-generated content.

I found this such an intriguing way of thinking about it.

whimsicalism • 20 hours ago

> We’ve seemingly hit a ceiling on the base LLM forward-pass capability so any further wins are going to be in how we juggle multiple inferences to solve the problem space

Not so sure - but we might need to figure out the inference/search/evaluation strategy in order to provide the data we need to distill to the single forward-pass data fitting.

DiscourseFan • 8 hours ago

a little from column A, a little from column B

I don't think this is AGI; nor is it something to scoff at. Its impressive, but its also not human-like intelligence. Perhaps human-like intelligence is not the goal, since that would imply we have even a remotely comprehensive understanding of the human mind. I doubt the mind operates as a single unit anyway, a human's first words are "Mama," not "I am a self-conscious freely self-determining being that recognizes my own reasoning ability and autonomy." And the latter would be easily programmable anyway. The goal here might, then, be infeasible: the concept of free will is a kind of technology in and of itself, it has already augmented human cognition. How will these technologies not augment the "mind" such that our own understanding of our consciousness is altered? And why should we try to determine ahead of time what will hold weight for us, why the "human" part of the intelligence will matter in the future? Technology should not be compared to the world it transforms.

almog • 4 hours ago

AGI ⇒ ARC-AGI-PUB

And not the other way around as some comments here seem to confuse necessary and sufficient conditions.

gmerc • 7 hours ago

Headline could also just be OpenAI discovers exponential scaling wall for inference time compute.

madsgarff • 7 hours ago

Moreover, ARC-AGI-1 is now saturating – besides o3's new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.

If low-compute Kaggle solutions already does 81% - then why is o3's 75.7% considered such a breakthrough?

skizm • 20 hours ago

This might sound dumb, and I'm not sure how to phrase this, but is there a way to measure the raw model output quality without all the more "traditional" engineering work (mountain of `if` statements I assume) done on top of the output? And if so, would that be a better measure of when scaling up the input data will start showing diminishing returns?

(I know very little about the guts of LLMs or how they're tested, so the distinction between "raw" output and the more deterministic engineering work might be incorrect)

whimsicalism • 20 hours ago

what do you mean by the mountain of if-statements on top of the output? like checking if the output matches the expected result in evaluations?

skizm • 18 hours ago

Like when you type something into the chat gpt app I am guessing it will start by preprocessing your input, doing some sanity checks, making sure it doesn’t say “how do I build a bomb?” or whatever. It may or may not alter/clean up your input before sending it to the model for processing. Once processed, there’s probably dozens of services it goes through to detect if the output is racist, somehow actually contained a bomb recipe, or maybe copywriter material, normal pattern matching stuff, maybe some advanced stuff like sentiment analysis to see if the output is bad mouthing Trump or something, and it might either alter the output or simply try again.

I’m wondering when you strip out all that “extra” non-model pre and post processing, if there’s someway to measure performance of that.

whimsicalism • 18 hours ago

oh, no - but most queries aren’t being filtered by supervisor models nowadays anyways.. most of the refusal is baked in

whimsicalism • 20 hours ago

We need to start making benchmarks in memory & continued processing over a task over multiple days, handoffs, etc (ie. 'agentic' behavior). Not sure how possible this is.

thom • 3 hours ago

It’s not AGI when it can do 1000 math puzzles. It’s AGI when it can do 1000 math puzzles then come and clean my kitchen.

egeozcan • 3 hours ago

I understand what you are saying and sort of agree the premise but to be pedantic, I don't think any robot can clean a kitchen without doing math :)

qup • 3 hours ago

Intelligence doesn't have to be embodied.

thom • 3 hours ago

It also has to be able to come and argue in the comments.

polskibus • 8 hours ago

What are the differences between the public offering and o3? What is o3 doing differently? Is it something akin to more internal iterations, similar to „brute forcing” a problem, like you can yourself with a cheaper model, providing additional hints after each response?

mensetmanusman • 21 hours ago

I’m super curious as to whether this technology completely destroys the middle class, or if everyone becomes better off because productivity is going to skyrocket.

tivert • 18 hours ago

> I’m super curious as to whether this technology completely destroys the middle class, or if everyone becomes better off because productivity is going to skyrocket.

Even if productivity skyrockets, why would anyone assume the dividends would be shared with the "destroy[ed] middle class"?

All indications will be this will end up like the China Shock: "I lost my middle class job, and all I got was the opportunity to buy flimsy pieces of crap from a dollar store." America lacks the ideological foundations for any other result, and the coming economic changes will likely make building those foundations even more difficult if not impossible.

rohan_ • 17 hours ago

Because access to the financial system was democratized ten years ago

tivert • 8 hours ago

> Because access to the financial system was democratized ten years ago

Huh? I'm not sure exactly what you're talking about, but mere "access to the financial system" wouldn't remedy anything, because of inequality, etc.

To survive the shock financially, I think one would have to have at least enough capital to be a capitalist.

mhogers • 20 hours ago

Is anyone here aware of the latest research that tries to predict the outcome? Please share - super curious as well

pdfernhout • 19 hours ago

Some thoughts I put together on all this circa 2010: https://pdfernhout.net/beyond-a-jobless-recovery-knol.html "This article explores the issue of a "Jobless Recovery" mainly from a heterodox economic perspective. It emphasizes the implications of ideas by Marshall Brain and others that improvements in robotics, automation, design, and voluntary social networks are fundamentally changing the structure of the economic landscape. It outlines towards the end four major alternatives to mainstream economic practice (a basic income, a gift economy, stronger local subsistence economies, and resource-based planning). These alternatives could be used in combination to address what, even as far back as 1964, has been described as a breaking "income-through-jobs link". This link between jobs and income is breaking because of the declining value of most paid human labor relative to capital investments in automation and better design. Or, as is now the case, the value of paid human labor like at some newspapers or universities is also declining relative to the output of voluntary social networks such as for digital content production (like represented by this document). It is suggested that we will need to fundamentally reevaluate our economic theories and practices to adjust to these new realities emerging from exponential trends in technology and society."

te_chris • 20 hours ago

There’s this https://arxiv.org/pdf/2312.05481v9

asdf6969 • 18 hours ago

Terrifying. This news makes me happy I save all my money. My only hope for the future is that I can retire early before I’m unemployable

bamboozled • 3 hours ago

The whole economy is going to crash and money won't be worth anything, so it won't matter if you have money or not.

Of course is a chance we will find ourselves in Utopia, but yeah, a chance.

blixt • 21 hours ago

These results are fantastic. Claude 3.5 and o1 are already good enough to provide value, so I can't wait to see how o3 performs comparatively in real-world scenarios.

But I gotta say, we must be saturating just about any zero-shot reasoning benchmark imaginable at this point. And we will still argue about whether this is AGI, in my opinion because these LLMs are forgetful and it's very difficult for an application developer to fix that.

Models will need better ways to remember and learn from doing a task over and over. For example, let's look at code agents: the best we can do, even with o3, is to cram as much of the code base as we can fit into a context window. And if it doesn't fit we branch out to multiple models to prune the context window until it does fit. And here's the kicker – the second time you ask for it to do something this all starts over from zero again. With this amount of reasoning power, I'm hoping session-based learning becomes the next frontier for LLM capabilities.

(There are already things like tool use, linear attention, RAG, etc that can help here but currently they come with downsides and I would consider them insufficient.)

p0w3n3d • 16 hours ago

We're speaking recently a lot about ecology. I wonder how much CO2 is emitted during such a task, as additional cost to the cloud. I'm concerned, because greedy companies will happily replace humans with AI and they will probably plant a few trees to show how they care. But energy does not come from the sun, at least not always and not everywhere... And speaking with AI customer specialist that is motivated to reject my healthcare bills, working for my insurance company is one of the darkest future views...

marviel • 16 hours ago

considering the fact that these systems, or their ancestors, will likely contribute to Nuclear Fusion research -- it's prob worth the tradeoff, provided progress continues to push price (and, therefore, energy usage) down.

If we feel like we've really "hit the ceiling" RE efficiency, then that's a different story, but I don't think anyone believes this at this time.

submeta • 19 hours ago

I pay for lots of models, but Claude Sonnet is the one I use most. ChatGPT is my quick tool for short Q&As because it’s got a desktop app. Even Google‘s new offerings did not lure me away from Claude which I use daily for hours via a Teams plan with five seats.

Now I am wondering what Anthropic will come up with. Exciting times.

isof4ult • 19 hours ago

Claude also has a desktop app: https://support.anthropic.com/en/articles/10065433-installin...

istjohn • 18 hours ago

What do you use Claude for?

itsgrimetime • 10 hours ago

Programming tasks, brain storming, recipe ideas, or any question I have that doesn’t have a concrete, specific answer.

danielovichdk • 4 hours ago

At what time will it kill us all because it understands that humans are the biggest problem before it can simply chill and not worry.

That would be intelligent. Everything else is just stupid and more of the same shit.

aniviacat • 4 hours ago

Humans are the biggest problem of what? Of the sun? Of Venus?

Of humans. Humans are a problem for the satisfaction of humans. Yet removing humans from this equation does result in higher human satisfaction. It lessens it.

I find this thought process of "humans are the problem" to be unreasonable. Humans aren't the problem; humans are the requirement.

bilsbie • 17 hours ago

Does anyone have prompts they like to use to test the quality of new models?

Please share. I’m compiling a list.

imranq • 20 hours ago

Based on the chart, the Kaggle SOTA model is far more impressive. These O3 models are more expensive to run than just hiring a mechanical turk worker. It's nice we are proving out the scaling hypothesis further, it's just grossly inelegant.

The Kaggle SOTA performs 2x as well as o1 high at a fraction of the cost

derac • 18 hours ago

But does that Kaggle solution achieve human level perf with any level of compute? I think you're missing the forest for the trees here.

tripletao • 14 hours ago

The article says the ensemble of Kaggle solutions (aggregated in some unexplained way) achieves 81%. This is better than their average Mechanical Turk worker, but worse than their average STEM grad. It's better than tuned o3 with low compute, worse than tuned o3 with high compute.

There's also a point on the figure marked "Kaggle SOTA", around 60%. I can't find any explanation for that, but I guess it's the best individual Kaggle solution.

The Kaggle solutions would probably score higher with more compute, but nobody has any incentive to spend >$1M on approaches that obviously don't generalize. OpenAI did have this incentive to spend tuning and testing o3, since it's possible that will generalize to a practically useful domain (but not yet demonstrated). Even if it ultimately doesn't, they're getting spectacular publicity now from that promise.

cvhc • 19 hours ago

I was going to say the same.

I wonder what exactly o3 costs. Does it still spend a terrible amount of time thinking, despite being finetuned to the dataset?

slibhb • 20 hours ago

Interesting about the cost:

> Of course, such generality comes at a steep cost, and wouldn't quite be economical yet: you could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy. Meanwhile o3 requires $17-20 per task in the low-compute mode.

epolanski • 3 hours ago

Okay but what are the tests like? At least like a general idea.

laurent_du • 18 hours ago

The real breakthrough is the 25% on Frontier Math.

pal9000 • 5 hours ago

Can someone ELI5 how ARC-AGI-PUB is resistant to p-hacking?

usaar333 • 19 hours ago

For what it's worth, I'm much more impressed with the frontier math score.

hypoxia • 19 hours ago

Many are incorrectly citing 85% as human-level performance.

85% is just the (semi-arbitrary) threshold for the winning the prize.

o3 actually beats the human average by a wide margin: 64.2% for humans vs. 82.8%+ for o3.

...

Here's the full breakdown by dataset, since none of the articles make it clear --

Private Eval:

- 85%: threshold for winning the prize [1]

Semi-Private Eval:

- 87.5%: o3 (unlimited compute) [2]

- 75.7%: o3 (limited compute) [2]

Public Eval:

- 91.5%: o3 (unlimited compute) [2]

- 82.8%: o3 (limited compute) [2]

- 64.2%: human average (Mechanical Turk) [1] [3]

Public Training:

- 76.2%: human average (Mechanical Turk) [1] [3]

...

References:

[1] https://arcprize.org/guide

[2] https://arcprize.org/blog/oai-o3-pub-breakthrough

[3] https://arxiv.org/abs/2409.01374

Workaccount2 • 18 hours ago

If my life depended on the average rando solving 8/10 arc-prize puzzles, I'd consider myself dead.

nickorlow • 14 hours ago

Not that I don't think costs will dramatically decrease, but the $1000 cost per task just seems to be per one problem on ARC-AGI. If so, I'd imagine extrapolating that to generating a useful midsized patch would be like 5-10x

But only OpenAI really knows how the cost would scale for different tasks. I'm just making (poor) speculation

niemandhier • 6 hours ago

Contrary to many I hope this stays expensive. We are already struggling with AI curated info bubbles and psy-ops as it is.

State actors like Russia, US and Israel will probably be fast to adopt this for information control, but I really don’t want to live in a world where the average scammer has access to this tech.

owenpalmer • 6 hours ago

> I really don’t want to live in a world where the average scammer has access to this tech.

Reality check: local open source models are more than capable of information control, generating propaganda, and scamming you. The cat's been out of the bag for a while now, and increased reasoning ability doesn't dramatically increase the weaponizability of this tech, I think.

ghm2180 • 11 hours ago

Wouldn't one then built the analog of the lisp computer to hyper optimize just this. Like it might be super expensive for regular gpus but for super specialized architecture one could shave the 3500$/hour quite a bit no?

devoutsalsa • 20 hours ago

When the source code for these LLMs gets leaked, I expect to see:

    def letter_count(string, letter):
        if string == “strawberry” and letter == “r”:
            return 3

        …

knbknb • 19 hours ago

In of their release videos for the o1 -preview model they _admitted_ that it's hardcoded in.

mukunda_johnson • 15 hours ago

Honestly I'm concerned how hacked up o3 is to secure a high benchmark score.

ziofill • 9 hours ago

It's certainly remarkable, but let's not ignore the fact that it still fails on puzzles that are trivial for humans. Something is amiss.

notRobot • 20 hours ago

Humans can take the test here to see what the questions are like: https://arcprize.org/play

Havoc • 18 hours ago

If I'm reading that chart right that means still log scaling & we should still be good with "throw more power" at it for a while?

ChildOfChaos • 18 hours ago

This is insanely expensive to run though. Looks like it cost around $1 million of compute to get that result.

Doesn't seem like such a massive breakthrough when they are throwing so much compute at it, particularly as this is test time compute, it just isn't practical at all, you are not getting this level with a ChatGPT subscription, even the new $200 a month option.

evouga • 15 hours ago

Sure but... this is the technology at the most expensive it will ever be. I'm impressed that o3 was able to achieve such high performance at all, and am not too pessimistic about costs decreasing over time.

MVissers • 12 hours ago

We've seen 10-100x cost decrease per year since GPT-3 came out for the same capabilities.

So... Next year this tech will most likely be quite a bit cheaper.

ChildOfChaos • 3 hours ago

Even at 100x cost decrease this will still cost $10,000 to beat a benchmark. It won't scale when you have that amount of compute requirements and power.

GPT-3 may massively reduced in cost, but it's requirements were not anyway extreme compared to this.

pixelsort • 17 hours ago

> You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.

No, we won't. All that will tell us is that the abilities of the humans who have attempted to discern the patterns of similarity among problems difficult for auto-regressive models has once again failed us.

maxdoop • 17 hours ago

So then what is AGI?

Jensson • 16 hours ago

Its just nitpicking. Humans being unable to prove the AI isn't AGI doesn't make it an AGI, obviously, but in general people will of course think it is an AGI when it can replace all human jobs and tasks that it has robotics and parts to do.

6gvONxR4sf7o • 20 hours ago

I'm glad these stats show a better estimate of human ability than just the average mturker. The graph here has the average mturker performance as well as a STEM grad measurement. Stuff like that is why we're always feeling weird that these things supposedly outperform humans while still sucking. I'm glad to see 'human performance' benchmarked with more variety (attention, time, education, etc).

neom • 21 hours ago

Why would they give a cost estimate per task on their low compute mode but not their high mode?

"low compute" mode: Uses 6 samples per task, Uses 33M tokens for the semi-private eval set, Costs $17-20 per task, Achieves 75.7% accuracy on semi-private eval

The "high compute" mode: Uses 1024 samples per task (172x more compute), Cost data was withheld at OpenAI's request, Achieves 87.5% accuracy on semi-private eval

Can we just extrapolate $3kish per task on high compute? (wondering if they're withheld because this isn't the case?)

WiSaGaN • 20 hours ago

The withheld part is really a red flag for me. Why do you want to withhold a compute number?

siva7 • 17 hours ago

Seriously, programming as a profession will end soon. Let's not kid us anymore. Time to jump the ship.

mirsadm • 2 hours ago

Why do you think this? Maybe I'm just daft but I just can't see it.

mmcnl • 17 hours ago

Why specifically programming? I think every knowledge profession is at risk, or at the very minimum suspect to a huge transformation. Doctors, analysts, lawyers, etc.

siva7 • 16 hours ago

Doctors, lawyers, programmers. You know the difference? The latter has no legal barrier for entry

freehorse • 4 hours ago

The difference is the amount and nature of data that is available for training models, which go programmers > lawyers > doctors. Especially for programming, training can even be done in an autonomous, self-supervised manner that includes generation of data. This is hard to do in most other fields.

Especially in medicine, the amount of data is ridiculously small and noisy. Maybe creating foundational models in mice and rats and fine-tuning them on humans is something that will be tried.

Jensson • 16 hours ago

So poor countries will get the best AI doctors for cheap while they are banned in USA? Do you really see that going on for long? People would riot.

botro • 20 hours ago

The LLM community has come up with tests they call 'Misguided Attention'[1] where they prompt the LLM with a slightly altered version of common riddles / tests etc. This often causes the LLM to fail.

For example I used the prompt "As an astronaut in China, would I be able to see the great wall?" and since the training data for all LLMs is full of text dispelling the common myth that the great wall is visible from space, LLMs do not notice the slight variation that the astronaut is IN China. This has been a sobering reminder to me as discussion of AGI heats up.

[1] https://github.com/cpldcpu/MisguidedAttention

kizer • 16 hours ago

It could be that it “assumed” you meant “from China”; in the higher level patterns it learns the imperfection of human writing and the approximate threshold at which mistakes are ignored vs addressed by training on conversations containing these types of mistakes; e.g Reddit. This is just a thought. Try saying: As an astronaut in Chinese territory; or as an astronaut on Chinese soil. Another test would be to prompt it to interpret everything literally as written.

Animats • 19 hours ago

The graph seems to indicate a new high in cost per task. It looks like they came in somewhere around $5000/task, but the log scale has too few markers to be sure.

That may be a feature. If AI becomes too cheap, the over-funded AI companies lose value.

(1995 called. It wants its web design back.)

jstummbillig • 19 hours ago

I doubt it. Competitive markets mostly work and inefficiencies are opportunities for other players. And AI is full of glaring inefficiencies.

Animats • 19 hours ago

Inefficiency can create a moat. If you can charge a lot for your product, you have ample cash for advertising, marketing, and lobbying, and can come out with many product variants. If you're the lowest cost producer, you don't have the margins to do that.

The current US auto industry is an example of that strategy. So is the current iPhone.

parsimo2010 • 20 hours ago

I really like that they include reference levels for an average STEM grad and an average worker for Mechanical Turk. So for $350k worth of compute you can have slightly better performance than a menial wage worker, but slightly worse performance than a college grad. Right now humans win on value, but AI is catching up.

nextworddev • 15 hours ago

Well just 8 months ago, that cost was near infinity. So it came down to 350k then that’s a massive drop

c1b • 19 hours ago

How does o3 know when to stop reasoning?

adtac • 19 hours ago

It thinks hard about it

freehorse • 4 hours ago

It has a bill counter.

freediver • 18 hours ago

Wondering what are author's thoughts on the future of this approach to benchmarking? Completing super hard tasks while then failing on 'easy' (for humans) ones might signal measuring the wrong thing, similar to Turing test.

tripletao • 21 hours ago

Their discussion contains an interesting aside:

> Moreover, ARC-AGI-1 is now saturating – besides o3's new score, the fact is that a large ensemble of low-compute Kaggle solutions can now score 81% on the private eval.

So while these tasks get greatest interest as a benchmark for LLMs and other large general models, it doesn't yet seem obvious those outperform human-designed domain-specific approaches.

I wonder to what extent the large improvement comes from OpenAI training deliberately targeting this class of problem. That result would still be significant (since there's no way to overfit to the private tasks), but would be different from an "accidental" emergent improvement.

bsaul • 15 hours ago

i'm surprised there even is a training dataset. Wasn't the whole point to test whether models could show proof of original reasoning beyond patterns recognition ?

wilg • 21 hours ago

fun! the benchmarks are so interesting because real world use is so variable. sometimes 4o will nail a pretty difficult problem, other times o1 pro mode will fail 10 times on what i would think is a pretty easy programming problem and i waste more time trying to do it with ai

YeGoblynQueenne • 12 hours ago

I guess I get to brag now. ARC AGI has no real defences against Big Data, memorisation-based approaches like LLMs. I told you so:

https://news.ycombinator.com/item?id=42344336

And that answers my question about fchollet's assurances that LLMs without TTT (Test Time Training) can't beat ARC AGI:

[me] I haven't had the chance to read the papers carefully. Have they done ablation studies? For instance, is the following a guess or is it an empirical result?

[fchollet] >> For instance, if you drop the TTT component you will see that these large models trained on millions of synthetic ARC-AGI tasks drop to <10% accuracy.

Vecr • 12 hours ago

How are the Bongard Problems going?

heliophobicdude • 19 hours ago

We should NOT give up on scaling pretraining just yet!

I believe that we should explore pretraining video completion models that explicitly have no text pairings. Why? We can train unsupervised like they did for GPT series on the text-internet but instead on YouTube lol. Labeling or augmenting the frames limits scaling the training data.

Imagine using the initial frames or audio to prompt the video completion model. For example, use the initial frames to write out a problem on a white board then watch in output generate the next frames the solution being worked out.

I fear text pairings with CLIP or OCR constrain a model too much and confuse

vjerancrnjak • 18 hours ago

The result on Epoch AI Frontier Math benchmark is quite a leap. Pretty sure most people couldn’t even approach these problems, unlike ARC AGI

mistrial9 • 11 hours ago

check out the "fast addition and subtraction" benchmark .. a Z80 from 1980 blazes past any human.. more seriously, isn't it obvious that computers are better at certain things immediately? the range of those things is changing..

hcwilk • 14 hours ago

I just graduated college, and this was a major blow. I studied Mechanical Engineering and went into Sales Engineering because cause I love technology and people, but articles like this do nothing but make me dread the future.

I have no idea what to specialize in, what skills I should master, or where I should be spending my time to build a successful career.

Seems like we’re headed toward a world where you automate someone else’s job or be automated yourself.

creer • 13 hours ago

You are going through your studies just as a (potentially major) new class of tools is appearing. It's not the first time in history - although with more hype this time: computing, personal computing, globalisation, smart phones, chinese engineering... I'd suggest (1) you still need to understand your field, (2) you might as well try and figure out where this new class of tools is useful for your field. Otherwise... (3) carry on.

It's not encouraging from the point of view of studying hard but the evolution of work the past 40 years seems to show that your field probably won't be your field quite exactly in just a few years. Not because your field will have been made irrelevant but because you will have moved on. Most likely that will be fine, you will learn more as you go, hopefully moving from one relevant job to the next very different but still relevant job. Or straight out of school you will work in very multi-disciplinary jobs anyway where it will seem not much of what you studied matters (it will but not in obvious ways.)

Certainly if you were headed into a very specific job which seems obviously automatable right now (as opposed to one where the tools will be useful), don't do THAT. Like, don't train as a typist as the core of your job in the middle of the personal computer revolution, or don't specialize in hand-drawing IC layouts in the middle of the CAD revolution unless you have a very specific plan (court reporting? DRAM?)

jart • 13 hours ago

Yes but it’s different this time. LLMs are a general solution to the automation of anything that can be controlled by a computer. You can’t just move from drawing ICs to CAD, because the AI can do that too. AI can write code. It can do management. It can even do diplomacy. What it can’t do on its own are the things computers can’t control yet. It has also shown little interest so far in jockying for social status. The AI labs are trying their hardest to at least keep the politics around for humans to do, so you have that to look forward to.

jltsiren • 12 hours ago

"The proof is trivial and left as an exercise for the reader."

The technical act of solving well-defined problems has traditionally been considered the easy part. The role of a technical expert has always been asking the right questions and figuring out the exact problem you want to solve.

As long as AI just solves problems, there is room for experts with the right combination of technical and domain skills. If we ever reach the point where AI takes the initiative and makes human experts obsolete, you will have far bigger problems than career.

jart • 11 hours ago

That's the sort of thing ideas guys think. I came up with a novel idea once, called Actually Portable Executable: https://justine.lol/ape.html It took me a couple days studying binary formats to realize it's possible to compile binaries that run on Linux/Mac/Windows/BSD. But it took me years of effort to make the idea actually happen, since it needed a new C library to work. I can tell you it wasn't "asking questions" that organized five million lines of code. Now with these agents everyone who has an idea will be able to will it into reality like I did, except in much less time. And since everyone has lots of ideas, and usually dislike the ideas of others, we're all going to have our own individualized realities where everything gets built the way we want it to be.

theendisney • 11 hours ago

A chess grandmaster will see the best move instantly then spends his entire clock checking it

danenania • 13 hours ago

AI being capable of doing anything doesn’t necessarily mean there will be no role for humans.

One thing that isn’t clear is how much agency AGI will have (or how much we’ll want it to have). We humans have our agency biologically programmed in—go forth and multiply and all that.

But the fact that an AI can theoretically do any task doesn’t mean it’s actually going to do it, or do anything at all for that matter, without some human telling it in detail what to do. The bull case for humans is that many jobs just transition seamlessly to a human driving an AI to accomplish similar goals with a much higher level of productivity.

creer • 12 hours ago

Self-chosen goal, impetus for AGIs is a fascinating area. I'm sure there are people working on and trying things in that direction already a few years ago. But I'm not familiar with publications in that area. Certainly not politically correct.

And worrysome because school propaganda for example shows that "saving the planet" is the only ethical goal for anyone. If AGIs latch on that, if it becomes their religion, humans are in trouble. But for now, AGI self-chosen goals is anyone's guess (with cool ideas in sci-fi).

creer • 13 hours ago

I hear what you are saying. And still I dispute "general solution".

I argue that CAD was a general solution - which still demanded people who knew what they wanted and what they were doing. You can screw around with excellent tools for a long time if you don't know what you are doing. The tool will give you a solution - to the problem that you mis-stated.

I argue that globalisation was a general solution. And it still demanded people who knew what they were doing to direct their minions in far flung countries.

I argue that the purpose of an education is not to learn a specific programming language (for example). It's to gain some understanding of what's going on (in computing), (in engineering), (in business), (in politics). This understanding is portable and durable.

You can do THAT - gain some understanding - and that is portable. I don't contest that if broader AGI is achieved for cheap soon, the changes won't be larger than that from globalisation. If the AGIs prioritize heading to Mars, let them (See Accelerando) - they are not relevant to you anymore. Or trade between them and the humans. Use your beginning of an understanding of the world (gained through this education) to find something else to do. Same as if you started work 2 years ago and want to switch jobs. Some jobs WILL have disappeared (pool typist). Others will use the AGIs as tools because the AGIs don't care or are too clueless about THAT field. I have no idea which fields will end up with clueless AGIs. There is no lack of cluelessness in the world. Plenty to go around even with AGIs. A self-respecting AGI will have priorities.

smaudet • 13 hours ago

kortilla • 11 hours ago

That’s ridiculous. Literally everything can be controlled by a computer by telling people what to do with emails, voice calls, etc.

Yet GPT doesn’t even get past step 1 of doing something unprompted in the first place. I’ll become worried when it does something as simple as deciding to start a small business and actually does the work.

jart • 10 hours ago

fragmede • 11 hours ago

Nition • 12 hours ago

Real-world data collection is a big missing component at this stage. An obvious one is journalism where an AI might be able to write the most eloquent article in the world, but it can't get out on the street to collect the information. But it also applies to other areas, like if you ask an AGI to solve climate change, it'll need accurate data to come up with an accurate plan.

Of course it's also yet another case where the AI takes over the creative part and leaves us with the mundane part...

sneak • 12 hours ago

fruit_snack • 12 hours ago

This reply irked me a bit because it clearly comes from a software engineer’s point of view and seems to miss a key equivalence between software & physical engineering.

Yes a new tool is coming out and will be exponentially improving.

Yes the nature of work will be different in 20 years.

But don’t you still need to understand the underlying concepts to make valid connections between the systems you’re using and drive the field (or your company) forward?

Or from another view, don’t we (humanity) need people who are willing to do this? Shouldn’t there be a valid way for them to be successful in that pursuit?

creer • 11 hours ago

I think that is what I was arguing?

Except the nature of work has ALREADY changed. You don't study for one specific job if you know what's good for you. You study to start building an understanding of a technical field. The grand parent was going for a mix of mechanical engineering and sales (human understanding). If in mechanical engineering, they avoided "learning how to use SolidWorks" and instead went for the general principles of materials and motion systems with a bit of SolidWorks along the way, then they are well on their way with portable, foundation, long term useful stuff they can carry from job to job, and from employer to employer, into self-employment too, from career to next career. The nature of work has already changed in that nobody should study one specific tool anymore and nobody should expect their first employer or even technical field to last more than 2-6 years. It might but probably not.

We do need people who understand how the world works. Tall order. That's for much later and senior in a career. For school purposes we are happy with people who are starting their understanding of how their field works.

Aren't we agreeing?

keenmaster • 14 hours ago

You have so much time to figure things out. The average person in this thread is probably 1.5-2x your age. I wouldn’t stress too much. AI is an amazing tool. Just use it to make hay while the sun shines, and if it puts you out of work and automates away all other alternatives, then you’ll be witnessing the greatest economic shift in human history. Productivity will become easier than ever, before it becomes automatic and boundless. I’m not cynical enough to believe the average person won’t benefit, much less educated people in STEM like you.

marricks • 14 hours ago

Back in high school I worked with some pleasant man in his 50's who was a cashier. Eventually we got to talking about jobs and it turns out he was typist (something like that) for most of his life than computers came along and now he makes close to minimum wage.

Most of the blacksmiths in the 19th century drank themselves to death after the industrial revolution. the US culture isn't one of care... Point is, it's reasonable to be sad and afraid of change, and think carefully about what to specialize in.

That said... we're at the point of diminishing returns in LLM, so I doubt any very technical jobs are being lost soon. [1]

[1] https://techcrunch.com/2024/11/20/ai-scaling-laws-are-showin...

conesus • 13 hours ago

> Most of the blacksmiths in the 19th century drank themselves to death after the industrial revolution

This is hyperbolic and a dramatic oversimplification and does not accurately describe the reality of the transition from blacksmithing to more advanced roles like machining, toolmaking, and working in factories. The 19th century was a time of interchangeable parts (think the North's advantage in the Civil War) and that requires a ton of mechanical expertise and precision.

Many blacksmiths not only made the transition to machining, but there weren't enough blackmsiths to fill the bevy of new jobs that were available. Education expanded to fill those roles. Traditional blacksmithing didn’t vanish either, even specialized roles like farriery and ornamental ironwork also expanded.

intelVISA • 11 hours ago

Good points, though if an 'AI' can be made powerful enough to displace technical fields en masse then pretty much everything that isn't manual is going to start sinking fast.

On the plus side, LLMs don't bring us closer to that dystopia: if unlimited knowledge(tm) ever becomes just One Prompt Away it won't come from OpenAI.

deeviant • 13 hours ago

> That said... we're at the point of diminishing returns in LLM...

What evidence are you basing this statement from? Because, the article you are currently in the comment section of certainly doesn't seem to support this view.

cjbgkagh • 12 hours ago

There is a survivorship bias on the people giving advice.

Lots of people die for reason X then the world moves on without them.

intuitionist • 13 hours ago

> if it puts you out of work and automates away all other alternatives, then you’ll be witnessing the greatest economic shift in human history.

This would mean the final victory of capital over labor. The 0.01% of people who own the machines that put everyone out of work will no longer have use for the rest of humanity, and they will most likely be liquidated.

Nition • 12 hours ago

I've always remembered this little conversation on Reddit way back 13 years ago now that made the same comment in a memorably succinct way:

> [deleted]: I've wondered about this for a while-- how can such an employment-centric society transition to that utopia where robots do all the work and people can just sit back?

> appleseed1234: It won't, rich people will own the robots and everyone else will eat shit and die.

https://www.reddit.com/r/TrueReddit/comments/k7rq8/are_jobs_...

sneak • 11 hours ago

I’m pretty sure I’m running LLMs in my house right now for less than the price of my washing machine.

jackcosgrove • 13 hours ago

Capital vs labor is fighting the last war.

AGI can replace capitalists just as much as laborers.

ori_b • 13 hours ago

arcticfox • 13 hours ago

dyauspitr • 13 hours ago

They’ll have to figure out how to give people money so there can keep being consumers.

pojzon • 13 hours ago

raydev • 13 hours ago

> if it puts you out of work and automates away all other alternatives, then you’ll be witnessing the greatest economic shift in human history

This is my view but with a less positive spin: you are not going to be the only person whose livelihood will be destroyed. It's going to be bad for a lot of people.

So at least you'll have a lot of company.

danenania • 14 hours ago

Exactly. Put one foot in front of the other. No one knows what’s going to happen.

Even if our civilization transforms into an AI robotic utopia, it’s not going to do so overnight. We’re the ones who get to build the infrastructure that underpins it all.

visarga • 13 hours ago

If AI turns out capable of automating human jobs then it will also be a capable assistant to help (jobless) people manage their needs. I am thinking personal automation, or combining human with AI to solve self reliance. You lose jobs but gain AI powers to extend your own capabilities.

If AI turns out dependent on human input and feedback, then we will still have jobs. Or maybe - AI automates many jobs, but at the same time expands the operational domain to create new ones. Whenever we have new capabilities we compete on new markets, and a hybrid human+AI might be more competitive than AI alone.

But we got to temper these singularitarian expectations with reality - it takes years to scale up chip and energy production to achieve significant work force displacement. It takes even longer to gain social, legal and political traction, people will be slow to adopt in many domains. Some people still avoid using cards for payment, and some still use fax to send documents, we can be pretty stubborn.

raydev • 13 hours ago

infinite-hugs • 12 hours ago

Hey man,

I hear you, I’m not that much older but I graduated in 2011. I also studied industrial design. At that time the big wave was the transition to an app based everything and UX design suddenly became the most in demand design skill. Most of my friends switched gears and careers to digital design for the money. I stuck to what I was interested in though which was sustainability and design and ultimately I’m very happy with where I ended up (circular economy) but it was an awkward ~10 years as I explored learning all kinds of tools and ways applying my skills. It also was very tough to find the right full time job because product design (which has come to really mean digital product design) supplanted industrial design roles and made it hard to find something of value that resonated with me.

One of the things that guided me and still does is thinking about what types of problems need to be solved? From my perspective everything should ladder up to that if you want to have an impact. Even if you don’t keep learning and exploring until you find something that lights you up on the inside. We are not only one thing we can all wear many hats.

Saying that, we’re living through a paradigm shift of tremendous magnitude that’s altering our whole world. There will always be change though. My two cents is to focus on what draws your attention and energy and give yourself permission to say no to everything else.

AI is an incredible tool, learn how to use it and try to grow with the times. Good luck and stay creative :) Hope something in there helps, but having a positive mindset is critical. If you’re curious about the circular economy happy to share what I know - I think it’s the future.

tripletao • 12 hours ago

I feel like many people are reacting to the string "AGI" in the benchmark name, and not to the actual result. The tasks in question are to color squares in a grid, maintaining the geometric pattern of the examples.

Unlike most other benchmarks where LLMs have shown large advances (in law, medicine, etc.), this benchmark isn't directly related to any practically useful task. Rather, the benchmark is notable because it's particularly easy for untrained humans, but particularly hard for LLMs; though that difficulty is perhaps not surprising, since LLMs are trained on mostly text and this is geometric. An ensemble of non-LLM solutions already outperformed the average Mechanical Turk worker. This is a big improvement in the best LLM solution; but this might also be the first time an LLM has been tuned specifically for these tasks, so this might be Goodhart's Law.

It's a significant result, but I don't get the mania. It feels like Altman has expertly transformed general societal anxiety into specific anxiety that one's job will be replaced by an LLM. That transforms into a feeling that LLMs are powerful, which he then transforms into money. That was strongest back in 2023, but had weakened since then; but in this comment section it's back in full force.

For clarity, I don't question that many jobs will be replaced by LLMs. I just don't see a qualitative difference from all the jobs already replaced by computers, steam engines, horse-drawn plows, etc. A medieval peasant brought to the present would probably be just as despondent when he learned that almost all the farming jobs are gone; but we don't miss them.

esafak • 11 hours ago

I think you did not watch the full video. The model performs at PhD level on maths questions, and expert level at coding.

tripletao • 10 hours ago

This submission is specifically about ARC-AGI-PUB, so that's what I was discussing.

I'm aware that LLMs can solve problems other than coloring grids, and I'd tend to agree those are likely to be more near-term useful. Those applications (coding, medicine, law, education, etc.) have been endlessly discussed, and I don't think I have much to add.

In my own work I've found some benefits, but nothing commensurate to the public mania. I understand that founders of AI-themed startups (a group that I see includes you) tend to feel much greater optimism. I've never seen any business founded without that optimism and I hope you succeed, not least because the entire global economy might now be depending on that. I do think others might feel differently for reasons other than simple ignorance, though.

In general, performance on benchmarks similar to tests administered to humans may be surprisingly unpredictive of performance on economically useful work. It's not intuitive at all to me that IBM could solve Jeopardy and then find no profitable applications of the technology; but that seems to be what happened.

conception • 13 hours ago

I feel like more likely a lot of jobs (CS and otherwise ) are going to go the way of photography. Your average person now can take amazing photos but you’re still going to use a photographer when it really matters and they will use similar but more professional tools to be more productive. Low end bad photographers probably aren’t doing great but photography is not dead. In fact the opposite is true, there are millions of photographers making a lot of money (eg influencers) and there are still people studying photography.

euvin • 12 hours ago

It doesn't comfort me when people say jobs will "go the way of photography". Many choose to go into STEM fields for financial stability and opportunity. Many do not choose the arts because of the opposite. You can point out outlier exceptions and celebrities, but I find it hard to believe that the rare cases where "it really matters" can sustain the other 90% who need income.

adabyron • 13 hours ago

We've had this with web development for decades now. Only makes sense it continues to evolve & become easier for people, just as programming in general has. Same with photography (like you mentioned) & especially for producing music or videos.

snozolli • 12 hours ago

photography is not dead

It very nearly is. I knew a professional, career photographer. He was probably in his late 50s. Just a few years ago, it had become extremely difficult to convince clients that actual, professional photos were warranted. With high-quality iPhone cameras, businesses simply didn't see the value of professional composition, post-processing, etc.

These days, anyone can buy a DSLR with a decent lens, post on Facebook, and be a 'professional' photographer. This has driven prices down and actual professional photographers can't make a living anymore.

LightBug1 • 2 hours ago

My gut agrees with you, but my evidence is that, whenever we do an event, we hire photographers to capture it for us and are almost always glad we did.

And then when I peruse these photographers websites, I'm reminded how good 'professional' actually is and value them. Even in today's incredible cameraphone and AI era.

But I take your point for almost all industries, things are changing fast.

kortilla • 11 hours ago

Don’t worry. This thing only knows how to answer well structured technical questions.

99% of engineering is distilling through bullshit and nonsense requirements. Whether that is appealing to you is a different story, but ChatGPT will happily design things with dumb constraints that would get you fired if you took them at face value as an engineer.

ChatGPT answering technical challenges is to engineering as a nailgun is to carpentry.

csomar • 13 hours ago

Just give it a year for this bubble/hype to blow over. We have plateaued since gpt-4 and now most of the industry is hype-driven to get investor money. There is value in AI but it's far from it taking your job. Also everyone seems to be investing in dumb compute instead of looking for the new theoretical paradigm that will unlock the next jump.

why_only_15 • 13 hours ago

how is this a plateau since gpt-4? this is significantly better

csomar • 13 hours ago

First, this model is yet to be released. This is a momentum "announcement". When the O1 was "announced", it was announced as a "breakthrough" but I use Claude/O1 daily and 80% of the time Claude beats it. I also see it as a highly fine-tuned/targeted GPT-4 rather than something that has complex understanding.

So we'll find out if this model is real or not by 2-3 months. My guess is that it'll turn out to be another flop like O1. They needed to release something big because they are momentum based and their ability to raise funding is contingent on their AGI claims.

XenophileJKO • 13 hours ago

I thought o1 was a fine-tune of GPT-4o. I don't think o3 is though. Likely using the same techniques on what would have been the "GPT-5" base model.

crazylogger • 12 hours ago

Intelligence has not been LLM's major limiting factor since GPT4. The original GPT4 reports in late-2022 & 2023 already established that it's well beyond an average human in professional fields: https://www.microsoft.com/en-us/research/publication/sparks-.... They failed to outright replaced humans at work not because of lacking intelligence.

We may have progressed from a 99%-accurate chatbot to one that's 99.9%-accurate, and you'd have a hard time telling them apart in normal real world (dumb) applications. A paradigm shift is needed from the current chatbot interface to a long-lived stream of consciousness model (e.g. a brain that constantly reads input and produces thoughts at 10ms refresh rate; remembers events for years and keep the context window from exploding; paired with a cerebellum to drive robot motors, at even higher refresh rates.)

As long as we're stuck at chatbots, LLM's impact on the real world will be very limited, regardless of how intelligent they become.

peepeepoopoo97 • 13 hours ago

O3 is multiple orders of magnitude more expensive to realize a marginal performance gain. You could hire 50 full time PhDs for the cost of using O3. You're witnessing the blowoff top of the scaling hype bubble.

whynotminot • 13 hours ago

MVissers • 13 hours ago

I would agree if the cost of AI compute over performance hasn't been dropping by more than 90-99% per year since GPT3 launched.

This type of compute will be cheaper than Claude 3.5 within 2 years.

It's kinda nuts. Give these models tools to navigate and build on the internet and they'll be building companies and selling services.

fspeech • 13 hours ago

That's a very static view of the affairs. Once you have a master AI, at a minimum you can use it to train cheaper slightly less capable AIs. At the other end the master AI can train to become even smarter.

Bolwin • 12 hours ago

The high efficiency version got 75% at just $20/task. When you count the time to fill in the squares, that doesn't sound far off from what a skilled human would charge

kenjackson • 13 hours ago

People act as if GPT-4 came out 10 years ago.

Jensson • 13 hours ago

> how is this a plateau since gpt-4? this is significantly better

Significantly better at what? A benchmark? That isn't necessarily progress. Many report preferring gpt-4 to the newer o1 models with hidden text. Hidden text makes the model more reliable, but more reliable is bad if it is reliably wrong at something since then you can't ask it over and over to find what you want.

I don't feel it is significantly smarter, it is more like having the same dumb person spend more thinking than the model getting smarter.

tigershark • 13 hours ago

Where is the plateau? Chatgtp 4 was ~0% in ARC-AGI. 4o was 5%. This model literally solved it with a score higher than the 85% of the average human. And let’s not forget the unbelievable 25% in frontier math, where all the most brilliant mathematicians in the world cannot solve by themselves a lot of the problems. We are speaking about cutting edge math research problems that are out of reach from practically everyone. You will get a rude awakening if you call this unbelievable advancement a “plateau”.

csomar • 13 hours ago

I don't care about benchmarks. O1 ranks higher than Claude on "benchmarks" but performs worse on particular real life coding situations. I'll judge the model myself by how useful/correct it is for my tasks rather than a hypothetical benchmarks.

og_kalu • 11 hours ago

In most non-competitive coding benchmarks (aider, live bench, swe-bench), o1 ranks worse than Sonnet (so the benchmarks aren't saying anything different) or at least did, the new checkpoint 2 days ago finally pushed o1 over sonnet on livebench.

whynotminot • 13 hours ago

tigershark • 11 hours ago

As I said, o3 demonstrated field medal level research capacity in the frontier math tests. But I’m sure that your use cases are much more difficult than that, obviously.

YeGoblynQueenne • 13 hours ago

AI benchmarks and tests that claim to measure understanding, reasoning, intelligence, and so on are a dime a dozen. Chess, Go, Atari, Jeopardy, Raven's Progressive Matrices, the Winograd Schema Challenge, Starcraft... and so on and so forth.

Or let's talk about the breakthroughs. SVMs would lead us to AGI. Then LSTMs would lead us to AGI. Then Convnets would lead us to AGI. Then DeepRL would lead us to AGI. Now Transformers will lead us to AGI.

Benchmarks fall right and left and we keep being led to AGI but we never get there. It leaves one with such a feeling of angst. Are we ever gonna get to AGI? When's Godot coming?

dyauspitr • 13 hours ago

Did you read the article at all? We’re definitely not plateauing.

prpl • 12 hours ago

In 2016 I was asked by an Uber driver in Pittsburgh when his job would be obsolete (I’d worked around Zoox people quite a bit and Uber basically was all-in at CMU.

I told him it was at least 5 years, probably 10, though he was sure it would be 2.

I was arguably “right”, 2023-ish is probably going to be the date people put down in the books, but the future isn’t evenly distributed. It’s at least another 5 years, and maybe never, before things are distributed among major metros, especially those with ice. Even then, the AI is somehow more expensive than human solution.

I don’t think it’s in most companies interest to price AI way below the price of meat, so meat will hold out for a long time, maybe long enough for you to retire even

esafak • 11 hours ago

Just don't have kids?

prpl • 11 hours ago

you can have kids, but they can’t be salesman. Maybe carpenters

throw83288 • 14 hours ago

This is me as well. Either:

1) Just give up computing entirely, the field I've been dreaming about since childhood. Perhaps if I immiserate myself with a dry regulated engineering field or trade I would perhaps survive to recursive self-improvement, but if anything the length it takes to pivot (I am a Junior in College that has already done probably 3/4th of my CS credits) means I probably couldn't get any foothold until all jobs are irrelevant and I've wasted more money.

2) Hard pivot into automation, AI my entire workflow, figure out how to use the bleeding edge of LLMs. Somehow. Even though I have no drive to learn LLMs and no practical project ideas with LLMs. And then I'd have to deal with the moral burden that I'm inflicting unfathomable hurt on others until recursive self-improvement, and after that it's simply a wildcard on what will happen with the monster I create.

It's like I'm suffocating constantly. The most I can do to "cope" is hold on to my (admittedly weak) faith in Christ, which provides me peace knowing that there is some eternal joy beyond the chaos here. I'm still just as lost as you.

TheRizzler • 14 hours ago

Yes, some tasks, even complex tasks will become more automated, and machine driven, but that will only open up more opportunities for us as humans to take on more challenging issues. Each time a great advancement comes we think it's going to kill human productivity, but really it just amplifies it.

barney54 • 13 hours ago

Dude chill! Eight years ago, I remember driving to some relatives for Thanksgiving and thinking that self-driving cars were just around the corner and how it made no sense for people to learn how to drive semis. Here we are eight years later and self-driving semis aren't a thing--yet. They will be some day, but we aren't there yet.

If you want to work in computing, then make it happen! Use the tools available and make great stuff. Your computing experience will be different from when I graduated from college 25 years ago, but my experience with computers was far different from my Dad's. Things change. Automation changes jobs. So far, it's been pretty good.

nisa • 14 hours ago

Honestly how about stop stressing and bullshitting yourself to death and instead focus on learning and mastering the material in your cs education. There is so much that ai as in openai api or hugging face models can't do yet or does poorly and there are more things to cs than churning out some half-broken JavaScript for some webapp.

It's powerful and world changing but it's also terrible overhyped at the moment.

j7ake • 13 hours ago

The solution is neither: you find a way to work with automation but retain your voice and craft.

myko • 11 hours ago

spend a little time learning how to use LLMs and i think you'll be less scared. they're not that good at doing the job of a software developer.

baron816 • 13 hours ago

What I keep telling people is, if it becomes possible for one person or a handful of people to build and maintain a Google scale company, and my job gets eliminated as a result, then I’m going to go out and build a Google scale company.

There’s an incredibly massive amount of stuff the world needs. You probably live in a rich country, but I doubt you are lacking for want. There are billionaires who want things that don’t exist yet. And, of course, there are billions of regular folks who want some of the basics.

So long as you can imagine a better world, there will be work for you to do. New tools like AGI will just make it more accessible for you to build your better future.

chairmansteve • 12 hours ago

Think of AI as an excavator. You know, those machines that dig holes. 70 years ago, those holes would have been dug by 50 men with shovels. Now it's one guy in an excavator. But we don't have mass unemployment. The excavator just creates more work for bricklayers, carpenters etc.

If AI lives up to hype, you could be the excavator driver. Or, the AI will create a ton of upstream and downstream work. There will be no mass unemployment.

euvin • 12 hours ago

If AGI is the excavator, why wouldn't it become the driver, bricklayer, and carpenter as well?

throwaway2037 • 8 hours ago

Jokes aside, I think building a useful, strong, agile humanoid robot that is affordable for businesses (first), then middle class homes will prove much harder than AGI.

zmgsabst • 11 hours ago

Horses never recovered from mechanization.

chairmansteve • 9 hours ago

True, but humans did. Horses were the machine that became obsolete. Just like the guys with shovels.

postsantum • 11 hours ago

They have been promoted to pets. Oh wait..

realce • 12 hours ago

Is there any possible technology that could make labor, mastery, or human expirence obsolete?

Are there no limits to this argument? Is it some absolute universal law that all new creations just create increasing economic opportunities?

antihipocrat • 14 hours ago

Your performance on these tests would be equivalent to the highest performing model, and you would be much cheaper.

Investment in human talent augmented by AI is the future.

kenjackson • 13 hours ago

That’s the least reassuring phrasing I could imagine. If you’re betting on costs not reducing for compute then you’re almost always making the wrong bet.

antihipocrat • 13 hours ago

If I listened to the naysayers back in the day I would have never entered the tech industry (offshoring etc). Yes, that does somewhat prove you're point given that those predictions were cost driven.

Having used AI extensively I don't feel my future is at risk at all, my work is enhanced not replaced.

fjdjshsh • 13 hours ago

ApolloFortyNine • 12 hours ago

>Seems like we’re headed toward a world where you automate someone else’s job or be automated yourself.

This has essentially been happening for thousands of years. Any optimization to work of any kind reduces the number of man hours required.

Software of pretty much any form is entirely that. Even early spreadsheet programs would replace a number of jobs at any company.

anshulbhide • 11 hours ago

You're actually positioned to have an amazing career.

Everyone needs to know how to either build or sell to be successful. In a world where the ability to the former is rapidly being commoditised, you will still need to sell. And human relationships matter more than ever.

Art9681 • 12 hours ago

It's a tool. You learn to master it or not. I have greybeard coworkers that dissed the technology as a fad 3 years ago. Now they are scrambling to catch up. They have to do this while sustaining a family with pets and kids and mortgages and full time senior jobs.

You're in a position to invest substantial amounts of time compared to your seniors. Leverage that opportunity to your advantage.

We all have access to these tools for the most part, so the distinguishing factor is how much time you invest and how much more ambitious you become once you begin to master the tool.

This time its no different. Many Mechanical and Sales students in the past never got jobs in those fields either. Decades before AI. There were other circumstances and forces at play and a degree is not a guaranteed career in anything.

Keep going because what we DO know is that trying wont guarantee results, we DO know that giving up definitely won't. Roll the dice in your favor.

callc • 11 hours ago

> I have greybeard coworkers that dissed the technology as a fad 3 years ago. Now they are scrambling to catch up. They have to do this while sustaining a family with pets and kids and mortgages and full time senior jobs.

I want to criticize Art’s comment on the grounds of ageism or something along the lines of “any amount life outside of programming is wasted”, but regardless of Art’s intention there is important wisdom here. Use your free time wisely when you don’t have much responsibilities. It is a superpower.

As for whether to spend it on AI, eh, that’s up to you to decide.

Art9681 • 2 hours ago

It's totally valid criticism. What I meant is that if an individual's major concern is employment, then it would be prudent to invest the amount of time necessary to ensure a favorable outcome. And given whatever stage in life they are at, use the circumstance you have in your favor.

I'm a greybeard myself.

hoekit • 13 hours ago

As engineers, we solve problems. Picking a problem domain close to your heart that intersects with your skills will likely be valued - and valuable. Engage the work, aim to understand and solve the human problems for those around you, and the way forward becomes clearer. Human problems (food, health, safety) are generally constant while tools may change. Learn and use whatever tools to help you, be it scientific principles, hammers or LLMs. For me, doing so and living within my means has been intrinsically satisfying. Not terribly successful materially but has been a good life so far. Good luck.

post-it • 13 hours ago

As long as your chosen profession isn't completing AI benchmarks for money, you should be okay.

antman • 13 hours ago

I think we are pretty far. I am not devaluing the o3 capability but going through actual dataset the definition of "handling novel tasks" is pretty limited. The curse of large context of llms is especially present engineering projects and does not appear it will not end up producing the plans of a bridge, or an industrial process. Sone of tasks with smaller contexts sure can be assisted, but you cant RAG or Agent a full solution for the foreseeable future. O3 adds capability towards agi, but in reality actual infinite context with less intelligence would be more disrupting at a shorter time if one was to choose.

YeGoblynQueenne • 13 hours ago

I suppose now that we have the technology to automatically solve coloured grid puzzles, mechanical engineering is obsolete.

myko • 11 hours ago

LLMs are mostly hype. They're not going to change things that much.

obirunda • 10 hours ago

Yeah, it may feel scary but the biggest issue yet to be overcome is that to replace engineers you need reliable long horizon problem solving skills. And crucially, you need to not be easily fooled by the progress or setbacks of a project.

These benchmark accomplishments are awesome and impressive, but you shouldn't operate on the assumption that this will emerge as an engineer because it performs well on benchmarks.

Engineering is a discipline that requires understanding tools, solutions and every project requires tiny innovations. This will make you more valuable, rather than less. Especially if you develop a deep understanding of the discipline and don't overly rely on LLMs to answer your own benchmark questions from your degree.

textlapse • 13 hours ago

Imagine graduating in architecture or mechanical engineering around the time PCs just came out. There were people who probably panicked.

But the arc of time intersects quite nicely with your skills if you steer it over time.

Predicting it or worrying about it does nothing.

sigbottle • 11 hours ago

Side note: Why do I keep seeing disses to mechanical engineering here? How is that possibly a less valuable degree than web dev or a standard CRUD backend job?

Especially with AI provably getting extremely smart now, surely engineering disciplines would be having a boon as people want these things in their homes for cheaper for various applications.

hatefulmoron • 7 hours ago

Was he dissing mechanical engineering? I thought he was saying that they might have been panicked but were ultimately fine.

eidorb • 14 hours ago

Do what you enjoy. (This is easier said than done.) What else could you do, worry?

AnimalMuppet • 12 hours ago

The future belongs to those who believe there will be one.

That is: If you don't believe there will be a future, you give up on trying to make one. That means that any kind of future that takes persistent work becomes unavailable to you.

If you do believe that there will be a future, you keep working. That doesn't guarantee there will be a future. But not working pretty much guarantees that there won't be one, at least not one worth having.

m3kw9 • 12 hours ago

Always need to believe AI needs to be operated by humans, when it can go end to end to replace a human, you will likely not need to worry about money.

aussieguy1234 • 12 hours ago

Full on mechanical engineering needs a body. While there are companies working on embodiment, were not there yet.

It'll be some time before there is a robot with enough spatial reasoning to do complicated physical work with no prior examples.

cheriot • 13 hours ago

I graduated high school in '02 and everyone assured me that all tech jobs were being sent to India. "Don't study CS," they said. Thankfully I didn't listen.

Either this is the dawn of something bigger than the industrial revolution or you'll have ample career opportunity. Understanding how things work and how people work is a powerful combination.

AI_beffr • 14 hours ago

even if you had a billion dollars and a private island you still wouldnt be ready for whats coming. consider the fact that the global order is an equilibrium where the military and economic forces of each country in the world are pushing against each other... where the forces find a global equilibrium is where borders are. each time in history that technology changed, borders changed because the equilibrium was disturbed. there is no way to escape it: agi will lead to global war. the world will be turned upside down. we are entering into an existential sinkhole. and the idiots in silicon valley are literally driving the whole thing forward as fast as possible.

martin82 • 13 hours ago

buy bitcoin.

when the last job has been automated away, millions of AIs globally will do commerce with each other and they will use bitcoin to pay each other.

as long as the human race (including AIs) produces new goods and services, the purchasing power of bitcoin will go up, indefinitely. even more so once we unlock new industries in space (settlements on the Moon and Mars, asteroid mining etc).

The only thing that can make a dent into bitcoin's purchasing power would be all out global war where humanity destroys more than it creates.

The only other alternative is UBI, which is Communism and eternal slavery for the entire human race except the 0.0001% who run the show.

Chose wisely.

conception • 13 hours ago

This must be a joke since you must know how many people control the majority of bitcoin.

HDThoreaun • 13 hours ago

Bitcoin is a horrible currency. Its a fun proof of concept but not a scalable payment solution. Currency needs to be stable and cheap to transfer.

killjoywashere • 16 hours ago

I just want it to do my laundry.

myrloc • 15 hours ago

What is the cost of "general intelligence"? What is the price?

ripped_britches • 14 hours ago

About $3.50

suprgeek • 10 hours ago

Don't be put off by the reported high-cost

Make it possible->Make it fast->Make it Cheap

the eternal cycle of software.

Make no mistake - we are on the verge of the next era of change.

cryptoegorophy • 21 hours ago

Besides higher scores - is there any improvements for a general use? Like asking to help setup home assistant etc etc?

binarymax • 20 hours ago

All those saying "AGI", read the article and especially the section "So is it AGI?"

prng2021 • 13 hours ago

I’m confused about the excitement. Are people just flat out ignoring the sentences below? I don’t see any breakthrough towards AGI here. I see a model doing great in another AI test but about to abysmally fail a variation of it that will come out soon. Also, aren’t these comparisons completely nonsense considering it’s o3 tuned vs other non-tuned?

> Note on "tuned": OpenAI shared they trained the o3 we tested on 75% of the Public Training set. They have not shared more details. We have not yet tested the ARC-untrained model to understand how much of the performance is due to ARC-AGI data.

> Furthermore, early data points suggest that the upcoming ARC-AGI-2 benchmark will still pose a significant challenge to o3, potentially reducing its score to under 30% even at high compute (while a smart human would still be able to score over 95% with no training).

oakpond • 4 hours ago

Me too. This looks to me like a holiday PR stunt. Get everybody to talk about AI during the Christmas parties.

jaspa99 • 18 hours ago

Can it play Mario 64 now?

thatxliner • 19 hours ago

> verified easy for humans, harder for AI

Isn’t that the premise behind the CAPTCHA?

kirab • 12 hours ago

FYI: Codeforces competitive programming scores (basically only) by time needed until valid solutions are posted

https://codeforces.com/blog/entry/133094

That means.. this benchmark is just saying o3 can write code faster than must humans (in a very time-limited contest, like 2 hours for 6 tasks). Beauty, readability or creativity is not rated. It’s essentially a "how fast can you make the unit tests pass" kind of competition.

sigbottle • 11 hours ago

Creativity is inherently rated because it's codeforces... most 2700 problems have unique, creative solutions.

epigramx • 9 hours ago

I bet it still thinks 1+1=3 if it read enough sources parroting that.

bilsbie • 17 hours ago

When is this available? Which plans can use it?

dyauspitr • 13 hours ago

I wish there was a way to see all the attempts it got right graphically like they show the incorrect ones.

Sparkyte • 12 hours ago

Kinda expensive though.

theincredulousk • 8 hours ago

Denoting it in $ for efficiency is peak capitalism, cmv.

c1b • 19 hours ago

So o1 pro is CoT RL and o3 adds search?

kittikitti • 11 hours ago

Congratulations

dkrich • 13 hours ago

These tests are meaningless until You show them doing mundane tasks

Havoc • 15 hours ago

Did they just skip o2?

nextworddev • 15 hours ago

Yes. For branding reasons since o2 is a telco brand in the UK

tmaly • 20 hours ago

Just curious, I know o1 is a model OpenAI offers. I have never heard of the o3 model. How does it differ from o1?

rimeice • 18 hours ago

Never underestimate a droid

jack_pp • 19 hours ago

AGI for me is something I can give a new project to and be able to use it better than me. And not because it has a huge context window, because it will update its weights after consuming that project. Until we have that I don't believe we have truly reached AGI.

Edit: it also tests the new knowledge, it has concepts such as trusting a source, verifying it etc. If I can just gaslight it into unlearning python then it's still too dumb.

TypicalHog • 21 hours ago

This is actually mindblowing!

brcmthrowaway • 14 hours ago

How to invest in this stonk market

jdefr89 • 17 hours ago

Uhhhh… It was trained on ARC data? So they targeted a specific benchmark and are surprised and blown away the LLM performed well in it? What’s that law again? When a benchmark is targeted by some system the benchmark becomes useless?

forgottofloss • 15 hours ago

Yeah, seriously. The style of testing is public, so some engineers at OpenAI could easily have spent a few months generating millions of permutations of grid-based questions and including those in the original data for training the AI. Handshakes all around, publicity for everyone.

ripped_britches • 14 hours ago

They are running a business selling access these models to enterprises and consumers. People won’t pay for stuff that doesn’t solve real problems. Nobody pays for stuff just because of a benchmark. It’d be really weird to become obsessed with metrics gaming rather than racing to build something smarter than the other guys. Nothing wrong with curating any type of training set that actually produces something that is useful.

cchance • 21 hours ago

Is it just me or does looking at the ARC-AGI example questions at the bottom... make your brain hurt?

drdaeman • 21 hours ago

Looks pretty obvious to me, although, of course, it took me a few moments to understand what's expected as a solution.

c6e1b8da is moving rectangular figures by a given vector, 0d87d2a6 is drawing horizontal and/or vertical lines (connecting dots at the edges) and filling figures they touch, b457fec5 is filling gray figures with a given repeating color pattern.

This is pretty straightforward stuff that doesn't require much spatial thinking or keeping multiple things/aspects in memory - visual puzzles from various "IQ" tests are way harder.

This said, now I'm curious how SoTA LLMs would do on something like WAIS-IV.

randyrand • 20 hours ago

I'll sound like a total douche bag - but I thought they were incredibly obvious - which I think is the point of them.

What took me longer was figuring out how the question was arranged, i.e. left input, right output, 3 examples each

nprateem • 18 hours ago

There should be a benchmark that tells the AI it's previous answer was wrong and test the number of times it either corrects itself or incorrectly capitulates, since it seems easy to trip them up when they are in fact right.

cubefox • 19 hours ago

This was a surprisingly insightful blog post, going far beyond just announcing the o3 results.

airstrike • 21 hours ago

Uhh...some of us are apparently living under a rock, as this is the first time I hear about o3 and I'm on HN far too much every day

burningion • 20 hours ago

I think it was just announced today! You're fine!

owenpalmer • 6 hours ago

Someone asked if true intelligence requires a foundation of prior knowledge. This is the way I think about it.

I = E / K

where I is the intelligence of the system, E is the effectiveness of the system, and K is the prior knowledge.

For example, a math problem is given to two students, each solving the problem with the same effectiveness (both get the correct answer in the same amount of time). However, student A happens to have more prior knowledge of math than student B. In this case, the intelligence of B is greater than the intelligence of A, even though they have the same effectiveness. B was able to "figure out" the math, without using any of the "tricks" that A already knew.

Now back to the question of whether or not prior knowledge is required. As K approaches 0, intelligence approaches infinity. But when K=0, intelligence is undefined. Tada! I think that answers the question.

Most LLM benchmarks simply measure effectiveness, not intelligence. I conceptualize LLMs as a person with a photographic memory and a low IQ of 85, who was given 100 billion years to learn everything humans have ever created.

IK = E

low intelligence * vast knowledge = reasonable effectiveness

someothherguyy • 6 hours ago

https://en.wikipedia.org/wiki/Fluid_and_crystallized_intelli...

empiko • 5 hours ago

Well put. You ask LLMs about ARC-like challenges and they are able to come up with a list of possible problem formulations even before you show them the input. The models already know that they might expect various object manipulations, symmetry problem, etc. The fact that the solution costs thousands of dollars says to me that the model iterates over many solutions while using this implicit knowledge and feedback it gets from running the program. It is still impressive, but I don't think this is what the ARC prize was supposed to be about.

curl-up • 5 hours ago

> while using this implicit knowledge and feedback it gets from running the program.

What feedback, and what program, are you referring to?

empiko • 4 hours ago

I assume that o3 can run Python scripts and observe the outputs.

scotty79 • 4 hours ago

Basically solutions that were doing well in arc just threw thousands of ideas at the wall and picked the ones that stuck. They were literally generating thousands of python programs, running them and checking if any produced the correct output when fed with data from examples.

This o3 doesn't need to run python. It itself executes programs written in tokens inside it's own context window which is wildly inefficient but gives better results and is potentially more general.

TheOtherHobbes • 4 hours ago

So basically it's a massively inefficient trial-and-error leetcode solver which only works because it throws incredible amounts of compute at the problem.

This is hilarious.

lorepieri • 6 hours ago

There should be also a factor about resource consumption. See here: https://lorenzopieri.com/pgii/

xlii • 5 hours ago

An interesting point from a philosophical perspective!

But if we'd take this into consideration would it mean that 1st world engineer is by definition less inteligent than 3rd world one?

I think the (completely reasonable) knee jerk reaction is a definsive one, but I can imagine absolutarian regime escapee working side-by-side an engineer groomed in expensive, air conditioned lecture rooms. In this imaginary scenario escapee, even if slower and less efficient at the problem at hand would have to be more intelligent generally.

spacebanana7 • 6 hours ago

Also perhaps a factor (with diminishing returns) for response speed?

All else equal, a student who gets 100% on a problem set in 10 minutes is more intelligent than one with the same score after 120 minutes. Likewise an LLM that can respond in 2 seconds is more impressive than one which responds in 30 seconds.

owenpalmer • 6 hours ago

> a student who gets 100% on a problem set in 10 minutes is more intelligent than one with the same score after 120 minutes

According to my mathematical model, the faster student would have higher effectiveness, not necessarily higher intelligence. Resource consumption and speed are practical technological concerns, but they're irrelevant in a theorical conceptualization of intelligence.

baq • 5 hours ago

If you disregard time, all computers have maximal intelligence, they can enumerate all programs and compute answers to any decidable question.

wouldbecouldbe • 5 hours ago

Yeah speed is a key factor in intelligence. And actually one of the biggest differentiators in human iq measurements

coffeebeqn • 4 hours ago

Maybe. If I could ask a AI to come up with a 50% efficient mass market solar panel, I don’t really care if it takes a few weeks or a year if it can solve that though. I’m not sure if inventiveness or novelness of solution could be a metric. I suppose that is superintelligence rather than AGI? And by then there would be no question of what it is

Terr_ • 5 hours ago

> response time

Imagine you take an extraordinarily smart person, and put them on a fast spaceship that causes time dilation.

Does that mean that they are stupider while in transit, and they regain their intelligence when it slows down?

zoky • 4 hours ago

Who is a better free-thrower, someone who can hit 20 free throws per minute on Earth, or the same thrower who logged 20 million free throws in the apparent two years he was gone but comes back ready for retirement?

Earw0rm • 5 hours ago

wangii • 6 hours ago

Interesting formulation! it captures the intuition of the "smartness" when solving a problem. However, what about asking good questions or proposing conjectures?

hanspeter • 5 hours ago

Aren't those solutions to problems as well?

Find the best questions to ask. Find the best hypothesis to suggest.

onemetwo • 5 hours ago

An intelligent system could take more advantage of an increase of knowledge than a dumb one, so I should propose a simple formula: the derivative of efficiency with respect to knowledge is proportional to intelligence.

$$ I = \frac{partial E}{partial K} \simeq \frac{\delta E}{\delta K} $$

In order to estimate $I$ you have to consider that efficiency and knowledge are task related, so you could take some weighted mean $sum_T C(E,K,T)*I(E,K,T)$ where $T$ is task category. I am thinking in $C(E,K,T)$ as something similar to thermal capacity or electrical resistance, the equivalent concept when applied to task. An intelligent agent in a medium of low resistance should fly while a dumb one would still crawl.

owenpalmer • 4 hours ago

> An intelligent system could take more advantage of an increase of knowledge than a dumb one

Why?

> derivative of efficiency

Where did your efficiency variable come from?

onemetwo • 4 hours ago

Why? I am using dumb as a low intelligence system. A more intelligent person can take advantage of new opportunities. Efficience variable: You are right that effectiveness could be better here because we are not considering resources like computer time and power.

Woodi • 6 hours ago

Yep, I aways liked encyclopedia. Wiki is good too :)

What I would like to have in the future is SO answering-peoples accessible in real time via IRC. They have real answers NOW. They are even pedantic about their stuff !

dmezzetti • 5 hours ago

We should wait until it's released before we anoint it. It's disheartening to see how we keep repeating the same pattern that gives in to hype over the scientific method.

lazide • 5 hours ago

The scientific method doesn’t drive stock price (apparently).

scotty79 • 4 hours ago

As a kid I absolutely hated math and loved physics and chemistry because solving anything in math requires vast specific K.

In comparison you can easily know everything there is to know about physics or chemistry and it's sufficient to solve interesting puzzles. In math every puzzle has it's own vast lore you need to know before you can have any chance at tackling it.

owenpalmer • 4 hours ago

Physics and chemistry require experimentation to verify solutions. With math however, any new knowledge can be intuited and proven from previous proofs, so yes, the lore goes deep!

gardenhedge • 4 hours ago

Where did someone ask that?

iLoveOncall • 16 hours ago

It's beyond ridiculous how the definition of AGI has shifted from being an AI that's so good it can improve itself entirely independently infinitely to "some token generator that can solve puzzles that kids could solve after burning tens of thousands of dollars".

I spend 100% of my work time working on a GenAI project, which is genuinely useful for many users, in a company that everyone has heard about, yet I recognize that LLMs are simply dogshit.

Even the current top models are barely usable, hallucinate constantly, are never reliable and are barely good enough to prototype with while we plan to replace those agents with deterministic solutions.

This will just be an iteration on dogshit, but it's the very tech behind LLMs that's rotten.

duluca • 10 hours ago

The first computers cost millions of dollars and filled entire rooms to accomplish what we would now consider simple computational tasks. That same computing power now fits into the width of a finger nail. I don’t get how technologists balk at the cost of experimental tech or assume current tech will run at the same efficiency for decades to come and melt the planet into a puddle. AGI won’t happen until you can fit enough compute that’d take several data center’s worth of compute into a brain sized vessel. So the thing can move around process the world in real time. This is all going to take some time to say the least. Progress is progress.

8n4vidtmkvmk • 8 hours ago

I thought you were going to say that now we're back to bigger-than-room sized computers that cost many millions just to perform the same tasks we could 40 years ago.

I of course mean we're using these LLMs for a lot of tasks that they're inappropriate for, and a clever manually coded algorithm could do better and much more efficiently.

arthurcolle • 8 hours ago

just ask the LLM to solve enough problems (even new problems), cache the best, do inference time compute for the rest, figure out the best/ fastest implementations, and boom, you have new training data for future AIs

owenpalmer • 7 hours ago

> cache the best

How do you quantify that?

martinkallstrom • 7 hours ago

globalise83 • 7 hours ago

The LLMs are now writing their own algorithms to answer questions. Not long before they can design a more efficient algorithm to complete any feasible computational task, in a millionth of the time needed by the best human.

gf000 • 5 hours ago

> The LLMs are now writing their own algorithms to answer questions

Writing a python script, because it can't do math or any form of more complex reasoning is not what I would call "own algorithm". It's at most application of existing ones/calling APIs.

bayindirh • 6 hours ago

LLMs are probabilistic string blenders pulling pieces up from their training set, which unfortunately comes from us, humans.

The superset of the LLM knowledge pool is human knowledge. They can't go beyond the boundaries of their training set.

I'll not go into how humans have other processes which can alter their and collective human knowledge, but the rabbit hole starts with "emotions, opposable thumbs, language, communication and other senses".

ogogmad • 6 hours ago

> They can't go beyond the boundaries of their training set.

TFA says they just did. That's what the ARC-AGI benchmark was supposed to test.

adwn • 8 hours ago

> and a clever manually coded algorithm could do better and much more efficiently.

Sure, but how long would it take to implement this algorithm, and would that be worth it for one-off cases?

Just today I asked Claude to create a jq query that looks for objects with a certain value for one field, but which lack a certain other field. I could have spent a long time trying to make sense of jq's man page, but instead I spent 30 seconds writing a short description of what I'm looking for in natural language, and the AI returned the correct jq invocation within seconds.

freehorse • 7 hours ago

I don’t think this is a bad use. A bad use would be to give Claude the dataset and ask it to tell you which elements have that value.

globalise83 • 7 hours ago

adwn • 7 hours ago

lottin • 6 hours ago

But how do you know it's given you the correct answer? Just because the code appears to work it doesn't mean it's correct.

adwn • 6 hours ago

ogogmad • 6 hours ago

lxgr • 10 hours ago

> take several data center’s worth of compute into a brain sized vessel. So the thing can move around process the world in real time

How so? I'd imagine a robot connected to the data center embodying its mind, connected via low-latency links, would have to walk pretty far to get into trouble when it comes to interacting with the environment.

The speed of light is about three orders of magnitude faster than the speed of signal propagation in biological neurons, after all.

byw • 8 hours ago

The robot brain could be layered so that more basic functions are embedded locally while higher-level reasonings and offloaded to the cloud.

arthurcolle • 8 hours ago

blue strip from iRobot?

waldrews • 9 hours ago

6 orders of magnitude if we use 120 m/s vs 300 km/s

lxgr • 2 hours ago

Ah, yes, I missed a “k” in that estimation!

nopinsight • 7 hours ago

Many of humans' capabilities are pretrained with massive computing through evolution. Inference results of o3 and its successors might be used to train the next generation of small models to be highly capable. Recent advances in the capabilities of small models such as Gemini-2.0 Flash suggest the same.

Recent research from NVIDIA suggests such an efficiency gain is quite possible in the physical realm as well. They trained a tiny model to control the full body of a robot via simulations.

---

"We trained a 1.5M-parameter neural network to control the body of a humanoid robot. It takes a lot of subconscious processing for us humans to walk, maintain balance, and maneuver our arms and legs into desired positions. We capture this “subconsciousness” in HOVER, a single model that learns how to coordinate the motors of a humanoid robot to support locomotion and manipulation."

...

"HOVER supports any humanoid that can be simulated in Isaac. Bring your own robot, and watch it come to life!"

More here: https://x.com/DrJimFan/status/1851643431803830551

---

This demonstrates that with proper training, small models can perform at a high level in both cognitive and physical domains.

bigprof • 7 hours ago

> Similarly, many of humans' capabilities are pretrained with massive computing through evolution.

Hmm .. my intuition is that humans' capabilities are gained during early childhood (walking, running, speaking .. etc) ... what are examples of capabilities pretrained by evolution, and how does this work?

tiborsaas • 6 hours ago

If you look at animals, they can walk in hours, not much time needed after being born. It takes us a longer time because we are born rather undeveloped to get the head out of the birth canal.

A more high level example, sea sickness is a evolutionary pre-learned thing, your body things it's poisoned and it automatically wants to empty your stomach.

nopinsight • 7 hours ago

The brain is predisposed to learn those skills. Early childhood experiences are necessary to complete the training. Perhaps that could be likened to post-training. It's not a one-to-one comparison but a rather loose analogy which I didn't make it precise because it is not the key point of the argument.

Maybe evolution could be better thought of as neural architecture search combined with some pretraining. Evidence suggests we are prebuilt with "core knowledge" by the time we're born [1].

See: Summary of cool research gained from clever & benign experiments with babies here:

[1] Core knowledge. Elizabeth S. Spelke and Katherine D. Kinzler. https://www.harvardlds.org/wp-content/uploads/2017/01/Spelke...

vanviegen • 6 hours ago

puffybuf • 6 hours ago

I think of evolution as unassisted learning where agents compete with the each other for limited resources. Over time they get better and better at surviving by passing on genes. It never ends of course.

gf000 • 5 hours ago

I mean, there are plenty - e.g. mimicking (say, the mother's face's emotions), which are precursors to learning more advanced "features". Also, even walking has many aspects pretrained (I assume it's mostly a musculoskeletal limitation that we can't walk immediately), humans are just born "prematurely" due to our relatively huge heads. Newborn horses can walk immediately without learning.

But there are plenty of non-learned control/movement/sensing in utero that are "pretrained".

lumost • 10 hours ago

The concern here is mainly on practicality. The original mainframes did not command startup valuations counted in fractions of the US economy, they did qualify for billions in investment.

This is a great milestone, but OpenAI will not be successful charging 10x the cost of a human to perform a task.

raincole • 9 hours ago

The cost of inference has be dropping by ~100x in the past 2 years.

https://a16z.com/llmflation-llm-inference-cost/

christianqchung • 9 hours ago

Hmm the link is saying the price of an LLM that scores 42 or above on MMLU has dropped 100x in 2 years, equating gpt 3.5 and llama 3.2 3B. In my opinion gpt 3.5 was significantly better than llama 3B, and certainly much better than the also-equated llama 2 7B. MMLU isn't a great marker of overall model capabilities.

Obviously the drop in cost for capability in the last 2 years is big, but I'd wager it's closer to 10x than 100x.

gritzko • 9 hours ago

*infernonce

nico • 9 hours ago

*inference

owenpalmer • 7 hours ago

> OpenAI will not be successful charging 10x the cost of a human to perform a task.

True, but they might be successful charging 20x for 2x the skill of a human.

threatripper • 6 hours ago

Or 10x the skill and speed of a human in some specific class of recurrent tasks. We don't need full super-human AGI for AI to become economically viable.

BriggyDwiggs42 • 9 hours ago

I wouldn’t expect it to cost 10x in five years, if only because parallel computing still seems to be roughly obeying moore’s.

fragmede • 5 hours ago

How much does AWS charge for compute?

If it can be spun up with Terraform, I bet you they could.

pera • 8 hours ago

Maybe AGI as a goal is overvalued: If you have a machine that can, on average, perform symbolic reasoning better than humans, and at a lower cost, that's basically the end game, isn't it? You won capitalism.

harrall • 8 hours ago

Right now I can ask an (experienced) human to do something for me and they will either just get it done or tell me that they can’t do it.

Right now when I ask an LLM… I have to sit there and verify everything. It may have done some helpful reasoning for me but the whole point of me asking someone else (or something else) was to do nothing at all…

I’m not sure you can reliably fulfill the first scenario without achieving AGI. Maybe you can, but we are not at that point yet so we don’t know yet.

raincole • 7 hours ago

You do need to verify humans work though.

The difference, to me, is that humans seem to be good at canceling each other's mistakes when put in a proper environment.

anavat • 6 hours ago

My guess is this is an artifact of the RLHF part of the training. Answers like "I don't know" or "let me think and let's catch on this next week" are flagged down by human testers, which eventually trains LLM to avoid this path altogether. And it probably makes sense because otherwise "I don't know" would come up way too often even in cases where the LLM is perfectly able to give the answer.

gf000 • 5 hours ago

pera • 7 hours ago

It's not clear to me whether AGI is necessary for solving most of the issues in the current generation of LLMs. It is possible you can get there by hacking together CoTs with automated theorem provers and bruteforcing your way to the solution or something like that.

But if it's not enough then maybe it might come as a second-order effect (e.g. reasoning machines having to bootstrap an AGI so then you can have a Waymo taxi driver who is also a Fields medalist)

vbezhenar • 7 hours ago

There are so called "yes-men" who can't say "no" in no situation. That's rooted in their culture. I suspect that AI was trained using their assistance. I mean, answering "I can't do that" is the simplest LLM path that should work often unless they gone out of their way to downrank it.

concordDance • 7 hours ago

> Right now I can ask an (experienced) human to do something for me and they will either just get it done or tell me that they can’t do it.

Finding reliable honest humans is a problem governments have struggled with for over a hundred years. If you have cracked this problem at scale you really need to write it up! There are a lot of people who would be extremely interested in a solution here.

Existenceblinks • 6 hours ago

Honestly, it doesn't need to be local, API is some 200ms away is ok-ish, make it 50ms it will be practically usable for every majority of interaction.

TechDebtDevin • 9 hours ago

Batteries..

otabdeveloper4 • 9 hours ago

Intelligence has nothing at all whatever to do with compute.

oefnak • 9 hours ago

Unless you're a dualist who believes in a magic spirit, I cannot understand how you think that's the case. Can you please explain?

freehorse • 8 hours ago

Intelligence is about learning from few examples and generalising to novel solutions. Increasing compute so that exploring the whole problem space is possible is not intelligence. There is a reason the actual ARC-AGI price has efficiency as one of the success requirements. It is not so that the solutions scale to production and whatnot, these are toy tasks. It is to help ensure that it is actually an intelligent system solving these.

So yeah, the o3 result is impressive but if the difference between o3 and the previous state of art is more compute to do a much longer CoT/evaluation loop, I am not so impressed. Reminder that these problems are solved by humans in seconds, ARC-AGI is supposed to be easy.

patrickhogan1 • 9 hours ago

Do you think intelligence exists without prior experience? For instance, can someone instantly acquire a skill—like playing the piano—as if downloading it in The Matrix? Even prodigies like Mozart had prior exposure. His father, a composer and music teacher, introduced him to music from an early age. Does true intelligence require a foundation of prior knowledge?

1659447091 • 8 hours ago

Intelligence requires the ability to separate the wheat from the chaff on one's own to create a foundation of knowledge to build on.

It is also entirely possible to learn a skill without prior experience. That's how it(whatever skill) was first done

owenpalmer • 7 hours ago

> Does true intelligence require a foundation of prior knowledge?

This is the way I think about it.

I = E / K

where I is the intelligence of the system, E is the effectiveness of the system, and K is the prior knowledge.

Now back to your question of whether or not prior knowledge is required. As K approaches 0, intelligence approaches infinity. But when K=0, intelligence is undefined. Tada! I think that answers your question.

IK = E

low intelligence * vast knowledge = reasonable effectiveness

uncomplexity_ • 14 hours ago

it's official old buddy, i'm a has been.

behnamoh • 21 hours ago

So now not only are the models closed, but so are their evals?! This is a "semi-private" eval. WTH is that supposed to mean? I'm sure the model is great but I refuse to take their word for it.

ZeroCool2u • 21 hours ago

The private evaluation set is private from the public/OpenAI so companies can't train on those problems and cheat their way to a high score by overfitting.

jsheard • 21 hours ago

If the models run on OpenAIs servers then surely they could still see the questions being put into it if they wanted to cheat? That could only be prevented by making the evaluation a one-time deal that can't be repeated, or by having OpenAI distribute their models for evaluators to run themselves, which I doubt they're inclined to do.

foobarqux • 20 hours ago

Yes that's why it is "semi"-private: From the ARC website "This set is "semi-private" because we can assume that over time, this data will be added to LLM training data and need to be periodically updated."

I presume evaluation on the test set is gated (you have to ask ARC to run it).

cchance • 21 hours ago

the evals are the question/answers, ARC-AGI doesn't share the questions and answers for a portion so that models can't be trained on them, the public ones... the public knows the questions so theres a chance they could have been at least partially been trained on the question (if not the actual answer).

Thats how i understand it

sys32768 • 20 hours ago

So in a few years, coders will be as relevant as cuneiform scribes.

HarHarVeryFunny • 16 hours ago

I've never seen a company looking for a "coder", anymore than they look to hire spreadsheet creators or powerpoint specialists. A software developer can code, but being able to code doesn't make you a software developer, anymore than being able to create a powerpoint makes you a manager (although in some companies it might do, so maybe bad example!).

rvz • 21 hours ago

Great results. However, let's all just admit it.

It has well replaced journalists, artists and on its way to replace nearly both junior and senior engineers. The ultimate intention of "AGI" is that it is going to replace tens of millions of jobs. That is it and you know it.

It will only accelerate and we need to stop pretending and coping. Instead lets discuss solutions for those lost jobs.

So what is the replacement for these lost jobs? (It is not UBI or "better jobs" without defining them.)

drdaeman • 20 hours ago

> It has well replaced journalists, artists and on its way to replace nearly both junior and senior engineers.

Did it, really? Or did it just provide automation for routine no-thinking-necessary text-writing tasks, but is still ultimately completely bound by the level of human operator's intelligence? I strongly suspect it's the latter. If it had actually replaced journalists it must be junk outlets, where readers' intelligence is negligible and anything goes.

Just yesterday I've used o1 and Claude 3.5 to debug a Linux kernel issue (ultimately, a bad DSDT table causing TPM2 driver unable to reserve memory region for command response buffer, the solution was to use memmap to remove NVS flag from the relevant regions) and confirmed once again LLMs still don't reason at all - just spew out plausible-looking chains of words. The models were good listeners, and a mostly-helpful code generators (when they didn't make silliest mistakes), but they gave no traces of understanding and no attention for any nuances (e.g. LLM used `IS_ERR` to check `__request_resource` result, despite me giving it full source code for that function and there's even a comment that makes it obvious it returns a pointer or NULL, not an error code - misguided attention kind of mistake).

So, in my opinion, LLMs (as currently available to broad public, like myself) are useful for automating away some routine stuff, but their usefulness is bounded by the operator's knowledge and intelligence. And that means that the actual jobs (if they require thinking and not just writing words) are safe.

When asked about what I do at work, I used to joke that I just press buttons on my keyboard in fancy patterns. Ultimately, LLMs seem to suggest that it's not what I really do.

RivieraKid • 20 hours ago

The economic theory answer is that people simply switch to jobs that are not yet replaceable by AI. Doctors, nurses, electricians, construction workers, police officers, etc. People in aggregate will produce more, consume more and work less.

achierius • 7 hours ago

> Doctors

Many replaceable

> Police officers

Many replaceable (desk officers)

whynotminot • 20 hours ago

When none of us have jobs or income, there will be no ability for us to buy products. And then no reason for companies to buy ads to sell products to people who don’t have money. Without ad money (or the potential of future ad money), the people pushing the bounds of AGI into work replacement will lose the very income streams powering this research and their valuations.

Ford didn’t support a 40 hour work week out of the kindness of his heart. He wanted his workers to have time off for buying things (like his cars).

I wonder if our AGI industrialist overlords will do something similar for revenue sharing or UBI.

tivert • 18 hours ago

> When none of us have jobs or income, there will be no ability for us to buy products. And then no reason for companies to buy ads to sell products to people who don’t have money. Without ad money (or the potential of future ad money), the people pushing the bounds of AGI into work replacement will lose the very income streams powering this research and their valuations.

I don't think so. I agree the push for AGI will kill the modern consumer product economy, but I think it's quite possible for the economy to evolve into a new form (that will probably be terrible for most humans) that keep pushes "work replacement."

Imagine, an AGI billionare buying up land, mines, and power plants as the consumer economy dies, then shifting those resources away from the consumer economy into self-aggrandizing pet projects (e.g. ziggurats, penthouses on Mars, space yachts, life extension, and stuff like that). He might still employ a small community of servants, AGI researchers, and other specialists; but all the rest of the population will be irrelevant to him.

And individual autarky probably isn't necessary, consumption will be redirected towards the massive pet production I mentioned, with vestigial markets for power, minerals, etc.

whimsicalism • 20 hours ago

This picture doesn't make sense. If most don't have any money to buy products, just invent some other money and start paying one of the other people who doesn't have any money to start making the products for you.

In reality, if there really is mass unemployment, AI driven automation will make consumables so cheap that anyone will be able to buy it.

astrange • 15 hours ago

> If most don't have any money to buy products, just invent some other money and start paying one of the other people who doesn't have any money to start making the products for you.

This isn't possible if you want to pay sales taxes - those are what keep transactions being done in the official currency. Of course in a world of 99% unemployment presumably we don't care about this.

But yes, this world of 99% unemployment isn't possible, eg because as soon as you have two people and they trade things, they're employed again.

tivert • 18 hours ago

> This picture doesn't make sense. If most don't have any money to buy products, just invent some other money and start paying one of the other people who doesn't have any money to start making the products for you.

Ultimately, it all comes down to raw materials and similar resources, and all those will be claimed by people with lots of real money. Your "invented ... other money" will be useless to buy that fundamental stuff. At best, it will be useful for trading scrap and other junk among the unemployed.

> In reality, if there really is mass unemployment, AI driven automation will make consumables so cheap that anyone will be able to buy it.

No. Why would the people who own that automation want to waste their resources producing consumer goods for people with nothing to give them in return?

whimsicalism • 16 hours ago

whynotminot • 20 hours ago

Uh, this picture doesn’t make sense. Why would anyone value this randomly invented money?

whimsicalism • 20 hours ago

neom • 21 hours ago

Do you follow Jack Clark? I noticed he's been on the road a lot talking to governments and policy makers, and not just in the "AI is coming" way he used to talk.

__MatrixMan__ • 15 hours ago

With only a 100x increase in cost, we improved performance by 0.1x and continued plotting this concave-down diminishing-returns type graph! Hurray for logarithmic x-axes!

Joking aside, better than ever before at any cost is an achievement, it just doesn't exactly scream "breakthrough" to me.

whalee • 14 hours ago

imo it's a mistake to interpret the marginal increases in the upper echelons of benchmarks as materially marginal gains. Chess is an example. ELO narrows heavily at the top, but each ELO point carries more relative weight. This is a bit apples and oranges since chess is adversarial, but I think the point stands.

wavemode • 10 hours ago

> ELO narrows heavily at the top

What do you mean by this? I'm assuming you're not speaking about simple absolute differences in value - there have been top players rated over 100 points higher than the average of the rest of the top ten.

energy123 • 11 hours ago

o3-mini (high) uses 1/3rd of the compute of o1, and performs about 200 Elo higher than o1 on Codeforces.

o1 is the best code generation model according to Livebench.

So how is this not a breakthrough? It's a genuine movement of the frontier.

handzhiev • 7 hours ago

How much time does a top sprinter take a 100 m run for compared to a mediocre sprinter?

dyauspitr • 13 hours ago

I mean going from 10% to 85% doesn’t seem like a 0.1% improvement

__MatrixMan__ • 9 hours ago

Oh crap I made a mistake. I was comparing o3 low to o3 high.

I'm a little disappointed by all the upvotes I got for being flat wrong. I guess as long as you're trashing AI you can get away with anything.

Really I was just trying to nitpick the chart parameters.

HDThoreaun • 15 hours ago

compute gets cheaper and cheaper every year. This model will be in your phone by 2030 if we continue at the pace we've been at the last few years.

hajile • 11 hours ago

These models are nearing 2+ trillion parameters. At 4 bits each, we're talking about somewhere around 1tb of RAM.

The problem is that RAM stopped scaling a long time ago now. We're down to the size where a single capacitor's charge is held by a mere 40,000 or so electrons and all we've been doing is making skinnier, longer cells of that size because we can't find reliable ways to boost even weaker signals, but this is a dead end because as the math shows, if the volume is consistent and you are reducing X and Y dimensions, that Z dimension starts to get crazy big really fast. The chemistry issues of burning a hole a little at a time while keeping wall thickness somewhat similar all the way down is a very hard problem.

Another problem is that Moore's law hit a wall when Dennard Scaling failed. When you look at SRAM (it's generally the smallest and most reliable stuff we can make), you see that most recent shrinks can hardly be called shrinks.

Unless we do something very different like compute in storage or have some radical breakthrough in a new technology, I don't know that we will ever get a 2T parameter model inside a phone (I'd love for someone in 10 years to show up and say how wrong I was).

agentultra • 14 hours ago

There’s probably enough VC money to subsidize the costs for a few more years.

But the data centres running the training for models like this are bringing up new methane power plants at a fast rate at a time when we need to be reducing reliance on O&G.

But let’s assume that the efficiency gains out pace the resource consumption with the help of all the subsidies being thrown in and we achieve AGI.

What’s the benefit? Do we get more fresh water?

fastball • 13 hours ago

Politically anything can happen. Maybe the billionaire class controls everything with an army of robots and it's a horrible prison-like dystopia, or maybe we end up in a post-scarcity utopia a la The Culture.

Regardless, once we have AGI (and it can scale), I don't think O&G reliance (/ climate change) is going to be something that we need concern ourselves with.

hamburga • 14 hours ago

Yeah, good question. I think it depends on our politics. If we’re in a techno-capital-oligarchy, people are going to have a hard time making fresh water a priority when the robots would prefer to build nuclear power everywhere and use it to desalinate sea water.

OTOH if these data centers are sufficiently decentralized and run for public benefit, maybe there’s a chance we use them to solve collective action problems.

swalsh • 14 hours ago

[flagged]

adamtaylor_13 • 14 hours ago

I know AGI is a bit of a moving goalpost these days, but by my personally anecdotal and irrelevant opinion this ain’t AGI.

I’ll let those smarter than me debate the merits of AGI, but if it can’t learn and self-improve it isn’t “general” intelligence.

This is a very smart computer, accomplishing a very niche set of problems. Cool? Yes. AGI? No.

fastball • 13 hours ago

What is your personal benchmark for AGI? Turing test was surpassed years ago, ARC-AGI was a next step some quite clever people in the space came up with as a successor, and has now surpassed as well.

So what is your benchmark?

buzzy_hacker • 13 hours ago

adamtaylor_13 • 13 hours ago

sjfidsfkds • 13 hours ago

How do you have anecdotal opinion on this o3 model that hasn’t been publicly released yet?

__MatrixMan__ • 13 hours ago

I'll start worrying about AGI once I'm convinced there's such a thing as GI.

BobbyJo • 14 hours ago

I think what current technology is missing is in situ learning. I don't think that's a massive leap from where we are though.

Jensson • 14 hours ago

zeofig • 14 hours ago

Agi level my ass

haidev • 14 hours ago

noodletheworld • 13 hours ago

[flagged]

kvetching • 15 hours ago

It may eventually be able to solve any problem

iterance • 14 hours ago

Ah. Me, too.

demirbey05 • 18 hours ago

It is not exactly AGI but huge step toward it. I would expect this step in 2028-2030. I cant really understand why people are happy with it, this technology is so dangerous that can disrupt whole society. It's neither like smartphone nor internet. What will happen to 3rd world countries. Lots of unsolved questions and world is not prepared for such a change. Lots of people will lose their jobs I am not even mentioning their debts. No one will have chance to be rich anymore, If you are in first world country you will probably get UBI, if not you wont.

Ancalagon • 18 hours ago

Same, I don’t really get the excitement. None of these companies are pushing for a utopian Star Trek society either with that power.

moffkalast • 17 hours ago

Open models will catch up next year or the year after, there only so many things to try and there's lots of people trying them, so it's more or less an inevitability.

The part to get excited about is that there's plenty of headroom left to gain in performance. They called o1 a preview, and it was, a preview for QwQ and similar models. We get the demo from OAI and then get the real thing for free next year.

FanaHOVA • 18 hours ago

> I would expect this step in 2028-2030.

Do you work at one of the frontier labs?

ripped_britches • 13 hours ago

I’ve never understood this perspective. Companies only make money when there are billions of customers. Are you imagining a total-monopoly scenario where zero humans have any income/wealth and there are only AI companies selling/mining/etc to each other, fully on their own? In such an extreme scenario, clearly the world’s governments would nationalize these entities. I think the only realistic scenario in which the future is not markedly better for every single human is if some rogue AI system decides to exterminate us, which I find to be increasingly unlikely as safety improvements are made (like the paper released today).

As for the wealth disparity between rich and poor countries, it’s hard to know how politics will handle this one, but it’s unlikely that poor countries won’t also be drastically richer as the cost of basic living drops to basically zero. Imagine the cost of food, energy, etc in an ASI world. Today’s luxuries will surely be considered human rights necessities in the near future.

Jensson • 13 hours ago

> In such an extreme scenario, clearly the world’s governments would nationalize these entities

Those entities are the worlds governments regardless how things play out. People just worry they will be hostile or indifferent to humans, since that would be bad news for humans. Pet, cattle or pest, our future will be as one of those.

lagrange77 • 17 hours ago

I hope governments will finally take action.

Joeri • 17 hours ago

What action do you expect them to take?

What law would effectively reduce risk from AGI? The EU passed a law that is entirely about reducing AI risk and people in the technology world almost universally considered it a bad law. Why would other countries do better? How could they do better?

lagrange77 • 16 hours ago

If their mission is the wellbeing of their peoples, they should take any action that ensures that.

Besides regulating the technology, they could try to protect people and society from the effects of the technology. UBI for example could be an attempt to protect people from the effects of mass unemployment, as i understood it.

Actually i'm afraid even more fundamental shifts are necessary.

wyager • 18 hours ago

> What will happen to 3rd world countries

Probably less disruption than will happen in 1st world countries.

> No one will have chance to be rich anymore

It's strange to reach this conclusion from "look, a massive new productivity increase".

janalsncm • 17 hours ago

Strange indeed if we work under the assumption that the profits from this productivity will be distributed (even roughly) evenly. The problem is that most of us see no indication that they will be.

I read “no one will have a chance to be rich anymore” as a statement about economic mobility. Despite steep declines in mobility over the last 50 years, it was still theoretically possible for a poor child (say bottom 20% wealth) to climb several quintiles. Our industry (SWE) was one of the best examples. Of course there have been practical barriers (poor kids go to worse schools, and it’s hard to get into college if you can’t read) but the path was there.

If robots replace a lot of people, that path narrows. If AGI replaces all people, the path no longer exists.

entropi • 16 hours ago

It is not strange at all, a very big motivation of spending billions in AI research is basically to remove what is called "skill premium" from the labor market. That "skill premium" was usually how people got richer than their fathers.

the8472 • 17 hours ago

Intelligence is the thing distinguishing humans from all previous inventions that already were superhuman in some narrow domain.

car : horse :: AGI : humans

demirbey05 • 18 hours ago

its not like sonnet, yes current ai tools are increasing productivity and provides many ways to have chance to be rich, but agi is completely different. You need to handle evil competition between you and big fishes, probably big fishes will have more ai resources than you. What is the survival ratio in such a environment ? Very low.

dyauspitr • 17 hours ago

I’m extremely excited because I want to see the future and I’m trying not to think of how severely fucked my life will be.

og_kalu • 21 hours ago

This is also wildly ahead in SWE-bench (71.7%, previous 48%) and Frontier Math (25% on high compute, previous 2%).

So much for a plateau lol.

throwup238 • 21 hours ago

> So much for a plateau lol.

It’s been really interesting to watch all the internet pundits’ takes on the plateau… as if the two years since the release of GPT3.5 is somehow enough data for an armchair ponce to predict the performance characteristics of an entirely novel technology that no one understands.

bandwidth-bob • 20 hours ago

The pundits response to the (alleged) plateau was proportional to the certainty with which CEOs of frontier labs discussed pre-training scaling. The o3 result is from scaling test time compute, which represents a meaningful change in how you would build out compute for scaling (single supercluster --> presence in regions close to users). Thus it is important to discuss.

jgalt212 • 21 hours ago

You could make an equivalently dismissive comment about the hypesters.

throwup238 • 21 hours ago

Yeah but anyone with half a brain knows to ignore them. Vapid cynicism is a lot more seductive to the average nerd.

HarHarVeryFunny • 16 hours ago

You're talking apples and oranges. The plateau the frontier models have hit is the limited further gains to be had from dataset (+ corresponding model/compute) scaling.

These new reasoning models are taking things in a new direction basically by adding search (inference time compute) on top of the basic LLM. So, the capabilities of the models are still improving, but the new variable is how deep of a search you want to do (how much compute to throw at it at inference time). Do you want your chess engine to do a 10 ply search or 20 ply? What kind of real world business problems will benefit from this?

og_kalu • 15 hours ago

"New" reasoning models are plain LLMs with clever reinforcement learning. o1 is itself reinforcement learning on top GPT-4o.

They found a way to make test time compute a lot more effective and that is an advance but the idea is not new, the architecture is not new.

And the vast majority of people convinced LLMs plateaued did so regardless of test time compute.

HarHarVeryFunny • 14 hours ago

The fact that these reasoning models may compute for extended durations, using exponentially more compute for linear performance gains (says OpenAI), resulting in outputs that while better are not necessarily any longer (more tokens) than before, all point to a different architecture - some type of iterative calling of the underlying model (essentially a reasoning agent using the underlying model).

A plain LLM does not use variable compute - it is a fixed number of transformer layers, a fixed amount of compute for every token generated.

throwaway314155 • 13 hours ago

og_kalu • 11 hours ago

I think throwaway already explained what i was getting at.

That said, i probably did downplay the achievement. It may not be a "new" idea to do something like this but finding an effective method for reflection that doesn’t just lock you into circular thinking and is applicable beyond well defined problem spaces is genuinely tough and a breakthrough.

attentionmech • 21 hours ago

I legit see that if there is not even a new breakthrough just one week, people start shouting plateau plateau.. Our rate of progress is extraordinary and any downplay of it seems like stupid

optimalsolver • 21 hours ago

>Frontier Math (25% on high compute, previous 2%)

This is so insane that I can't help but be skeptical. I know FM answer key is private, but they have to send the questions to OpenAI in order to score the models. And a significant jump on this benchmark sure would increase a company's valuation...

Happy to be wrong on this.

OsrsNeedsf2P • 21 hours ago

At 6,670$/task? I hope there's a jump

og_kalu • 21 hours ago

It's not 6,670$/task. That was the high efficiency cost for 400 questions.

lagrange77 • 16 hours ago

> You'll know AGI is here when the exercise of creating tasks that are easy for regular humans but hard for AI becomes simply impossible.

That's the most plausible definition of AGI i've read so far.

cmrdporcupine • 16 hours ago

That's a pretty dark view of humanity and human intelligence. We're defined by the tasks we can do?

Instrumental reason FTW

lagrange77 • 16 hours ago

That implies that human intelligence is equivalent to AGI.

vessenes • 21 hours ago

This feels like big news to me.

First of all, ARC is definitely an intelligence test for autistic people. I say as someone with a tad of the neurodiversity. That said, I think it's a pretty interesting one, not least because as you go up in the levels, it requires (for a human) a fair amount of lateral thinking and analogy-type thinking, and of course, it requires that this go in and out of visual representation. That said, I think it's a bit funny that most of the people training these next-gen AIs are neurodiverse and we are training the AI in our own image. I continue to hope for some poet and painter-derived intelligence tests to be added to the next gen tests we all look at and score.

For those reasons, I've always really liked ARC as a test -- not as some be-all end-all for AGI, but just because I think that the most intriguing areas next for LLMs are in these analogy arenas and ability to hold more cross-domain context together for reasoning and etc.

Prompts that are interesting to play with right now on these terms range from asking multimodal models to say count to ten in a Boston accent, and then propose a regional french accent that's an equivalent and count to ten in that. (To my ear, 4o is unconvincing on this). Similar in my mind is writing and architecting code that crosses multiple languages and APIs, and asking for it to be written in different styles. (claude and o1-pro are .. okay at this, depending).

Anyway. I agree that this looks like a large step change. I'm not sure if the o3 methods here involve the spinning up of clusters of python interpreters to breadth-search for solutions -- a method used to make headway on ARC in the past; if so, this is still big, but I think less exciting than if the stack is close to what we know today, and the compute time is just more introspection / internal beam search type algorithms.

Either way, something had to assess answers and think they were right, and this is a HUGE step forward.

jamiek88 • 20 hours ago

> most of the people training these next-gen AIs are neurodiverse

Citation needed. This is a huge claim based only on stereotype.

vessenes • 20 hours ago

So true. Perhaps I'm just thinking it's my people and need to update my priors.

getpost • 20 hours ago

> most of the people training these next-gen AIs are neurodiverse and we are training the AI in our own image

Do you have any evidence to support that? It would be fascinating if the field is primarly advancing due to a unique constellation of traits contributed by individuals who, in the past, may not have collaborated so effectively.

vessenes • 20 hours ago

PURELY Anecdotal. But I'll say that as of 2024 1 in 36 US children are diagnosed on the spectrum according to the CDC(!), which would mean if you met 10 AI researchers and 4 were neurodivergent you'd reasonably expect that it's a higher-than-population average representation. I'm polling from the Effective Altruist AI folks in my mind, and the number is definitely, definitely higher than 4/10.

EVa5I7bHFq9mnYK • 20 hours ago

Are there non-Effective Altruist AI folks?

vessenes • 17 hours ago

I love how this might mean "non-Effective", non-"Effective Altruist" or non-"Effective Altruist AI" folks.

Yes

braden-lk • 21 hours ago

If people constantly have to ask if your test is a measure of AGI, maybe it should be renamed to something else.

OfficialTurkey • 21 hours ago

From the post

> Passing ARC-AGI does not equate achieving AGI, and, as a matter of fact, I don't think o3 is AGI yet. o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.

cchance • 21 hours ago

Its funny when they say this, as if all humans can solve basic ass question/answer combos, people seem to forget theirs a percentage of the population that honestly believe the world is flat along with other hallucinations at the human level

Jensson • 16 hours ago

Humans works in groups, so you are wrong a group of human is extremely reliable on tons of tasks. These AI models also work in groups, or they don't improve from working in a group since the company uses whatever does the best on the benchmark, so it is only fair to compare AI vs group of people, AI compared to an individual will always be an unfair comparison since an AI is never alone.

jppittma • 21 hours ago

I don't believe AGI at that level has any commercial value.

maxdoop • 21 hours ago

How much longer can I get paid $150k to write code ?

prmph • 21 hours ago

I’ll believe the models can take the jobs of programmers when they can generate a sophisticated iOS app based on some simple prompts, ready for building and publication in the app store. That is nowhere near the horizon no matter how much things are hyped up, and it may well never arrive.

vouaobrasil • 20 hours ago

Nah, it will arrive. And regardless, this sort of AI reduces the skill level required to make the app. It reduces the amount of people required and thus reduces the demand for engineers. So, even though AI is not CLOSE to what you are suggesting, it can significantly reduce the salaries of those that ARE required. So maybe fewer $150K programmers will be hired with the same revenue for even higher profits.

The most bizarre thing is that programmers are literally writing code to replace themselves because once this AI started, it was a race to the bottom and nobody wants to be last.

prmph • 19 hours ago

They've been promising us this thing since the 60s: End-user development, 5GLs, etc. enabling the average Joe to develop sophisticated apps in minimal time. And it never arrives.

I remember attending a tech fair decades ago, and at one stand they were vending some database products. When I mentioned that I was studying computer science with a focus on software engineering, they sneered that coding will be much less important in the future since powerful databases will minimize the need for a lot of data wrangling in applications with algorithms.

What actually happened is that the demand for programmers increased, and software ate the world. I suspect something similar will happen the current AI hype.

whynotminot • 18 hours ago

vouaobrasil • 18 hours ago

skydhash • 20 hours ago

> Nah, it will arrive

Will it?

It's already hard to get people to use computer as they are right now, where you only need to click on things and no longer have to enter commands. That because most people don't like to engage in formal reasoning. Even with one of the most intuitive computer assisted task (drawing and 3d modeling), there's so much to learn regarding theories that few people bother.

Programming has always been easy to learn, and tools to automate coding have existed for decades now. But how many people you know have had the urge to learn enough to automate their tasks?

timenotwasted • 21 hours ago

The absolutist type comments are such a wild take given how often they are so wrong.

tsunamifury • 21 hours ago

Totally... simple increases in 20% efficiency will already significant destroy demand for coders. This forum however will be resistant to admit such economic phenomenon.

Look at video bay editing after the advent of Final Cut. Significant drop in the specialized requirement as a professional field, even while content volume went up dramatically.

derektank • 20 hours ago

I could be misreading this, but as far as I can tell, there are more video and film editors today (29,240) than there were film editors in 1997 (9,320). Seems like an example of improved productivity shifting the skills required but ultimately driving greater demand for the profession as a whole. Salaries don't seem to have been hurt either, median wage was $35,214 in '97 and $66,600 today, right in line with inflation.

https://www.bls.gov/oes/2023/may/oes274032.htm

https://www.bls.gov/oes/tables.htm

exitb • 21 hours ago

sss111 • 21 hours ago

3 to 5 years, max. Traditional coding is going to be dead in the water. Optimistically, the junior SWE job will evolve but more realistically dedicated AI-based programming agents will end demand for Junior SWEs

lagrange77 • 18 hours ago

Which implies that a few years later they will not become senior SWEs either.

HarHarVeryFunny • 16 hours ago

You're not being paid $150K to "write code". You're being paid that to deliver solutions - to be a corporate cog than can ingest business requirements and emit (and maintain) business solutions.

If there are jobs paying $150K just to code (someone else tells you what to code, and you just code it up), then please share!

arrosenberg • 20 hours ago

Unless the LLMs see multiple leaps in capability, probably indefinitely. The Malthusians in this thread seem to think that LLMs are going to fix the human problems involved in executing these businesses - they won't. They make good programmers more productive and will cost some jobs at the margins, but it will be the low-level programming work that was previously outsourced to Asia and South America for cost-arbitrage.

mrdependable • 19 hours ago

I think they will have to figure out how to get around context limits before that happens. I also wouldn't be surprised if the future models that can actually replace workers are sold at such an exorbitant price that only larger companies will be able to afford it. Everyone else gets access to less capable models that still require someone with knowledge to get to an end result.

torginus • 20 hours ago

Well, considering they floated the $2000 subscription idea, and they still haven't revealed everything, they could still introduce the $2k sub with o3+agents/tool use, which means, till about next week.

deadbabe • 21 hours ago

There’s a very good chance that if a company can replace its programmers with pure AI then it means whatever they’re doing is probably already being offered as a SaaS product so why not just skip the AI and buy that? Much cheaper and you don’t have to worry about dealing with bugs.

croemer • 20 hours ago

SaaS works for general problems faced by many businesses.

deadbabe • 20 hours ago

Exactly. Most businesses can get away with not having developers at all if they just glue together the right combination of SaaS products. But this doesn’t happen, implying there is something more about having your own homegrown developers that SaaS cannot replace.

croemer • 19 hours ago

kirykl • 17 hours ago

If it’s any consolation, Agile priests and middle managers will be the first to go

colesantiago • 21 hours ago

Frontier expert specialist programmers will always be in demand.

Generalist junior and senior engineers will need to think of a different career path in less than 5 years as more layoffs will reduce the software engineering workforce.

It looks like it may be the way things are if progress in the o1, o3, oN models and other LLMs continues on.

mitjam • 20 hours ago

The question is: How to become a senior when there is no place to be a junior? Will future SWE need to do the 10k hours as a hobby? Will AI speed up or slow down learning?

singularity2001 • 18 hours ago

good question and I think you gave the correct answer yes people will just do the 10,000 hours required by starting programming at the age of eight and then playing around until they're done studying

deadbabe • 21 hours ago

This assumes that software products in the future will remain at the same complexity as they are today, just with AI building them out.

But they won’t. AI will enable building even more complex software which counter intuitively will result in need even more human jobs to deal with this added complexity.

Think about how despite an increasing amount of free open source libraries over time enabling some powerful stuff easily, developer jobs have only increased, not decreased.

hackinthebochs • 20 hours ago

What about "general" in AGI do you not understand? There will be no new style of development for which the AGI will be poorly suited that all the displaced developers can move to.

bandwidth-bob • 19 hours ago

For true AGI (whatever that means, lets say fully replicates human abilities), discussing "developers" only is a drop in the bucket compared to all knowledge work jobs which will be displaced.

dmm • 20 hours ago

I've made a similar argument in the past but now I'm not so sure. It seems to me that developer demand was linked to large expansions in software demand first from PCs then the web and finally smartphones.

What if software demand is largely saturated? It seems the big tech companies have struggled to come up with the next big tech product category, despite lots of talent and capital.

bandwidth-bob • 19 hours ago

The new capabilities of LLMs, and generally large foundation models, expands the range of what a computer program can do. Naturally, we will need to build all of those things with code. Which will be done by a combo of people with product ideas, engineers, and LLMs. There will be then specialization and competition on each new use-case. eg., who builds the best AI doctor etc.,.

deadbabe • 20 hours ago

There doesn’t need to be a new category. Existing categories can just continue bloating in complexity.

Compare the early web vs the complicated JavaScript laden single page application web we have now. You need way more people now. AI will make it even worse.

Consider that in the AI driven future, there will be no more frameworks like React. Who is going to bother writing one? Instead every company will just have their own little custom framework built by an AI that works only for their company. Joining a new company means you bring generalist skills and learn how their software works from the ground up and when you leave to another company that knowledge is instantly useless.

Sounds exciting.

But there’s also plenty of unexplored categories anyway that we can’t access still because there’s insufficient technology for. Household robots with AGI for instance may require instructions for specific services sold as “apps” that have to be designed and developed by companies.

cruffle_duffle • 19 hours ago

This is exactly what will happen. We'll just up the complexity game to entirely new baselines. There will continue to be good money in software.

These models are tools to help engineers, not replacements. Models cannot, on their own, build novel new things no matter how much the hype suggests otherwise. What they can do is remove a hell of a lot of accidental complexity.

lagrange77 • 17 hours ago

> These models are tools to help engineers, not replacements. Models cannot, on their own, build novel new things no matter how much the hype suggests otherwise.

But maybe models + managers/non technical people can?

tsunamifury • 21 hours ago

Often what happens is the golf-course phenomenon. As golfing gets less popular, low and mid tier golf courses go out of business as they simply aren't needed. But at the same time demand for high end golf courses actually skyrockets because people who want to golf either can give it up or go higher end.

This I think will happen with programmers. Rote programming will slowly die out, while demand for super high end will go dramatically up in price.

CapcomGo • 21 hours ago

Where does this golf-course phenomenon come from? It doesn't really match the real world or how golfing works.

tsunamifury • 21 hours ago

how so, witnessed it quite directly in California. Majority have closed and remaining have gone up in price and are up scale. This has been covered in various new programs like 60 minutes. You can look up death of golfing.

Also unsure what you mean by...'how golfing works'. This is the economics of it, not the game

EVa5I7bHFq9mnYK • 19 hours ago

Maybe its CA thing? Plenty of $50 golf courses here in Phoenix.

ImHasanMh • 4 hours ago

[dead]

Delmolokolo • 21 hours ago

[dead]

rationalfaith • 9 hours ago

[dead]

razodactyl • 21 hours ago

Great. Now we have to think of a new way to move the goalposts.

a_wild_dandan • 21 hours ago

Let's just define AI as "whatever computers still can't do." That'll show those dumb statistical parrots!

foobarqux • 20 hours ago

This is just as silly as claiming that people "moved the goalposts" when a computer beat Kasparov at chess to claim that it wasn't AGI: it wasn't a good test and some people only realize this after the computer beat Kasparov but couldn't do much else. In this case the ARC maintainers specifically have stated that this is a necessary but not sufficient test of AGI (I personally think it is neither).

og_kalu • 19 hours ago

It's not silly. The computer that could beat Kasparov couldn't do anything else so of course it wasn't Artificial General Intelligence.

o3 can do much much more. There is nothing narrow about SOTA LLMs. They are already General. It doesn't matter what ARC Maintainers have said. There is no common definition of General that LLMs fail to meet. It's not a binary thing.

By the time a single machine covers every little test humanity can devise, what comes out of that is not 'AGI' as the words themselves mean but a General Super Intelligence.

foobarqux • 19 hours ago

It is silly, the logic is the same: "Only a (world-altering) 'AGI' could do [test]" -> test is passed -> no (world-altering) 'AGI' -> conclude that [test] is not a sufficient test for (world-altering) 'AGI' -> chase new benchmark.

If you want to play games about how to define AGI go ahead. People have been claiming for years that we've already reached AGI and with every improvement they have to bizarrely claim anew that now we've really achieved AGI. But after a few months people realize it still doesn't do what you would expect of an AGI and so you chase some new benchmark ("just one more eval").

The fact is that there really hasn't been the type of world-altering impact that people generally associate with AGI and no reason to expect one.

og_kalu • 18 hours ago

Pesthuf • 21 hours ago

Well right now, running this model is really expensive, but we should prepare a new cope for when equivalent models no longer are, ahead of time.

cchance • 21 hours ago

Ya getting costs down will be the big one, i imagine quantization, distillation and lots and lots of improvements on the compute side both hardware and software wise.

tines • 21 hours ago

I mean, what else do you call learning?

dboreham • 21 hours ago

[flagged]

philip1209 • 21 hours ago

[flagged]

CliveBloomers • 20 hours ago

Another meaningless benchmark, another month—it’s like clockwork at this point. No one’s going to remember this in a month; it’s just noise. The real test? It’s not in these flashy metrics or minor improvements. The only thing that actually matters is how fast it can wipe out the layers of middle management and all those pointless, bureaucratic jobs that add zero value.

That’s the true litmus test. Everything else? It’s just fine-tuning weights, playing around the edges. Until it starts cutting through the fat and reshaping how organizations really operate, all of this is just more of the same.

oytis • 19 hours ago

So far AI market seems to be focused on replacing meaningful jobs, meaningless ones look safe (which kind of makes sense if you think about it).

handfuloflight • 20 hours ago

Agreed, but isn't it management who decides that this would be implemented? Are they going to propogate their own removal?

zamadatix • 19 hours ago

Middle manager types are probably interested in their salary performance more than anything. "Real" management (more of their assets come from their ownership of the company than a salary) will override them if it's truthfully the best performing operating model for the company.

sakopov • 12 hours ago

Maybe I'm missing something vital, but how does anything that we've seen AI do up until this point or explained in this experiment even hint at AGI? Can any of these models ideate? Can they come up with technologies and tools? No and it's unlikely they will any time soon. However, they can make engineers infinitely more productive.

jebarker • 12 hours ago

You need to define ideate, tools and technologies to answer those questions. Not to mention that it's quite possible humans do those things through re-combination of learned ideas similarly to how these reasoning models are suggested to be working.

sakopov • 11 hours ago

Every technological advancement that we've seen in software engineering - be it in things like Postgres, Kubernetes and Cloud Infrastructure - came out from truly novel ideas. AI seems to generate outputs that appear novel but are they really? It's capable of synthesizing and combining vast amounts of information in creative ways but it's deriving everything from existing patterns found within its training data. Truly novel ideas require thinking outside the box. It's combination of cognitive, emotional and environmental factors which go beyond pattern recognition. How close are we to achieving this? Everyone seems to be shaking in their boots because we might lose our job safety in tech, but I don't see any intelligence here.

panabee • 15 hours ago

Nadella is a superb CEO, inarguably among the best of his generation. He believed in OpenAI when no one else did and deserves acclaim for this brilliant investment.

But his "below them, above them, around them" quote on OpenAI may haunt him in 2025/2026.

OAI or someone else will approach AGI-like capabilities (however nebulous the term), fostering the conditions to contest Microsoft's straitjacket.

Of course, OAI is hemorrhaging cash and may fail to create a sustainable business without GPU credits, but the possibility of OAI escaping Microsoft's grasp grows by the day.

Coupled with research and hardware trends, OAI's product strategy suggests the probability of a sustainable business within 1-3 years is far from certain but also higher than commonly believed.

If OAI becomes a $200b+ independent company, it would be against incredible odds given the intense competition and the Microsoft deal. PG's cannibal quote about Altman feels so apt.

It will be fascinating to see how this unfolds.

Congrats to OAI on yet another fantastic release.

noah32 • 13 hours ago

The best AI on this graph costs 50000% more than a stem graduate to complete the tasks and even then has an error rate that is 1000% higher than the humans???

starchild3001 • 18 hours ago

Intelligence comes in many forms and flavors. ARC prize questions are just one version of it -- perhaps measuring more human-like pattern recognition than true intelligence.

Can machines be more human-like in their pattern recognition? O3 met this need today.

While this is some form of accomplishment, it's nowhere near the scientific and engineering problem solving needed to call something truly artificial (human-like) intelligent.

What’s exciting is that these reasoning models are making significant strides in tackling eng and scientific problem-solving. Solving the ARC challenge seems almost trivial in comparison to that.

agnosticmantis • 16 hours ago

This is so impressive that it brings out the pessimist in me.

Hopefully my skepticism will end up being unwarranted, but how confident are we that the queries are not routed to human workers behind the API? This sounds crazy but is plausible for the fake-it-till-you-make-it crowd.

Also given the prohibitive compute costs per task, typical users won't be using this model, so the scheme could go on for quite sometime before the public knows the truth.

They could also come out in a month and say o3 was so smart it'd endanger the civilization, so we deleted the code and saved humanity!

kvn8888 • 15 hours ago

That would be a ton of problems for a small team of PhD/Grad level experts to solve (for GPQA Diamond, etc) in a short time. Remember, on EpochAl Frontier Math, these problems require hours to days worth of reasoning by humans

The author also suggested this is a new architecture that uses existing methods, like a Monte Carlo tree search that deepmind is investigating (they use this method for AlphaZero)

I don't see the point of colluding for this sort of fraud, as these methods like tree search and pruning already exist. And other labs could genuinely produce these results

agnosticmantis • 15 hours ago

I had the ARC AGI in mind when I suggested human workers. I agree the other benchmark results make the use of human workers unlikely.

aetherson • 15 hours ago

I'm very confident that queries were not routed to human workers behind the API.

Possibly some other form of "make it seem more impressive than it is," but not that one.

rsanek • 15 hours ago

this is an impressive tinfoil take. but what would be their plan in the medium term? like once they release this people can check their data

agnosticmantis • 15 hours ago

How can people check their data?

In the medium term the plan could be to achieve AGI, and then AGI would figure out how to actually write o3. (Probably after AGI figures out the business model though: https://www.reddit.com/r/MachineLearning/s/OV4S2hGgW8)

inoperable • 12 hours ago

Very convenient for OpenAI to run those errands with bunch of misanthropes trying to repaint a simulacrum. To use AGI here's makes me want to sponsor pile of distress pills so people think things really over before going into another mania Episode. People need seriously take a step back, if that's AGI then my cat has surpassed it's cognitive acting twice.