Back

Chinchilla Scaling: A replication attempt

105 points15 hoursarxiv.org
magnio14 hours ago

> To extract the data from the figure, we first downloaded the PDF from Hoffmann et al.’s arXiv submission and saved it in SVG format. We then parsed the SVG content to navigate and search the SVG structure. Within the SVG, we identified the group of points representing the scatter plot data and iterated over each point to extract its fill color and position (x and y coordinates) using the attributes of the corresponding SVG elements.

> To map the SVG coordinates to the model size and training FLOP values, we used the location of the labels or ticks on the respective axes. This allowed us to establish a correspondence between the SVG coordinates and the actual data values represented in the plot.

They ... reconstructed the data ... from a plot ... using ruler and eyes? Why not just emailed the original authors for the raw data? I can't help but feel like @yuvaltheterrible debunking papers.

mxwsn14 hours ago

Funnily enough, I've done this for a paper I wrote as well. Emailing authors is kind of a crapshoot. It's normal to get no response if it's been several years since the paper came out. In this case, a pdf plot is essentially lossless, and it's much faster than waiting for authors to maybe respond.

V1ndaar14 hours ago

And not only that, in many cases they will tell you (if they reply) "oh, we can't find the source of that plot anymore". Happened to me quite a few times (although in physics).

I'm pretty sure I'm not the only one who's written themselves a mini tool to even extract data from a bitmap plot based on the axes. Involves some manual steps (cropping mainly), but is very convenient for the cases where people not even use vector graphics, but sometimes even just screenshots of plots... Do I like it? Hell no! It's why I've put quite some effort in doing it better for my PhD thesis.

mirekrusin49 minutes ago

Somebody tell them that huggingface, github, gitlab, codeberg etc exist.

WanderPanda3 hours ago

So be fair, sometimes (e.g. in the case of scatter plots with many dots) pdf renderers become very slow and/or mess up the rendering. In this case the easiest option is rasterizing it (for performance and consistency of the appearance)

jszymborski2 hours ago

If you have the misfortune of having to use Word for writing manuscripts and/or have scatter plots with a good number of points, SVGs will ruin your day in my experience.

(Yes, I'd much rather use LaTeX)

godelski10 hours ago

Yeah it's very annoying especially these days when there's no real excuse to not have a copy. You can easily store all code and data for free and in an accessible manner. Even just GitHub for 90+% is good enough. Hugging face helps, and there's many other ways too.

I remember my first year in grad school I was trying to replicate a work by a very prestigious university. It definitely wasn't reproducible from text but I did my best. Couldn't get close to their claims so I email the lead author (another grad student). No response. Luckily my advisor knew their advisor. Got a meeting and then I got sent code. It was nothing like what they claimed in the paper so I have no idea what they gave me. Anyways, my paper never got published because I couldn't beat them. It is what it is.

acc_29713 hours ago

In fairness they did not use a ruler or eyes based on the excerpts you quote they extracted exact coordinates of data from an svg format which if the svg was created correctly should at least give a non-biased dataset maybe with less precision than the source

levocardia13 hours ago

I do that all the time using WebPlotDigitizer [1]. Works great.

[1] https://apps.automeris.io/wpd/

dynm12 hours ago

Seconded. When I first saw this, I thought it looked unintuitive and difficult to use, but when I tried it, it was very easy and I had the extracted data in a few minutes.

Ajoo14 hours ago

They claimed that they did ask several times in one of the replies.

ege_erdil9 hours ago

we did and gave them a two week grace period to respond, but they only responded to us after we published on arxiv

also, we didn't reconstruct the data using a ruler, you can automate that entire process so that it's much more reliable than that

saurabh20n5 hours ago

Looks like you’re one of the authors.

It would be nice if you could post if the actual data matches your reconstruction—now that you have it in hand. Would help us not worry about the data provenance and focus on the result you found.

polygamous_bat14 hours ago

> Why not just emailed the original authors for the raw data?

Industry research labs, especially Google deepmind, are notoriously closed up about their “proprietary” data. I’ve hit this wall multiple times in my own work in AI.

sp3329 hours ago

https://twitter.com/borgeaud_s/status/1780988694163321250 says they're going to open the data from the paper. Not sure why they didn't do it before, but good news.

williamdclt13 hours ago

I particularly like this second quote, I appreciate them taking the time to explain "what is a graph" in a scientific paper!

gwern10 hours ago

The original Chinchilla authors have now identified the original bug, apparently: https://twitter.com/borgeaud_s/status/1780988694163321250

mirekrusin42 minutes ago

Lovely, they are also open sourcing data.

cs70215 hours ago

Interesting! If the authors are right, it seems that the number of training tokens required per parameter (slowly) declines as models become larger (Figure 5).

That's good news. I think it deserves wider dissemination, so I'm upvoting your post.

Thank you for sharing this on HN!

dzdt14 hours ago

Could be that the independence of training points available decline as the dataset size grows? At some point it becomes hard to add data that isn't essentially similar to something youve already added.

cs70214 hours ago

Yes, could be. Not sure how or even if anyone could prove it, though.

godelski9 hours ago

This should be fairly de facto true. Remember your dataset is some proxy for some real (but almost surely intractable) distribution.

Now let's think about filling the space with p-balls that are bound by nearest points. So there should be no data point inside the ball. Then we've turned this problem into a sphere packing problem and we can talk about the size and volumes of those spheres.

So if we uniformally fill our real distribution with data then the average volume of those spheres decrease. If we fill but not uniformly the average ball will decrease but the largest ball will shrink slower (this case being we aren't properly covering data in that region). In either case that more you add data, the more the balls shrink. Essentially meaning the difference between data decreases. The harder question is about those under represented regions. Finding them and determining how to properly sample.

Another quick trick you can use to convince yourself if thinking about basis vectors (this won't be robust btw but a good starting point). In high dimensions the likelihood that two randomly sampled vectors are orthogonal is almost certainly true. So then we think of drawing basis vectors (independent vectors that span our space). So as we fill in data, we initially are very likely to have vectors (or data) that are independent in some way. But as we add more, the likelihood that they are orthogonal decreases. Of course your basis vectors don't need to be orthogonal but that's more semantics because we can always work in a space where that's true.

sebzim450010 hours ago

I guess you could artifically limit the training data (e.g. by removing languages, categories) and see if the utility of extra tokens drops off as a result.

Kronopath13 hours ago

This is not good news, this means that we could end up with a dangerously superintelligent AI just by scaling up the number of parameters, without increasing the amount of training data.

kelseyfrog12 hours ago

No, but LLMs require orders of magnitude more language input than humans[1]. It's very reasonable to assume that architectural differences (size among them) is more likely a constraint for performance.

1. Specifically larger than the upper bound on lifetime language input for humans, even assuming 24/7 at max reading speed.

mirekrusin45 minutes ago

LLMs are super-intelligent at mimicking already, it won't take much time to find some kind of RL loop there.

p1esk10 hours ago

How much language input does a human need to become intelligent if he doesn’t receive any other input?

HeatrayEnjoyer11 hours ago

Do they? What is the total size of all visual, audio, touch, locomotive, scent, and taste data collected between birth and when a human reaches IQ 100? There are multiple high-bandwidth feeds running into the brain 24/7.

zarzavat3 hours ago

Vision is not necessary for language acquisition.

Proof: blind and partially sighted people exist.

cubefox11 hours ago

> language input

TeMPOraL10 hours ago

Yes, but LLMs come out of training as experts in approximately any single thing you can think of, and then some, and all that in dozen of languages. Humans don't achieve even a fraction of this kind of breadth.

+1
sdenton45 hours ago
godelski9 hours ago

This is not quite accurate, but complex because measurement is hard. The things they are being tested on are almost surely within the dataset. Let's take the bar exam for instance. Sure, we don't know what's in GPT data, but we know it has reddit, and we know reddit has many similar if not exact questions on it. We know that the first GPT4 did not have good semantic similarity matching because they just used a 3 substring matching on 50 chararcters (Appendix C) and they only consider the false positive nature. Then there's this line...

  The RLHF post-training dataset is vastly smaller than the pretraining set and unlikely to have any particular question contaminated. However we did not check explicitly.
But my favorite is the HumanEval. I'll just remind everyone that this was written by 60 authors, mostly from OpenAI

  We evaluate functional correctness on a set of 164 handwritten programming problems, which we call the HumanEval dataset. ... __It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.__
The problems? Well they're leetcode style... Can you tell me you can write leetcode style questions that

  Human Eval 2

  Prompt:
  def truncate_number(number: float) -> float: """ Given a positive floating point number, it can be decomposed into and integer part (largest integer smaller than given number) and decimals (leftover part always smaller than 1). Return the decimal part of the number. >>> truncate_number(3.5) 0.5 """ 

  Solution:
  return number % 1.0 

  Human Eval 4

  Prompt:
  from typing import List def mean_absolute_deviation(numbers: List[float]) -> float: """ For a given list of input numbers, calculate Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the average absolute difference between each element and a centerpoint (mean in this case): MAD = average | x - x_mean | >>> mean_absolute_deviation([1.0, 2.0, 3.0, 4.0]) 1.0 """ 

  Solution
  mean = sum(numbers) / len(numbers) 
  return sum(abs(x - mean) for x in numbers) / len(numbers) 
You really want to bet that that isn't on github? Because I'll bet you any dollar amount you want that there are solutions in near exact form that are on github prior to their cutoff date (Don't trust me, you can find them too. They're searchable even). Hell, I've poisoned the dataset here!

LLMs are (lossy) compression systems. So they're great for information retrieval. And a lot of what we consider intelligence (and possibly even creativity) is based on information retrieval. Doesn't mean these things are any less impressive but just a note on how we should be interpreting results and understanding the limitations of our tools. Measuring intelligence is a really difficult thing and we need to be aware that the term isn't universally agreed upon and so people are often talking past one another and also some people are conflating the differences as if they are the same.

exe3413 hours ago

Like a corporation then. We should ban them until we can figure out how to align them!

tehsauce12 hours ago

ASI is nothing like a corporation

TeMPOraL10 hours ago

Is very much like a corporation; a corp is effectively an AGI, just running very slowly - at the speed of bureaucracy.

+1
wizzwizz412 hours ago
cgearhart14 hours ago

TL;DR—couldn’t exactly replicate their results, but broadly confirmed their findings. They agree that the optimal range is 5–40 tokens per parameter, and close to 20 for the “chinchilla” model from the original paper.

Very unusual choice to reconstruct the dataset by eyeballing the graph in the source paper (why not just ask for it…?) and it’s not really clear why the result is dressed up behind the salacious-seeming abstract.

ege_erdil9 hours ago

we didn't eyeball the graph, there are more accurate ways of extracting the data from a pdf file than that

we did ask for the data but got no response until we published on arxiv

what is supposed to be "salacious" about the abstract?

Olesya00015 hours ago

[dead]

newfocogi15 hours ago

Key claims:

"We have found three potential issues with Hoffmann et al.’s estimates of the Chinchilla scaling law that rely on Approach 3: 1. Their estimated model fits the reconstructed data very poorly. These conclusions hold even when accounting for potential noise in data reconstruction and excluding outlier models. 2. The confidence are implausibly tight given the number of data points. Obtaining confidence intervals that tight would require many hundreds of thousands of observations, while they likely had only ∼400. 3. Their estimated model implies a scaling policy that is inconsistent with their other approach"

Data point most people are probably looking for: "We find a range consistent with the 20 tokens per parameter rule of thumb. Indeed, our point estimates imply that 25.6 tokens per parameters is optimal."

moffkalast15 hours ago

Their rule of thumb would imply that a 70B model is saturated with 1.7T tokens, that's inconsistent with reality.

og_kalu15 hours ago

The Chinchilla laws were compute optimal scaling laws. They're not supposed to tell you what parameter-token combination will saturate a model.

moffkalast15 hours ago

Compute optimal for what, training? There's nothing optimal in blowing up model size beyond the absolute minimal needed or you'll spent the equivalent of a country in electricity trying to scale inference later.

+1
rfw30014 hours ago
+1
og_kalu14 hours ago
FeepingCreature14 hours ago

Blow up model size, get lots of space and parameters to do the double-descent grok thing in, then distill it way way down?

eldenring15 hours ago

No their claim is that there are dimishing returns for a fixed compute budget (in training) to scaling up data past that threshold vs. scaling up params.

This doesn't take inference into account either, obviously.

warbaker11 hours ago

Calling this a "replication attempt" implied to me that they tried to replicate the Chinchilla Scaling paper and found that it did not replicate, which would be a very big deal!

Instead, they just redid the analysis based on a figure in the paper and found that the old model with slightly different parameters gave a better fit to the data. This is a valuable contribution, but a bit over-stated by the paper title, and the confrontational, "gotcha" tone of the paper is unwarranted.

A better framing would have been something like "Chinchilla Scaling: Reanalyzed".

ege_erdil9 hours ago

one of their three approaches does not replicate and it's because of a software bug in the optimizer they used, i don't know what else we were supposed to say