Chinchilla Scaling: A replication attempt

magnio • 1 year ago

> To extract the data from the figure, we first downloaded the PDF from Hoffmann et al.’s arXiv submission and saved it in SVG format. We then parsed the SVG content to navigate and search the SVG structure. Within the SVG, we identified the group of points representing the scatter plot data and iterated over each point to extract its fill color and position (x and y coordinates) using the attributes of the corresponding SVG elements.

> To map the SVG coordinates to the model size and training FLOP values, we used the location of the labels or ticks on the respective axes. This allowed us to establish a correspondence between the SVG coordinates and the actual data values represented in the plot.

They ... reconstructed the data ... from a plot ... using ruler and eyes? Why not just emailed the original authors for the raw data? I can't help but feel like @yuvaltheterrible debunking papers.

mxwsn • 1 year ago

Funnily enough, I've done this for a paper I wrote as well. Emailing authors is kind of a crapshoot. It's normal to get no response if it's been several years since the paper came out. In this case, a pdf plot is essentially lossless, and it's much faster than waiting for authors to maybe respond.

V1ndaar • 1 year ago

And not only that, in many cases they will tell you (if they reply) "oh, we can't find the source of that plot anymore". Happened to me quite a few times (although in physics).

I'm pretty sure I'm not the only one who's written themselves a mini tool to even extract data from a bitmap plot based on the axes. Involves some manual steps (cropping mainly), but is very convenient for the cases where people not even use vector graphics, but sometimes even just screenshots of plots... Do I like it? Hell no! It's why I've put quite some effort in doing it better for my PhD thesis.

godelski • 1 year ago

Yeah it's very annoying especially these days when there's no real excuse to not have a copy. You can easily store all code and data for free and in an accessible manner. Even just GitHub for 90+% is good enough. Hugging face helps, and there's many other ways too.

I remember my first year in grad school I was trying to replicate a work by a very prestigious university. It definitely wasn't reproducible from text but I did my best. Couldn't get close to their claims so I email the lead author (another grad student). No response. Luckily my advisor knew their advisor. Got a meeting and then I got sent code. It was nothing like what they claimed in the paper so I have no idea what they gave me. Anyways, my paper never got published because I couldn't beat them. It is what it is.

WanderPanda • 1 year ago

So be fair, sometimes (e.g. in the case of scatter plots with many dots) pdf renderers become very slow and/or mess up the rendering. In this case the easiest option is rasterizing it (for performance and consistency of the appearance)

V1ndaar • 1 year ago

That is certainly true (and why added a general "embed plot data as bitmap into SVG/PDF" option to https://github.com/Vindaar/ggplotnim that works not only for raster heatmaps). But realistically such plots are often not ideal anyway (too many data points in a plot is often a sign that a different type of plot would be better; typically one that aggregates in some way) and it's just another argument to make the data for plots available as well.

jszymborski • 1 year ago

If you have the misfortune of having to use Word for writing manuscripts and/or have scatter plots with a good number of points, SVGs will ruin your day in my experience.

(Yes, I'd much rather use LaTeX)

mirekrusin • 1 year ago

Somebody tell them that huggingface, github, gitlab, codeberg etc exist.

acc_297 • 1 year ago

In fairness they did not use a ruler or eyes based on the excerpts you quote they extracted exact coordinates of data from an svg format which if the svg was created correctly should at least give a non-biased dataset maybe with less precision than the source

ege_erdil • 1 year ago

we did and gave them a two week grace period to respond, but they only responded to us after we published on arxiv

also, we didn't reconstruct the data using a ruler, you can automate that entire process so that it's much more reliable than that

saurabh20n • 1 year ago

Looks like you’re one of the authors.

It would be nice if you could post if the actual data matches your reconstruction—now that you have it in hand. Would help us not worry about the data provenance and focus on the result you found.

ege_erdil • 1 year ago

we're not sure if the actual data exactly matches our reconstruction, but one of the authors pointed out to us that we can exactly reproduce their scaling law if we make the mistake they made when fitting it to the data

what they did was to take the mean of the loss values across datapoints instead of summing them and used L-BFGS-B with the default tolerance settings, so the optimizer terminated early, and we can reproduce their results with this same mistake

so our reconstruction appears to be good enough

levocardia • 1 year ago

I do that all the time using WebPlotDigitizer [1]. Works great.

[1] https://apps.automeris.io/wpd/

dynm • 1 year ago

Seconded. When I first saw this, I thought it looked unintuitive and difficult to use, but when I tried it, it was very easy and I had the extracted data in a few minutes.

Ajoo • 1 year ago

They claimed that they did ask several times in one of the replies.

polygamous_bat • 1 year ago

> Why not just emailed the original authors for the raw data?

Industry research labs, especially Google deepmind, are notoriously closed up about their “proprietary” data. I’ve hit this wall multiple times in my own work in AI.

sp332 • 1 year ago

https://twitter.com/borgeaud_s/status/1780988694163321250 says they're going to open the data from the paper. Not sure why they didn't do it before, but good news.

williamdclt • 1 year ago

I particularly like this second quote, I appreciate them taking the time to explain "what is a graph" in a scientific paper!

cs702 • 1 year ago

Interesting! If the authors are right, it seems that the number of training tokens required per parameter (slowly) declines as models become larger (Figure 5).

That's good news. I think it deserves wider dissemination, so I'm upvoting your post.

Thank you for sharing this on HN!

dzdt • 1 year ago

Could be that the independence of training points available decline as the dataset size grows? At some point it becomes hard to add data that isn't essentially similar to something youve already added.

cs702 • 1 year ago

Yes, could be. Not sure how or even if anyone could prove it, though.

godelski • 1 year ago

This should be fairly de facto true. Remember your dataset is some proxy for some real (but almost surely intractable) distribution.

Now let's think about filling the space with p-balls that are bound by nearest points. So there should be no data point inside the ball. Then we've turned this problem into a sphere packing problem and we can talk about the size and volumes of those spheres.

So if we uniformally fill our real distribution with data then the average volume of those spheres decrease. If we fill but not uniformly the average ball will decrease but the largest ball will shrink slower (this case being we aren't properly covering data in that region). In either case that more you add data, the more the balls shrink. Essentially meaning the difference between data decreases. The harder question is about those under represented regions. Finding them and determining how to properly sample.

Another quick trick you can use to convince yourself if thinking about basis vectors (this won't be robust btw but a good starting point). In high dimensions the likelihood that two randomly sampled vectors are orthogonal is almost certainly true. So then we think of drawing basis vectors (independent vectors that span our space). So as we fill in data, we initially are very likely to have vectors (or data) that are independent in some way. But as we add more, the likelihood that they are orthogonal decreases. Of course your basis vectors don't need to be orthogonal but that's more semantics because we can always work in a space where that's true.

+1

cs702 • 1 year ago

sebzim4500 • 1 year ago

I guess you could artifically limit the training data (e.g. by removing languages, categories) and see if the utility of extra tokens drops off as a result.

Kronopath • 1 year ago

This is not good news, this means that we could end up with a dangerously superintelligent AI just by scaling up the number of parameters, without increasing the amount of training data.

kelseyfrog • 1 year ago

No, but LLMs require orders of magnitude more language input than humans[1]. It's very reasonable to assume that architectural differences (size among them) is more likely a constraint for performance.

1. Specifically larger than the upper bound on lifetime language input for humans, even assuming 24/7 at max reading speed.

p1esk • 1 year ago

How much language input does a human need to become intelligent if he doesn’t receive any other input?

HeatrayEnjoyer • 1 year ago

Do they? What is the total size of all visual, audio, touch, locomotive, scent, and taste data collected between birth and when a human reaches IQ 100? There are multiple high-bandwidth feeds running into the brain 24/7.

zarzavat • 1 year ago

Vision is not necessary for language acquisition.

Proof: blind and partially sighted people exist.

cubefox • 1 year ago

> language input

TeMPOraL • 1 year ago

Yes, but LLMs come out of training as experts in approximately any single thing you can think of, and then some, and all that in dozen of languages. Humans don't achieve even a fraction of this kind of breadth.

+1

sdenton4 • 1 year ago

godelski • 1 year ago

This is not quite accurate, but complex because measurement is hard. The things they are being tested on are almost surely within the dataset. Let's take the bar exam for instance. Sure, we don't know what's in GPT data, but we know it has reddit, and we know reddit has many similar if not exact questions on it. We know that the first GPT4 did not have good semantic similarity matching because they just used a 3 substring matching on 50 chararcters (Appendix C) and they only consider the false positive nature. Then there's this line...

  The RLHF post-training dataset is vastly smaller than the pretraining set and unlikely to have any particular question contaminated. However we did not check explicitly.

But my favorite is the HumanEval. I'll just remind everyone that this was written by 60 authors, mostly from OpenAI

  We evaluate functional correctness on a set of 164 handwritten programming problems, which we call the HumanEval dataset. ... __It is important for these tasks to be hand-written, since our models are trained on a large fraction of GitHub, which already contains solutions to problems from a variety of sources.__

The problems? Well they're leetcode style... Can you tell me you can write leetcode style questions that

  Human Eval 2

  Prompt:
  def truncate_number(number: float) -> float: """ Given a positive floating point number, it can be decomposed into and integer part (largest integer smaller than given number) and decimals (leftover part always smaller than 1). Return the decimal part of the number. >>> truncate_number(3.5) 0.5 """ 

  Solution:
  return number % 1.0 

  Human Eval 4

  Prompt:
  from typing import List def mean_absolute_deviation(numbers: List[float]) -> float: """ For a given list of input numbers, calculate Mean Absolute Deviation around the mean of this dataset. Mean Absolute Deviation is the average absolute difference between each element and a centerpoint (mean in this case): MAD = average | x - x_mean | >>> mean_absolute_deviation([1.0, 2.0, 3.0, 4.0]) 1.0 """ 

  Solution
  mean = sum(numbers) / len(numbers) 
  return sum(abs(x - mean) for x in numbers) / len(numbers)

You really want to bet that that isn't on github? Because I'll bet you any dollar amount you want that there are solutions in near exact form that are on github prior to their cutoff date (Don't trust me, you can find them too. They're searchable even). Hell, I've poisoned the dataset here!

LLMs are (lossy) compression systems. So they're great for information retrieval. And a lot of what we consider intelligence (and possibly even creativity) is based on information retrieval. Doesn't mean these things are any less impressive but just a note on how we should be interpreting results and understanding the limitations of our tools. Measuring intelligence is a really difficult thing and we need to be aware that the term isn't universally agreed upon and so people are often talking past one another and also some people are conflating the differences as if they are the same.

mirekrusin • 1 year ago

LLMs are super-intelligent at mimicking already, it won't take much time to find some kind of RL loop there.

exe34 • 1 year ago

Like a corporation then. We should ban them until we can figure out how to align them!

tehsauce • 1 year ago

ASI is nothing like a corporation

+2

wizzwizz4 • 1 year ago

TeMPOraL • 1 year ago

Is very much like a corporation; a corp is effectively an AGI, just running very slowly - at the speed of bureaucracy.

pfdietz • 1 year ago

It's only bad news if you don't want a dangerously superintelligent AI.

Kronopath • 1 year ago

No one should want this.

gwern • 1 year ago

The original Chinchilla authors have now identified the original bug, apparently: https://twitter.com/borgeaud_s/status/1780988694163321250

mirekrusin • 1 year ago

Lovely, they are also open sourcing data.

anonymousDan • 1 year ago

The scientific process at work!

cgearhart • 1 year ago

TL;DR—couldn’t exactly replicate their results, but broadly confirmed their findings. They agree that the optimal range is 5–40 tokens per parameter, and close to 20 for the “chinchilla” model from the original paper.

Very unusual choice to reconstruct the dataset by eyeballing the graph in the source paper (why not just ask for it…?) and it’s not really clear why the result is dressed up behind the salacious-seeming abstract.

ege_erdil • 1 year ago

we didn't eyeball the graph, there are more accurate ways of extracting the data from a pdf file than that

we did ask for the data but got no response until we published on arxiv

what is supposed to be "salacious" about the abstract?

Olesya000 • 1 year ago

[dead]

newfocogi • 1 year ago

Key claims:

"We have found three potential issues with Hoffmann et al.’s estimates of the Chinchilla scaling law that rely on Approach 3: 1. Their estimated model fits the reconstructed data very poorly. These conclusions hold even when accounting for potential noise in data reconstruction and excluding outlier models. 2. The confidence are implausibly tight given the number of data points. Obtaining confidence intervals that tight would require many hundreds of thousands of observations, while they likely had only ∼400. 3. Their estimated model implies a scaling policy that is inconsistent with their other approach"

Data point most people are probably looking for: "We find a range consistent with the 20 tokens per parameter rule of thumb. Indeed, our point estimates imply that 25.6 tokens per parameters is optimal."

moffkalast • 1 year ago

Their rule of thumb would imply that a 70B model is saturated with 1.7T tokens, that's inconsistent with reality.

og_kalu • 1 year ago

The Chinchilla laws were compute optimal scaling laws. They're not supposed to tell you what parameter-token combination will saturate a model.

moffkalast • 1 year ago

Compute optimal for what, training? There's nothing optimal in blowing up model size beyond the absolute minimal needed or you'll spent the equivalent of a country in electricity trying to scale inference later.

+1

rfw300 • 1 year ago

+1

og_kalu • 1 year ago

FeepingCreature • 1 year ago

Blow up model size, get lots of space and parameters to do the double-descent grok thing in, then distill it way way down?

eldenring • 1 year ago

No their claim is that there are dimishing returns for a fixed compute budget (in training) to scaling up data past that threshold vs. scaling up params.

This doesn't take inference into account either, obviously.

warbaker • 1 year ago

Calling this a "replication attempt" implied to me that they tried to replicate the Chinchilla Scaling paper and found that it did not replicate, which would be a very big deal!

Instead, they just redid the analysis based on a figure in the paper and found that the old model with slightly different parameters gave a better fit to the data. This is a valuable contribution, but a bit over-stated by the paper title, and the confrontational, "gotcha" tone of the paper is unwarranted.

A better framing would have been something like "Chinchilla Scaling: Reanalyzed".

ege_erdil • 1 year ago

one of their three approaches does not replicate and it's because of a software bug in the optimizer they used, i don't know what else we were supposed to say