Cerebras Trains Llama Models to Leap over GPUs

latchkey • 8 months ago

  1x MI300x has 192GB HBM3.

  1x MI325x has 256GB HBM3e.

They cost less, you can fit more into a rack and you can buy/deploy at least the 300's today and 325's early next year. AMD and library software performance for AI is improving daily [0].

I'm still trying to wrap my head around how these companies think they are going to do well in this market without more memory.

[0] https://blog.vllm.ai/2024/10/23/vllm-serving-amd.html

krasin • 8 months ago

> I'm still trying to wrap my head around how these companies think they are going to do well in this market without more memory.

Cerebras and Groq provide the fastest (by an order of magnitude) inference. This is very useful for certain workflows, which require low-latency feedback: audio chat with LLM, robotics, etc.

Outside that narrow niche, AMD stuff seems to be the only contender to NVIDIA, at the moment.

latchkey • 8 months ago

> Cerebras and Groq provide the fastest (by an order of magnitude) inference.

Only on smaller models, their numbers are all 70b in the article.

Those numbers also need to be adjusted for the comparable amounts of capex+opex costs. If the costs are so high that they have to subsidize the usage/results, then they are just going to run out of money, fast.

krasin • 8 months ago

> Only on smaller models, their numbers are all 70b in the article.

No, they are 5x-10x faster for all the model sizes (because it's all just running from SRAM and they have more of it than NVIDIA/AMD), even though they benchmarked just up to 70B.

> Those numbers also need to be adjusted for the comparable amounts of capex+opex costs. If the costs are so high that they have to subsidize the usage/results, then they are just going to run out of money, fast.

True. Although, for some workloads, fast enough inference is a strict prerequisite and GPUs just don't cut it.

+2

cma • 8 months ago

+4

latchkey • 8 months ago

campers • 8 months ago

On Google Cloud a server with 8 TPU v5e will do 2175 token/seconds on Llama2 70B.

https://cloud.google.com/blog/products/compute/updates-to-ai...

From https://cloud.google.com/tpu/pricing and https://cloud.google.com/vertex-ai/pricing#prediction-prices (search for ct5lp-hightpu-8t on the page) the cost for that appears to be $11.04/hr which is just under $100k for a year. Or half that on a 3-year commit.

That seems like a better deal than millions for a few CS-3 nodes.

And they've just announced the v6 TPU:

  Compared to TPU v5e, Trillium delivers: 
  Over 4x improvement in training performance 
  Up to 3x increase in inference throughput 
  A 67% increase in energy efficiency
  An impressive 4.7x increase in peak compute performance per chip 
  Double the High Bandwidth Memory (HBM) capacity 
  Double the Interchip Interconnect (ICI) bandwidth

https://cloud.google.com/blog/products/compute/trillium-sixt...

modeless • 8 months ago

You are right assuming that model capabilities are determined only by model size. But consider that OpenAI is saying they have a way of scaling intelligence with inference time compute, not just model size. If that proves out, reducing latency per output token potentially becomes as valuable as or even possibly more valuable than scaling model size. Speed becomes intelligence. And Cerebras has 1/10 the latency per token of anything else.

krasin • 8 months ago

You're correct on $/bandwidth. The point about low latency continues to be ignored, though.

arisAlexis • 8 months ago

That's a benchmark or shower thought?

hhdhdbdb • 8 months ago

How is this a narrow niche?

Chain of thought type operations is in this "niche".

Also anything where the value is in the follow up chat not the one shot.

wmf • 8 months ago

Groq and Cerebras only make sense at massive scale which is why I guess they pivoted to being API providers so they can amortize the hardware over many customers.

latchkey • 8 months ago

Correct except that massive scale doesn't work cause it just uses up exponentially more power/space/resources.

They also have a very limited use case... if things ever shift away from LLM's and into another form of engineering that their hardware does not support, what are they going to do? Just keep deploying hardware?

Slippery slope.

YetAnotherNick • 8 months ago

2x 80GB A100 is better in all the metrics than MI300x while being cheaper.

arisAlexis • 8 months ago

The article explains in depth the issues with memory, did you read through ?

7e • 8 months ago

"So, the delta in price/performance between Cerebras and the Hoppers in the cloud when buying iron is 2.75X but for renting iron it is 5.2X, which seems to imply that Cerebras is taking a pretty big haircut when it rents out capacity. That kind of delta between renting out capacity and selling it is not a business model, it is a loss leader from a startup trying to make a point."

As always, it is about TCO, not who can make the biggest monster chip.

asdf1145 • 8 months ago

clickbait title: inference is not training

mentalically • 8 months ago

The value proposition of Cerebras is that they can compile existing graphs to their hardware and allow inference at lower costs and higher efficiencies. The title does not say anything about creating or optimizing new architectures from scratch.

germanjoey • 8 months ago

the title says "Cerebras Trains Llama Models"...

mentalically • 8 months ago

That's correct and if you read the whole thing you will realize that it is followed by "... to leap over GPUs" which indicates that they're not literally referring to optimizing the weights of the graph on a new architecture or freshly initialized variables on existing ones.

+2

pama • 8 months ago

7e • 8 months ago

"It would be interesting to see what the delta in accuracy is for these benchmarks."

^ the entirety of it

htrp • 8 months ago

Title is about training.... article about inference

KTibow • 8 months ago

Why is nobody mentioning that there is no such thing as Llama 3.2 70B

pk-protect-ai • 8 months ago

Wow, 44GB SRAM, not HBM3 or HBM3e, but actual SRAM ...

asdf1145 • 8 months ago

did they release MLPerf data yet or wouldn't help their IPO?