Zen 5's 2-ahead branch predictor: how a 30 year old idea allows for new tricks

gary_0 • 17 days ago

Here's a great explanation of branch prediction, starting from the earliest implementations: https://danluu.com/branch-prediction/

skywal_l • 16 days ago

Godbolt recently did an ELI5 on computerphile about everything CPU[0] and about branch prediction in particular[1].

[0]: https://www.youtube.com/watch?v=nhXevKMm3JI&list=PLzH6n4zXuc...

[1]: https://www.youtube.com/watch?v=nczJ58WvtYo&list=PLzH6n4zXuc...

holoduke • 16 days ago

Thats a good read

ksec • 16 days ago

It will be interesting to see the SMT performance, I am expecting this would provide benefits and be further refined in future generation. With Zen5c we get 192 Core or 384vCPU. We should be getting 256 Core with Zen 6c next year. Which means on a Dual Socket 1U Server, that is a potential of 512 Core with 1024 vCPU.

Whatever Web App Scaling issues we had in 2014 could now fit into a single server, assuming we somehow manage to cool the thing. Even at 1 RPS per vCPU that is 1000 RPS, excluding cache hit. Even on HN front-page dont hit the server at 1000 Page View per second.

bayindirh • 16 days ago

Serving web pages is cheap. You’ll probably hit network I/O limits before you saturate the cores.

I wonder what about its HPC performance. I think cooling this won’t be big problem, but might be wet one, requiring DLC after a certain point.

Dylan16807 • 16 days ago

> Serving web pages is cheap. You’ll probably hit network I/O limits before you saturate the cores.

It's hard to be network I/O bound when serving web pages. Netflix struggles to be network I/O bound when serving video, which is so much bigger and uses so much less processing.

Epyc started off with 32 cores on PCIe 3, and quickly moved to 64 cores on PCIe 4. When we hit 256 cores it's probably going to have PCIe 6, which means it's still the same I/O per core.

But those numbers are crazy overkill for web serving anyway. If you wanted to allocate about a gigabit per core, with 512 cores across two CPUs, using PCIe 5 to be conservative, you'd need at total of... 16 lanes, for a single 400gb/s card you can buy today.

(This is assuming you mean the I/O from the server to the network. If you're talking about I/O outside the server, then upgrade your switches. Using denser servers doesn't increase your network load, it lets you take the same load and send it to fewer racks. You're already handling the total data somewhere.)

account42 • 14 days ago

Video being bigger also means it does not fit in lower levels of the cache hierarchy. For a normal website 99% of views should be served entirely from RAM.

bayindirh • 16 days ago

I mean, we already run NDR Infiniband, and with sometimes multiple cards per host in our data center, so the numbers are not surprising or out of the this world for me...

Terabit Ethernet also is easy. Just 10 100gbit fibers. Again, not much either in port count, or in physical space needs (port count is ample, fibers are thin).

What I meant is, in a conventional data center, you'll probably hit your allocated bandwidth limits before you saturate a processor like this while serving web pages, unless you're sitting on a big exchange point or the backbone router is not in the next system hall. A single socket Epyc system has 128 PCIe lanes and 12 channel memory. A complete overkill for such a job unless you serve millions, and only have a basement to put your servers with some very fat network pipes.

Dylan16807 • 16 days ago

touisteur • 16 days ago

As for zen3 and zen4, it's been a pleasure to work with good vectorization perf, and it seems they've been doing avx512 'right', at least for HPC I know.

Add to that the 12 channels of DDR per socket and the quite generous cache, and once you've swallowed the need to buy 24 sticks for 2 sockets, these things are beautiful beasts, at what Intel has taught us, a very reasonable price (ask you OEM for prices, public prices on CPUs are nuts).

JonChesterfield • 15 days ago

Lots of water in data centers these days. Works fine. I think the fear about that lost to the utility some years ago.

jiggawatts • 15 days ago

We're entering the era of "kilo cores" in the same way computing entered the era of kilobytes in the 1940s. If you consider a tightly-coupled rack of servers with GPUs to be one machine, then we're well into the hundreds of kilocores.

I found it entertaining having a debate with someone here on HN who just couldn't grok the concept that it's possible to serve something the size of Wikipedia from a single server. That's been easy for a while now, it just isn't done for practical reasons such as availability or cost-efficiency.

JonChesterfield • 15 days ago

Is that actually true, or only that you can serve it from one server behind a lot of servers named CDN?

jiggawatts • 15 days ago

It’s definitely true. It won’t be optimal or fast for end users, but it’s entirely doable.

One of the main benefits of a CDN is that it brings the data closer to the users, reducing network latency. Obviously a single server can’t be close to every user globally!

In practice one would always use a CDN for a site like WikiPedia but the point is that it isn’t necessary to handle the traffic.

A single two-socket EPYC server can put out something like 400 Gbps of traffic, which is far more than all but the biggest sites.

fulafel • 13 days ago

GPU terminology "cores" are more like SIMD lanes, they're not comparable.

WithinReason • 16 days ago

The opposite might be true, the better you are at utilising the CPU pipeline the less space you have to run a 2nd thread so SMT's benefits might diminish

ColonelPhantom • 16 days ago

Correct. However, Zen 5 seems to be the first major 'scaling up' that AMD has done since the original Zen, given that the core seems to be moving from dispatching 6 to 8 instructions per clock and 4 to 6 ALUs. I would expect that SMT to be excellent for low-IPC workloads to get the most out of that scaling.

fulafel • 16 days ago

SMT could use a lot more benchmarking investigations.

Intuitively having more tasks working on the same problem at half speed should have a memory usage cost, are apps commonly using more memory for no speed gain when using SMT?

In a lot of published benchmarks it seems most apps don't noticeably benefit in executions peed.

crest • 16 days ago

SMT also allows the cores to hide memory latency by executing the other thread(s). Taken to the extreme end this would give you a barrel CPU.

fulafel • 16 days ago

Yes, this is reflected in the execution speed of the workloads (vs being an additional thing).

Other things making up the potential for being more than a zero sum game (half the speed) include making better use of the ALU and branch units that would otherwise be waiting for dependencies. On the other side of the scale is how different working sets of the threads can thrash each others cache working sets.

JonChesterfield • 15 days ago

Push the hyperthread count up higher and you have enough stuff to do while waiting on memory that branch prediction doesn't matter as much. Could call it a GPU.

nullc • 15 days ago

I think unfortunately with just 2-way smt it's pretty easy to get both of them stalled at once.

The power9 4-way SMT seems to be a fair bit more effective than what I've seen on x86.

teaearlgraycold • 16 days ago

Web app scaling issues are usually around database latency.

jiggawatts • 15 days ago

Only because the database is on the far end of a network hop, and then all too often its storage is remote from its compute as well. This is the most common scenario in enterprise or cloud deployments, where for good measure a couple of firewalls, load balancers, routers, and reverse proxies are added at every stage.

CodesInChaos • 16 days ago

In my experience, scaling is limited by database throughput, not latency.

teaearlgraycold • 16 days ago

Depends. Most apps that are slow have shitty app code with N+1 queries. If you can get past that stage and write good enough app code then eventually you’ll hit database throughout issues.

codesnik • 16 days ago

well, db latency in some naive unoptimized app with multiple sequential queries directly affects _app_'s throughput

jeltz • 16 days ago

But database performance is also a lot about CPU and RAM speed plus concurrent use of those resources.

andrepd • 16 days ago

Man I sure hope that your server does more than 1 RPS :)

ksec • 16 days ago

I wish I am joking but there are plenty of production instances doing less than 10 RPS per vCPU. ( I think the request part doesn't really explain well, since it could be a page view but it could also be an API request. In this case it is mostly referring to a page view )

IvanAchlaqullah • 17 days ago

It's always interesting to see decades old papers, sometimes published with little to no fanfares, suddenly becomes "state of the art" because hardware have become powerful enough.

For example Z-buffers[1]. It's used by 3d video games. When it's first published on paper, it's not even the main topic of the paper, just some side notes because it requires expensive amount of memory to run.

Turn out megabytes is quite cheap few decades latter, and every realtime 3d renderer ended up using it.

[1] https://en.wikipedia.org/wiki/Z-buffering

abainbridge • 16 days ago

Another example is Low Density Parity Check Codes [1]. Discovered in 1962 by Robert Gallager but abandoned and forgotten about for decades due to being computationally impractical. It looks like there was a 38 year gap in the literature until rediscovered by David MacKay [2].

The first mainstream use was in 2003. It is now used in WiFi, Ethernet and 5G.

[1] https://en.wikipedia.org/wiki/Low-density_parity-check_code

[2] https://scholar.google.com/scholar?q=%22low+density+parity+c...

bee_rider • 17 days ago

I sometimes wonder if there’s an academic career hidden in there for an engineer: go to the library and read what the CS folks were publishing on physical papers, maybe there are some ideas that can actually be implemented now that weren’t practical back then.

Terr_ • 17 days ago

In a series of books by David Brin [0] there is a galaxy-wide institution known as the library, and civilizations regularly mine its millions of years of data for suddenly-relevant-again techniques and technologies.

I remember one bit where a species had launched some tricky fleet-destroying weapon to surprise their enemies with esoteric physics, only to have it reversed against them, possibly because the Librarian that once helped their research-agent wasn't entirely unbiased.

[0] https://en.m.wikipedia.org/wiki/Uplift_Universe

AceJohnny2 • 17 days ago

Also in Vinge's Deepness in the Sky, there aren't really "programmers" as we know them anymore, but "programmer-archeologists" that just search the archives for code components to reuse.

abecedarius • 16 days ago

I think that's unfair to the programmer-archaeologists: young Pham wanted to write things from scratch (and took advantage when he got a chance to), and other characters said he was a talented hacker, but as they also said, it was just way more productive most of the time to cobble together ancient code.

The_Colonel • 16 days ago

Terr_ • 17 days ago

Also: In the Destiny mythic sci-fi franchise, the human golden age ended with a mysterious apocalypse, leaving "cryptarchs" (crypto-archeologists) to try to rebuild from arcane fragments of encrypted data or unknown formats.

CoastalCoder • 17 days ago

Mind editing that to give a spoiler alert?

Terr_ • 17 days ago

Don't worry, it's nowhere near the main plot or characters, just a small "meanwhile, elsewhere" vignette. Basically emphasizing the "why bother everything's already invented" mentality of most client-races, and how deep access and query-secrecy have big impacts.

CoastalCoder • 16 days ago

kvemkon • 17 days ago

Academic? Perhaps, applied:

"When Soviets Won the Cold War: Wading Through Pyotr Ufimstev's Work on Stealth" (26.03.2024)

https://news.ycombinator.com/item?id=39830671

> In the early 1970s, Lockheed Martin engineer Denys Overholser discovered the key to stealth technology hidden in a stack of translated Soviet technical papers. Disregarded by the Soviet academic elite, and unheard of in the United States, Pyotr Ufimstev had worked out calculations that would help win ...

Not sure of the details of this story, but in general having there enough people to grant them time just to search for something interesting seems not unrealistic.

findthewords • 17 days ago

Yes, "read 10 year old papers as a source of ideas ripe for commercialization" IS common advice in universities.

chrisbrandow • 17 days ago

A post-doc in my chemistry lab had the saying, “two weeks in the lab will save you a day in the library”

scns • 16 days ago

Weeks of work can save hours of planning.

fragmede • 16 days ago

The whole AI trend is in part due to things that are now possible on GPU supercomputers with gigabytes of RAM backed by petabytes of data and at the top speed of GPUs. Some of the algorithms date back to before the ai winter and it's just that we can now do the same thing with a ton more data and faster.

zaptrem • 16 days ago

All of the main algorithms do (multi-layer perceptrons and stochastic gradient descent are from the 50s and 60s!). Basically the only thing that changed is we decided to multiply some of the outputs of the multi-layer perceptrons by each other and softmax it (attention) before passing them back into more layers. Almost all of the other stuff is just gravy to make it converge faster (and run faster on modern hardware).

eru • 14 days ago

toast0 • 16 days ago

Heck. Look at 10 year old product launch PRs from big tech. Anything that Google launched 10 years ago and killed, but seems like a good idea today is probably also easier to do. And if you look 5-10 years before that, you can find the Yahoo launch PR where they did the same thing ;p

dehrmann • 16 days ago

I always like to point out Wordle for nailing the timing. Some version of it has been viable since 2000, mass smart phone adoption helped with how it spread, but it could have been viable in 2010. What it did was landed at the right moment.

pornel • 17 days ago

Image and video compression are like that. Ideas for mainframes in the '80s are realtime algorithms now.

eru • 17 days ago

I guess they are _soft_ realtime algorithms now?

taneq • 16 days ago

eru • 17 days ago

I wonder if current AI training's effort to hover up all the training data they can find will accidentally give us most of the benefits of that?

A human can only read so much, so has to discriminate. But our machines are omnivorous readers.

lispwitch • 16 days ago

this seems to be how PEG parsing became popular during the last couple of decades, for example; see https://bford.info/pub/lang/peg/ (peg.pdf p11: "This work is inspired by and heavily based on Birman’s TS/TDPL and gTS/GTDPL systems [from the 1970s ...] Unfortunately it appears TDPL and GTDPL have not seen much practical use, perhaps in large measure because they were originally developed and presented as formal models [...]")

The_Colonel • 16 days ago

> suddenly becomes "state of the art" because hardware have become powerful enough.

I'd rather say that we were capable of such a design for several decades, but only now the set of trade-offs currently in-place made this attractive. Single core performance was stifled in the last 2 decades by prioritizing horizontal scaling (more cores), thus the complexity / die area of each individual core became a concern. I imagine if this trend did not take place for some reason, and the CPU designers primarily pursued single core performance, we'd see implementation much sooner.

Regarding the Z-buffer, I kind of get that this would appear as a side note, it's a simple concept. Perhaps even better example is ray tracing - the concept is even quite obvious to people without 3D graphics background, but it was just impractical (for real-time rendering) in terms of performance until recently. What I find interesting is that we haven't found a simpler approach to approximate true to life rendering and need to fallback on this old, sort of naive (and expensive) solution.

pcwalton • 16 days ago

Another example is Rust's borrow checker, which has roots in substructural type system papers from decades earlier. Many academics considered substructural type systems dead (killed by GC, more or less) until Rust resurrected the idea by combining it with some new ideas from C++ of the time.

protomolecule • 16 days ago

"with some new ideas from C++ of the time"

Could you elaborate on that?

masklinn • 16 days ago

I assume it's the deterministic, implicit, synchronous, single-use destructor.

Affine logic only says what you're able to do with a value, it doesn't say anything about what happens when that value goes out of scope.

protomolecule • 16 days ago

Hmm, that's possible

pcwalton • 15 days ago

Move semantics and rvalue references. It was clear that within C++11 there was a substructural type system struggling to get out.

protomolecule • 14 days ago

I was going to say that Rust's destructive move is quite different from C++'s move semantics, but official docs disagree: "C++: references, RAII, smart pointers, move semantics, monomorphization, memory model" [0]

[0] https://doc.rust-lang.org/reference/influences.html

crest • 16 days ago

Z-buffers don't just require about an other frame buffer worth of memory, but also lots of read and write bandwidth per pixel. This extra memory bandwidth requirement made them expensive to implement well. High end implementations used dedicated RAM channels for them, but on lower end hardware they "stole" a lot of memory bandwidth from a shared memory interface e.g. some N64 games optimised drawing a software managed background/foreground with the Z-buffer disabled to avoid the cost of reading and updating the depth information.

rasz • 16 days ago

Power of Z-buffer lies in letting you skip all of the expensive per pixel computations. Secondary perk is avoidance of overdraw.

Early 3D, especially gaming related, implementations didnt have any expensive per pixel calculations. Cost of looking up and updating depth in Z-buffer was higher than just rendering and storing pixels. Clever programming to avoid overdraw (Doom sectors, Duke3D portals, Quake sorted polygon spans) was still cheaper than Z-Buffer read-compare-update.

Even first real 3D accelerators (3Dfx) treated Z-buffer as an afterthought used at the back of fixed pipeline - every pixel of every triangle was being brute force rendered, textured, lighted and blended just to be discarded by Z-buffer at the very end. Its very likely N64 acted in same manner and enabling Z-buffer didnt cut the cost of texturing and shading.

Z-buffer started to shine with introduction of Pixel Shaders where per pixel computations became expensive enough (~2000 GF/Radeon).

kevingadd • 16 days ago

Some modern algorithmic smarts (like hierarchical Z) helped a lot to reduce the memory bandwidth demands, to be fair.

BitPirate • 16 days ago

The EEVDF scheduling algorithm is also a good example. Designed back in 1995 and it's the default process scheduler of Linux now.

throwaway2037 • 15 days ago

I never heard about this before your post.

Wiki tells me: <<Earliest eligible virtual deadline first (EEVDF) is a dynamic priority proportional share scheduling algorithm for soft real-time systems.>>

Ref: https://en.wikipedia.org/wiki/Earliest_eligible_virtual_dead...

twoodfin • 17 days ago

On the software side, garbage collection was well explored by academia for more than two decades before the JVM brought it to wide commercial adoption ca. 1995.

bhouston • 17 days ago

Lisp and Smalltalk and BASIC (to name a few) were popular languages before the JVM that used GC.

tasty_freeze • 17 days ago

When I was in high school, I wrote a Z80 disassembler in BASIC and printed out the entire TRS-80 Model 3 ROM on a thick stack of fan folded greenline paper. I spent a lot of free time figuring out what was what and commenting the code by pen in the right margin.

The (string) garbage collector was amazingly frugal and correspondingly slow. Because the garbage collector typically ran only when memory was critically low it used almost no memory, just 4 bytes I think. (it was also possible to invoke it any time with the FRE(0) function call IIRC)

The routine would sweep all of the symbol table and expression stack looking for string pointers, and would remember the lowest (highest? I forget which) string base address which hadn't already been compacted. After each pass it would move that one string and fix the symbol table pointer (or expression stack str pointer) then do another pass. As a result, is was quadratic with the number of live strings -- so if you had a large number of small strings, it was possible for your program to lock solidly for 30 seconds or more.

bhouston • 15 days ago

Your description closely matches what is described here: https://en.wikipedia.org/wiki/Garbage_collection_(computer_s...

fulafel • 16 days ago

Not academia, GC (and lots of other PLT) was mature and productized then, just missed by the majority of the industry living the dark ages. In the 90s, enterprisey architecture astronaut cultures went Java/C++ and hackerish cultures went Perl/Python seasoned with C.

fragmede • 15 days ago

given how much C++ underpins the video game industry, I don't think "enterprisey architecture astronaut" is entirely fair to video game programmers

fulafel • 15 days ago

Sure, it's a gross generalization, games and embedded have an excuse. Though I don't think C++ was that big in the mid-90s on games side yet, it was more C.

mrlonglong • 17 days ago

Speculative predictors have been subjected to a number of attacks to weasel out private data. Given that so many of the common ISAs are vulnerable, are they taking steps to reduce the impact of such attacks?

kmeisthax • 17 days ago

The vulnerability is speculative execution, not branch prediction. The branch predictor is the thing you have to trick to force the processor to speculatively execute code in the victim program. Furthermore you also need a valid timing source to read out the results of the speculative execution.

As for how to stop that, short of boiling the ocean[0], you don't. Speculative execution is so valuable for performance that a computer without it is completely unusable. If you really want a processor without it, buy an old first-gen Pentium.

Actual practical mitigations for speculative execution vulnerabilities are varied, but at a minimum you have to ensure process separation between a victim process holding secrets and any potential attackers that may have the opportunity to influence victim process execution. Intel was caught with their hands in their pants speculating across rings, which is why you could read kernel or hypervisor memory from userspace, but on not-poorly-designed CPUs the main victim you have to worry about is HTML iframes. Different origins aren't allowed to make HTTP requests to one another[1], but they can transclude[2] one another without permission. That traditionally loaded information from the origin into the attacker's process, which could be exfiltrated with timing attacks.

The web's solution to this was actually not to process-separate iframes, at least not initially, but to take away shared-memory multithreading entirely. If you deny the attacker a timing reference then it doesn't matter what they can make the victim speculatively execute. But to do this you have to take away multithreading because otherwise a thread can just repeatedly write known data in a loop to create a clock.

[0] https://hackaday.com/2013/08/02/the-mill-cpu-architecture/

[1] At least not without the target origin allowing it via CORS

[2] e.g. hotlink images or embed iframes

topspin • 16 days ago

> Speculative execution is so valuable for performance that a computer without it is completely unusable.

Jim Keller's view aligns with this and goes further. My interpretation of his thinking is that predictors and speculation are the only meaningful features of CPUs today. ISA doesn't matter anymore because the power of modern compilers makes high performance software highly portable and all CPUs end up bottlenecked on the quality and capacity of the predictors, regardless of the ISA. For example, the burden of x86 complexity no longer matters because it amounts to a "tax" small enough to be lost in the noise.

That's from a designer making high performance RISC-V CPUs.

fanf2 • 16 days ago

This article has several paragraphs discussing how the decode width of x86 front ends is limited by the need to discover instruction boundaries, which in turn limits the useful issue width. Big ARM cores have much larger decode width than x86 cores, so they don’t need SMT to keep their execution units busy.

ISA doesn’t matter any more because all the CISCiest CISCs and the RISCiest RISCs have been discarded (except for RISC V…) so modern CPUs don’t have to cope with multiple memory addresses per instruction or indirect addressing.

topspin • 12 days ago

> limited by the need to discover instruction boundaries

Keller specifically addressed this. His view is that yes, the x86 instruction width problem is real, but it's not an important performance bottleneck because decoders are now sophisticated enough that other bottlenecks (predictors) dominate the net performance. Yes, this means more power usage and die area, but only a part of the market is sensitive to this: where other considerations dominate the cost premium of x86 can be ignored.

He also believes that "RISC-V is the future of computing," (not surprising from the CEO of a RISC-V vendor,) so it isn't as if the legacy ISAs are somehow just fine. But ISA complexity isn't the key determinant in that future. The keys are the open ISA, commoditization of designs and high software portability.

This makes sense to me. x86, despite it's inherent issues, has fought off alternatives before. But now, as per Keller, there is a Cambrian explosion of RISC-V designs appearing, and what emerges from that is going to prevail against any legacy ISA, not just x86. Ulitimately, therefore, the x86 "tax" doesn't actually matter.

kllrnohj • 15 days ago

> Big ARM cores have much larger decode width than x86 cores

Not in general they don't. Apple's does, and Qualcomm's newest Snapdragon X Elite also does. But most big ARM cores don't have larger decode widths than x86 cores. The Cortex-A78 is a 4-wide decode, same as the majority of x86 CPUs on the market. ARM's latest & greatest Cortex X2 is only a 5-wide decode, it only just finally surpassed the original Zen design (4-wide).

Also Zen 5 bumps to an 8-wide decode, the same as what Apple & Qualcomm's best ARM chips can do.

The_Colonel • 16 days ago

This is so annoying about the hype around ARM, for which even smart people fall. Yes, Apple Silicon is good, but it's not because of ARM ISA. I still keep hearing that ARM is RISC which is wrong since the 1990s.

hajile • 16 days ago

ribit • 16 days ago

While I understand the argument, it would also be good to see some empirical evidence. So far all x86 built need more power to reach the same performance level as ARM. Of course, Apple is still the outlier.

scheme271 • 16 days ago

I think a few other things like memory models also matter and affect cpu architecture. E.g. the x86 total store order vs arm64's model. You potentially get to do a few nice optimizations on arm64 vs x86. I'm not sure how much of a difference that makes though.

topspin • 14 days ago

Memory models are indeed important. That's why Apple extended ARM with total store order capability. A CPU can efficiently implement more than one memory model.

hajile • 16 days ago

Look into Intel's iAPX and tell me that ISA doesn't matter. How would you go about making that pile of garbage into something fast?

Most CISC designs died. A large part of x86's survival is because it wasn't exceptionally CISC.

topspin • 15 days ago

> How would you go about making that pile of garbage into something fast?

Make a Rosetta analog and translate to an instruction set that is amenable to efficient prediction, wide dispatch, etc., replacing all the iAPX hardware object oriented nonsense with conventional logic. If necessary, extend the CPU to accommodate whatever memory ordering behavior is necessary for efficient execution, ah la Apple Silicon ARM with x86 total store ordering.

toast0 • 16 days ago

> Speculative execution is so valuable for performance that a computer without it is completely unusable. If you really want a processor without it, buy an old first-gen Pentium.

Pentiums have branch prediction and speculative execution. You need to go back to i486 if you don't want speculative execution. Most of the socket 5/7 processors from other makers also had branch predictors and speculative execution, but not the Centaur Winchip. The Cyrix 5x86 for socket 3 (486) had speculative execution, but it was disabled by default and is reported to be buggy (but helps performance on published benchmarks).

pezezin • 16 days ago

According to Wikipedia, the original Pentium had branch prediction, but speculative execution was first implemented by the Pentium Pro: https://en.wikipedia.org/wiki/P6_(microarchitecture)#Feature...

gpderetta • 16 days ago

The first pentium had branch prediction so of course it had speculative execution.

What it didn't have is out of order execution, which greatly increases the speculation window.

akira2501 • 17 days ago

Interactions between speculative execution and virtual memory translation and caches are exploitable. It's not an inherent vulnerability in prediction.

paulmd • 17 days ago

Sure, as long as it’s not observable in any way, it’s not observable. But the problem is that speculation has been repeatedly been found to be observable in unexpected ways from both brands.

Amd, for example, has observability issues in all zen1/2/3 processors that leaks enough data to break KASLR that remain unpatched in all chips except for Milan (specifically epyc only, not ryzen). They didn’t expect cache ways to be visible in that fashion, and it’s observable that the model is incorrectly implemented.

the idea cannot fail, only be failed, because if it’s observable then obviously you just did it wrong.

akira2501 • 17 days ago

The point was, you can engineer around problems that aren't _inherent_ in the design, but arise as an interaction between systems. You just change the rules of interaction between systems to mitigate the vulnerabilities.

So, to your original question, we're always going to have prediction, and we're going to have to solve any vulnerabilities that arise downstream. Which, thankfully, is always an available option.

emn13 • 17 days ago

As a novice in this area, it's not clear to me after reading this what exactly the 2-ahead branch predictor is.

sillywalk • 16 days ago

It's from around 30 years ago, my guess is it's referring to this[0] paper from 1996. It's above my head, but it seems to help with branch prediction issues arising with both many instruction units and high-clock speed, which were sort of either or in the '90s, but I think most modern processors are both.

Multiple-block ahead branch predictors

Abstract:

A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel cost-effective mechanism called the two-block ahead branch predictor. Information from the current instruction block is not used for predicting the address of the next instruction block, but rather for predicting the block following the next instruction block. This approach overcomes the instruction fetch bottle-neck exhibited by wide-dispatch "brainiac" processors by enabling them to efficiently predict addresses of two instruction blocks in a single cycle. Furthermore, pipelining the branch prediction process can also be done by means of our predictor for "speed demon" processors to achieve higher clock rate or to improve the prediction accuracy by means of bigger prediction structures. Moreover, and unlike the previously-proposed multiple predictor schemes, multiple-block ahead branch predictors can use any of the branch prediction schemes to perform the very accurate predictions required to achieve high-performance on superscalar processors."

[0] https://dl.acm.org/doi/10.1145/237090.237169

EDIT: oops. Looks like eyegor posted the link earlier. Oh well, enjoy the abstract.

cpldcpu • 17 days ago

My understanding is that they do not predict the target of the next branch but of the one after the next (2-ahead). This is probably much harder than next-branch prediction but does allows to initiate code fetch much earlier to feed even deeper pipelines.

layer8 • 17 days ago

Surely you must also predict the next branch to predict the one after. Otherwise you wouldn’t know which is the one after.

Given that, I still don’t understand how predicting the next two branches is different from predicting the next branch and then the next after that, i.e. two times the same thing.

ithkuil • 17 days ago

Interestingly you can build a branch predictor that predicts the second branch without predicting the first.

A branch predictor result is just a tuple of ("branch instruction address", "branch target address") that hints the processor that when the CPU will encounter a given branch instructions in the future (at "branch instruction address") it will likely branch to the branch target and so it would make sense to start fetching that address and filling the instruction pipeline with whatever steps are safe to perform before the jump will be actually performed.

Now, commonly this branch happens to be at the end of the current basic block and I assume some branch predictors may also leverage this fact in order to encode only offsets from the current instruction pointer.

But there is no reason why the branch location might be after some other branches may be taken. As long as the cpu eventually gets to that branch location the prediction will be useful. If the IP never reaches that location it's like the branch was never actually taken.

toast0 • 16 days ago

> Given that, I still don’t understand how predicting the next two branches is different from predicting the next branch and then the next after that, i.e. two times the same thing.

I'm not involved in CPU design, I just read a lot, but...

I think you need to do something special to have a second prediction, because you have to track three windows of out of order execution:

Window 0: code you're definitely running but is still being completed.

Window 1: code from the branch you think will be taken

Window 2: code from the 2-ahead branch you think will be taken.

If you figure out that the window 1 branch isn't taken, you have to drop the whole pipeline (pipeline bubble). But if you figure out that window 1 is taken, then window 1 becomes window 0 and window 2 bcomes window 1.

With a 1 ahead predictor, the pipeline stalls if you get to a conditional branch while speculating in window 1, because the processor can't manage three instruction windows.

IMHO, it sounds like if the core is doing SMT and both threads are active, each thread only gets 1-ahead prediction because the two windows are statically divided between the cpu threads. This may mean a) a significant boost for some loads when SMT is not in use and b) SMT can branch predict in both threads in the same cycle, I don't think that was possible on AMD before (no idea for other vendors)

gpderetta • 16 days ago

With a typical OoO CPU you have reorder buffers in the hundreds of instructions, so you are not speculating two or three branches but potentially dozens ahead of non-speculative execution (and that's why prediction accuracy is so important).

So the question remain, what's the innovation here? I'm sure there is something, but it is not simply speculating two ahead.

I need tor was further, bit this might be an optimization in the fetched tage to avoid a bubble when fetching consecutive taken branches.

Filligree • 17 days ago

Building on the sibling comment:

  if (a) { ... }

  if (b) { return x; } else { return y; }

The two branches can be wholly independent, but predicting the second is still a two-ahed prediction.

fulafel • 16 days ago

> Surely you must also predict the next branch to predict the one after. Otherwise you wouldn’t know which is the one after.

I'd think if you are at PC N and there are branches at N+1 and N+2, predicting just branch N+2 is fine because you predicted the N+1 branch previously, at PC N-1.

flamedoge • 16 days ago

I wonder what they need before this change. Branch predictor hardware may not have accounted for depth beyond single conditional branch? but pipeline was probably always filled, unpredicted.

emn13 • 17 days ago

Ah, that makes sense in the context of the article - thanks!

hmry • 17 days ago

You're not alone. Non-novice, same here. Article spends a lot of time explaining the absolute basics of branch prediction but then when it gets to 2-ahead it just skips over explaining it...

eyegor • 17 days ago

I think it just predicts 2 branches per cycle instead of 1. So it can evaluate the result of n+2 ahead of time instead of only n+1 (typical branch prediction). How this works without wrecking the L1 cache, I'm not sure. It seems like the lookahead past n+1 would make cache evictions much more likely, so maybe I'm missing something here.

> Zen 5 can look farther forward in the instruction stream beyond the 2nd taken branch and as a result Zen 5 can have 3 prediction windows where all 3 windows are useful in producing instructions for decoding.

The original paper is open access but I haven't read far into it: https://dl.acm.org/doi/10.1145/237090.237169

phire • 16 days ago

> It seems like the lookahead past n+1 would make cache evictions much more likely, so maybe I'm missing something here.

The frontend is already predicting dozens of branches ahead of what the backend can actually confirm. Looking ahead by one extra branch ahead doesn't really hurt.

Also, modern TAGE branch predictors are scary accurate, well above 99% on most code (including unpredictable indirect jumps). Besides, the majority of branch prediction targets are already in the L1 cache, it only predicts them because it saw them recently.

The branch predictor in Apple's M1 actually takes advantage of the latter fact. It doesn't predict what address to fetch next, it predicts which cacheline in L1 holds the target. So you only actually get branch predictions for targets in L1.

mjevans • 16 days ago

That seems like a good idea against speculation based attacks too; predict within what is at hand, do not cause side effects.

phire • 16 days ago

I didn't even notice that advantage. I was just thinking about how it minimised each branch target entry to about 16 bits.

I suspect Apple are also using it as a way predictor. If the BTB points directly to the correct cacheline and each cacheline points to the next way then the only time you need to do a search is on a branch misspredit.

tigrou • 12 days ago

A regular branch predictor try to guess which way a branch (eg: if-else) will go before it's executed. That way, CPU can fetch and decode instructions in advance.

Each side of the branch leads to the start of a new block of instructions. The last instruction of such a block is usually another branch.

In other words, a branch predictor guess the address of the next block. A 2-ahead branch predictor does the same thing, but for the two subsequent blocks.

As stated in the paper: "information from the current instruction block is used for predicting the address of the block following the next instruction block".

Unlike a regular branch predictor, it can do that without having to wait for instructions in the next block to be decoded. That way, you can feed multiple instruction decoders at once.

This is particularly useful in modern CPUs where the instruction decoder has become a bottleneck. With only 1 decoder and 1 instruction decoded per cycle (at best) it cannot cope with a wide frontend that can execute many instructions (eg: 4-6) per cycle.

deadmutex • 17 days ago

You can check out the seminal paper linked in the article. Or start by summarizing the paper with Gemini, Claude, ChatGPT, etc. to get a high level overview (and then confirm the answer by reading the paper).

coherentpony • 17 days ago

https://link.springer.com/article/10.1007/s10676-024-09775-5

pyrolistical • 16 days ago

We would need more branch hints? https://github.com/ziglang/zig/issues/5177

Cold, warm, warmer, omit hot as it is the default? Sometimes you would set all branches to be cold except one

Szpadel • 17 days ago

that's probably bad idea but I would like to learn why:

why when we have a conditional branch we cannot just fetch and prepare instructions for both possible branches and then discard the incorrect one?

is this that much harder or there are other reasons that makes this not worth it

phire • 16 days ago

It's a sub-optimal strategy.

A modern TAGE branch predictor is correct well over 99% of the time. So those extra instructions for the other side of the branch are almost always discarded.

Worse, the frontend is fetching dozens of branches ahead of where the backend can actually confirm which direction to take. What are you going to do at the next branch, start decoding four possible branches? then 8, 16, 32 possible branches? Remember, most of the time you are going to throw it away.

If you actually have the hardware to fetch from multiple instruction streams in parallel (which Intel's Gracemont/Goldmont/Skymont and now AMDs Zen 5 do), the better strategy is to assume your branch predictor is actually correct 100% of the time. Follow one side of the branch, then the one after it.

Intel's Skymont actually decodes the next 3 branch targets in parallel because it has three decoders, each 3-wide. Intel actually introduce fake branches to break up large blocks of code, so that all three decoders are always active decoding different part of the upcoming instruction stream. The three uop streams are later merged, allowing Skymont to maintain an effect decode bandwidth of 9 instructions per cycle.

If you executed both sides of the branch, you are only slightly reducing the branch misspredit delay in the rare case the branch prediction was wrong. Instead, by executing one side the next two or three predictions, Intel and AMD can make multiple decoders do work in parallel. Intel are doing 9-wide with three simpler 3-wide decoders, and AMD can do 8-wide with two simpler 4-wide decoders.

tigrou • 12 days ago

Are you saying that branches can be used as a way of dividing code into several parts, so that several instruction decoders can be fed at the same time? Is it because the instruction decoder is actually one of the bottlenecks, not being able to deliver instructions fast enough to the hungry frontend? If branches can be used in this way, unconditional jumps (e.g. gotos) and function calls can probably be used for similar purposes.

phire • 10 days ago

Yep. It uses the branch predictor to slice up the instruction stream and feed it to multiple instruction decoders. Because the variable length nature of x86 instructions makes it hard to implement extremely wide instruction decoders. To reach a throughput of 4 instructions per cycle of the typical 4-wide x86 decoder you actually need 16 length decoders running in parallel, one for every byte position. The next pipeline stage then picks the 4 outputs that don't overlap before feeding it into proper decoders.

Intel have actually implemented a 6-wide decoder for Golden Cove, but they must really be pushing the limits of propagation delays. So it's attractive to just use two or three smaller decoders in parallel.

> If branches can be used in this way, unconditional jumps (e.g. gotos) and function calls can probably be used for similar purposes.

It's a common misconception that the branch predictor only predicts conditional branches. They need to predict anything that causes control flow to diverge. So unconditional branches, calls, returns, and maybe even syscalls, all count as "branches" and get handled by the branch predictor.

(On most RISC instruction sets, these instruction are often labeled branch. Call is just "Branch And Link" which saves the return address in a register. To return, you just use the "Branch to Register" instruction)

This is necessary because the frontend is pipelined. It takes a minimum of four cycles to fetch the instruction data from instruction cache and decode it before the frontend can even know there is a call, return or unconditional branch instruction. If we are decoding four instruction per cycle, we are already 16 instructions past the branch before we even know it's there, and it's common to have a branch every 3-4 instructions.

Basically, the frontend is blind and can't see any branches.

Predicting the direction of conditional branches is almost a secondary task for modern branch predictors. It's much more important to predict where the branches are and their destination, so that instruction fetch stage can request the correct data from the instruction cache and keep the pipeline fed.

tigrou • 6 days ago

Thanks for your reply. I learned a lot. For people wondering why you need a lot of decoders with x86, this answer explain it well (see "Decode problem" part) : https://qr.ae/p2IEt9

swatcoder • 17 days ago

Because it's rare for a branch result to be random. The compiler/runtime/cpu/etc can often guess which result is more likely, and correctly not do the extra work in the first place, and so that's usually the better strategy than spending silicon and heat on the wrong answer just in case.

I think a lot of people don't have an intuition about how accurate branch prediction can be, but if you look at your own code, you'll quickly realize "well, yeah, control flow is almost always going to go this way and we just have this branch so we can handle the exceptional case" -- compilers can often deduce this pretty well themselves now, and cpus/jits/runtimes can develop some pretty impressive heuristics as well, and when all those fail you can often add explicit hints in your code that tell your compiler/etc what you expect if they can't guess.

branko_d • 17 days ago

> it's rare for a branch result to be random

How rare, though?

QuickSort has fundamentally unpredictable branches, and it’s a pretty widely used algorithm. Binary search, B-trees also come to mind.

kllrnohj • 17 days ago

Binary searching is quite slow and should be used sparingly but not because of branch misprediction necessarily but because of memory stalls - you're almost always guaranteed to have a cache miss during the search. Similarly for B-trees it's going to be memory stalls that you're probably more focused on addressing, not branch mispredicts.

emn13 • 17 days ago

barrkel • 17 days ago

B-trees are cache friendly, spending more time doing linear scans at each level to keep the tree shallow, and thus indirection less frequent. They're designed for high latency indirection - loading pages from spinning rust.

saagarjha • 15 days ago

Most code with branches is not “algorithms”. (Bear with me.) It’s simple loops and function calls and vtable indirection. Those can be predicted with very high accuracy, and loops in particular by their very nature dominate execution. For every unpredictable branch in your quicksort there are several very predictable branches that do things like check bounds, the recursion base case, and return address.

eyegor • 17 days ago

Also happens to be why quicksort loses to almost anything else on small arrays, even bubble sort

gpderetta • 16 days ago

Qsort can be implemented branchless by converting the branch into a data dependency.

sjburt • 16 days ago

In benchmarks, branch predictors guess correctly 90-95% of the time.

jayd16 • 17 days ago

Loops are usually much more one way than the other.

sapiogram • 17 days ago

Disclaimer: I don't work in this field, just an enthusiast.

As far as I can tell, branch predictors have always been too good for it to be worth it. Moderns CPUs have instruction reorder buffers that are hundreds of instructions deep, so even if only 8 of those instructions are conditional jumps, there's 256 different paths your program could take. If your branch predictor predicts all 8 correctly >50% of the time (It does), doing 256x the work to cover your ass is not worth it.

ithkuil • 17 days ago

That's called speculative execution and IIRC all modern CPUs are doing it.

It requires more silicon to hold more microarchitectural state and more execution units to fully exploit the technique, but superscalar CPUs already have those since they are essential to exploit instruction level parallelism in non-branchy code. The rest is "just" a lot of headaches to handle complicated stuff such as aliasing, interrupts, ... But hardware engineers are such wizards they can do these things too.

Turns out however that speculative execution opens up a possibility of abusing a cache timing side channel to extract information from data touched by branches of code that has been only speculatively executed but whose architectural side effects were not committed (i.e. not "really" executed).

Which includes code that had been explicitly not executed because of a conditional check (e.g. permissions, ...)

A familiar instance of such an attack is Spectre [1]

1: https://en.m.wikipedia.org/wiki/Spectre_(security_vulnerabil...

MBCook • 17 days ago

We reached 90% accuracy decades ago. Depending on workload modern chips can do way better.

So basically it’s just nowhere near worth it. Much better to use those chip resources for another thread or core.

wmf • 17 days ago

It's a huge waste of energy and in some cases it would even be slower because you'd execute more instructions overall. If the branch mispredict rate is around 1% it's simply not worth paying a penalty 99% of the time to get a gain 1% of the time. Maybe it would be worth doing on low-confidence branches.

kolbe • 17 days ago

The vast vast vast majority of instruction pipelines are sparse. You can pack in non-dependent instructions essentially for free.

gpderetta • 16 days ago

Divergence. On an OoO pipeline fetch might be ahead of retire by hundreds of instructions. That's dozen of branches, and if each causes execution to speculatively fork, it wouldn't scale.

You could potentially do it only for predicted "unpredictable" branches. Now the tradeoff is wasted power and execution units for dead work and so far the tradeoff was just not worth it.

Some form of this has been experimented with. In the late '90 SPRC experimented with scouting threads and as mentioned else thread the Efficient Intel cores can fetch (but not execute) across branches.

0x000xca0xfe • 17 days ago

- Side effects; how do you handle two different writes to memory?

- Double the execution units; very expensive for wide vector units

- Massive waste of energy as half the resources will always be wasted no matter what

- Bad scaling, i.e. four branches ahead would require 16x the resources

isotypic • 17 days ago

Handling two different writes to memory is not really a concern - existing speculative/out of order processors already solve this issue by completing (perform architectural side effects) instructions in-order. So even if two writes are made, one in each branch, by the time the write is meant to be completed, the prior branch is resolved and we know which write is actually meant to be made and the bad one can be discarded.

Doubling the execution units also isn't strictly needed - you can use the existing out-of-order core to send two sets of instructions through the same functional units. There will be more contention for the resources, possibly causing stalls, but you don't need to fully double everything.

Things similar to this idea are already done in processors - simultaneous multithreading, early branch resolution, conditional instructions, are all ideas that have similar implementation difficulties. So the reason this specific idea is not done is more in line with your last two points rather than the first two.

eigenform • 17 days ago

Most branches are biased one way or the other. "Fetching down both paths" means not exploiting any information you might have gathered about a branch being biased - I think that would be equivalent to randomly predicting the branch (except for it would cost more than a random predictor because you're actually fetching both ways instead of just one).

dymk • 17 days ago

Transistor count; now you have to duplicate all the decode and speculative execution circuitry for both possible branches

immibis • 17 days ago

No, the same circuits would execute them interleaved just like they execute multiple hardware threads now

nomel • 17 days ago

With the SMT core count having to be one less.

immibis • 15 days ago

bsder • 17 days ago

> why when we have a conditional branch we cannot just fetch and prepare instructions for both possible branches and then discard the incorrect one?

We actually do that. It's called a GPU. And it sucks for general code.

hnpl • 16 days ago

I'd love to see the performance data before judging whether it is a good idea. There's no information on the branch prediction penalty of this approach as well.

Anyway, I think the intuition of this approach is to aggressively fetch/decode instructions that might not already in L1 instruction cache / micro-op cache. This is important for x86 (and probably RISC-V) because both have varied instruction length, and just by looking at an instruction cache block, the core wouldn't know how to decode the instruction in the cache block. Both ISAs (x86, RISC-V) require knowing the PC of at least one instruction to start decoding an instruction cache block. So, knowing where the application can jump to 2 blocks ahead helps the core fetching/decoding further ahead compared to the current approach.

This approach is comparable to instruction prefetching, though, instruction prefetching does not give the core information about the starting point.

(High performance ARM cores probably won't suffer from the "finding the starting point" problem because every instruction has the length of 32-bit, so the decoding procedure can be done in parallel without knowing a starting point).

This approach likely benefits front-end heavy applications (applications with hot code blocks scatter everywhere in the binary, e.g., cloud workloads). I wonder if there's any performance benefit/hit for other types of applications.

im3w1l • 17 days ago

I still ahve no idea what a 2-ahead branch predictor is.

adrian_b • 16 days ago

You can better start by reading the old research papers linked in the article.

In general, the older research papers suppose that the reader knows much less about such subjects, because they were still much more niche knowledge at that time.

im3w1l • 16 days ago

Okay, so a "block" is a sequence of instructions that can be executed from top to bottom, i.e. only the last instruction may be a jump.

A 2-ahead branch predictor is a branch predictor that predicts not which block to run next after the current one, but rather, which block should be run after that one.

vegabook • 17 days ago

Now all it needs is more memory bandwidth, because those two memory channels on the consumer AM5 socket are pathetic given the compute this will deliver, and especially in comparison with even the most basic ASi.

I moved to an M2 Max from a chunky Zen setup and it's a revelation how much the memory bandwidth improvement accelerates intensive data work. Also for heavy-ish multitasking the Zen setup's narrow memory pipe would often choke.

AnthonyMouse • 16 days ago

There are very few applications that are actually memory bandwidth bound but aren't more suited to a GPU than a CPU.

The reason people have been looking at Apple Silicon for LLMs in particular is that even though they are more suited to GPUs, they also require a lot of VRAM and NVIDIA charges an extortionate amount of money for GPUs with a lot of VRAM.

What AMD should really do if they want to steal NVIDIA's thunder is to sell consumer GPUs with 64-128GB of VRAM.

The_Colonel • 16 days ago

> What AMD should really do if they want to steal NVIDIA's thunder is to sell consumer GPUs with 64-128GB of VRAM.

For gaming, it's an overkill (esp. given the price), it would be useful only for LLM enthusiasts which isn't that big of a market.

AnthonyMouse • 16 days ago

Then how come so many people are buying GPUs with that amount of VRAM for extraordinary amounts of money?

Nobody said it was for gaming. Though of course consumers could use it for both.

kimixa • 16 days ago

cesarb • 16 days ago

> because those two memory channels on the consumer AM5 socket

There are actually four memory channels in AM5, because DDR5 doubles the number of channels.

pezezin • 16 days ago

But each channel is half the width, so we are back at the same point.

yogrish • 14 days ago

When I began my career as a DSP optimization engineer, I worked on ZSP processors. I had a particular passion for branch predictions. By analyzing data, we would code and recode branch instructions to minimize mispredictions. I loved that super- scalar architecture of ZSP and its Branch prediction feature was amazing, apart from its multi Instruciton execution in a cycle.

phkahler • 17 days ago

>> Now when Zen 5 has two threads active, the decode clusters and the accompanying fetch pipes are statically partitioned.

This sounds like a big boost for hyper threading performance. My Zen1 gets about 25 percent faster due to HT. Has anyone tested the newer ones in this regard?

paulmd • 17 days ago

High SMT speedups aren’t a good thing, because those are pipeline bubbles - resources that can’t be saturated by the first thread.

The ideal speedup from SMT is 0% because you’d be getting full output for a single thread. Like if you got a 50% speedup from SMT, in an ideal world that means a single thread could run 50% faster than it is, but it’s being stalled by pipeline bubbles.

Remnant44 • 17 days ago

For a fully compute-bound workload, you're certainly correct.

That's rare though. All it takes is a couple stalls waiting on memory where a second thread is able to make progress to make that "ideal speedup" certainly be nonzero.

paulmd • 17 days ago

Regardless though why would it potentially being higher in newer architectures be viewed as a good thing?

Remnant44 • 16 days ago

Dylan16807 • 16 days ago

adrian_b • 16 days ago

Yes, while in my older tests on Intel Skylake derivatives I have also obtained SMT speedups around 25%, on the newer Zen 3 (a 5900X) I have obtained at most a 20% speedup in the most SMT friendly task that I have ever encountered, i.e. the compilation of a big software project (the comparison being done between optimal parameters for SMT disabled vs. SMT enabled, which for a 5900X have been determined to be "make -j13" vs. "make -j24").

An example of a multithreaded benchmark that is not SMT friendly is the GeekBench 6 multithreaded test, where Zen 3 with SMT disabled (12 threads on a 5900X) is slightly faster than with SMT enabled (24 threads on a 5900X).

ColonelPhantom • 16 days ago

It's worth noting that compilation is a partially serial task (e.g. linking is often largely single-core). It's entirely possible that going from 4 to 8 threads is much more helpful than 12 to 24 threads, as a 24-thread system will have far more idle threads in comparison. (Of course this is assuming 4c8t Skylake, so a normal consumer i7. Skylake-X had more cores.)

adrian_b • 16 days ago

For big projects, as I have mentioned, the linking phase is at most a few percent of the compilation time.

I have CPUs with 48 threads and for big projects the compilation time decreases monotonically from 1 thread to 48 threads (and almost proportionally with the number of threads until 24 threads, then with a constant smaller slope until 48). The published benchmarks show that for big software projects the compilation times decrease monotonically until 512 threads on a 2-socket MB with 128-core CPUs.

So the compilation of a big software project, like Chromium, Firefox, LLVM, gcc, the Linux kernel, Libreoffice, etc. (all of which have many thousands of files that must be compiled) is one of the tasks that can use efficiently any number of threads that is currently available.

Moreover, there are now linkers that work partially concurrently. Tasks like the relocation of the object files and of the external symbol references can be done completely concurrently for all source files (after the start addresses are known for all object files and the corresponding object files are known for each symbol).

ryukoposting • 17 days ago

4 years after graduating college, my decision to dive into computer architecture classes has borne no fruit except my ability to loosely understand what writeups like this are talking about. But, I guess that's the point, isn't it? This is fascinating stuff, whether or not I need to know it.

brcmthrowaway • 17 days ago

Did the paper author get a cut from AMD?

adrian_b • 16 days ago

They have no reason to give him any cut.

On the contrary, he points that one of their most important innovations is not that innovative, because it was analyzed in detail almost 30 years ago.

Before AMD, Intel had started to use such multiple decoders in their Atom cores, now rebranded as E-cores (efficient cores). The multiple decoders have been first used in Tremont (2020), then in Gracemont (2021) and Crestmont (2023), and now in Skymont (2024).

It is hard to predict which of the decoders of Skymont (triple 3-instruction decoder) or of Zen 5 (double 4-instruction decoder) is better. AMD has made the choice of the double decoder because their core uses SMT, and a double decoder is easy to be partitioned into two decoders when both simultaneous threads are active.

sholladay • 16 days ago

Despite being aware of AMD’s Zen chips, for a moment I thought this was about a ZEN player. Good times!

https://en.wikipedia.org/wiki/Creative_Zen