Back

Do not taunt happy fun branch predictor

275 points18 hoursmattkeeter.com
jerf16 hours ago

This is another good example of how our CPUs are in many ways specialized C processors. C is a structured programming language that uses functions, so our processors like functions. If you jump out of that paradigm, even if the assembly instructions nominally seem to allow it, you'll run more slowly. Even when it seems like what you're offering is a shortcut to the CPU.

This is neither praise nor criticism of the current CPU paradigm; it's just something you need to understand if you want the best performance out of our machines.

A different paradigm, like a concatenative-paradigm-based program, might naively be more inclined to compile into code that looks more like what the author tried, jumping between implementations of the stack operators without it actually being "functions". One can imagine processors that would be "happier" with that, and would be bothered by things that look like function returns more. But that's not the CPUs we have.

flohofwoe15 hours ago

It's just a natural side effect of software and hardware forming a symbiosis and that they cannot really be separated. New software is (or should be!) built to run well on existing hardware, and new hardware is built to run existing software well. Besides, subroutine call instructions were a thing long before C became mainstream and manual assembly coding ruled supreme.

brigade13 hours ago

If you want to encode a series of indirect jumps between concatenated functions, you can do that already and the return address stack won’t get involved; you’ll simply get the normal branch predictors.

But generalized branch prediction is expensive (multiple entries in the BTB hashed by the previous program flow, and mispredicts if the next function in the chain changes); the point of an RAS is that it’s a very cheap way to keep some 100% predictable branches from using those general resources. Concatenation pretty much requires branches without unique characteristics, so no fast path shortcuts and everything is a bit slower.

mncharity9 hours ago

> Concatenation pretty much requires branches without unique characteristics, so no fast path

Discussion the other day, of optimizing dispatch overhead in no-jit-allowed lightweight-word vm's, raised a brainstormy idea of having multiple copies of some words, and the compiler juggling them with awareness of their tailcall-next-word varied branch predictor states. Sort of a dynamic version of reducing dispatch by identifying hot phrases and creating multi-word-composite new words for them.

sacnoradhq2 hours ago

General-purpose CPUs began as universal Turing machines for running any sort of mathematic calculations automatically. This was done in binary on punch cards and/or with panel cable settings. Then came textual assemblers to make that easier. Imperative procedural programs were then grafted-on to simplify writing assembly rather than in any sort of assembly language.

Most processors have accumulated functionality beyond minimal instructions for the simplification and acceleration of operating system services, hardware interfacing, cryptography, vector & matrix integer and floating-point math, arbitrary precision math, string processing, virtualization (IO and CPU), secure computing, and virtual memory management; just to name a few. :)

A future era of green field development closer to the technological singularity will look across the software-hardware interface boundaries to optimize both silicon (or superconductors) and compilers to generate fast, small, and/or power efficient designs and code without as many limitations. Generative AI and holistic algorithms with enormous computing power will make this possible. It's almost possible now, it's just a few leaps beyond what EDA and compilers are already doing.

masklinn15 hours ago

Author was not jumping out of the paradigm here, they were deliberately misusing constructs specialised for the paradigm (br/ret). That’s like saying the toolcase is a specialised screw processor because you’re trying to drive nails using a screwdriver and it does not go well.

And C is hardly the first or only procedural langage.

Nevermark5 hours ago

No but C is a better model of most processors assembly, than most other procedural languages.

pjmlp3 hours ago

Only when we are talking about PDP-11.

Thankfully nowadays we have Compiler Explorer supporting almost every flavour of mainstream compiled languages to dispel myths.

+1
flohofwoe2 hours ago
masklinn3 hours ago

Not sure how that is relevant. The GP asserts (and bemoans) the reverse cause-and-effect.

ablob15 hours ago

This has more to do with the ISA than C I'd assume. C was built as an "easy assembler". Furthermore, the computer doesn't really care about structure (in the structured programming sense).

In this case the implementation of the ISA stores information on each 'ret' depending on some 'bl' that came before. One can imagine that a different optimization technique which actually leads to a speedup exists. Without a branch-predictor, the program that Matt wrote might've been faster.

Imo, this has nothing to do with paradigms, but how control-flow interacts with the system. This interaction is different between different implementations of an architecture.

Code written for Cache-locality and paradigms that work well with it, for example, only became a "winner" after caches were widely implemented. Before that, the cost of sequential array access and random array access was identical. With caches, a linear search for an element can be faster than a binary search, even though it requires a lot more memory accessess. Thus, the optimal implementation for finding an element in an array is now dependent on size as well. (I.e. after an ordered array reaches a certain size, binary search should become faster on average).

int_19h14 hours ago

The point is that the ISA is assuming that BL and RET opcodes come in pairs - which is an assumption that does reflect what structured code is typically compiled down to. Going by the semantics of the opcodes themselves, there's no reason why RET should be treated differently from a simple indirect jump here.

zerohp7 hours ago

> Going by the semantics of the opcodes themselves, there's no reason why RET should be treated differently from a simple indirect jump here.

RET is documented as a subroutine return hint in the official ARM architecture reference manual. That's the only reason it has a distinct opcode from BR.

BL is also documented as a subroutine call hint.

masklinn13 hours ago

> there's no reason why RET should be treated differently from a simple indirect jump here.

There’s its very existence. If you’re looking for a general purpose indirect jump, use that.

tsimionescu13 hours ago

As far as I understand, the ISA explicitly intends for BL and RET to come in pairs. The only point of RET is to jump to the address last stored by BL. If you don't need this behavior, there's no reason to use RET - as the article itself shows, B x30 does the exact same job, and doesn't come with the extra assumptions.

+1
int_19h10 hours ago
raverbashing8 hours ago

> Going by the semantics of the opcodes themselves,

This is RISC wishful thinking that makes it sound like BL is not CALL. Maybe in ARM 1 it wasn't, it is now.

classichasclass14 hours ago

I agree, because this semantic difference between br x30 and ret doesn't exist in many other RISCs which also mostly run C-like languages. Power just has blr, and MIPS jr $ra, for example. This feels more like a well-intentioned footgun in the ISA.

shadowofneptune12 hours ago

The potential population for that footgun is compiler writers, who should safely fall into 'know what they are doing' territory.

dmurray9 hours ago

> C is a structured programming language that uses functions, so our processors like functions.

The main alternative paradigms that are ever mooted are LISP machines, which elevate the importance of functions, and maybe stack-based programming languages like Forth, which emphasize the feature of functions seen here (that they keep popping things off the stack).

It's hard to argue that C has led us down to a local minimum because it's just too functional.

pjmlp3 hours ago

> C is a structured programming language that uses functions, so our processors like functions.

So were most of the structured programming languages from 1960's.

PL/I, ESPOL, NEWP, ALGOL, Pascal, ...

prohisto11 hours ago

[flagged]

titzer17 hours ago

> More specifically, the branch predictor probably keeps an internal stack of function return addresses, which is pushed to whenever a bl is executed. When the branch predictor sees a ret coming down the pipeline, it assumes that you're returning to the address associated with the most recent bl (and begins prefetching / speculative execution / whatever), then pops that top address from its internal stack.

There's no need for "probably" here. The micro-architectural mechanism is known as a return stack buffer[1] and is generally separate from the branch predictor unit, though the processor may make use of indirect branch prediction entries for returns as well.

[1] It is, indeed, a tiny little stack of return addresses and indeed, the article hit performance issues by misaligning it. The (Intel chips') RSB is behind the Retbleed vulnerabilities.

ahh10 hours ago

Interestingly, Matt has invented a variant on the retpoline [1] which _intentionally_ missteers the branch predictor to prevent various speculative attacks. (Invented by my former Google manager.). It's pretty cool how much simpler a retpoline would be in aarch64, since we have explicit control over the link register rather than having to play stupid games with stacks.

(Real retpolines have a little more magic, naturally.)

[1]https://stackoverflow.com/questions/48089426/what-is-a-retpo...

FullyFunctional15 hours ago

Well, of course, the Return Address Stack (RAS) predictor maintains its own call stack and you need to understand how it works. However, there's a subtler way to break it: recurse too deeply. The RAS only has a fixed, small, and implementation dependent length. If you use deep recursion with non-trivial control flow (in particular multiple call sites), then the RAS will starting missing once you return from beyond that limit.

Another consequence of the RAS is that co-routines switching is more expensive than they might appear at first. RISC-V has encoding hints to mark call(jal)/returns that are actually co-routine switching but the full cost can't be avoided.

gpderetta12 hours ago

you can mitigate the cost by not 'call'-ing into your coroutine switch function but inlining the code into the surrounding coroutine. As a bonus you get a bit better branch prediction on your yield because distinct yields will share less state.

Of course there is always going to be a penality for stackful coroutines that yield deep into a callstack.

10000truths10 hours ago

Unfortunately, this is very difficult to do above the assembly level because it requires a custom calling convention that doesn’t yet seem to be supported by any systems programming language compiler. You have to use an assembler macro, or pipe the assembler output through sed, to patch the call and ret instructions:

https://stackoverflow.com/questions/43894511

gpderetta10 hours ago

GCC inline assembly can be coerced to do the right thing.

saagarjha8 hours ago

It’s not great, though. Modern C compilers will generally fight you if you try to subvert structured programming.

jefftk16 hours ago

> The SIMD code does come with one asterisk, though: because floating-point addition is not associative, and it performs the summation in a different order, it may not get the same result as straight-line code. In retrospect, this is likely why the compiler doesn't generate SIMD instructions to compute the sum!

What if you set -funsafe-math-optimizations, which allows "optimizations that allow arbitrary reassociations and transformations with no accuracy guarantees"?

Arnavion10 hours ago

Based on the "We can get faster still by trusting the compiler" part of the article, the author's using Rust. It doesn't have a global flag like `-funsafe-math-optimizations` or `-ffast-math` so the change would have to be a bit more involved. They'd have to change their use of `+`, `*` etc operators on f32 to `std::intrinsics::fadd_fast`, `std::intrinsics::fmul_fast`, etc functions.

So, putting it into https://rust.godbolt.org/z/jT6Mb1K13 , it seems to indeed be using the SIMD instructions.

tialaramex9 hours ago

They're asking for sum() on a slice f32s. The sum() function actually works via a Trait for this specific purpose, Sum, so you could go like this...

New Type wrapper for f32 called like FastFloat, marked #[repr(transparent)], and if necessary (not sure) have the compiler promise you you're getting the same in memory representation as an actual f32.

Implement Sum over FastFloat by having it use the faster SIMD intrinsics for this work to give you an answer, accepting the potential loss of accuracy.

Now, unsafely transmute the f32 slice into a FastFloat slice (in principle this is zero instructions, it just satisfies the type checking) and ordinary sum() goes real fast because it's now Sum on the slice of FastFloats.

Arnavion2 hours ago

If you want to go the newtype + Sum impl route, you don't have to make it `#[repr(transparent)]` or transmute the slice. You can just `impl Sum<FastFloat> for f32` and do `f.iter().copied().map(FastFloat).sum()`

https://rust.godbolt.org/z/b9s3dna6r

tialaramex34 minutes ago

Oh, I didn't think of that, clever.

not2b10 hours ago

It's a bad idea to scramble the order of floating point operations unless you can tolerate extreme inaccuracy, and in some applications accurate FP results don't matter, but the non-associativity of FP isn't just a technicality: you can lose all of your significant digits if the order of operations in well-written scientific code is changed.

thomasahle2 hours ago

But if we aren't assuming the original order was particularly nice, we are simple substituting one random (or arbitrary) order for another. No reason to expect it to be any worse or better.

makapuf15 hours ago

You could just turn gcc to 11 and use -Ofast

zokier13 hours ago

Both clang and gcc vectorize if you ask them nicely: https://gcc.godbolt.org/z/xvjY8P4cM

ogogmad17 hours ago

Minor erratum: Floating point addition actually is commutative; it's in fact non-associative.

dekhn16 hours ago

with some significant exceptions, such as NaNs.

recursive16 hours ago

Can you think of some `x` where `x + NaN` is not identical to `NaN + x`? I can't.

jcranmer12 hours ago

It depends on your model of NaN. If you think there is only one NaN value, then the two values return the same result. If there are multiple distinct NaN values (i.e., you distinguish between NaNs with different payloads), then the resulting payload of an arithmetic operation is... not well-specified.

Most hardware architectures (x86, ARM fall into this category) pick the rule that the payload is the Nth operand (usually first, sometimes second) when multiple inputs are NaN. I believe there's some hardware where the lesser of the payloads (converted to an integer) is picked instead. RISC-V dispenses with NaN payload propagation entirely. There is theoretically the ability for hardware to generate a new unique NaN with the hardware address of the instruction into the payload, but I don't believe any hardware actually does it.

Most programming languages do not generally model NaN with greater fidelity than "there is a NaN value" or maybe "there is a distinction between qNaN and sNaN."

MaulingMonkey16 hours ago

If x is another NaN with a different payload, the result is likely a NaN with a payload corresponding to the left side of the addition.

raphlinus15 hours ago

That is correct, here is a playground link which shows that result: https://play.rust-lang.org/?version=stable&mode=debug&editio...

stephencanon13 hours ago

This varies tremendously across architectures and uarches and compilers. The result will be a quiet NaN, and you can’t say anything beyond that.

dekhn16 hours ago

you mean, like 1 + NaN = NaN and NaN + 1 = NaN, but NaN != NaN? (I'm not a numerical expert, just repeating what others have told me)

+1
CountSessine15 hours ago
mkeeter16 hours ago

Thanks, fixed!

lmm12 hours ago

It's not commutative; -0.0 + 0.0 = -0.0 but 0.0 + -0.0 = 0.0.

jcranmer11 hours ago

Um, IEEE 754 specifies that -0.0 + 0.0 is 0.0, not -0.0. It's still commutative.

marcosdumay14 hours ago

Yes, it's commutative. But it is associative.

Make a number with 3 bits of mantissa, and go add a thousand repetitions of 1. You can get hundreds of different results depending on the order you add them up.

wizzwizz413 hours ago

> But it is associative.

You're describing a non-associative operation.

ogogmad12 hours ago
marcosdumay9 hours ago

Ops. Yes, you are right.

snerbles16 hours ago

While I was in undergrad I toyed around with abusing the branch predictor on a few different machines, compiling something like the following with optimizations off - it performs an identical computation regardless of branch outcome:

    void branchLoop(unsigned int condition, unsigned int &sum)
    {
        // put something suitably large here
        unsigned int loopCount = 0x0fffffff;

        unsigned int i;

        // compile with -O0 or this gets optimized away
        for (i = 0; i < loopCount; i++)
            if ((i & condition) == 0)
                sum++;
            else
                sum++;
    }
The Core Duo on my Thinkpad T60 had some very distinct slowdowns on certain bit patterns, which were not repeatable on the handful of other CPUs I had access to at the time. I haven't tried this with more modern CPUs, however.
gpderetta12 hours ago

Predictors are getting better and better at recognizing long patterns (sometime at the cost of not being optimal with short patterns).

sacnoradhq2 hours ago

After about the Pentium-/Pentium Pro-era, hand-coded assembly generally is premature optimization (and wasted effort).

Once upon a time(tm), you could write self-modifying code or guess at keeping pipelines occupied by manual instruction reordering, but cache line invalidation and OOOE make these moot.

The problem is that with a pipeline stall (wrong branch predicted) in hyper-deep pipelines, the penalty is enormous: waiting for the other condition calculation to percolate through the pipeline or independent stages.

Processors are optimized for the mainstream, usually the current or last generation of compilers when the processors were designed.

To generate the generally fastest bitcode, it would require an incremental JIT with history that can permute and mutate bitcode from runtime metrics. That's beyond HotSpot(tm), LLVM, or anything of the sort.

iamsmooney16 hours ago

Title is a reference to this old SNL skit: https://www.youtube.com/watch?v=GmqeZl8OI2M

Agentlien2 hours ago

It is actually linked from the article. In the sentence "In conclusion, do not taunt happy fun branch predictor with asymmetric usage of bl and ret instructions." the words "do not taunt happy fun branch predictor" are a link to the video on YouTube.

sophacles14 hours ago

It's definitely a classic - if you haven't seen it I'd recommend it even if it wasn't related to the article :D

bioint781215 hours ago
efitz15 hours ago

I think that the days where hand tuned assembly language outperform compiler generated code are largely behind us (let loose the contrary anecdotes).

Compilers and microprocessors are way more complex than they were back in the 80s or 90s, and compiler engineers know way more about how instructions are actually executed than the vast majority of programmers.

jcalvinowens14 hours ago

I think about it like this:

If I ask you to write assembler faster than what the compiler emits for one very specific CPU model, you'll pretty much always be able to do that (given enough time).

But as CPUs become more complex, and compilers get better, achieving that win requires more and more "overfitting" to implementation details of the specific hardware, in ways that are not helpful or even counterproductive on older or newer models of the same CPU family.

It's still worth it sometimes, of course. It's just more work and a lower value proposition on average.

makapuf15 hours ago

... and yet in this article, the author beats by 10x a C compiled code with hand tuned assembly. (By using SIMD and unrolling, which the compiler did not. Granted linear compiler code is faster than hand made linear assembly)

zokier13 hours ago

Author beats compiler because floating point constraints, with -ffast-math compiler vectorizes the code.. I don't have arm64 hardware to test the result, but its probably again pretty fast: https://gcc.godbolt.org/z/xvjY8P4cM

ravi-delia14 hours ago

The only exception (pointed out in this post) is leveraging SIMD instructions, which aren't nearly as well exploited by compilers. I doubt, of course, that this will last all that long, but for now there are too many subtle differences that might matter and totally different instruction sets between cpus and even generations.

Agentlien2 hours ago

I feel like I've been hearing complaints that compilers aren't leveraging SIMD instructions for years. I wonder when this will be solved.

wolf550e14 hours ago

compiler autovectorization is poor, people often outperform it when writing SIMD code. intrinsics are so low level they might as well be assembly.

JonChesterfield14 hours ago

I've had really good results from the two in LLVM (one works on loops, one within basic blocks). Optimal loop body when the pointers had alignment metadata attached, though at the time it failed to unroll the tail.

Using intrinsics with the control flow in C or C++ works really well - you get the right instruction selection from the intrinsics, easy to reason about control flow and the compiler deals with register allocation (and possibly instruction scheduling) which are a pain to do by hand.

userbinator9 hours ago

If you're optimising for size, you can still very easily beat the compiler.

rowanG07713 hours ago

I used to think that was true. Then I had a crypto course in uni which required us to write and optimize 3 different hashing and encryption algorithms.

I was stunned by the first once, which I first did in C and then moved to ASM. The C code was pretty straightforward. But it was trivial to beat in ASM by something like 120%.

That course taught me how bad compilers(Or rather GCC) are at high-level register optimization and using the barrel shifter. Basically all the performance I squeezed out of ASM was because the compiler just wasn't figuring basic stuff out. Mind you I was(am) not an ASM expert. That piece of code was the first thing I ever write in ASM. And yet I was able to easily beat a decades old world class compiler.

nobody999910 hours ago

Thank you for posting this.

Somewhat OT, but I miss Phil Hartman[0][1][2]. You may remember him[3] from shows like Saturday Night Live, The Simpsons and "Planet of the Apes."[4]

[0] https://en.wikipedia.org/wiki/Phil_Hartman

[1] https://en.wikipedia.org/wiki/Happy_Fun_Ball

[2] https://www.youtube.com/watch?v=GmqeZl8OI2M

[3] https://screenrant.com/the-simpsons-funniest-troy-mcclure-qu...

[4] https://www.youtube.com/watch?v=yOeUXEpxzcc

londons_explore16 hours ago

Observation: Almost any code, when micro-optimized, can gain about 10x performance.

So, if we had the time and energy, we could probably make all of computing at least 10x faster.

But we don't have the time or energy to dedicate that much effort to every line of code... But perhaps AI does?

celeritascelery16 hours ago

I don’t think that is generally true. He only got a large speedup because he used SIMD, which has nothing to do with micro optimization. I would say a better take away is that micro optimization is really hard and you will often make things worse if you don’t know what you are doing. Even if you do, you are only going to get a few percentage points.

thethirdone15 hours ago

My experience micro optimizing things is that even without SIMD, most software can get at least a 5x in performance. With SIMD, you can often get 50x improvements.

The reason why people thing "Even if you do, you are only going to get a few percentage points." is because it generally takes 5-50x the developer time to optimize such code. If it takes half a day to write naive code to do something like validate utf8, it probably takes ~25 workdays to make a fast SIMD version. If you instead spend an extra half a day, there is a good chance you get a 10-50% speedup using normal code.

mumumu15 hours ago

This is true on a few streaming application (such as parsing).

And most of the speedup is because of tricks to avoid doing beaches. There is a great blog post from one of the authors of JSON SIMD discussing this.

I'm on mobile, there is a link for the blog post on the simd JSON github repository.

mumumu7 hours ago

*avoid branches

The blog post I mentioned:

Paper: Parsing Gigabytes of JSON per Second https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of...

Another related post from Lemire:

Ridiculously fast unicode (UTF-8) validation https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-...

Those algorithms are fast. But to put them in perspective. A single x86 CPU can write 64B per cycle. At 5GHz, the theorical maximum bandwidth is 320 GBps. IIRC, the read bandwidth is twice that.

There are others botlenecks, and is very hard to write code that writes at every cycle.

A interesting consequence, is that the theorical maximum bandwidth is logarithmical to the number of cycles. Again, talking about branchless streaming application.

bee_rider13 hours ago

It also can we weird and non-obvious. For example depending on the instruction mix and hardware it might not be worth getting dinged by AVX-512 clocks. And if you are, say, writing the UTF validation code as a library (more favorable to put effort into a library!) you might not know where the code is being used, so you might not even know the instruction mix…

shoo15 hours ago

> Almost any code, when micro-optimized, can gain about 10x performance.

well, it depends on where you start from. the speedups can become arbitrarily impressive sounding when the starting point is arbitrarily inefficient.

e.g. if you're starting with a python script that wasn't written with performance in mind and has ended up being compute-bound in pure python number crunching code, if you rewrite the thing in C while thinking a little about appropriate data structures and memory allocation/access -- i.e. replacing a festival of python dict lookups and very frequent memory allocation for tiny intermediate results with indexing into preallocated arrays or so on -- it's quite common to see speedups of 500x - 1000x. this is before micro-optimisation, before introducing SIMD or multi-threading or setting the compiler to build for your exact CPU or so on.

Pet_Ant15 hours ago

Whenever you read/hear/think "AI" replace it with "a probabilistic process". If you can validate the output then it's really a beam search ( https://en.wikipedia.org/wiki/Beam_search ) of the solution space, and not really "AI".

If you can't, then it's really a crap shoot as to whether the output is at all valid. If we want to use more opaque processes, I think we need more transparent outputs. If a neural net can produce a machine checkable proof or supply it with the optimisation that's great, but otherwise it's just hot air.

david2ndaccount15 hours ago

He didn’t gain 10x from a micro optimization, he gained that much by converting it to use SIMD which is a macro optimization. You usually have to structure your program in such a way to be SIMD friendly. In this case it was already simd-friendly (adding a large number of floats).

Jensson14 hours ago

> In this case it was already simd-friendly (adding a large number of floats).

So it was micro optimization afterall!

lmm12 hours ago

It was a micro optimization because it was a toy example. Doing floating-point arithmetic (and not already using BLAS etc.) is very niche in real code.

+1
Jensson12 hours ago
fluoridation16 hours ago

Not true by a long shot, unfortunately. Getting even a 100% performance gain out of manual optimization is unusual. Usually the only way to get significant gains is by either switching algorithms or by relaxing problem requirements.

cozzyd11 hours ago

It... depends. Often things like changing memory layout, intelligent prefetching, etc. can make a pretty big difference.

water-your-self16 hours ago

If an HN reader wanted to play around with similar digging, what would be the essential tools to be aware of and where best could he start?

Assuming prior knowledge of assembly/C but without much experience decompiling or testing speed.

shoo15 hours ago

Learn how to use a decent profiler. if you're running linux, that's probably perf:

https://man7.org/linux/man-pages/man1/perf.1.html

https://www.brendangregg.com/perf.html

Here's a fun article from the cloudflare blog that gives an example of using of perf to diagnose performance of a small utility: https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

Matt Godbolt's compiler explorer is also worth checking out: https://godbolt.org/

dcow15 hours ago

Your compiler can spit out assembly, you just need to know how to read it. Sounds like the author was also using Xcode Instruments https://help.apple.com/instruments/mac/current/#/dev7b09c84f... to check cpu counters. And they were using criterion https://crates.io/crates/criterion to microbenchmark.

My guess would be that the author is porting some C code to Rust and making sure not to regress performance along the way (probably hopefully trying to increase it). Likely their program was written in Rust and the section they were trying to optimize called some old c code. Sounds like they rewrote the section in Rust since Rust <-> C ffi calls break out of the happy realm the rust compiler likes and end up causing a performance hit themselves. You can write inline assembly in Rust using the macro https://doc.rust-lang.org/reference/inline-assembly.html.

gpderetta16 hours ago

Yes, never push and ret. Here is something I wrote (/me checks calendar) more than 15 years ago about optimizing coroutine control flow: https://www.crystalclearsoftware.com/soc/coroutine/coroutine...

xKingfisher9 hours ago

An interesting application of this is massaging the branch predictor using tail calls to speed up a parser/interpreter:

https://blog.reverberate.org/2021/04/21/musttail-efficient-i...

saagarjha8 hours ago

The wins from tail calls are generally more from being able to skip the overhead that comes from a function call rather than better branch prediction.

xKingfisher7 hours ago

It's mentioned in passing at the end of the "the trouble with interpreter loops" section.

A traditional switch/goto loop can thrash the branch predictor. Separating into different tail calling functions gives you more slots and allows the branch predictor to learn relationships between ops.

Not to discount the many other benefits of tail calls.

*Edit: I misspoke slightly, computed gotos can also split the patch jump, but less reliably[0].

[0]https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html

error50310 hours ago

Interesting.

I thought it would be interesting to compare the behaviour of (very) different AArch64 processors on this code.

I ran your code on an Oracle Cloud Ampere Altra A1:

  sum_slice                  time:   [677.45 ns 684.25 ns 695.67 ns]
  sum_ptr                    time:   [689.11 ns 689.42 ns 689.81 ns]
  sum_ptr_asm_matched        time:   [1.3773 µs 1.3787 µs 1.3806 µs]
  sum_ptr_asm_mismatched     time:   [1.0405 µs 1.0421 µs 1.0441 µs]
  sum_ptr_asm_mismatched_br  time:   [699.79 ns 700.38 ns 701.02 ns]
  sum_ptr_asm_branch         time:   [695.80 ns 696.61 ns 697.56 ns]
  sum_ptr_asm_simd           time:   [131.28 ns 131.42 ns 131.59 ns]
It looks like there's no penalty on this processor, though I would be surprised if it does not have a branch predictor / return stack tracking at all. In general there's less variance here than the M1. The SIMD version is indeed much faster, but by a smaller factor.

And on the relatively (very) slow Rockchip RK3399 on OrangePi 4 LTS (1.8GHz Cortex-A72):

  sum_slice                  time:   [1.7149 µs 1.7149 µs 1.7149 µs]
  sum_ptr                    time:   [1.7165 µs 1.7165 µs 1.7166 µs]
  sum_ptr_asm_matched        time:   [3.4290 µs 3.4291 µs 3.4292 µs]
  sum_ptr_asm_mismatched     time:   [1.7284 µs 1.7294 µs 1.7304 µs]
  sum_ptr_asm_mismatched_br  time:   [1.7384 µs 1.7441 µs 1.7519 µs]
  sum_ptr_asm_branch         time:   [1.7777 µs 1.7980 µs 1.8202 µs]
  sum_ptr_asm_simd           time:   [421.93 ns 422.63 ns 423.30 ns]
Similar to the Ampere processor, but here we pay much more for the extra instructions to create matching pairs. Interesting here that the mismatched branching is faster than the single branch.

I guess absolute numbers are not too meaningful here, but a bit interesting that Ampere Altra is also the fastest of the 3 except in SIMD where M1 wins. I would have expected that with 80 of these cores on die they'd be more power constrained than M1, but I guess not.

Edit: I took the liberty of allowing LLVM to do the SIMD vectorization rather than OP's hand-built code (using the fadd_fast intrinsic and fold() instead of sum()). It is considerably faster still:

Ampere Altra:

  sum_slice               time:   [86.382 ns 86.515 ns 86.715 ns]
RK3399:

  sum_slice               time:   [306.94 ns 306.94 ns 306.95 ns]
sweetjuly7 hours ago

>It looks like there's no penalty on this processor, though I would be surprised if it does not have a branch predictor / return stack tracking at all

One potential explanation is that a lot of processors have "meta" predictors. That is, they have multiple distinct predictors that they then use a second level predictor to decide when and how they use it. This is really useful because some predictors perform very well in certain cases but very poorly in others. Therefore, what may be happening is that the RAS is getting overridden by another structure since the predictor detects that the RAS is frequently wrong but another predictor is frequently right.

sakras8 hours ago

If you’re only heavily using one of the cores, that core is free to use a lot more power, and can probably push its clock speed much higher than if this were an all-core workload. So I’d actually expect the opposite, that the ampere would be allowed to use a lot more power than the M1 (since it’s not a laptop).

LoganDark6 hours ago

I love how "rewrite it in Rust" is an actual thing they tried, and it actually performed pretty well given the circumstances.

pranith15 hours ago

Great investigative work! The stack structure you refer to here is called the Return Address Stack (RAS).

Malic17 hours ago

Ow. My head hurts.

And this is why optimizing compilers are some of the most complex programs there are. (or so I have been taught)

shadowgovt16 hours ago

It's also why the modern rule of thumb is "don't optimize by writing your own assembly."

The rule is a boiled-down version of the larger notion "Don't optimize by writing your own assembly, because even with domain knowledge of the problem you're trying to solve, you're probably not more clever than the engineer-decades that went into building your compiler toolchain and processor architecture, unless you're an expert in both fields, in which case good luck and shoulder that maintenance burden."

The rule of thumb drops a lot of detail on the ground but is a good first approximation.

pjdesno15 hours ago

Maybe better expressed as "don't write your own assembly unless you know why it might be better than the compiler".

Trying to beat the compiler at optimizing normal code is kind of like trying to beat a calculator with paper and pencil - computers are just better at that sort of thing than people are.

One use case is where you want to use bizarre CPU functions (popcount, encryption, load CR3, etc.) that no one's taught the compiler how to generate, although for some of them you might be better off using compiler intrinsics.

Another is when you're dealing with things underneath the language abstraction, like the Boost co-routines mentioned in a link a few comments above. Of course, if the language folks decide to add the capability you want (e.g. C++20 coroutines), you're back to being better off using the compiler.

Finally there are pedagogical reasons, e.g. sometimes I show my classes the world's simplest "Hello World", using a few lines of assembler to invoke the write and exit syscalls.

JonChesterfield14 hours ago

The boost coroutines mentioned are not the same thing as the C++20 ones. Boost captures the stack, that's what the register shuffling is for. C++ is a compiler transform that moves some state onto the heap, but probably not the whole stack, and builds a switch dispatch style thing to transform the control flow. This is why C++ comes with co_* annotations and coroutines don't.

RodgerTheGreat16 hours ago

I would just phrase it as "if you optimize by writing your own assembly, don't expect your program or its performance to be portable."

avgcorrection12 hours ago

I will never understand the trend of “using quotes around things”. Which is a shorthand version for “using quotes around things that I wanted to point to and say, hey, this is something that “goes over here”, you know, inside these quotes, to make sure that you understand exactly what I’m delimiting, since using commas, and semicolons, and colons wouldn’t fit for some reason. Oh look the thing that I’m quoting now consumes 80% of this paragraph. But this is way better than just saying “the modern rule of thumb is to not optimize by writing your own assembly.” Because then it isn’t 100% clear that the rule of thumb is delimited by (exactly) “[do] not optimize by writing your own assembly.” Ya see?

astrobe_16 hours ago

... Which is kind of worrying; is it really a good thing that processors are so complex that you need "decades" to use them to fully. Bottom line, you end up with chaotic (in the "sensitive to the slightest change") performance behavior.

OTOH this reminds of another saying, "don't roll your own crypto". But all those "don't" are a bit frustrating.

miloignis16 hours ago

I've seen people get frustrated by the "don't"s before, but I think that's generally taking the first degree approximation too literally. Feel free to hand-write assembly or roll your own crypto, but don't depend on it for anything serious unless you are an expert or have it reviewed by one. Doing so for learning and fun is fine, if that's clearly called out such that no one accidentally depends on it for something serious. There's only one way to become good at something, and that's good practice!

In a professional setting there's a responsibility to the end user which generally precludes doing these dangerous things - that is, one should feel free to take up woodworking as a hobby but shouldn't offer to build someone's house unless you're a licensed professional.

olliej15 hours ago

But you don't need decades of experience: we have compilers and optimizers to do that.

+1
bioint781213 hours ago
magicalhippo11 hours ago

Another is that your assembly will never target newer CPUs, but the compiler will.

I've gained a lot of performance by replacing handwritten assembly functions with plain code versions, just because CPUs and compilers have evolved over the last 15-20 years, while that assembly code is what it is.

JonChesterfield14 hours ago

There's a sense in which they're complicated. It's a sequence of graph transforms which mostly deal with non-polynomial time problems using heuristics, where mistakes in the transforms can manifest quite a long way away from the error. There's a significant risk that they're implemented in languages unique to that compiler toolchain, as compiler devs are quite prone to solving problems by writing compilers.

There's also a sense in which they're really simple. The input format and output format are (usually) well defined. The user interface is largely printing things to stderr and giving up, possibly in a retry loop when there's an editor involved. The program dependency graph is often quite small so the bug you're looking at is probably in the source code you checked out. Security is not the dominant concern you have elsewhere.

astrange8 hours ago

Optimizing compilers don't model things like branch prediction well, and aren't great at autovectorizing either. They work just well enough.

In general I think they aren't that complicated since they have a pass structure that's relatively easy to inspect and helps avoid spaghetti code.

bob102913 hours ago

The state of modern compilers is pretty staggering to me. I've seen some code folding in RyuJIT that makes me feel inferior as a developer.

You've got a few compilers (Java, .NET, et. al.) which are capable of re-compiling hot path code during live execution and then seamlessly transitioning to those paths. This recompilation can be based upon the statistics of the live process, so it's almost like a sort of adaptive AI. Which paths are hot in production does not need to be known at compile time with these approaches.

CalChris15 hours ago

It took me a while to understand the mismatched bl/ret pairs because I'm used to reading matched bl/ret pairs. This confusion has to be similar to what the silicon is 'thinking': Human, I see matched bl/ret pairs all the time and I'm good at them. Why are you giving me mismatched pairs? I'll do the right thing but I'm not so good at them.

Still, this seems like function inlining. But why not just inline and use a regular branch loop? Is foo() also being called from elsewhere? Is space at a premium?

masklinn13 hours ago

> Upon seeing this program, it's a common reaction to ask "why is foo a subroutine at all?"

> The answer is "because this is a didactic example, not code that's trying to go as fast as possible".

ShroudedNight13 hours ago

I feel like this treatment is incomplete without having tested the scenario where the unmatched ret is replaced with a br lr.

EDIT: Reading the documentation after the fact, it appears that that was what br x30 was - naively I had interpreted the hex as a fixed offset to a label.

sylware14 hours ago

I write assembly mainly _not_ because it is faster, but because I don't depend on an absurdely complex and massive compiler.

packetlost14 hours ago

So... how does that work? What sort of work do you do that you have the time to write raw ASM and still be productive? I'm asking in earnest, because I'm curious what sort of workflows still allow for writing ASM directly outside of very specific cases (such as initializing embedded MCUs, for example)

Am4TIfIsER0ppos11 hours ago

any audio or video work

You don't want to be forced to trick the compiler into using the SIMD instructions you are aware of so you write an assembly function.