Sometimes, it is a compiler bug: inding and fixing a bug in a C++ toolchain

lisper • 2 years ago

I found a compiler bug once, about 30 years ago. It took nearly a year. This was Lisp code (the T 3.1 compiler to be exact) that ran on an embedded system to control a robot. It was a 680x0 processor running vxWorks. The robot also had an arm (i.e. a robot arm, not an ARM processor - those didn't exist yet). The symptom was an intermittent crash that happened only when the arm was moving. Forensic analysis revealed a badly corrupted heap, so the actual bug was well upstream of the crash.

The problem turned out to be a confluence of two circumstances. First there was the compiler bug, which inverted the order of two instructions when popping values off the stack. It decremented the stack pointer and then pulled a now-unprotected value off the stack into a register. Second, vxWorks did not have a dedicated stack for the operating system, so when an interrupt happened it would be processed using the same stack as the process being interrupted. When an interrupt happened right on the boundary between those two mis-ordered instructions, it would overwrite the temporarily unprotected value still sitting on the stack above the stack pointer.

It took us several months to find a way to reliably reproduce the problem, and another couple of months of single-stepping through machine code to finally figure it all out. Good times.

Tyr42 • 2 years ago

Ouch.

I remember I had a OS bug in my realtime class where I didn't same the ARM core's flags that get set of the compare operations.

In arm, it's two instructions to do something like `if (x > 3) { blah; }`. First do a subtract of 3 and x, and this will set the NZCV flags. If the result of `3-x` is negative, the N flag will be set.

Then do a branch if N instruction afterwards.

https://developer.arm.com/documentation/dui0801/g/Condition-...

So if there was a context switch / interrupt between the two steps, my code would take the wrong if arm. So I'd get things like

    if (x > 0) {
      y = 5 / x;
    }

throwing a divide by zero error, which was super confusing. And I could work around it by doing

    if (x > 0) {
      if (x > 0) { // For real this time
        y = 5 / x;
      }
    }

since it was very unlikely to happen twice in a row.

Eventually I fixed it, but that was fun.

lisper • 2 years ago

I don't understand. Are you saying that interrupts clobber the processor flags on ARM cores? That seems improbable. They'd be crashing all the time.

zen_1 • 2 years ago

I'm guessing (since this is what my RTOS did) that interrupts were used to preempt the running task, and the OS would need to explicitly save and restore CPSR/SPSR (the arm32 processor flags) for the interrupted process, since odds are another process would be scheduled to run next, with it's own execution state and CPSR.

Tyr42 • 2 years ago

I didn't issue the instructions to save and restore the flags register, which was the bug.

It's possible to get it right.

lisper • 2 years ago

> I didn't issue the instructions to save and restore the flags register, which was the bug.

So... you were writing your own interrupt handler then, yes?

zen_1 • 2 years ago

Was this CS452 at uWaterloo by any chance? If so, then I encountered the same bug last term

dblohm7 • 2 years ago

(CS452 vet here, Fall 2004)

Am I reading this right, you're using ARM now? We were using a Pentium PC (and that was already old for 2004, but good enough for for the course, obviously).

zen_1 • 2 years ago

Yes, we're currently using ARM32, and I'm actually working on porting the course to the Raspberry Pi 4.

pfdietz • 2 years ago

I've found hundreds of compiler bugs over the years. It's not that hard to find compiler bugs, especially in lesser used compilers. Finding them in the code you're writing "in the wild", rather than just for compiler testing, is harder, as the bugs that show up in that sort of code tend to quickly be fixed.

The way to find these bugs is systematic randomized testing, where programs are generated by various random processes, and the compiler tested on them. A bug is found either when the compiler crashes, or the code does something different from similar code (or on a different compiler) that should otherwise perform the same computation. If your compiler has not been subjected to this kind of testing it will inevitably have bugs that this testing will reveal, and possibly a large number of them.

In the compiler I most often test this way (the compiler in the Common Lisp implementation SBCL) it will commonly quickly reveal if new commits (at least, changes that hit the compiler) have bugs. Very handy.

klik99 • 2 years ago

In my 25 years working with c++ I’ve encountered just one compiler bug - msvc hard crash no error - found that reversing the order of two int declarations fixed it. Was quite a while ago. It’s insane how rare bugs are in these incredibly complicated programs. Annoyances? Oh sure there’s a ton, but straight up bugs are so close to non existent it’s safer to say “it’s never a compiler bug”

FartyMcFarter • 2 years ago

I've seen quite a few compiler bugs, but they're almost always in new features rather than the stuff everyone's been using for years.

As examples, I've seen intrinsics for new x86 instructions being broken in both GCC and Intel's compiler, as well as Profile-guided optimizations crashing in MSVC with a large program (I reported this to MS but I'm not sure they ever got around to looking at it).

The lesson I took from this is to not use the latest and greatest features except in toy projects.

DylanSp • 2 years ago

Similar experience here, the one compiler bug I've found (a Typescript type-checking bug) was in a feature that had just been released.

benibela • 2 years ago

The mainstream compilers are rather robust

I use a niche language.

It feels like I find a compiler bug there every month.

pfdietz • 2 years ago

I feel that niche language implementations need to be more rigorously tested, because they have fewer users testing them "for free" (not that that would be sufficient.)

maccard • 2 years ago

I've been on the unfortunate end of compiler bugs on multiple occasions - ICEs, invalid codegen and invalid optimisations. It still takes a lot to convince me it's a compiler bug it they seem to be like pringles - once you start you can't stop.

yakubin • 2 years ago

I think it's more productive not to assume a religious stance on whether compiler has bugs one way or the other. If there is something wrong, better to read the language spec, the ABI spec and the machine code the compiler generated and judge for yourself. People who, given an extensive description of what compiler did and why it's wrong, get back at you with the religious "it's never a compiler bug" just make me not want to talk to them.

3836293648 • 2 years ago

I encountered two or three compiler bugs within weeks of starting to (seriously) work with C++. And one of them was as simple as enabling lto caused GCC to segfault.

RcouF1uZ4gsC • 2 years ago

> Conclusion: The problem was specific to Windows x64 builds.

> Then I tried a Release build with GCC (MinGW), and I successfully reproduced the bug.

GCC (MinGW) is probably the least battle tested of the major C++ compilers on Windows. The vast majority of commercial software on Windows is built with MSVC. Google builds Chrome with Clang. GCC on Windows doesn’t have nearly the same number of (paid) eyeballs looking for problems.

The chances of finding compiler bugs go up if you use less widely used compilers on a platform.

boris • 2 years ago

I think this is the correct line of reasoning but I disagree with the conclusion: yes, MSVC is (unfortunately) most battle-tested where codegen is concerned. But next I believe is MinGW GCC, not Clang. Clang is used to build Chrome (and Firefox) but both of these codebases have C++ exceptions disabled. The result is you end up with bugs like this[1] which remains unfixed for years. IME, MinGW GCC`s achilles' heel is requiring pthread (via libwinpthreads) for C++11 threads support (see here[2] for details).

[1] https://bugs.llvm.org/show_bug.cgi?id=45021

[2] https://github.com/build2-packaging/libmingw-stdthread#backg...

strager • 2 years ago

That's a good point. I never thought of this. I assumed that, because Clang+MinGW was newer, it's less stable. That's why I went with GCC+MinGW. I completely forgot about the positive corporate influence on Clang.

junek • 2 years ago

I once found a compiler bug in C#/.NET , this must have been ten years ago by now. I was a junior dev debugging some weird problem with a desktop app.

The code was something like:

    if (foo.bar < 5) {
        a()
    } else {
        b()
    }

But `b()` would never get called. I stepped through with a debugger, checking the relevant values,and the `else` branch would never execute.

I remember inspecting the IR and seeing that yes, in fact, the `else` branch was simply missing. It would come back if I made a trivial change like deleting a blank line before the conditional block, which lead me to believe the compiler must be getting into a weird state somehow.

I showed my work to the senior engineers and they confirmed that yes this was a compiler bug. We submitted a bug report to MS but never heard back from them. I wonder if it ever got fixed?

maccard • 2 years ago

This isn't a compiler bug, and while the OP clearly did their due diligence, they make multiple statements in the blog post that are wrong, although they're on the right path.

mort96 • 2 years ago

"It's never a compiler bug" is a reference to a commonly held idea that programmers are usually too quick to blame the compiler before realising the bug is actually in their own code. I chose to read the title to mean that the author _does_ know the difference between the compiler and the rest of the toolchain, but chose to let the slight inaccuracy slide for the sake of the reference.

More interestingly, which other statements are wrong?

maccard • 2 years ago

> but chose to let the slight inaccuracy slide for the sake of the reference.

There's a _huge_ difference between a toolchain bug and a compiler bug. It's not really a slight inaccuracy.

> More interestingly, which other statements are wrong?

"""I switched to the x64 version of VS Code and installed quick-lint-js from the VS Code Marketplace. The buggy squiggly appeared! I successfully reproduced the bug.

Conclusion: The problem was specific to Windows x64 builds."""

The conclusion here is quite a stretch - the author has _not_ confirmed the problem is specific to x64 builds, they have confirmed they can repro it in x64 builds. The reason I call it out is because the author _is_ careful to differentiate between hypothesis and conclusion in the rest of the blog post, so this is either a copy error or a misunderstanding on their part!

strager • 2 years ago

You're right. I should have written 'Hypothesis', not 'Conclusion', for that line. EDIT: I corrected the article.

The statement is correct though. The bug was specific to Windows x64 builds when compiled with GCC-MinGW.

beached_whale • 2 years ago

I had the fun the other day of having a unit test suite cause an ICE(Internal compiler error,bug) on MSVC, Clang, and Gcc. A trifecta or compiler bugs. Luckily it was older compilers and it’s been fixed

optimalsolver • 2 years ago

Not sure if the typo in the title is intentional irony.

vegerot • 2 years ago

JonChesterfield • 2 years ago

Almost every bug I see is a compiler bug. The few remaining are hardware bugs.

strager • 2 years ago

Do you work on a compiler backend? =]

JonChesterfield • 2 years ago

Essentially yes :) llvm dev here

mysterydip • 2 years ago

I've had this happen a few times with a newer embedded system I've been developing on. Fortunately the mid-step generated assembly is available to look through. One that had me scratching my head, some specific instance of:

if x and y then do z

operated wrong, so i had to do:

if x then if y then do z

UncleEntity • 2 years ago

The bug is the compiler is supposed generate code to save the register or the implementation they patched is wrong?

Kind of hard to wade through the overly descriptive story to figure out what the problem was.

enoent • 2 years ago

The implementation was not saving any of the ymm registers, you can see this in the bug report's attached patch. It also took me a while to get, because the last snippet in the writeup only highlights the ymm1 register, instead of showing the actual diff along with the comment.

As an aside, I didn't find the writeup to be overly descriptive. The author does a great job linking every step and how they came to each conclusion along the way, even including hypotheses where they came out empty-handed. When a writeup omits these, it makes the author look like they were somehow enlightened to find the solution, instead of just following a methodical approach.

strager • 2 years ago

> It also took me a while to get

Thanks for the feedback. I redesigned that code block, making it a side-by-side diff. Is it easier to understand now?

saagarjha • 2 years ago

The former.

CountSessine • 2 years ago

I found a genuine compiler bug once, but it was in an obscure C++ compiler for the Playstation from an outfit called SN Systems (I think it used the EDG front end).

But it was in a semi-experimental feature that did some really weird stuff - it was for inline assembly code without an `asm` block - you could just inter-mix MIPS assembly instructions with your C++ code statements. Or rather you could if it worked properly. It was an ambitious feature - a bit too ambitious I think.

tomerv • 2 years ago

Is it really a compiler bug if the root cause is in library code?

alophawen • 2 years ago

Right, this blog finds a bug in binutils not restoring a register in a single code path on Windows.

That is not a compiler bug.

strager • 2 years ago

dlltool generates implibs. quick-lint-js has a .def -> .lib build step in its CMake build system. I'd say this is a form of compilation.

Izikiel43 • 2 years ago

I found a compiler bug while doing a university project for applying image filters using C, asm and gcc. I had created a packed structure for an RGB pixel, so 3 bytes. The C code was super simple, read a byte, apply function, move on, but an image diff showed errors. The problem was that gcc was reading 4 bytes at a time, and this caused an out of bounds read that shouldn't happen.

strager • 2 years ago

Interesting. I've heard of GCC doing this and breaking multi-threaded code, but not breaking single-threaded code. I'm curious. Can you share the bug report?

alophawen • 2 years ago

> Conclusion: The problem was specific to Windows x64 builds.

This is an odd conclusion as the author just layed out how he didn't test with 32-bit vscode.

makomk • 2 years ago

He didn't test the version of the extension that he'd built himself with 32-bit vscode, but the previous test with the prebuilt extension from VS Code Marketplace which didn't reproduce the bug must have been done using the 32-bit version since that's what he had installed at that point in the tale.

gabcoh • 2 years ago

The fun thing about working on compilers is that you find compiler bugs all the time (usually introduced by you, but not always)!!

mabster • 2 years ago

I once had a function call in a rather large outer function causing unbalanced stack operations. It would pop more space than it pushed to invoke the function. Thankfully that meant a pretty quick crash. After trying a number of things, the solution in the end was to call the function twice since it was idempotent.

0xTJ • 2 years ago

I've only found one compiler bug, and it's in an interaction between the m68k code and inline assembly, in for modern GCC, which gives less hope of it being fixed. As far as I know it's still unfixed and open on the bug tracker.

wyldfire • 2 years ago

Inline assembly is notorious for being nontrivial and is a big source of invalid compiler bug reports.

Make sure you understand all of the subtle rules about how to use it. If you can, use intrinsics instead.

One particularly tricky/subtle item is the early-clobbers indication.

netheril96 • 2 years ago

I've seen many C++ compiler bugs. After that I made it almost mandatory for people installing my software from source running tests before using it. Before that I figured I only need to run test on my own machines.

rurban • 2 years ago

This just dlltool, which was always broken. Luckily you don't need to use it, just use the linker.

Lately just postgresql kept using dlltool, nobody else.

strager • 2 years ago

Does GNU ld support delayload?

rurban • 2 years ago

It does -z lazy but not on PE i386.

So if you need that you'd need to either fix dlltool, or do it manually in the code.

xbar • 2 years ago

I have made a lot of mistakes in my life. By count, most of them were grumbling claims of non-existent compiler bugs.

stewx • 2 years ago

*finding

zigzag312 • 2 years ago

I recently used C++ for the first time. Because I saw a blog post that MSVC now supports C++ modules, I decided to try them. During the development I started getting weird linker errors about duplicate COMDAT and from my research it seems it's compiler or linker bug.

zigzag312 • 2 years ago

To reproduce that bug create CMake project, create shared library subproject, add module to library and use std::filesystem::path in the module. Then use that shared library from another project and try to build it with MSVC (VS2022 17.3.0 Preview 1).

https://developercommunity.visualstudio.com/t/lnk1179-invali...