Back

Why CVE-2022-3602 was not detected by fuzz testing (2022)

67 points2 daysallsoftwaresucks.blogspot.com
cjbprime3 hours ago

I'm not sure the post (from 2022) is/was correct. I've looked into it too, and I expect this was reachable by the existing x509 fuzzer. There's a fallacy in assuming that a fuzzer will solve for all reachable code paths in a reasonable time, and that if it doesn't then there must be a problem with the harness. The harness is a reasonable top-level x509 parsing harness, but starting all the way from the network input makes solving those deep constraints unlikely to happen by (feedback-driven) chance, which is what I think happened here.

Of course, a harness that started from the punycode parsing -- instead of the top-level X509 parsing -- finds this vulnerability immediately.

hedora2 hours ago

I've found that writing randomized unit tests for each small part of a system finds this sort of stuff immediately.

In this case, a test that just generated 1,000,000 random strings and passed them to punycode would probably have found the problem (maybe not in the first run, but after a week or two in continuous integration).

I try to structure the tests so they can run with dumb random input or with coverage-guided input from a fuzzer. The former usually finds >90% of the bugs the fuzzer would, and does so 100x faster, so it's better than fuzzing during development, but worse during nightly testing.

One other advantage for dumb random input is that it works with distributed systems and things written in multiple languages (where coverage information isn't readily available).

arp2424 hours ago

(2022)

Previous:

Why CVE-2022-3602 was not detected by fuzz testing - https://news.ycombinator.com/item?id=33693873 - Nov 2022 (171 comments)

dang2 hours ago

Year added above. Thanks!

_flux4 hours ago

I suppose it would be nice to require at least 100% line coverage for tested encryption-related functionality, when tested by a fuzzer. Deviations from this could be easily detected in CI.

Testing cannot be used to prove that a flaw doesn't exist, only that it does.

taviso3 hours ago

> Testing cannot be used to prove that a flaw doesn't exist, only that it does.

FWIW, I wrote a similar blog post about a different encryption bug that really seemed like it should have been found by fuzzing, and had 100% coverage.

https://googleprojectzero.blogspot.com/2021/12/this-shouldnt...

Not that I disagree with you, just a practical example.

jcims27 minutes ago

It's pretty remarkable how effective fuzzing is despite the layers upon layers of optimizations/assumptions that it requires in order to be feasible at all (eg max_len = 10000). I haven't tinkered with fuzzing since afl was a toddler but its mechanism for pruning the test space seemed so brilliant and tweakable at the time.

It would be interesting to find a way to create excursions into these various parameter spaces but some of them are baked into the infrastructure in such a way that it makes it difficult.

Retr0id4 hours ago

> Testing cannot be used to prove that a flaw doesn't exist

This is not universally true, formal methods can take you a long way, depending on the problem domain.

And sometimes you can test for every possible input. This usually isn't feasible, but it's nice when it is.

_flux37 minutes ago

True, I should have remembered to express it as Non-exhaustive testing cannot...

I also consider formal methods not testing. But model checking (e.g. with tlc) is something in between, and sort of takes advantage of the "This usually isn't feasible, but it's nice when it is.": when you limit your domain enough, it becomes feasible, yet exhaustively checking each small-world scenario gives a pretty good confidence on the model being correct.

dllthomas3 hours ago

Formal methods aren't "testing".

As you say, though, exhaustive testing is sometimes possible.

Retr0id2 hours ago

I agree that it's not testing in the strict sense, but fuzzing isn't really testing either.

The work done by a a theorem prover and a fuzzer isn't all that different (and sufficiently advanced fuzzers use SMT solvers etc.)

_flux30 minutes ago

I guess I need to disagree :). Why isn't fuzzing testing? Ultimately it ends up running the code and seeing if it works—for some definition of "working", that is, it doesn't crash—even if it may pick the ways it chooses the input by considering how the code behaves while running it.

Also, how is even "smart fuzzing" similar to theorem provers in any way? Fuzzers still think in terms of runs, but a theorem considers all cases. The best fuzzer wouldn't be able to tell if a sort algorithm works for all inputs, presuming the input size is unbounded. Actually, my understanding is that a fuzzer wouldn't be able to prove if identity function works for all kinds of objects it may be given.

NovemberWhiskey58 minutes ago

This seems to neglect the fundamental difference between static and dynamic analysis. Formal methods approaches don’t usually need me to run my code.

bluGill3 hours ago

It is normally safe to assume the exhaustive testing isn't possible because the total states to exhaustively tests exceeds the number of atoms in the universe. There are a few exceptions, but in general we should assume it is always impossible to exhaustively test programs. Which means we need to use something else to find what the limits are and test only those, or use formal methods (both would be my recommendation - though I'll admit I have yet to figure out how to use formal methods)

+1
dllthomas3 hours ago
bluGill3 hours ago

Formal methods catch a different class of problem from other tests and so are worth it. However they cannot catch everything. The classic example: "midpoint = (high + low) / 2" is easy to formally prove correct - but in fact it isn't correct because high + low could overflow.

_flux22 minutes ago

There is a point here: if your implementation doesn't match the specification, then that can be the case. And your example presumes the specification assumed unbounded numbers, while an implementation did not have them. Well, in that case the specification just needs to be revised to include that information.

But the trick is to remember making the specification model the real world at sufficient detail: not too detailed to make the specification unwieldy, not too relaxed to skip real real-world issues that need to be modeled.

I don't think there are too many good solutions to that, except perhaps theorem provers from where you can also extract code from (e.g. Coq), but it seems either these tools are still difficult to use for engineers, or there is an expectation that models and actual implementations are going to be different anyway, making it difficult to see if the implementation actually implements the spec.

dinosaurdynasty3 hours ago

Any formal system worth its salt is going to model overflow.

Jtsummers2 hours ago

And with SPARK/Ada it will. When I was learning it that was actually a very common thing that it caught in my code (various forms of over/underflow errors). In practice most of them would not occur, but a few could have under real-world use-cases.

thfuran2 hours ago

If you've formally proved that correct, then either it can't overflow or there's a massive bug in the prover.

mathgradthrow2 hours ago

I don't mean to be rude, but why are you commenting on formal methods when you clearly have no idea what they are?

+1
bluGill49 minutes ago
TacticalCoder4 hours ago

Why is OpenSSL using punycode? To do internationalized domain name parsing?

amiga3864 hours ago

I would assume it's needed for hostname validation. Is one hostname equivalent to another? Does this wildcard match this hostname?

EDIT: I looked it up. It is used to implement RFC 8398 (Internationalized Email Addresses in X.509 Certificates)

https://www.rfc-editor.org/rfc/rfc8398#section-5

cryptonector4 hours ago

Yes.

stcredzero4 hours ago

Would it be possible to attach an LLM to a debugger sessions executing all of the fuzzer seeds, and ask it to figure out how to expand coverage?

swatcoder4 hours ago

Not to dismiss the novelty and power of LLM's but why would you turn to a black box language interface for that?

Wouldn't you expect a designed system to be more efficient, complete, reliable, and auditable? Most or those characteristics are critical to security applications and the nature of LLM's largely run counter to them.

cjbprime3 hours ago

These links are a little different to the GP comment, though. Both of these cases (which I agree show LLMs being an excellent choice for improving fuzzing coverage) are static analysis, going from the project source code to a new harness.

Some issues with that are that the model probably doesn't have enough context to be given all the project source code, so you have to work out which subset to share, including definitions of all relevant symbols (but not too many).

It helps a bit that the foundation models were already pre-trained on these large open source projects in oss-fuzz, so they already know something about the project's symbols and definitions from their original training sets -- and even from public discussions about the code! -- but that wouldn't work for a private codebase or for recent changes to a large public one.

Then the harness source that the LLM writes might have syntax errors/fail to compile, and you have to deal with that somehow, and the harness source that the LLM writes might be valid but not generate any coverage improvement and you have to deal with that, and so on.

GP seems to be talking about instead some form of LLM-aided dynamic analysis, where you are probably using some kind of symbolic execution to generate new seeds, not new harnesses.

That's important work too, because I think in this case (disagreeing with the blog post author) the vulnerable function was actually reachable by existing harnesses, just not through the seed corpora (at least the public ones).

One approach could be for the LLM to become a kind of a symbolic execution constraint solver, using the debugger as a form of path instrumentation and producing new seeds by predicting what a new input would look like when you invert each interesting constraint that the fuzzing coverage is blocked by, as the debugger hits the test for that constraint (which is difficult because it can require actual computation, not pattern matching, and because of path explosion).

Or perhaps more plausibly, the LLM could generate Z3 or other SAT-solver code to define and solve for those constraints to generate new seeds, replacing what is currently extremely tedious and time-consuming work when done by humans.

swatcoder3 hours ago

Those demonstrate that they're capable of generating capable ones, which is really cool but also not surpising.

What matters for engineering is how that technique compares to others in the context of specific requirements.

A big part of "falling for hype" is in mistaking a new and capable tool for the being the right or oprimal tool.

+1
cjbprime3 hours ago