Why the OpenSSL punycode vulnerability was not detected by fuzz testing (2022)

cjbprime • 1 year ago

I'm not sure the post (from 2022) is/was correct. I've looked into it too, and I expect this was reachable by the existing x509 fuzzer. There's a fallacy in assuming that a fuzzer will solve for all reachable code paths in a reasonable time, and that if it doesn't then there must be a problem with the harness. The harness is a reasonable top-level x509 parsing harness, but starting all the way from the network input makes solving those deep constraints unlikely to happen by (feedback-driven) chance, which is what I think happened here.

Of course, a harness that started from the punycode parsing -- instead of the top-level X509 parsing -- finds this vulnerability immediately.

hedora • 1 year ago

I've found that writing randomized unit tests for each small part of a system finds this sort of stuff immediately.

In this case, a test that just generated 1,000,000 random strings and passed them to punycode would probably have found the problem (maybe not in the first run, but after a week or two in continuous integration).

I try to structure the tests so they can run with dumb random input or with coverage-guided input from a fuzzer. The former usually finds >90% of the bugs the fuzzer would, and does so 100x faster, so it's better than fuzzing during development, but worse during nightly testing.

One other advantage for dumb random input is that it works with distributed systems and things written in multiple languages (where coverage information isn't readily available).

mattgreenrocks • 1 year ago

I really like this idea because it avoids the issue of fuzzers needing to burn tons of CPU just to get down to the actual domain logic, which can have really thorny bugs. Per usual, the idea of "unit" gets contentious quickly, but with appropriate tooling I could foresee adding annotations to code that leverage random input, property-based testing, and a user-defined dictionary of known weird inputs.

paulddraper • 1 year ago

> maybe not in the first run, but after a week or two in continuous integration

You'd use a different seed for each CI run??

That sounds like a nightmare of non-determinism, and lost of trust in CI system in general.

forty • 1 year ago

Not if you log the seed of the failing runs

+1

hedora • 1 year ago

+1

paulddraper • 1 year ago

Arnavion • 1 year ago

aka property testing.

cryptonector • 1 year ago

> Of course, a harness that started from the punycode parsing -- instead of the top-level X509 parsing -- finds this vulnerability immediately.

Yes, it's nice to have a fuzzer that can start from a very high-level entry point, but it's even nicer to fuzz lots of entry points (one for every parser).

capitainenemo • 1 year ago

If finding it required generating a valid signed certificate from random string generation that seems essentially equivalent to "unreachable" to me....

gravescale • 1 year ago

Not necessarily because the fuzzer has visibility into the execution pathways triggered, so it can "feel" its way through the maze.

For example, say any certificate that starts with 0x00 is invalid (and there's an up-front test and return). Once the fuzzer has tried 0x00 and concluded that the pathway is a dead end, it won't try again, even though there are squillions of data blocks beginning 0x00.

It's not a highly efficient way to generate a valid certificate (especially when you have to navigate the network code as well), but it's also not just rolling 'n' 256-sided dice until the end of the universe.

cjbprime • 1 year ago

I think the parent is pointing out that if the signature has to be both syntactically and cryptographically valid, then this would defeat the fuzzer for obvious reasons.

But I don't think it does, for this vulnerability. The signature is checked last.

capitainenemo • 1 year ago

Yep. That was what I was thinking. Didn't realise the signature was checked later. Thanks for the clarification.

arp242 • 1 year ago

(2022)

dang • 1 year ago

Year added above. Thanks!

_flux • 1 year ago

I suppose it would be nice to require at least 100% line coverage for tested encryption-related functionality, when tested by a fuzzer. Deviations from this could be easily detected in CI.

Testing cannot be used to prove that a flaw doesn't exist, only that it does.

taviso • 1 year ago

> Testing cannot be used to prove that a flaw doesn't exist, only that it does.

FWIW, I wrote a similar blog post about a different encryption bug that really seemed like it should have been found by fuzzing, and had 100% coverage.

https://googleprojectzero.blogspot.com/2021/12/this-shouldnt...

Not that I disagree with you, just a practical example.

jcims • 1 year ago

It's pretty remarkable how effective fuzzing is despite the layers upon layers of optimizations/assumptions that it requires in order to be feasible at all (eg max_len = 10000). I haven't tinkered with fuzzing since afl was a toddler but its mechanism for pruning the test space seemed so brilliant and tweakable at the time.

It would be interesting to find a way to create excursions into these various parameter spaces but some of them are baked into the infrastructure in such a way that it makes it difficult.

Retr0id • 1 year ago

> Testing cannot be used to prove that a flaw doesn't exist

This is not universally true, formal methods can take you a long way, depending on the problem domain.

And sometimes you can test for every possible input. This usually isn't feasible, but it's nice when it is.

dllthomas • 1 year ago

Formal methods aren't "testing".

As you say, though, exhaustive testing is sometimes possible.

Retr0id • 1 year ago

I agree that it's not testing in the strict sense, but fuzzing isn't really testing either.

The work done by a a theorem prover and a fuzzer isn't all that different (and sufficiently advanced fuzzers use SMT solvers etc.)

+1

_flux • 1 year ago

NovemberWhiskey • 1 year ago

This seems to neglect the fundamental difference between static and dynamic analysis. Formal methods approaches don’t usually need me to run my code.

bluGill • 1 year ago

It is normally safe to assume the exhaustive testing isn't possible because the total states to exhaustively tests exceeds the number of atoms in the universe. There are a few exceptions, but in general we should assume it is always impossible to exhaustively test programs. Which means we need to use something else to find what the limits are and test only those, or use formal methods (both would be my recommendation - though I'll admit I have yet to figure out how to use formal methods)

+1

dllthomas • 1 year ago

bluGill • 1 year ago

Formal methods catch a different class of problem from other tests and so are worth it. However they cannot catch everything. The classic example: "midpoint = (high + low) / 2" is easy to formally prove correct - but in fact it isn't correct because high + low could overflow.

dinosaurdynasty • 1 year ago

Any formal system worth its salt is going to model overflow.

Jtsummers • 1 year ago

And with SPARK/Ada it will. When I was learning it that was actually a very common thing that it caught in my code (various forms of over/underflow errors). In practice most of them would not occur, but a few could have under real-world use-cases.

_flux • 1 year ago

There is a point here: if your implementation doesn't match the specification, then that can be the case. And your example presumes the specification assumed unbounded numbers, while an implementation did not have them. Well, in that case the specification just needs to be revised to include that information.

But the trick is to remember making the specification model the real world at sufficient detail: not too detailed to make the specification unwieldy, not too relaxed to skip real real-world issues that need to be modeled.

I don't think there are too many good solutions to that, except perhaps theorem provers from where you can also extract code from (e.g. Coq), but it seems either these tools are still difficult to use for engineers, or there is an expectation that models and actual implementations are going to be different anyway, making it difficult to see if the implementation actually implements the spec.

thfuran • 1 year ago

If you've formally proved that correct, then either it can't overflow or there's a massive bug in the prover.

mathgradthrow • 1 year ago

I don't mean to be rude, but why are you commenting on formal methods when you clearly have no idea what they are?

+2

bluGill • 1 year ago

_flux • 1 year ago

True, I should have remembered to express it as Non-exhaustive testing cannot...

I also consider formal methods not testing. But model checking (e.g. with tlc) is something in between, and sort of takes advantage of the "This usually isn't feasible, but it's nice when it is.": when you limit your domain enough, it becomes feasible, yet exhaustively checking each small-world scenario gives a pretty good confidence on the model being correct.

stcredzero • 1 year ago

Would it be possible to attach an LLM to a debugger sessions executing all of the fuzzer seeds, and ask it to figure out how to expand coverage?

swatcoder • 1 year ago

Not to dismiss the novelty and power of LLM's but why would you turn to a black box language interface for that?

Wouldn't you expect a designed system to be more efficient, complete, reliable, and auditable? Most or those characteristics are critical to security applications and the nature of LLM's largely run counter to them.

agilob • 1 year ago

Turns out LLMs are really good at writing fuzzers

https://verse.systems/blog/post/2024-03-09-using-llms-to-gen...

https://google.github.io/oss-fuzz/research/llms/target_gener...

cjbprime • 1 year ago

These links are a little different to the GP comment, though. Both of these cases (which I agree show LLMs being an excellent choice for improving fuzzing coverage) are static analysis, going from the project source code to a new harness.

Some issues with that are that the model probably doesn't have enough context to be given all the project source code, so you have to work out which subset to share, including definitions of all relevant symbols (but not too many).

It helps a bit that the foundation models were already pre-trained on these large open source projects in oss-fuzz, so they already know something about the project's symbols and definitions from their original training sets -- and even from public discussions about the code! -- but that wouldn't work for a private codebase or for recent changes to a large public one.

Then the harness source that the LLM writes might have syntax errors/fail to compile, and you have to deal with that somehow, and the harness source that the LLM writes might be valid but not generate any coverage improvement and you have to deal with that, and so on.

GP seems to be talking about instead some form of LLM-aided dynamic analysis, where you are probably using some kind of symbolic execution to generate new seeds, not new harnesses.

That's important work too, because I think in this case (disagreeing with the blog post author) the vulnerable function was actually reachable by existing harnesses, just not through the seed corpora (at least the public ones).

One approach could be for the LLM to become a kind of a symbolic execution constraint solver, using the debugger as a form of path instrumentation and producing new seeds by predicting what a new input would look like when you invert each interesting constraint that the fuzzing coverage is blocked by, as the debugger hits the test for that constraint (which is difficult because it can require actual computation, not pattern matching, and because of path explosion).

Or perhaps more plausibly, the LLM could generate Z3 or other SAT-solver code to define and solve for those constraints to generate new seeds, replacing what is currently extremely tedious and time-consuming work when done by humans.

swatcoder • 1 year ago

Those demonstrate that they're capable of generating capable ones, which is really cool but also not surpising.

What matters for engineering is how that technique compares to others in the context of specific requirements.

A big part of "falling for hype" is in mistaking a new and capable tool for the being the right or oprimal tool.

+1

cjbprime • 1 year ago

TacticalCoder • 1 year ago

Why is OpenSSL using punycode? To do internationalized domain name parsing?

amiga386 • 1 year ago

I would assume it's needed for hostname validation. Is one hostname equivalent to another? Does this wildcard match this hostname?

EDIT: I looked it up. It is used to implement RFC 8398 (Internationalized Email Addresses in X.509 Certificates)

https://www.rfc-editor.org/rfc/rfc8398#section-5

cryptonector • 1 year ago

Yes.