You know, articles like this make me wish an OS would actually have a built-in fast, reliable, fully-featured HTTP parser. I've written a couple of (very strict) HTTP parsers on my own, and this whole "request smuggling" is possibly only because the HTTP messages have rather delicate and fragile, totally non-robust framing and structure. Miss a letter in one of the relevant RFCs (IIRC, you need to read at least the first 3 RFCs to learn the complete wire format of an HTTP message) and you'll end with a subtly non-compliant and vulnerable parser.
And yet, every single programming language/platform build their own HTTP-handling library, usually several, of very varying quality and feature support. Again, it would not be as bad if HTTP was a robust format where you could skip recognizing and correctly dealing with half of the features you don't intend to support but it is not: even if you don't want to accept e.g. trailers, you still have to be aware of those. We have OpenSSL, why not also have OpenHTTP (in sans-io style)?
I've implemented a few things from RFCs and I always wish that for each RFC there was a library of test cases to test your implementation.
Does anyone know if there is anything like this for HTTP or associated RFCs?
Eg, for HTTP header parameter, names can have a * to change the character encoding of the parameter value. How many implementations test this? Or tests for decoding of URI paths that contain escaped / characters to make sure they're not confused with the /s that are the path separators.
Or at least a bunch of examples in the RFC itself. Don't you just love reading a long description of a convoluted data format with literally zero examples of how the full thing looks and what it is supposed to mean? Sadly, leaving the validation undocumented is pretty common across formats/protocol descriptions, and RFCs actually seem to generally be on the "more specific" end of scale, thanks to the ubiquitous use of MUST/SHOULD language. But I've recently wrote a toy ELF parser and it's amazing how many things in its spec are left implicit: e.g. you probably should check that calculating a segment's end (base+size) doesn't overflow and wrap over zero... should you? Maybe you're supposed to support segments that span over the MAX_MEMORY_ADDRESS into the lower memory, who knows? The spec does not say.
>Eg, for HTTP header parameter, names can have a * to change the character encoding of the parameter value
Where did you read this? HTTP header fields may contain MIME-encoded values using the encoding scheme outlined in rfc2047, but I haven't heard of the asterisk having any special meaning...
I believe he refers to RFC 8187.
Fuchsia has an http client component  which is part of the platform and, given Fuchsia's component architecture, it's accessed through a message-passing protocol  which is programming language agnostic.
: https://cs.opensource.google/fuchsia/fuchsia/+/main:sdk/fidl... : https://cs.opensource.google/fuchsia/fuchsia/+/main:src/conn... : https://fuchsia.dev/fuchsia-src/reference/fidl/language/lang...
When I started to build my browser  I realized that there's literally no standard test suite to test your HTTP implementation against.
There are test suites for _some_ subsets of the spec, and there are implementation-specific testsuites (e.g. in chromium) ... but there's not a single HTTP 1.1 all-in-one testserver that you can test your client or server implementation against - over the wire.
The additional lack of tests for hop by hop networking changes (which is e.g. the Transfer Encoding parts of the spec in 1.1) and you have a disaster waiting to happen.
Combine that with 206 Partial Content and say, some byte ranges a server cannot process...and you've got a simple way to crash a lot of server implementations.
There's not a single web server implementation out there that correctly implements multiple byte range requests and responses especially not when chunked encoding can be a requested thing. Don't get me started on the ";q=x.y" value headers, they are buggy everywhere, too.
For my browser project, I had to build a pcap (tcpdump) based test runner  that can setup temporary local networks with throttling and fragmentation behaviour so that I have reproducible tests that I can analyze later when they failed. Otherwise it would be a useless network protocol test that's implementation specific as all others.
I think the web heavily needs a standard HTTP test suite, similar to the ACID tests back then...but for malicious and spec compliant HTTP payloads combined.
So from that observation, why don't you put your browser project on hold for a while and start that http test server project?
I think there currently is no such thing because writing test cases for protocols is an uphill start. You simply don't have any constraints on how to start. Write the tests in plain text? How to encode the behavior? Write the tests in a programming language? How to execute the tested client? It's not impossible to have a client/server-agnostic test library, but it's non-trivial to design the framework.
"And the next thing you know, you’re at the zoo, shaving a yak, all so you can wax your car."
That said, it depends on your goals. Writing pragmatic, limited test cases for protocols is super hard, due as you say to the lack of constraints.
But if your goal from the outset is to write a definitive, exhaustive test suite then it's a far more mechanical task (much the way that writing a chess AI is hard if you want it to run on a desktop computer, but writing a program to play perfect chess only requires a simple understanding of graph searching if you don't care how fast it is.) Just start from the start of the protocol and work your way through one statement of a time, enumerate all the different ways that an implementation could cock it up, and write a test for each. Of course there are still engineering decisions to be made but you don't have to pick the perfect solution to each. A solution is enough, you (or someone else) can always improve it later.
> articles like this make me wish an OS would actually have a built-in fast, reliable, fully-featured HTTP parser.
You mean, like Windows?
Http is application layer, it should have nothing to do with an operating system. In fact, the OS should likely have no access to the HTTP frame at all, if the connection uses TLS
Can we please stop with this OSI nonsense already? HTTP is a transport-level protocol today. If something uses TCP, chances are pretty good it also uses HTTP on top of that, and some sort of homegrown RPC on top of that.
And the OS absolutely has access to the HTTP frame: it manages the process's network buffers and its whole memory mapping, it locates and loads OpenSSL at the process's startup... a process is really not a black box from the OS point of view.
I think FFI, as you mention with OpenSSL, would be the better approach. And I think this would be a good idea in general. But most languages don't make FFI easy on either side.
The best solution would be to make an http version 4 with a non-fragile format, e.g. json.
Otherwise we will keep chasing bugs forever.
Configuring your webserver/reverse proxy to talk HTTP/2 to backend appservers is a good improvement against request smuggling. (If they support it, sadly not guaranteed). The binary format is much less ambiguous.
There is a talk by James Kettle about request smuggling with HTTP/2, but it is largely about attacks when the frontend talks HTTP/2 and then converts to HTTP/1.1 to talk to backend servers . That said, it does also highlight some HTTP/2-only quirks, so it’s not completely perfect, but it’s so much better than HTTP/1.1.
A lot of the new http bugs aren’t caused by ambiguities in http1 headers, or ambiguities in http2 headers. They happen when an http2 message gets rewritten into http1 and “valid” http header characters (like new lines) show up as header separators in http1.
The problem isn’t that we don’t have a good header format. The problem is we have too many.
> a non-fragile format, e.g. json.
JSON is a terrible format. Especially for streaming data.
Here's a super simple shell script for generating invalid JSON that will blow Python's stack:
It is invalid JSON. But you cannot tell if it's invalid until either:
n="$(python3 -c 'import math; import sys; sys.stdout.write(str(math.floor(sys.getrecursionlimit() - 4)))')" left="$(yes [ | head -n "$n" | tr -d '\n')" echo "$left" | python3 -c 'import json; print(json.loads(input()))'
a) You run out of memory.
b) The connection ends.
Because JSON is terrible for streaming data.
http 2 and 3 have much stricter defined binary formats. JSON would be a step back in terms of spec and performance.
JSON has the problem that different parsers handle multiple occurrences of a object element differently. You need to watch out that header names are only ascii otherwise you could run into string comparisons being different on different platforms.
Why is HTTP so complex? The base use case (hypermedia request-response) sounds really simple.
HTTP 1.1 is an old protocol, over time new requirements made modifications necessary, some things fell out of use, and some changes turned out to be mistakes. That it's text-based without using doesn't help
The basis is simple, but then add Cookies, HTTPS, Authentication, Redirecting, Host headers, caching, chunked encoding, WebDAV, CORS, etc etc. All justifiable but all adding complexity.
Http/0.9 is pretty simple, but for a fast web we needmore complexity.
More parsing and data processing = faster web. It all makes sense, really!
Joking aside, some "features" in HTTP/1.1 are really questionable. Trailing headers? 1xx responses? Comments in chunked encoding? The headers that proxies must cut out in addition to those specified in "Connection" header except the complete list of those is specified nowhere? The methods that prohibit the request/response to have a body but again, the full list is nowhere to be found?
All these features have justifications but the end result is a protocol with rather baroque syntax and semantics.
P.S. By the way, the HTTP/1.1 specs allow a GET request to have a body in chunked encoding — guess how many existing servers support that.
Nearly all L7 protocols and their parsers are complex. HTTP is kind of simple, relatively speaking.
Nice. Link to the whitepaper: https://bahruz.me/papers/ccs2021treqs.pdf
I only skimmed very quickly to look for which server setups they found new vulnerabilities for, and it looked like they tested a 2D matrix of popular webservers/caches/reverse-proxies with each other? Which is neat for automation, but in the real world I'm not usually going to be running haproxy behind nginx or vice versa. I'd be much more interested in findings for popular webserver->appserver setups, e.g., nginx in front of gunicorn/django.
I've definitely seen people do nginx+haproxy setups in the real world.
Sure, I'm not saying it doesn't happen or that there's no reason to do it, I just think that in practical terms the much more widespread attack surface area would be interaction between one of these and common application servers.
Tomcat in front of anything else (Apache, Nginx) is a common combination they tested. This is for a Java application with a webserver frontend that's enforcing rules/caching/authentication.
the grammar is pretty great
On a whole what I found more interesting here was just the techniques they came across through fuzzing that had some impact. Yes its interesting to see the specific combinations that were impacted, but in the real-world there are so many other potential combinations.
The dominate method for request smuggling as of the last few years has been with `Content-Length` and `Transfer-Encoding`. What I found most interesting and the biggest take-away as someone who has worked doing web-app assessments is more just the attacks that they found to work and cause problems.
I mean the details about particular server pairs having issues is great information, as is the fuzzing setup (great use of differential fuzzing) but I think more important is being able to take these potential attack avenues that they had success with and running them against your own deployments. Given how many applications internally are running their own stacks there is still a lot of room for potential issues. I can imagine people running with some of these for bounties in the near future.
A brief summary of the manipulations they had some success with are on pages 7-8. Though if you don't feel like reading, the "headline" version of the list gives you a pretty decent idea of what was having an impact:
Request Line Mutations
- Mangled Method - Distorted Protocol - Invalid Version - Manipulated Termination - Embedded Request Lines
Request Headers Mutations
- Distorted Header Value - Manipulated Termination - Expect Header - Identity Encoding - V1.0 Chunked Encoding - Double Transfer-Encoding
Request Body Mutations
- Chunk-Size Chunk-Data Mismatch - Manipulated Chunk-Size Termination - Manipulated Chunk-Extension Termination - Manipulated Chunk-Data Termination - Mangled Last-Chunk
And a bit of self-promotion but we talked about this paper on Monday on the DAY podcast I cohost (24m53s - 35m30s) https://www.youtube.com/watch?v=GmOuX8nHZuc&t=1497s
This sort of thing is why it's nice to have authz throughout your environment. A client request that gets incorrectly forwarded to a proxy should be rejected by the downstream service.
The problem is that people keep all of their authn/authz at the boundary and then, once you're past that, it's a free for all.
Every service needs to validate authorization of the request.
Can someone give an example of how this works? The article was light on details.
the eternal cat mouse game, an escher triangle of misery, always going up and down at the same time, never really getting anywhere