Increasing QUIC and UDP Throughput over Tailscale

254 points7
gorkish7 months ago

GSO (Generic segmentation offload) is supported for UDP in the Linux kernel from 4.18 so I would assume these optimizations should be easily possible for the kernel driver as well and can take advantage of any drivers that pass the GSO down to the network hardware too.

Kinda weird optimization though; I'm not exactly sure why it works as well as it appears to; I am thinking that the gain may be far far less noticeable/important at <1gbps. This may be why there aren't any benchmarks for lower end devices or bandwidth constrained paths.

On another front, I wonder would there be any advantage to encapsulating a native UDP protocol like wireguard in QUIC frames? Might this result in increased reliability and improved packet handling from firewalls and middleboxes?

tomohawk7 months ago

GSO improves performance because you can send a ~64kB "datagram" through the stack, and then it will be split up either by the network card (if it supports GSO) or by the kernel. This is a lot more efficient than sending a pile of 1500 byte packets through the stack. It also saves a lot of context switches with the kernel.

By way of comparison, when you send data over TCP, the system call interface lets you write very large chunks, so a single system call can write gigabytes of data.

stock_toaster7 months ago

  > GSO down to the network hardware too
Isn't the article specifically talking about tx-udp-segmentation in the tun driver though, and not the hardware nic adapter?

From the article:

  > The TUN driver support in v6.2 was the missing piece
  > needed to improve UDP throughput over wireguard-go. With
  > it toggled on, wireguard-go can receive “monster” UDP 
  > datagrams from the kernel
dan-robertson7 months ago

Wireguard udp packets are already pretty unstructured. The first four bytes are 4, 0, 0, 0. They have another 12 bytes of header and then encrypted data.

ultrahax7 months ago

Tangentially, I've faced some interesting challenges getting a multi-gigabit Wireguard VPN operating through my 2Gb Frontier connection.

My UDM Pro seems to top out around ~800mbit per UDP stream - pegged at 100% CPU on a single core. Likely it can't keep up with the interrupt rate, given it's ksoftirqd pegging it. Replaced UDM Pro with a pfsense machine.

Then I started getting 100% packet loss on the edge of Frontier's network after a couple of minutes of sustained UDP near-line-rate throughput. In the end, after trying and failing to explain this to Frontier's tech support, I reached out to their engineering management on LinkedIn, and got put in touch with the local NOC director. Turns out to be some intermediate hop is rebooting after a few mins, and they're "in contact with the manufacturer". Haven't heard back in a few months.

tldr as >1Gb connections become more ubiquitous, other bottlenecks will become apparent!

jjjjmoney7 months ago

> I reached out to their engineering management on LinkedIn, and got put in touch with the local NOC director.

I hate that this is a thing. I'm dealing with a similar potential issue on Charter Spectrum right now. Specifically it's an issue that's called out here (failing the IPv4 fragmentation test ).

How on earth is one supposed to get past the front-line tech support in 2023?

thfuran7 months ago

You're not supposed to.

tialaramex7 months ago

You could look for a better ISP. The larger problem is that in the US it's completely normal for there to be no actual choice, or for your "choice" to be between two equally huge uninterested corporations who know they don't need to be better than each other to keep the same revenue.

Separating the last mile infrastructure from the ISP can make it possible to have natural monopoly for everybody's last miles, but widespread competition for ISPs. That might be really hard to pull off in the US but I think it'd be worth striving for.

sofixa7 months ago

> Separating the last mile infrastructure from the ISP can make it possible to have natural monopoly for everybody's last miles, but widespread competition for ISPs. That might be really hard to pull off in the US but I think it'd be worth striving for.

Or even better, the model we have in France. The last mile is a monopoly for a limited time only (2-3 years). So if you build a connection to some place that didn't have one, you can profit off exclusivity for some time, and are incentivised to be good to the consumers because they can switch, but will probably ony do so if you're shit/too expensive.

thfuran7 months ago
sammy22557 months ago

Through LinkedIn evidently

keep_reading7 months ago

Smells like Sandvine traffic shaper falling over or something.

remram7 months ago

Tailscale uses a custom, userspace, Golang implementation of Wireguard instead of the kernel one? Why?

TheDong7 months ago

Using the same code across more OSs than just linux seems nice.

Also, it's based on code by the wireguard author:

They customized it some, but it's all more or less upstream condoned code that Jason built.

Also, if you want to access your tailscale network, but don't have permissions to create a tun or wg device, the fully userspace implementation can work in that situation, which seems like a nice property to have.

H8crilA7 months ago

Also, Wireguard is really easy to implement, making it less of a problem to have multiple implementations. Each implementation is more likely to be correct/invulnerable.

Small implementation was a design objective of Wireguard, after the horrors of IPsec (see Linus' email that praises the difference).

dastbe7 months ago

what’s the right way to interpret the last section on cpu utilization? i.e. now that you’re able to achieve 12.5gbps how much overhead is this at a machine level?

also was ena express used on the c6i.8xlarge? that should allow for getting past the ec2 single flow limits.

fulafel7 months ago

I wonder what explains the large gap between Mellanox and AWS kernel wireguard performance (the original Go code has a much smaller difference so shouldn't be just CPU speed difference).

imperialdrive7 months ago

Y'all are impressive—Kudos!

cmer7 months ago

Off-topic: Tailscale seems like such a perfect acquisition target for Cloudflare. It seems like there's perfect product and culture alignment. Amirite?

wmf7 months ago

I don't think they want to be acquired and I'm not sure it's healthy for everything to get sucked into Cloudflare either.

candiddevmike7 months ago

Cloudflare tunnels/zero trust largely do the same thing as Tailscale. Yes there are some gaps but probably not acquisition worthy

apitman7 months ago

As far as I know Cloudflare doesn't do any p2p/e2ee, which is a pretty fundamental difference

neverrroot7 months ago

Hope not. CF is in the meantime too large to really care about “retail” that much.

aerhardt7 months ago

What do you mean by “retail”? Because Cloudflare does plenty of products for smaller teams - I’d even say it’s a cornerstone of their strategy.

vluft7 months ago

I sure hope you are not.

dontdoxxme7 months ago

Not sure if serious. Tailscale is end to end encrypted.

Cloudflare is probably just this generation’s Crypto AG:

djha-skin7 months ago

This is pretty cool, but I would have liked to see more benchmarks around phones to servers instead of Linux box to Linux box.

UDP and QUIC are most effective serving traffic from phones to servers, not from Linux box to Linux box.

Linux boxes are typically either servers or behind a corporate firewall as e.g user laptops such as the one I use at work. There are distinct disadvantages to running QUIC in that environment:

* UDP is often blocked by default by corporate firewalls.

* Having everything in user space means having to actually update the user software before getting a security patch deep in the transport layer. Compared with getting a kernel patch to fix a TCP vulnerability, which typically happens more often on a Linux box and is more stable than updating userspace software.

* TCP throughput in a data center or behind a corporate firewall is typically fast enough for most needs.

However, from a phone on a cell tower, QUIC starts to make sense:

* Having everything in user space means I can update for security patches every time I update the app, which is much more frequent than OS updates on for example Android.

* Having everything over UDP means I can get the usual non head of line blocking benefits so often touted, with top notch security as well.

FridgeSeal7 months ago

> UDP is often blocked by default by corporate firewalls.

I mean, that sounds a lot like a “them problem”. Kind of like the people that can’t use grpc because their silly corporate firewall MITM’s the encryption. The rest of us aren’t beholden to archaic IT decisions.

> TCP throughput in a data center or behind a corporate firewall is typically fast enough for most needs

Yeah but if I can go even faster, why wouldn’t I? Quic gives per-sub-stream back pressure, out of the box that’s so useful! No HoL blocking, substreams, out-of-order reassembly, there’s so many neat features, why wouldn’t you want to use it, given the chance?

> Having everything in user space…

Means we basically get “user space networking for the rest of us” and that we can run Quic implementations that fit our own applications requirements.

KMag7 months ago

Regarding kernel patches coming more often than userspace updates, presumably you mean in the context of random third-party apps not maintained as well as the major browsers. That's fair, but if QUIC becomes popular enough, I imagine we'll see distros including QUIC dlls that these minor apps link against.