Increasing QUIC and UDP Throughput over Tailscale

254 points15
gorkish15 days ago

GSO (Generic segmentation offload) is supported for UDP in the Linux kernel from 4.18 so I would assume these optimizations should be easily possible for the kernel driver as well and can take advantage of any drivers that pass the GSO down to the network hardware too.

Kinda weird optimization though; I'm not exactly sure why it works as well as it appears to; I am thinking that the gain may be far far less noticeable/important at <1gbps. This may be why there aren't any benchmarks for lower end devices or bandwidth constrained paths.

On another front, I wonder would there be any advantage to encapsulating a native UDP protocol like wireguard in QUIC frames? Might this result in increased reliability and improved packet handling from firewalls and middleboxes?

tomohawk15 days ago

GSO improves performance because you can send a ~64kB "datagram" through the stack, and then it will be split up either by the network card (if it supports GSO) or by the kernel. This is a lot more efficient than sending a pile of 1500 byte packets through the stack. It also saves a lot of context switches with the kernel.

By way of comparison, when you send data over TCP, the system call interface lets you write very large chunks, so a single system call can write gigabytes of data.

stock_toaster15 days ago

  > GSO down to the network hardware too
Isn't the article specifically talking about tx-udp-segmentation in the tun driver though, and not the hardware nic adapter?

From the article:

  > The TUN driver support in v6.2 was the missing piece
  > needed to improve UDP throughput over wireguard-go. With
  > it toggled on, wireguard-go can receive “monster” UDP 
  > datagrams from the kernel
dan-robertson15 days ago

Wireguard udp packets are already pretty unstructured. The first four bytes are 4, 0, 0, 0. They have another 12 bytes of header and then encrypted data.

ultrahax15 days ago

Tangentially, I've faced some interesting challenges getting a multi-gigabit Wireguard VPN operating through my 2Gb Frontier connection.

My UDM Pro seems to top out around ~800mbit per UDP stream - pegged at 100% CPU on a single core. Likely it can't keep up with the interrupt rate, given it's ksoftirqd pegging it. Replaced UDM Pro with a pfsense machine.

Then I started getting 100% packet loss on the edge of Frontier's network after a couple of minutes of sustained UDP near-line-rate throughput. In the end, after trying and failing to explain this to Frontier's tech support, I reached out to their engineering management on LinkedIn, and got put in touch with the local NOC director. Turns out to be some intermediate hop is rebooting after a few mins, and they're "in contact with the manufacturer". Haven't heard back in a few months.

tldr as >1Gb connections become more ubiquitous, other bottlenecks will become apparent!

jjjjmoney15 days ago

> I reached out to their engineering management on LinkedIn, and got put in touch with the local NOC director.

I hate that this is a thing. I'm dealing with a similar potential issue on Charter Spectrum right now. Specifically it's an issue that's called out here (failing the IPv4 fragmentation test ).

How on earth is one supposed to get past the front-line tech support in 2023?

thfuran15 days ago

You're not supposed to.

tialaramex14 days ago

You could look for a better ISP. The larger problem is that in the US it's completely normal for there to be no actual choice, or for your "choice" to be between two equally huge uninterested corporations who know they don't need to be better than each other to keep the same revenue.

Separating the last mile infrastructure from the ISP can make it possible to have natural monopoly for everybody's last miles, but widespread competition for ISPs. That might be really hard to pull off in the US but I think it'd be worth striving for.

sofixa14 days ago

> Separating the last mile infrastructure from the ISP can make it possible to have natural monopoly for everybody's last miles, but widespread competition for ISPs. That might be really hard to pull off in the US but I think it'd be worth striving for.

Or even better, the model we have in France. The last mile is a monopoly for a limited time only (2-3 years). So if you build a connection to some place that didn't have one, you can profit off exclusivity for some time, and are incentivised to be good to the consumers because they can switch, but will probably ony do so if you're shit/too expensive.

thfuran14 days ago
sammy225514 days ago

Through LinkedIn evidently

keep_reading15 days ago

Smells like Sandvine traffic shaper falling over or something.

remram15 days ago

Tailscale uses a custom, userspace, Golang implementation of Wireguard instead of the kernel one? Why?

TheDong15 days ago

Using the same code across more OSs than just linux seems nice.

Also, it's based on code by the wireguard author:

They customized it some, but it's all more or less upstream condoned code that Jason built.

Also, if you want to access your tailscale network, but don't have permissions to create a tun or wg device, the fully userspace implementation can work in that situation, which seems like a nice property to have.

H8crilA14 days ago

Also, Wireguard is really easy to implement, making it less of a problem to have multiple implementations. Each implementation is more likely to be correct/invulnerable.

Small implementation was a design objective of Wireguard, after the horrors of IPsec (see Linus' email that praises the difference).

dastbe15 days ago

what’s the right way to interpret the last section on cpu utilization? i.e. now that you’re able to achieve 12.5gbps how much overhead is this at a machine level?

also was ena express used on the c6i.8xlarge? that should allow for getting past the ec2 single flow limits.

fulafel15 days ago

I wonder what explains the large gap between Mellanox and AWS kernel wireguard performance (the original Go code has a much smaller difference so shouldn't be just CPU speed difference).

imperialdrive15 days ago

Y'all are impressive—Kudos!

cmer15 days ago

Off-topic: Tailscale seems like such a perfect acquisition target for Cloudflare. It seems like there's perfect product and culture alignment. Amirite?

wmf15 days ago

I don't think they want to be acquired and I'm not sure it's healthy for everything to get sucked into Cloudflare either.

candiddevmike15 days ago

Cloudflare tunnels/zero trust largely do the same thing as Tailscale. Yes there are some gaps but probably not acquisition worthy

apitman14 days ago

As far as I know Cloudflare doesn't do any p2p/e2ee, which is a pretty fundamental difference

neverrroot15 days ago

Hope not. CF is in the meantime too large to really care about “retail” that much.

aerhardt14 days ago

What do you mean by “retail”? Because Cloudflare does plenty of products for smaller teams - I’d even say it’s a cornerstone of their strategy.

vluft15 days ago

I sure hope you are not.

dontdoxxme15 days ago

Not sure if serious. Tailscale is end to end encrypted.

Cloudflare is probably just this generation’s Crypto AG:

djha-skin15 days ago

This is pretty cool, but I would have liked to see more benchmarks around phones to servers instead of Linux box to Linux box.

UDP and QUIC are most effective serving traffic from phones to servers, not from Linux box to Linux box.

Linux boxes are typically either servers or behind a corporate firewall as e.g user laptops such as the one I use at work. There are distinct disadvantages to running QUIC in that environment:

* UDP is often blocked by default by corporate firewalls.

* Having everything in user space means having to actually update the user software before getting a security patch deep in the transport layer. Compared with getting a kernel patch to fix a TCP vulnerability, which typically happens more often on a Linux box and is more stable than updating userspace software.

* TCP throughput in a data center or behind a corporate firewall is typically fast enough for most needs.

However, from a phone on a cell tower, QUIC starts to make sense:

* Having everything in user space means I can update for security patches every time I update the app, which is much more frequent than OS updates on for example Android.

* Having everything over UDP means I can get the usual non head of line blocking benefits so often touted, with top notch security as well.

FridgeSeal15 days ago

> UDP is often blocked by default by corporate firewalls.

I mean, that sounds a lot like a “them problem”. Kind of like the people that can’t use grpc because their silly corporate firewall MITM’s the encryption. The rest of us aren’t beholden to archaic IT decisions.

> TCP throughput in a data center or behind a corporate firewall is typically fast enough for most needs

Yeah but if I can go even faster, why wouldn’t I? Quic gives per-sub-stream back pressure, out of the box that’s so useful! No HoL blocking, substreams, out-of-order reassembly, there’s so many neat features, why wouldn’t you want to use it, given the chance?

> Having everything in user space…

Means we basically get “user space networking for the rest of us” and that we can run Quic implementations that fit our own applications requirements.

KMag15 days ago

Regarding kernel patches coming more often than userspace updates, presumably you mean in the context of random third-party apps not maintained as well as the major browsers. That's fair, but if QUIC becomes popular enough, I imagine we'll see distros including QUIC dlls that these minor apps link against.