Just Say No to Paxos Overhead: Replacing Consensus with Network Ordering (2016)

110 points9 days26
jpgvm9 days ago

They aren't wrong but they are basically just digging up the Tandem architecture and adapting it to the current RoCE/FCoE enhanced ethernet we have today.

Ironically that is where most of our newer interconnects have their heritage. PCI-e is descendant from Infiniband which in turn is decendant from ServerNet which was developed at Tandem as the internal interconnect for their NonStop systems for the very usecases this paper describes.

i.e the very ordered networking this relies on and the architectures it enables were invented ~25 years ago.

Unfortunately the reality of today is we don't build data centres and especially not for very highly available systems. Instead we pay that latency penalty to build geographically redundant systems using cheap rented virtual hardware from 3rd parties.

This is in part because of how the always on nature of the internet changed the availability requirements for most software (which is now delivered as SaaS) from business hours in 1 timezone to always-on everywhere in the world. It's no longer an excuse for a business critical application to be down because "DC lost power" or some variation of that.

I'm old and grumpy though so whatever, everything old becomes new again eventually.

jiggawatts9 days ago

I recently got to work on some HPE NonStop hardware for the local emergency number call centre. As you can imagine, availability was a critical requirement, because it truly is a 24/7 service and you can't have staff interrupted in a middle of a call when someone is dying.

The actual final result was a shambles and couldn't remotely achieve the desired availability level. For example, mirrored or synchronized systems are more difficult to update by their very nature. They're also highly constrained physically, such as having to have physical proximity. They're so expensive that the org couldn't afford enough of them to provide dev/test/staging etc... environments. They were also already out-of-date and barely supported, because upgrading every 5 years was also too expensive.

My $0.02 was that modern development practices of highly redundant but ordinary commodity servers would have been both far cheaper and far more robust in practice.

Mind you, that specific case had a particularly schizophrenic architecture: They decided they wanted a Tier-1 data centre for availability, but their core application couldn't tolerate the latency increase. So they used Citrix to paper over this, but Citrix is complex and has many moving parts prone to failure. They also wanted security, so they tunnelled everything over the various WAN links. They would have been better off just leaving the servers in the same building as the call centre: no long distance links to break, no need for SSL gateways, no need for fibre rings or router failover, etc...

My method would have been to have two call centres, each with local compute. Much simpler, fewer things to break, and truly highly available.

hinkley9 days ago

I joke about how we're just slowly rediscovering all of the tech developed for supercomputers in the late 80's to early 90's. Except it's not really a joke, is it?

Time sharing and partitioning were big deals in Big Iron, and here we are piling little iron into heaps and having the same set of problems.

bob10298 days ago

Don't forget the power of batch processing. Even today, you can dramatically speed up processing of work by putting things into batches. Speedups of 3-4 orders of magnitude can easily be realized, even if you hedge and do micro batches (10-1000 microseconds wide).

To a consumer of one of these systems, you can make the batched processing occur so quickly as to seem real-time in nature, while still realizing most of the low-level CPU architecture benefits of doing things in reasonably-sized chunks.

There are benefits to doing it the old-school way too (i.e. nightly). Giving the business ample time to respond to errors before EOD processing is a huge perk in many industries where the downstream consequences of errors cannot be reconciled. Real-time processing is wonderful, but there are still a lot of places where you don't really need it or where it causes more harm than good.

ignoramous8 days ago

> Unfortunately the reality of today is we don't build data centres and especially not for very highly available systems. Instead we pay that latency penalty to build geographically redundant systems using cheap rented virtual hardware from 3rd parties.

Reminds me of AWS' contrasting approach (reliable delivery, out-of-order messages) in dealing with the latency penalty using specialized hardware and a proprietary transport protocol viz. SRD (Scalable Reliable Datagram):

> They aren't wrong but they are basically just digging up the Tandem architecture and adapting it to the current RoCE/FCoE enhanced ethernet we have today.

Pat Helland, who worked on Tandem, published a paper on a novel way to avoid distributed transactions, which I believe is similar to Google Megastore's design (on top of BigTable) and seems (to my untrained eye) spiritually closer to Cloudflare's Durable Objects:

AceJohnny29 days ago

that's funny, the tech lead at my last company was formerly from Tandem, and the architecture of our HA product reflected that :)

It was a cool system, with architecture completely alien to the consumer products I work on nowadays, and I kinda miss QNX.

Did you know a gregk?

nigwil_9 days ago

"PCI-e is descendant from Infiniband which in turn is decendant from ServerNet ..." When you say PCIe descended from IB, at the signalling layer or a higher layer? Can you suggest a starting place to understand this lineage please, was it in the PCIe standards group discussions? I've not heard of this relationship before and I'm not having much luck finding the original discussion.

jpgvm8 days ago

PCI-e's IB heritage is present in every aspect of the standard, it's electrically the same, signalling and encoding are the same etc.

In the early days of all this there was a few competing standards that all had the same underlying architecture, serial, 8b/10b encoding, etc. The 2 big ones were NGIO and FutureIO. These two were eventually merged and was called ServerIO before being renamed to Infiniband.

Infiniband was going to be the next big thing and a few companies set out to make the silicon, Mellanox, QLogic/Silverstorm, Topspin/Cisco among the main ones.

Sun and Microsoft both committed to building the drivers to make this work. However pretty soon MS ran into issues with their drivers and eventually pulled the plug on shipping IB drivers in NT. This caused Cisco to drop IB and eventually doomed IB in the enterprise datacenter space and settled for Ethernet.

Sun pretty much followed suit for standard enterprise data centers but kept working on IB drivers and IB switches for their HPC and storage units. You can still see it kicking around in stuff like Exadata.

Intel is mainly responsible for making PCI-e happen. They had proposed one of the initial standards that became IB, NGIO. When NGIO was fused with FutureIO to create IB the same group renamed it to Arapahoe which was later renamed 3GIO before being renamed to PCI-e when standardized by the PCI SIG. Design wise little changed it just eschewed some of the switching complexity and focused on being an internal system interconnect.

So in many ways it's not quite PCI-e is descendant from IB but rather PCI-e -is- IB lol.

p_l9 days ago

You can see very early signs of PCI Express in intel presentations about 3GIO, which originally had a lot more external cabling in vision (which finally became common with thunderbolt, but with extra tunnel layer)

rkagerer9 days ago

Interesting, I hadn't heard that. I always just assumed it was a descendant of PCI and ISA.

Here's a history video on busses from ISA to PCIe:

w79 days ago

Yeah, that statement piqued my curiosity and I'd like a source too. I couldn't tell if it was meant there was a direct lineage or that work and concepts were done in tandem (ha) with the relevant standards bodies and engineers. Googling didn't yield much except InfiniBand's earlier genesis than PCIe, and Intel engineers being involved with InfiniBand and doing the initial work on PCIe.

strictfp9 days ago

When working on networked control systems I realized that ordered networking is the reason why automation still uses proprietary protocols and interconnects such as Profibus and Profinet.

If you want good guarantees you need determinism.

foobiekr9 days ago

man, I agree with this so much. reading the tandem technical articles were very enlightening for me when I was a younger engineer.

wahern9 days ago

> The first aspect of our design is network serialization, where all OUM packets for a particular group are routed through a sequencer on the common path.

This solution actually just shunts the problem to a different layer. To be robust to sequencer failure and rollover, you will need to rely on an inner consensus protocol to choose sequencers. Which is basically how Multi-Paxos, Raft, etc work--you use the costly consensus protocol to pick a leader, and thereafter simply rely on the leader to serialize and ensure consistency.

It seems like an interesting paper w/ a novel engineering model and useful proofs. But from an abstract algorithmic perspective it doesn't actually offer anything new, AFAICT. There are an infinite number of ways to shuffle things around to minimize the components and data paths directly reliant on the consensus protocol. None obviate the need for an underlying consensus protocol, absent problems where invariants can be maintained independently.

hardwaresofton9 days ago

Shameless plug for my list of paxos variants out there:

Unfortunately there are a few I'm missing and I haven't had time to update that post or make a new one -- I haven't had time to read these and try to fit them in the narrative:

- Compartmentalized Paxos (

- Matchmaker Paxos (

- PigPaxos (

- Lineraizable Quorum reads in Paxos (

klodolph9 days ago

Yeah, this paper gets dredged up every now and then. Seems a bit like a cheap trick to say “just say no to Paxos overhead” and then sequence network requests. I’m all for exploring alternatives to Paxos, but if you’re doing something so novel, my response is that I’ll believe it when I see it operate at scale.

I’m just not sure how you would sequence requests in a typical setup, with Clos fabrics everywhere, possibly when your redundancy group is spread across different geographical locations. Wouldn’t you need some kind of queue to reorder messages? That queue could get large, and quickly.

Paxos and Raft have the advantage of being simple and easy to understand. (Not necessarily easy to incorporate into system designs, sure, and the resulting systems can be fairly complicated, but Paxos and Raft themselves are simple enough to fit on notecards.)

throwawaaarrgh9 days ago

From what I understand, the concept is as simple as "No packets out of order cross the border", and "Retransmit if your packets ain't legit". Will there be lag? Sure. But will it be faster 98% of the time than Paxos? Probably. (But I also could be reading it wrong)

> when your redundancy group is spread across different geographical locations

I think 99% of the time this is just a bad idea, unless the thing you're accessing is already strongly consistent somehow. If a service has to be accessible across regions, there's potentially a million reasons for the service to be loosely coupled across those regions besides the availability issues.

hinkley9 days ago

I don't get to apply this enough, but I'm a big fan of distributed read, centralized write solutions. So many problems we have are read-mostly, but we mask them by doing a bunch of feature factory work that turns what should be a cheap, rather simple solution into something that takes a huge amount of resources to implement.

Having read replicas in several locations makes it easier to stand up a new master if your main data center catches fire (OVH)

gfv9 days ago

>in a typical setup, with Clos fabrics everywhere

You choose a single spine switch to carry the multicast messages destined to the process group. The paper also explicitly notes that different process groups need not to share the sequencer.

>when your redundancy group is spread across different geographical locations

The paper applicability is limited to data center networks with programmable network switching hardware.

infogulch9 days ago

Kinda neat. Splits the big problem of consensus into ordering and replication, and then leans on a network device like a switch to 'solve' the ordering problem in the context of a single data center. The key observation is that all those packets are going through the switch anyways, and the switch has enough spare compute to maintain a counter and add it as a header to packets, and it can easily be dynamically programmed with SDN...

I bet public clouds could offer this as a specialized 'ordered multicast vnet' infrastructure primitive to build intra-dc replicated systems on top of.

strictfp9 days ago

Reminds me of the hack of using auxiliary information such as AWS group metadata to determine ordering.

If there's already an ordering system in place, directly or indirectly, why not use it?

mro_name8 days ago

> Network- Ordered Paxos (NOPaxos)


noxer8 days ago

Stupid banner about politically motivated garbage virtue signaling that only US people would ever care about is the first thing I read on that page.

And it the last thing too because I closed the page. Could people just stop with this nonsense.

brighton369 days ago

These consensus systems are usually solutions in search of a problem. It's pretty rare for these consensus systems to offer an efficiency in practice...

pfraze9 days ago

To be clear, this paper is referring to non-decentralized (highly-consistent I assume) consensus algorithms where the goal is to operate a cluster of machines as one logical system. You use this, for instance, to maintain high uptime in the face of individual machines going down.

I suspect you were reacting to decentralized consensus (blockchains) which is a pretty different space.