Back

The road to Zettalinux

274 points13 hourslwn.net
jmillikin12 hours ago

The section about 128-bit pointers being necessary for expanded memory sizes is unconvincing -- 64 bits provides 16 EiB (16 x 1024 x 1024 x 1024 x 1 GiB), which is the sort of address space you might need for byte-level addressing of a warehouse full of high-density HDDs. Memory sizes don't grow like they used to, and it's difficult to imagine what kind of new physics would let someone fit that many bytes into a machine that's practical to control with a single Linux kernel instance.

CHERI is a much more interesting case, because it expands the definition of what a "pointer" is. Most low-level programmers think of pointers as just an address, but CHERI turns it into a sort of tuple of (address, bounds, permissions) -- every pointer is bounds-checked. The CHERI folks did some cleverness to pack that all into 128 bits, and I believe their demo platform uses 128-bit registers.

The article also touches on the UNIX-y assumption that `long` is pointer-sized. This is well known (and well hated) by anyone that has to port software from UNIX to Windows, where `long` and `int` are the same size, and `long long` is pointer-sized. I'm firmly in the camp of using fixed-size integers but the Linux kernel uses `long` all over the place, and unless they plan to do a mass migration to `intptr_t` it's difficult to imagine a solution that would let the same C code support 32-, 64-, and 128-bit platforms.

(comedy option: 32-bit int, 128-bit long, and 64-bit `unsigned middle`)

The article also mentions Rust types as helpful, but Rust has its own problems with big pointers because they inadvisably merged `size_t`, `ptrdiff_t`, and `intptr_t` into the same type. They're working on adding equivalent symbols to the FFI module[0], but untangling `usize` might not be possible at this point.

[0] https://github.com/rust-lang/rust/issues/88345

dragontamer12 hours ago

> it's difficult to imagine what kind of new physics would let someone fit that many bytes into a machine that's practical to control with a single Linux kernel instance.

I nominally agree with most of your post. But I should note that modern systems seem to be moving towards a "one pointer space" for the entire cluster. For example, 8 GPUs + 2 CPUs would share the same virtual memory space (GPU#1 may take one slice, GPU#2 takes another, etc. etc.).

This allows for RDMA (ie: mmap across Ethernet and other networking technologies). If everyone has the same address space, then you can share pointers / graphs between nodes and the underlying routing/ethernet software will be passing the data automatically between all systems. Its actually quite convenient.

I don't know how the supercomputer software works, but I can imagine that 4000 CPUs + 16000 GPUs all sharing the same 64-bit address space.

cmrdporcupine12 hours ago

It seems to me that such a memory space could be physically mapped quite large while still presenting 64-bit virtual memory addresses to the local node? How likely is it that any given node would be mapping out more than 2^64 bytes worth of virtual pages?

The VM system could quite simply track the physical addresses as a pair of `u64_t`s or whatever, and present those pages as 64-bit pointers.

It seems in particular you might want to have this anyways, because the actual costs for dealing with such external memories would have to be much higher than local memory. Optimizing access would likely involve complicated cache hierarchies.

I mean, it'd be exciting if we had need for memory space larger than 2^64 but I just find it implausible with current physics and programs? But I'm also getting old.

maxwell8610 hours ago

> How likely is it that any given node would be mapping out more than 2^64 bytes worth of virtual pages?

In the Grace Hopper whitepaper, NVIDIA says that they connect multiple nodes with a fabric that allows them to creat a virtual address space across all of them.

rektide12 hours ago

Leaving cluster coherent address space behind - like you say - is doable. But you lose what the parent was saying:

> If everyone has the same address space, then you can share pointers / graphs between nodes and the underlying routing/ethernet software will be passing the data automatically between all systems. Its actually quite convenient.

+1
vlovich1239 hours ago
+4
Aperocky12 hours ago
rwmj12 hours ago

Distributed Shared Memory is a thing, but I'm not sure how widely it is used. I found that it gives you all the coordination problems of threads in symmetric multiprocessing but at a larger scale and with much slower synchronisation.

https://en.wikipedia.org/wiki/Distributed_shared_memory

dragontamer11 hours ago

https://en.wikipedia.org/wiki/Remote_direct_memory_access

Again, I'm not a supercomputer programmer. But the whitepapers often discuss RDMA.

From my imagination, it sounds like any other "mmap". You, the programmer, just remembers that the mmap'd region is slower (since it is read/write to a Disk, rather than to RAM). Otherwise, you treat it "like RAM" from a programming perspective entirely for convenience sake.

As long as you know my_mmap_region->next = foobar(); is a slow I/O operation pretending to be memory, you're fine.

---------

Modern systems are converging upon this "single address space" programming model. PCIe 3.0 implements atomic operations and memory barriers, CXL is going to add cache-coherence over a remote / I/O interface. This means that all your memory_barriers / atomics / synchronization can be atomic-operations, and the OS will automatically translate these memory commands into the proper I/O level atomics/barriers to ensure proper synchronization.

This is all very new, only within the past few years. But I think its one of the most exciting things about modern computer design.

Yes, its slow. But its consistent and accurately modeled by all elements in the chain. Atomic-compare-and-swap over RDMA can allow for cache-coherent communications and synchronization over Ethernet, over GPUs, over CPUs, and any other accelerators sharing the same 64-bit memory space. Maybe not quite today, but soon.

This technology already exists for PCIe 3.0 CPU+GPUs synchronization from 8 years ago (Shared Virtual Memory). Its exciting to see it extend out into more I/O devices.

+1
c-linkage7 hours ago
FuriouslyAdrift7 hours ago

RDMA is used heavily in SMB3 file systems for Microsoft HyperV failover clusters.

throw1092011 hours ago

That's what I was going to chime in with - you pay for that extra address width. Binary addition and multiplication latency is super-linear with regards to operand width. Larger pointers lead to more memory use, and memory access latency is non-constant with respect to size.

It might make sense for large distributed systems to move to a 128-bit architecture, but I don't see any reason for consumer devices, at least with current technology.

+3
dragontamer11 hours ago
sakras8 hours ago

At least at this latest SIGMOD, it felt like everyone and their dog was researching databases in an RDMA environment… so I’d imagine this stuff hasn’t peaked in popularity.

shadowofneptune11 hours ago

Even so, existing ISAs could address more memory using segmentation. AMD64 has a variant of long mode where the segment registers are re-enabled. For the special programs that need such a large space, far pointers wouldn't be that complicating.

_the_inflator11 hours ago

I agree with you.

And seeing datacenter after datacenter shooting up like mushrooms, there might be some sort of abstraction running in this direction, that makes 128bit addresses feasible. At the moment 64bit seems like paging in this sense.

torginus8 hours ago

And there's another disadvantage to 128-bit pointers - memory size and alignment. It would follow that each struct field would become 16 byte-aligned, and pointers would bloat up as well, leading to even more memory consumption, especially in languages that favor pointer-heavy structures.

This was a major counterargument against 64-bit x86, where the transition came out as a net zero in terms of performance, due to the hit of larger pointer sizes counterbalanced by ISA improvements such as more addressable registers.

Many people in high-performance circles advocate using 32-bit array indices opposed to pointers, to counteract the cache pollution effects.

akira25015 hours ago

I figure the cache is going to be your largest disadvantage, and is the primary reason CPUs don't physically implement all address bits and why canonical addressing was required to get all this off the ground in the first place.

cmrdporcupine12 hours ago

Thanks for pointing out the `usize` ambiguity. It drives me nuts. I suspect it would make me even crazier if I was doing embedded with Rust right now and had to work with unsafe hardware pointers.

(They also fail to distinguish an equivalent of `off_t` out, too. Not that I think that would have the same bit width ambiguities. But it seems odd to refer to offsets by a 'size')

eterevsky11 hours ago

8 EiB of data (in case we want addresses using signed integers) is around 20 metric tons of micro-SD cards one TB each (assuming they weigh 2g each). This could probably fit in a single shipping container.

MR4D10 hours ago

But the cooling....

Seriously, that was great math you did there, and a neat way to think about volume. That's a standard shipping container [0], which is less than I thought it would be.

[0] - https://www.mobilemodularcontainers.com/products/storage-con...

hnuser12345610 hours ago

shipping container = 40x8x8ft

microsd card = 15x11x1mm, 0.5g

fits 437,503,976 cards = 379 EiB, costs $43.7B

219 metric tons

8 EiB ~ 10,000,000 TB = fills the shipping container 2.2% high or 56mm or 2 inches, 5 metric tons, costs $1B

shipping containers are rated for up to 24 metric tons, so ~40 EiB $5B 10 inches of cards etc

bombcar4 hours ago

Don’t underestimate the bandwidth of a shipping container filled with SD cards.

retrac10 hours ago

> 64 bits provides 16 EiB (16 x 1024 x 1024 x 1024 x 1 GiB), which is the sort of address space you might need for byte-level addressing of a warehouse full of high-density HDDs. Memory sizes don't grow like they used to

An exabyte was an absolutely incomprehensible amount of memory, once. Nearly as incomprehensible as 4 gigabytes seemed, at one time. But as you note, 64 bits of addressable data can fit into a single warehouse now.

Going by the historical rate of increase, $100 would buy about a petabyte of storage in 2040. Even presuming a major slowdown, we still start running into 64 bit addressing as a practical limit, perhaps sooner than you think.

TheCondor5 hours ago

It's very difficult to see normal computers that normal people use needing it any time soon, I agree. Frontier has 9.2PB of memory though, so that's 50bits for a petabyte and then 4 more bits, 54bits of memory addressability if we wanted to byte address it all. Looking at it that way, if super computers continue to be funded and grow like they have, we're getting shockingly close to 64bits of addressable memory.

I don't know that that really means we need 128bit, 80 or 96bits buys a lot of time, but it's probably worth a little bit of thought.

I don't know how many of you remember the pre-386 days. It was an effort to write interesting programs though, 512KB or 640KB of memory to work with but it was 16bit addressable and so you're writing code to manage segments and stuff, it's an extra degree of complexity and a pain to debug. 32bits seemed like a godsend when it happened. I imagine most of the dorks on here have ripped a blu-ray or transcoded a video image from somewhere, it's not super unusual to be dealing with a single file that cannot be represented as bytes with a 32bit pointer.

It's all about cost and value, 64bits is still a staggering amount of memory but if the protein folding problems and climate models and what have you need 80bits of memory to represent the problem space, I would hope that the people building those don't also have to worry about the memory "shoe boxing" problems of yesteryear too.

musicale3 hours ago

> CHERI is a much more interesting case

It would be interesting to see something like this on x86 and ARM. I could imagine Apple implementing something similar.

dylan60410 hours ago

>(comedy option: 32-bit int, 128-bit long, and 64-bit `unsigned middle`)

rather than unsigned middle, could we just call it malcom?

travisgriggs11 hours ago

> (comedy option: 32-bit int, 128-bit long, and 64-bit `unsigned middle`)

Or rather than keep moving the long goalpost, keep long at u64/i64 and add prolong(ed) for 128. Or we could keep long as the “nominal” register value, and introduce “short long” for 64. So many options.

Spooky2311 hours ago

Storage and interconnect specs are getting a lot faster. I could see a world where you treated an S3 scale storage system as a giant tiered addressable memory space. AS/400 systems sort of did something like that at a small scale.

mort9611 hours ago

We couldn't introduce a new 'middle' keyword, but could we say 'int' is 32 bit, 'long' is 128 bit and 'short long' is 64 bit..?

mhh__12 hours ago

Systems for which pointers are not just integers have come and gone, sadly.

Many mainframes had function pointers which were more like a struct than a pointer.

masklinn12 hours ago

Technically that’s still pretty common on the software side, that’s what tagged pointers are.

The ObjC/Swift runtime uses that for instance, the class pointer of an object also contains the refcount and a few flags.

mhh__12 hours ago

Pointer tagging is still spiritually an integer (i.e. mask off the bits you guarantee the tags are in via alignment) versus (say) the pointer being multiple addresses into different bits of different types of memory and tags insert other stuff here.

masklinn12 hours ago

> Pointer tagging is still spiritually an integer

No, pointer tagging is spiritually a packed structure. That can be a simple union (as it is in ocaml IIRC) but it can be a lot more e.g. the objc “non-pointer isa” ended up with 5 flags (excluding raw isa discriminant) and two additional non-pointer members of 19 and 9 bits.

You mask things on and off to unpack the structure into component the system will accept. Nothing precludes using tagged pointers to discriminate between kinds and locations.

apaprocki12 hours ago

Itanium function pointers worked like that, so it could have been the new normal if IA64 wasn’t so crazy on the whole.

xxpor10 hours ago

A Raymond Chen post about the function pointer craziness: https://devblogs.microsoft.com/oldnewthing/20150731-00/?p=90...

aidenn011 hours ago

> ...which is the sort of address space you might need for byte-level addressing of a warehouse full of high-density HDDs

So if you want to mmap() files stored in your datacenter warehouse, maybe you do need it?

anotherhue11 hours ago

FYI Such memory tagging has a rich history https://en.wikipedia.org/wiki/Tagged_architecture

tcoppi10 hours ago

Even assuming you are correct on all these points ASLR is still an important use case and the effective security of current 64-bit address spaces is low.

lostmsu12 hours ago

So 64 bit address is only 1024 16TB HDDs? That number may go down quickly. There is a 100TB SSD already.

ijlx12 hours ago

1024*1024 16TB HDDs, or 1,048,576.

Beltiras11 hours ago

As my siblings are pointing out your error they are not addressing what you are saying. You are absolutely correct. It's conceivable within a few years to fit this much storage in one device (might be a full rack full of disks but still).

infinityio12 hours ago

Not quite - 1024GiB is 1TiB, so it's 1024 x 1024 x 16TiB drives

lostmsu9 hours ago

Ah, parent already corrected their mistake. The comment I was responding to was saying 16*1024*1024*1GB.

oofbey12 hours ago

1,000,000 drives at 16 TB each I think.

Kilo is 10 bits. Mega 20. Gigs 30. Tera 40. 16TB is 44 bits. 1000* is another 10 bits so 54.

api12 hours ago

I can't imagine a single Linux kernel instance or single program controlling that much bus-local RAM, but as you say there are other uses.

One use I can imagine is massively distributed computing where pointers can refer to things that are either local or remote. These could even map onto IPv6 addresses where the least significant 64 bits are a local machine pointer and the most significant 64 bits are the machine's /64. Of course the security aspect would have to be handled at the transport layer or this would have to be done on a private network. The latter would be more common since this would probably be a supercomputer thing.

Still... I wonder if this needs CPU support or just compiler support? Would you get that much more performance from having this in hardware?

I do really like how Rust has u128 native in the language. This permits a lot of nice things including efficient implementation of some cryptography and math stuff. C has irregular support for uint128_t but it's not really a first class citizen.

gjvc11 hours ago

I can't imagine a single Linux kernel instance or single program controlling that much bus-local RAM, but as you say there are other uses.

MS-DOS and 640K ...

api10 hours ago

The addressable size grows exponentially with more bits, not linearly. 2^64 is not twice as big as 2^32. It’s more than four billion times as big. 2^32 was only 65536 times as big as 2^16.

Going past 2^64 bytes of local high speed RAM becomes a physics problem. I won’t say never but it would not just be an evolutionary change from what we have and a processor that could perform useful computations on that much data would be equally nuts. Just moving that much data on a bus of today would take too long to be useful, let alone computing on it.

PaulHoule13 hours ago

On one hand The IBM System/38 used 128 bit pointers in the 1970s, despite having a 48 bit physical address bus. These were used to manage persistent objects on disk or network with unique ids a lot like uuids.

On the other hand, filling out a 64 bit address space looks tough. I struggled to find something of the same magnitude of 2^64 and I got ‘number of iron atoms in an iron filing’, From a nanotechnological point of view a memory bank that size is feasible (fits in a rack at 10,000 atoms per bit) but progress in semiconductors is slowing down. Features are still getting smaller but they aren’t getting cheaper anymore.

bdn_9 hours ago

I thought up a few ways to visualize 2^64 unique items:

- You could give every ant on Earth ~920 unique IDs without any collisions

- You could give unique IDs for every brain neuron for all ~215 million people in Brazil

- The ocean contains about 20 × (2^64) gallons of water (3.5267 × 10^20 gallons total)

- There are between 100-400 billion stars in the Milky Way, so you could assign each star between 46,000,000–184,000,000 unique IDs each

- You could assign ~2.5 unique IDs to each grain of sand on Earth

- If every cell of your body contained a city with 500,000 people each, every "citizen" of your body could have a unique ID without any collisions

Calculating these figures is actually a lot of fun!

Eduard9 hours ago

There are only ~368 grains of sand per ant?

fragmede3 hours ago

Well no. See there's this one ant, Jeff, that's hogging them all for itself, so each ant only gets 50 grains grains of sand.

ManuelKiessling5 hours ago

No wonder I keep reading about sand shortages.

tonnydourado6 hours ago

Those are great examples

lloeki10 hours ago

> filling out a 64 bit address space looks tough. I struggled to find something of the same magnitude of 2^64 and I got ‘number of iron atoms in an iron filing’

Reminded me of Jeff Bonwick's answer to the following question about his 'boiling the oceans' quip related to ZFS being a "128 bit filesystem":

> 64 bits would have been plenty ... but then you can't talk out of your ass about boiling oceans then, can you?

Sadly his Sun hosted blog was eaten by the migration to Oracle, so thanks to the Internet Archive again:

http://web.archive.org/web/20061111054630/http://blogs.sun.c...

That one dives into some of the "handwaving" a bit:

https://hbfs.wordpress.com/2009/02/10/to-boil-the-oceans/

And that one goes into how much energy it would take to merely spin enough disks up:

https://www.reddit.com/r/DataHoarder/comments/71p8x4/reachin...

PaulHoule9 hours ago

One absolute limit of computation is that it takes

  (1/2) kT
of energy to delete one bit of information where k is the Boltzmann constant and T is the temperature. Let T = 300° K (room temperature)

I multiplied that by 2¹²⁸, and got 1.41×10¹⁸ J of energy. 1 ton of TNT is 4.2×10¹² J, so that is a 335 kiloton explosion worth of energy just to boot.

That's not impossible, that much heat is extracted from a nuclear reactor in a few months. If you want to go faster you need a bigger system, but a bigger system will be slower because of light speed latency.

(You do better, however, at a lower temperature, say 1° K but heat extraction gets more difficult at lower temperatures and you spend energy on refrigeration unless you wait long enough for the Universe to grow colder.)

turtletontine9 hours ago

In fairness, the need for 128-bit addressable systems will be when 64 address bits is not enough. That will be long before people are using 2^128 bytes on one system. So doing the calculation with 2^65 bytes would be a more even handed estimate of the machine that would require this

tuatoru3 hours ago

> 1.41×10¹⁸ J of energy.

Used over the course of a year, that is a constant 44.4 GW. Less than Bitcoin uses already

protomyth12 hours ago

The AS/400 and iSeries also use 128-bit pointers. 128-bit would be useful for multiple pointers already in common use such as ZFS and IP6 addresses. I expect it will the last hop for a long time.

EvanAnderson12 hours ago

In the context of the AS/400's single-level store architecture the 128-bit pointers make a lot of sense, too.

PaulHoule12 hours ago

Those are evolved from the System/38.

protomyth10 hours ago

Yeah, IBM is one company that shows how to push the models down the road. They do take their legacy seriously.

tuatoru3 hours ago

> On one hand The IBM System/38 used 128 bit pointers in the 1970s, despite having a 48 bit physical address bus.

And the original processor was 24 bits, then it was upgraded to 36 bits (not a typo: 36 bits), and then to POWER 64 bits.

(When that last happened, it was re-badged AS/400. Later, marketing renamed the AS/400 to iSeries, and then to IBM i, without changing anything significant. Still uses Power CPUs, AFAIK).

For users, upgrades were a slightly longer than usual backup and restore.

What's the hard part here?

mhh__12 hours ago

IBM i is still in development and also has a 128 bit pointer.

gwbas1c12 hours ago

> Matthew Wilcox took the stage to make the point that 64 bits may turn out to be too few — and sooner than we think

Let's think critically for a moment. I grew up in the 1980s and 1990s, when we all craved more and more powerful computers. I even remember the years when each generation of video games was marketed as 8-bit, 16-bit, 32-bit, ect.

BUT: We're hitting a point where, for what we use computers for, they're powerful enough. I don't think I'll ever need to carry a 128-bit phone in my pocket, nor do I think I'll need a 128-bit web browser, nor do I think I'll need a 128-bit web server. (See other posts about how 64-bits can address massive amounts of memory.)

Will we need 128-bit computing? I'm sure someone will find a need. But let's not assume they'll need an operating system designed in the 1990s for use cases that we can't imagine today.

jerf11 hours ago

I think the argument for 128-bits being necessary strictly for routing memory in a single "computer" is fairly weak. We're already seeing the plateauing of memory sizes nowadays; what was exponentially in reach within just a couple of decades exponentially recedes from us as our progress goes from exponential to polynomial.

But the argument that we need more than 64 bit capability for a lot of other reasons in conjunction with memory addressability is I think very strong. A lot of very powerful and safe techniques become available if we can tag pointers with more than a bit squeezed out here and a bit squeezed out there. I could even see hard-coding the CPU to, say, look at 80 bits as an address and then us the remaining 48 for tagging of various sorts. There's precedent, and 80 bits is an awful lot of addressible memory; that's a septillion+ addressible bytes, by the time we need more than that, if we do, our future selves can deal with that. (It is good to look ahead a decade or two and make reasonable preparations, as this article does; it is hubris to start trying to look ahead 50 years or a century.)

wongarsu12 hours ago

We are talking about the OS kernel that supports 4096 CPU cores. The companies who pay their engineers to do linux kernel development tend to be the same ones that have absurd needs.

p_l11 hours ago

And that's not enough to boot on some platforms that linux runs on (even with amd64 ISA) unless one partitions the computer into smaller cpu counts

mikepurvis12 hours ago

IMO there's an important distinction to be made between high-bit addressing and high-bit computing.

Like, no one has enough memory to need more than 64 bits for addressing, and that is likely to remain the case for the foreseeable future. However, 128- and 256-bit values are commonly used in domains like graphics, audio, and so-on, where you need to apply long chains of transformations and filters, but retain as much of the underlying dynamic range as possible.

PaulDavisThe1st11 hours ago

Those are data types, not pointer values or filesystem offsets. Totally different thing.

mikepurvis10 hours ago

Is that strictly the definition, that the bit-width of a processor refers only to its native pointer size?

I know it's hardly a typical or modern example, but the N64 had just 4MB of memory (8MB with the expansion pack). It most certainly didn't need 64-bit pointers to address that pittance, so it was a "64 bit processor" largely for the purposes of register/data size.

PaulDavisThe1st10 hours ago

Not so much a definition as where the problem lies.

If the thing that can refer to a memory address changes size, there are very different problems than will arise if the size of "an integer" changes.

You could easily imagine a processor that can only address an N-bit address space, but can trivially do arithmetic on N*M bit integers or floating point values. And obviously the other way around, too.

In general, I think "N bit processor" tends to refer to the data type sizing, but since those primitive data types will tend to fit into the same registers that are used to hold pointers, it ends up describing addressing too.

uluyol12 hours ago

Supercomputers? Rack-scale computing? See some of the work being done with RDMA and "far memory".

gwbas1c11 hours ago

Yes... And do you need that in the same kernel that goes into your phone and your web server?

ghoward9 hours ago

It would be sad if we, as an industry, do not take this opportunity to create a better OS.

First, we should decide whether to have a microkernel or a monolithic kernel.

I think the answer is obvious: microkernel. This is much safer, and seL4 has shown that performance need not suffer too much.

Next, we should start by acknowledging the chicken-and-egg problem, especially with drivers. We will need drivers.

So let's reuse Linux drivers by implementing a library for them to run in userspace. This should be difficult, but not impossible, and the rewards would be massive, basically deleting the chicken-and-egg problem for drivers.

To solve the userspace chicken-and-egg problem (having applications that run on the OS), implement a POSIX API on top of the OS. Yes, this will mean that some bad legacy like `fork()` will exist, but it will solve that chicken-and-egg problem.

From there, it's a simple matter of deciding what the best design is.

I believe it would be three things:

1. Acknowledging hardware as in [1].

2. A copy-on-write filesystem with a transactional API (maybe a modified ZFS or BtrFS).

3. A uniform event API like Windows' handles and Wait() functions or Plan 9's file descriptors.

For number 3, note that not everything has to be a file, but receiving events like signals and events from child processes should be waitable, like in Windows or Linux's signalfd and pidfd.

For number 2, this would make programming so much easier on everybody, including kernel and filesystem devs. And I may be wrong, but it seems like it would not be hard to implement. When doing copy-on-write, just copy as usual, and update the root B-tree node; the transaction commits when the root B-tree node is flushed to disk, and the flush succeeds.

(Of course, this would also require disks that don't lie, but that's another problem.)

[1]: https://www.usenix.org/conference/osdi21/presentation/fri-ke...

LinkLink13 hours ago

For reference 2^64 = ~10^19.266 I don't think this is unreasonable at all, its unlikely that computers will largely stay the same in the coming years. I believe we'll see many changes to how things like mass addressing of data and computing resources is done. Right now our limitations in these regards are addressed by distributed computing and databases, but in a hyper-connected world there may come a time when such huge address space could actually be used.

It's an unlikely hypothetical but imagine if fiber ran everywhere, and all computers seamlessly worked together sharing computer power as needed. Even 256 bits wouldn't be out of the question then. And before you say something like that will never happen consider trying to convince somebody from 2009 that in 13 years people would be buying internet money backed by nothing.

jmillikin12 hours ago

  > It's an unlikely hypothetical but imagine if fiber ran everywhere,
  > and all computers seamlessly worked together sharing computer power
  > as needed. Even 256 bits wouldn't be out of the question then.
You could do this today with 196 bits (128-bit IPv6 address, 64-bit local pointer). Take a look at RDMA, which could be summarized as "every computer's RAM might be any computer's RAM".

The question is whether such an address makes sense for the Linux kernel. If your hyper-converged distributed program wants to call `read()`, does the pointer to the buffer really need to be able to identify any machine in the world? Maybe it's enough for the kernel to use 64-bit local pointers only, and have a different address mechanism for remote storage.

Tsarbomb12 hours ago

Not agreeing or disagreeing for the most part, but in 2009 all of the prerequisite technology for cryptocurrency existed: general purpose computers for the average person, accessible internet, cryptographic algorithms and methods, and cheap storage.

For 256 bit computers, we need entirely new CPU architectures and updated ISAs for not just x86/AMD64, but for other archs increasing in popularity such as ARM and even RISC-V. Even then compilers, build tools, and dependant devices with their drivers need updates too. On top of all of this technical work, you have the political work of getting people to agree on new standards and methods.

mhh__12 hours ago

256 bits in the case of a worldwide mega-computer would be such a huge departure from current architectures and more importantly latency-numbers that we can barely even speculate about it.

It may be of note that hypothetically one can have a soft-ISA 128 bit virtual address (a particularly virtual virtual address) which is JITed down into a narrower physical address by the operating system. This is as far as I'm aware how IBM i works.

est3112 hours ago

The extra bits might be used for different things like e.g. in CHERI. The address space is still 64 bits, but there are 64 bits in metadata added to it, so you get a 128 bit architecture.

masklinn12 hours ago

I don’t know it seems excessive to me. I could see the cold storage maybe, with spanning storage pools (by my reckoning there were 10TB drives in 2016 and the largest now are 20, so 16 years from now should be 320 if it keeps doubling, which is 5 orders of magnitude below still).

> Right now our limitations in these regards are addressed by distributed computing and databases, but in a hyper-connected world there may come a time when such huge address space could actually be used.

Used at the core of the OS itself? How do you propose to beat the speed of light exactly?

Because you don’t need a zettabyte-compatible kernel to run a distributed database (or even file system, see ZFS), trying to DMA things on the other side of the planet sounds like the worst possible experience.

Hell, our current computers right now are not even close to 64 bit address spaces. The baseline is 48 bits, and x86 and ARM are in the process of extending the address space (to 57 bits for x86, and 52 for ARM).

alain940409 hours ago

Thanks to Moore's law, you can assume that DRAM capacity will double every 1-3 years. Every time it doubles, you need one more bit. So if we use 48 bits today, we have 16 bits left to grow, which gives us at least 16 years of margin, and maybe 48 years. (and it could be even longer if you believe that Moore's law is going to keep slowing down).

Deukhoofd12 hours ago

> all computers seamlessly worked together sharing computer power as needed. Even 256 bits wouldn't be out of the question then.

This sounds like it would be massively out of scope for Linux. It'd require a complete overhaul of most of its core functionality, and all of its syscalls. While not a completely infeasible idea, it sounds to me like it'd require a completely new designed kernel.

PaulDavisThe1st11 hours ago

I remember a quote from the papers about Opal, an experimental OS that was intended to use h/w protection rather than virtual memory, so that all processes share the same address space and can just exchange pointers to share data.

"A 64 bit memory space is large enough that if a process allocated 1MB every second, it could continue doing this until significantly past the expected lifetime of the sun before it ran into problems"

wongarsu12 hours ago

What's the average lifespan of a line of kernel code? I imagine by starting this project 12 years before its anticipated use case they can get very far just by requiring that any new code is 128-bit compatible (in addition to doing the broader infrastructure changes needed like fixing the syscall ABI)

rwmj12 hours ago

> What's the average lifespan of a line of kernel code?

There's a fun tool called "Git of Theseus" which can answer this question! You can see some graphs of Linux code on the web page: https://github.com/erikbern/git-of-theseus

Named after the Ship of Theseus: https://en.wikipedia.org/wiki/Ship_of_Theseus

forgotpwd1612 hours ago

There're some more in the presentation article: https://erikbern.com/2016/12/05/the-half-life-of-code.html#:...

A (Linux) kernel line has half-life 6.6 years. The highest of the projects analyzed. The lowest went to Angular with half-life 0.32 years.

trasz12 hours ago

Not sure about those 12 years - 128-bit registers are already there, and CHERI Morello prototype is at a “physical silicon using this functionality under CheriBSD” stage.

jupp0r4 hours ago

"The problem now is that there is no 64-bit type in the mix. One solution might be to "ask the compiler folks" to provide a __int64_t type. But a better solution might just be to switch to Rust types, where i32 is a 32-bit, signed integer, while u128 would be unsigned and 128 bits. This convention is close to what the kernel uses already internally, though a switch from "s" to "i" for signed types would be necessary. Rust has all the types we need, he said, it would be best to just switch to them."

Does anybody know why they don't use the existing fixed size integer types [1] from C99 ie uint64_t etc and define a 128 bit wide type on top of that (which will also be there in C23 IIRC)?

My own kernel dev experience is pretty rusty at this point (pun intended), but in the last decade of writing cross platform (desktop, mobile) userland C++ code I advocated exclusively for using fixed width types (std::uint32_t etc) as well as constants (UINT32_MAX etc).

MikeHalcrow9 hours ago

I recall sitting in a packed room with over a hundred devs at the 2004 Ottawa Linux Symposium while the topic of the number of filesystem bits was being discussed (link: https://www.linux.com/news/ottawa-linux-symposium-day-2/). I recall people throwing out questions as to why we weren't just jumping to 128 or 256 bits, and at one point someone blurted out something about 1024 bits. Someone then made a comment about the number of atoms in the universe, everyone chuckled, and the discussion moved on. I sensed the feeling in the room was that any talk of 128 bits or more was simply ridiculous. Mind you this was for storage.

Fast-forward 18 years, and it's fascinating to me to see people now seriously floating the proposal to support 256-bit pointers.

hughw2 hours ago

The difference between "number of atoms in the universe" and "number of possible states of a system" are vastly different. The latter is a combinatorial problem, and if you're trying to to track the possible combinations of 100 variables that can take on 10 states each, you've got 10^100 combinations and are already beyond atoms in the universe (10^80). You can never enumerate them all, but the ability to work on large subspaces would be a help.

munro11 hours ago

There was a post awhile back from NASA saying how many digits of Pi they actually need [1].

    import math

    pi = 3141592653589793238462643383279502884197169399375105820974944592307816406286208998628034825342117067982148086513282306647093844609550582231725359408128481117450284102701938521105559644622948954930381964428810975665933446128475648233786783165271201909145648566923460348610454326648213393607260249141273724587006606315588174881520920962829254091715364367892590360

    sign_bits = 1
    sig_bits = math.ceil(math.log2(pi))
    exp_bits = math.floor(math.log2(sig_bits))

    assert sign_bits + sig_bits + exp_bits == 1209
I'm sure I got something wrong here, def off-by-one, but roughly it looks like it would need 1209-bit floats (2048-bit rounded up!). IDK, mildly interesting. :>

[1] https://www.jpl.nasa.gov/edu/news/2016/3/16/how-many-decimal...

vanderZwan11 hours ago

The value of Pi you mention was the one in the question, the one in the answer is:

> For JPL's highest accuracy calculations, which are for interplanetary navigation, we use 3.141592653589793. Let's look at this a little more closely to understand why we don't use more decimal places. I think we can even see that there are no physically realistic calculations scientists ever perform for which it is necessary to include nearly as many decimal points as you present.

That's sixteen digits, so a quick trip to the dev tools tels me::

    >> Math.log2(3141592653589793)
    -> 51.480417552782754 
The last statement of the text I quoted is more interesting though. Although not surprising to me, given how many astronomers I know who joke that Pi equals three all the time.
Beltiras11 hours ago

I have horrid memories of debugging I had to do to get some god-awful fourier transform to calculate with 15 digits of precision to fit a spec. It's right at the boundary where double-precision stops being deterministic. Worst debugging week of my life.

vanderZwan11 hours ago

> stops being deterministic

I'm imagining the maths equivalent of Heisenbugs, is that correct?

Beltiras10 hours ago

No, just having to match how Matlab did the calculation (development of an index) to implementing the same thing in C++ (necessitating showing the calculation returned same significant digits for the precision we expected). I've seen a Heisenbug and that was really weird. Happened during uni so I didn't have to start tracing down compiler bugs. Not even sure if I could, happened with Java.

munro11 hours ago

Lol I should RTFA ;D

vanderZwan11 hours ago

Nah, just claim you were invoking Cunningham's Law ;)

jabl10 hours ago

Pi is a bit special because in order to get accurate argument reduction for trigonometric functions you needs lots of digits (IIRC ~1000 for double precision).

E.g. https://redirect.cs.umbc.edu/~phatak/645/supl/Ng-ArgReductio...

PaulDavisThe1st11 hours ago

The size of required data types is mostly orthogonal to the size of memory addresses or filesystem offsets.

rmorey12 hours ago

this seems just a bit too early - so that probably means it’s exactly the right time!

marktangotango12 hours ago

I was wondering if this (128bit memory) are on the radar of any of the BSDs. Will they forever be stuck at 64bit?

fanf211 hours ago

CheriBSD might be the first unix-like with 128 bit pointers

teddyh12 hours ago

Maybe we can finally fix maxint_t to be the largest integer type again.

torginus9 hours ago

On a bit tangential note, RAM price for a given cost used to increase exponentially until the 2010s or so.

Since then, it only roughly halved. What happened?

https://jcmit.net/memoryprice.htm

I know it's not process geometry, since we went from 45nm->5nm in the time, a roughly 81x decrease.

Is is realistic to assume scaling will resume?

hnuser1234569 hours ago

We decided to slow down giving programmers excuses to make chat applications as heavy as web browsers

amelius4 hours ago

Perhaps it's an idea to make Linux parameterized in the pointer/word size, and let the compiler figure it out in the future.

Aqueous10 hours ago

If we're going to go for 128- why not just go for 256-? that way we won't have to do this again for a while.

or better yet, design a new abstraction for not having to hard-code the limit of the pointer size but instead allow it to be extensible as more addressable space becomes a reality, instead of having to transition over and over. is this even possible? if it is, shouldn't we head in that direction?

kmeisthax4 hours ago

The problem with variable-sized pointers is that...

1. Any abstraction you could make will have worse performance than a fixed-size machine pointer

2. In order to support any kind of variably-sized type you need machine pointers to begin with, and those will always be fixed-size because variable size is even harder to support in hardware than native code

And furthermore going straight to 256 has its own problems. Each time you double the pointer size you also significantly increase the size of structures with a lot of pointers. V8 notably uses "pointer compression" - i.e. using 32-bit offsets instead of 64-bit pointers, because it never needs >4GB of JavaScript objects at once and JS objects are very pointer-ridden.

There's two forces at play here: pointers need to be small enough to embed in any data structure and large enough to address the entire working set of the program. Larger pointers are not inherently better[0], and neither are smaller pointers. It's a balancing act.

[0] ASLR, PAC, and CHERI are exceptions, as mentioned in the original article.

tomcam8 hours ago

Just wanted to say I love this discussion. Have been pondering the need for a 128-bit OS for decades but several of the issues raised were completely novel to me. Fantastic to have so many people so much smarter than I am hash it out informally here. Feels like a master class.

Beltiras11 hours ago

I would think by now any bunch of clever people would be trying to fix the generalized problem of supporting n-bit memory addressing instead of continually solving the single problem of "how do we go from n*2 to (n+1)*2". I guess it's more practical to just let the next generation of kernel maintainers go through all of this hullabaloo again in 2090.

t-311 hours ago

Are there operations that vector processors are inherently worse at or much harder to program for? Nowadays they seem to be mainly used for specialized tasks like graphics and machine learning accelerators, but given the expansion of SIMD instruction sets, are general purpose vector CPUs in the pipeline anywhere?

tayistay12 hours ago

Is 128 bit the limit of what we would need? We use 128 bit UUIDs. 2^256 seems to be more than the number of atoms on Earth.

wongarsu12 hours ago

The article does talk about just making the pointer type used in syscalls 256 bit wide to "give room for any surprising future needs".

The size of large networked disk arrays will grow beyond 64 bit addresses, but I don't think we will exceed 2^128 bits of storage of any size, for any practical application. Then again, there's probably people who thought the same about 32 bit addresses when we moved from 16bit to 32bit addresses.

The most likely case for "giant" pointers (more than 128 bits) will be adding more metadata into the pointer. With time we might find enough use cases that are worth it to go to 256bit pointers, with 96bit address and 160 bit metadata or something like that.

dylan60410 hours ago

>Then again, there's probably people who thought the same about 32 bit addresses when we moved from 16bit to 32bit addresses.

There's a fun "quote" about 384k being all anyone would ever need, so clearly everyone just needs to settle down and figure out how to refactor their code.

shadowofneptune8 hours ago

The IBM PC's 20-bit addressing was 16 times the size of 16-bit addresses. From 20-bit to 32-bit, 4096 times larger. 32 to 64 is 4,294,967,296 times larger (!). The scale alone makes using all this space unlikely on a PC.

tonnydourado6 hours ago

The "In my shoes?" bit was hilarious

Blikkentrekker13 hours ago

> How would this look in the kernel? Wilcox had originally thought that, on a 128-bit system, an int should be 32 bits, long would be 64 bits, and both long long and pointer types would be 128 bits. But that runs afoul of deeply rooted assumptions in the kernel that long has the same size as the CPU's registers, and that long can also hold a pointer value. The conclusion is that long must be a 128-bit type.

Can anyone explain the rationale for not simply naming types after their size? In many programming languages, rather than this arcane terminology, “i16”, “i32”, “i64”, and “i128” simpy exist.

masklinn13 hours ago

Legacy, there’s lots of dumb stuff in C. As you note, in modern languages the rule is generally to have fixed-size integers.

Though I think there are portability issues concerns, that world is mostly gone (it remains in some corners of computing e.g. dsps) but if you’re only using fixed-size integers what do you do when a platform doesn’t have that size? With a more flexible scheme, you have less issues there, however as the weirdness landscape contracts the risk of making technically incorrect assumptions (about relations between type sizes, or the actual limits and behaviour of a given type) start increasing dramatically.

Finally there’s the issue at hand here: even with fixed-size integers, “pointer” is a variable-size datum. So you still need a variable-size integer to go with it. C historically lacking that (nowadays it’s called uintptr_t), the kernel made assumptions which are incorrect.

Note that you can still get it wrong even if you try e.g. Rust believes and generally assumes that usize and pointers correspond, but that gets iffy with concepts like pointer provenance, which decouple pointer size and address space.

raverbashing12 hours ago

> Legacy, there’s lots of dumb stuff in C.

Yes, this, so much this

Who cares what an 'int' or a 'long' is. Except for things like the size of a pointer, it's better if you know exactly what you're working with.

mpweiher12 hours ago

> in modern languages the rule is generally to have fixed-size integers.

Modern languages have unlimited size integers :-)

"Modern" as in "since at least the 80s, more likely 70s".

fluoridation12 hours ago

Good luck using those to specify the data layout of a network packet.

josefx12 hours ago

You are supposed to stream your video data as base64 encoded xml embedded in a json array.

+1
mpweiher10 hours ago
kuratkull12 hours ago

Good luck seeing your performance drop off a very sharp cliff if you start using larger numbers than your CPU can fit into a single register.

mpweiher10 hours ago

Well, in those case other languages fail.

Either silently with overflows, usually leading to security exploits, or by crashing.

So in either case you are betting that these cases are somewhere between rare and non-existent, particularly for your core/performance intensive code.

Being somewhat slower, probably in very isolated contexts (60-62 bits is quite a bit to overflow), but always correct seems like the better tradeoff.

YMMV. ¯\_(ツ)_/¯

creativemonkeys12 hours ago

I'm sure someone will come along and explain why I have no idea what I'm talking about, but so far my understanding is those names exist because of the difference in CPU word size. Typically "int" represents the natural word size for that CPU, which matches the register size as well, so 'int plus int' is as fast as addition can run by default, on a variety of CPUs. That's one reason chars and shorts are promoted to ints automatically in C.

Let's say you want to work with numbers and you want your program to run as fast as possible. If you specify the number of bits you want, like i32, then the compiler must make sure on 64bit CPUs, where the register holding this value has an extra 32bits available, that the extra bits are not garbage and cannot influence a subsequent operation (like signed right shift), so the compiler might be forced to insert an instruction to clear the upper 32bits, and you end up with 2 instructions for a single operation, meaning that your code now runs slower on that machine.

However, had you used 'int' in your code, the compiler would have chosen to represent those values with a 64bit data type on 64bit machines, and 32bit data type on 32bit machines, and your code would run optimally, regardless of the CPU. This of course means it's up to you to make sure that whatever values your program handles fit in 32bit data types, and sometimes that's difficult to guarantee.

If you decide to have your cake and eat it too by saying "fine, I'll just select i32 or i64 at compile time with a condition" and you add some alias, like "word" -> either i32 or i64, "half word" -> either i16 or i32, etc depending on the target CPU, then congrats, you've just reinvented 'int', 'short', 'long', et.al.

Personally, I'm finding it useful to use fixed integer sizes (e.g. int32_t) when writing and reading binary files, to be able to know how many bytes of data to read when loading the file, but once those values are read, I cast them to (int) so that the rest of the program can use the values optimally regardless of the CPU the program is running on.

nicoburns11 hours ago

That explains "int", but it doesn't explain short or long or long long. Rust has "usize" for the "int" case, and then fixed sizes for everything else, which works much better. If you want portable software, it's usually more important to know how many bits you have available for your calculation than it is to know how efficiently that calculation will happen.

creativemonkeys2 hours ago

I suppose short and long have to do with register sizes being available as half word and dword, and there are instructions that work with smaller data sizes on both x86 and ARM, but I agree that in today's world, you want to know the number of bits. On those weak 4MHz machines, squeezing a few extra cycles was typically very important.

m0RRSIYB0Zq8MgL13 hours ago

That is what was suggested in the next paragraph.

> But a better solution might just be to switch to Rust types, where i32 is a 32-bit, signed integer, while u128 would be unsigned and 128 bits. This convention is close to what the kernel uses already internally, though a switch from "s" to "i" for signed types would be necessary. Rust has all the types we need, he said, it would be best to just switch to them.

wongarsu12 hours ago

That's pretty much the mentioned proposal of "just use rust types", which are i16/u16 to i128/u128, plus usize/isize for pointer-sized things.

The only improvement that you really need over that is to differentiate between what c calls size_t and uintptr_t: the size of the largest possible array, and the size of a pointer. On "normal" architectures they're the same, but on architectures that do pointer tagging or segmented memory a pointer might be bigger than the biggest possible array.

But you still have to deal with legacy C code, and C was dreamt up when running code written for 16 bits on a 14 bit architecture without losing speed was a consideration, so the C type's are weird.

thrown_2211 hours ago

stdint.h has been around far longer than Rust.

I've been using those since the 00s for bit banging code where I need guarantees for where each bit goes.

Nothing quite like working with a micro processor with 12bit words to make you appreciate 2^n addresses.

PaulHoule13 hours ago

C dates back to a time when the 8 bit byte didn’t have 100% market share.

yetihehe12 hours ago

Plus, it was a language to write systems, where "size of register on current machine" was a nice shortcut for "int", where registers could be anywhere from 8-32 bits, with 48 or 12 also a possibility.

cestith11 hours ago

I have a couple of 12-bit machines upstairs. There were also 36-bit systems once upon a time.

masklinn12 hours ago

Except that’s not been true in a while, and technically this assumptions was not kosher for even longer: C itself only guarantees that int is 16 bits.

mhh__12 hours ago

C still (I think, C23 may have finally killed support*) supports architectures like clearpath mainframes which have a 36 bit word, 9 bit byte, 36 (IIRC) bit data pointer and a 81 bit function pointer.

The changes to the arithmetic rules mean you can't have sign-magnitude or 1s complement anymore IIRC

Stamp0112 hours ago

C99 specifies stdint.h/inttypes.h as part of the standard library for exactly this purpose. I'd expect using it would be a best practice at this point. But I'm no C expert, so maybe there's a good reason for not always using those explicitly sized types.

pjmlp12 hours ago

Windows has macros for that kind of stuff, and only in C99 the stdint header came to be.

So you had almost three decades with everyone coming up with their own solution.

To be fair, the other languages were hardly any better than C in this regard.

quonn12 hours ago

I think that's because of portability. So that the common types just map to the correct size on a given system.

zasdffaa11 hours ago

Sounds nuts. Does anyone know how much power a 32GB DIMM draws? How much would a fully populated 64-bit address space therefore pull?

Edit, if a 4GB (32-bits used) DRAMM pulls 1 watt, the rest of the memory space is 32 bit = 4E9 so your memory is pulling ~4Gwatts alone. That's not supportable, given the other electronics needed to go around it.

znpy7 hours ago

I wonder if with 128-bit wide pointers it would make sense to start using early-lisp-style tagged pointers.

krackout10 hours ago

I think the article is very shortsighted. By 2035-40 we'll probably have memory only (RAM) computers massively available. No disks means no current OS capable of handling these computers. A change of paradigm needing new platforms and OSes.

These future OSes may be 128bit, but I don't think the current ones will make it to the transition.

alpb10 hours ago

There are plenty of OSes today capable of booting and running from RAM. Pretty sure we wouldn't be burning all the prominent OSes for something like that.

bitwize11 hours ago

Pointers will get fat (CHERI and other tagged pointer schemes) well before even server users will need to byte-address into more than 2^64 bytes' worth of stuff. So we should probably be realistically aiming for 256-bit architectures...

jaimehrubiks10 hours ago
dredmorbius7 hours ago

As long as the posting of subscriber links in places like this is occasional, I believe it serves as good marketing for LWN - indeed, every now and then, I even do it myself. We just hope that people realize that we run nine feature articles every week, all of which are instantly accessible to LWN subscribers.

-- Jonathan Corbet, LWN founder & and grumpy editor in chief

<https://news.ycombinator.com/item?id=1966033>

Multiple other approvals: <https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...>

Jon's own submissions: <https://news.ycombinator.com/submitted?id=corbet>

And if we look for SubscriberLink submissions with significant (>20 comments) discussion ... they're showing up every few weeks, largely as Jon had requested.

<https://hn.algolia.com/?dateRange=all&page=0&prefix=true&que...>

That said, those who are able to comfortably subscribe and find this information useful: please do support the site through subscriptions.

nequo6 hours ago

Whether the posting of subscriber links is “occasional” as of late is debatable.[1] Most of LWN’s paywalled content is posted on HN.

[1] https://news.ycombinator.com/item?id=32926853

dredmorbius6 hours ago

jaimehrubiks stated unequivocally without substantiation that "Somebody asked before to please not share lwn's SubscriberLinks". LWN's founder & editor has repeatedly stated otherwise, hasn't criticised the practice, and participates in the practice himself, as recently as three months ago.

SubscriberLinks are tracked by the LWN account sharing them. Abuse can be managed through LWN directly should that become an issue. Whether or not that's occurred in the past I've no idea, but the capability still exists and is permitted.

No link substantiating jamiehrubiks' assertion seems to have been supplied yet.

I'm going to take Corbet's authority on this.

+1
nequo5 hours ago
t3estabc10 hours ago
Majestic1219 hours ago

Why are you posting GPT3 responses here?

wdutch10 hours ago

Hi GPT3, I didn't expect to see you here.