Back

How CERN serves 1EB of data via FUSE [video]

195 points20 hourskernel-recipes.org
julienchastang3 hours ago

Somewhat off topic, but CERN has a fantastic science museum attached to it that I had the privilege of visiting last summer. There is of course Tim Berners-Lee's NeXT workstation, but also so much more. It is also the only science museum I've visited that addresses topics in cyberinfrastructure such as serving out massive amounts of data. (I personally get interested when I see old StorageTek tapes lol.) The more traditional science displays are also great. Check it out if you are ever in the Geneva area. It is an easy bus ride to get out there.

udev409613 hours ago

This is fascinating. How are they managing or even taking backup for this gigantic storage?

tantalor3 hours ago

What is Rucio?

Rucio enables centralized management of large volumes of data backed by many heterogeneous storage backends.

Data is physically distributed over a large number of storage servers, potentially each relying on different storage technologies (SSD/Disk/Tape/Object storage) and, frequently, managed by different teams of system administrators.

Rucio builds on top of this heterogeneous infrastructure and provides an interface which allows users to interact with the storage backends in a unified way. The smallest operational unit in Rucio is a file. Rucio enables users to upload, download, and declaratively manage groups of such files.

https://rucio.cern.ch/documentation/started/what_is_rucio

JBorrow6 hours ago

They use a distributed data management tool called RUCIO: https://rucio.cern.ch to distribute data on the grid.

ephimetheus13 hours ago

For experiment data, there is a layer on top of all of this that distributes datasets across the computing grid. That system has a way to handle replicate at the dataset level.

rob_c12 hours ago

Tape and off-site replicas at globally distributed data centres for science. Of the 1EB a huge amount of that is probably in automated recall and replication with "users" running staged processing of the data at different sites ultimately with data being reduced to "manageable" GB-TB level for scientists to do science

fnands10 hours ago

Yup, lots of tape for stuff in cold storage, and then some subset of that on disk spread out over several sites.

It's kinda interesting to watch anything by Alberto Pace, the head of storage at CERN to get an understanding of the challenges and constraints: https://www.youtube.com/watch?v=ym2am-FumXQ

I was basically on the helpdesk for the system for a few years so had to spend a fair amount of time helping people replicate data from one place to another, or from tape onto disk.

qwertox14 hours ago

IIRC I had issues with inotify when I was editing files on a remote machine via SSHFS, when these files were being used inside a Docker container. inotify inside the container did not trigger the notifications, whereas it did, when editing a file with an editor directly on that host.

I think this was related to FUSE, that Docker just didn't get notified.

z3t44 hours ago

The inotify signals might work if you add -v a whole directory

a-dub15 hours ago

does modern fuse still context switch too much or does it now use io_uring or similar?

mappu14 hours ago

FUSE over io_uring is still WIP: https://lwn.net/Articles/988186/

FUSE Passthrough landed in kernel 6.9, which also reduces context switching in some cases: https://www.phoronix.com/news/Linux-6.9-FUSE-Passthrough . The benchmarks in this article are pretty damning for regular FUSE.

Dwedit10 hours ago

FUSE Passthrough is only useful for filesystems that wrap an existing filesystem, such as union mounts. Otherwise, you don't have an open file to hand over.

a-dub13 hours ago

yeah but still not great for metadata operations, no?

i remember it was really not great for large sets of search paths because it defeated the kernel's built-in metadata caches with excessive context switching?

Dwedit15 hours ago

Last I read about FUSE, adding a 128KB read-ahead buffer drastically reduced context switching.

synicalx15 hours ago

1EB with only 30k users, thats a wild TB-per-user ratio. My frame of reference; the largest storage platform I've ever worked on was a combined ~60PB (give or take) and that had hundreds of millions of users.

chipdart14 hours ago

Most humans don't handle sensor and simulation data for a living, though. CERN just so happens to employ thousands who do that for a living.

lmihaig5 hours ago

When experiments are running the sensors generate about 1PB of data per second. They have to do multiple (I think four?) layers of filtering, including hardware level to get to actual manageable numbers.

elashri4 hours ago

It depends on which experiment. We call it trigger system. And it varies according to each experiment requirements and physics of interest. For example LHCb is doing now full trigger system on a software side (No hardware FPGA triggering) and mainly utilizing GPUs for that. That would be hare to achieve with the harsher conditions and requirements of CMS and ATLAS.

But yes at LHCb we discard about 97% of the data generated during collisions.

Disclaimer: I work on LHCb trigger system

shric14 hours ago

My frame of reference; the largest storage platform I've ever worked on was a combined ~tens of EB (give or take) and that had over a billion users.

hackernewds14 hours ago

That's the scale of the universe, compared to data generated by humans

vfclists5 hours ago

Re: https://news.ycombinator.com/item?id=41716523,

> over the years what discoveries have been made at CERN that have had practical social and economic benefits to humanity as a whole?

Some responders to the question believe I was criticizing a supposed wastefulness of the research. Not knowing the benefits of the discoveries in high energy physics, ie the stuff the accelerators are actually built to discover, doesn't mean I was criticizing it.

Responses referenced the contributions the development of the infrastructure supporting the basic research itself have made, which is fine, but not the benefits of high energy physics discoveries.

So to rephrase the question - What are the practical social and economic benefits to society that the discoveries in high-energy particle physics at institutions like CERN have made over the years?

This is not just in relation to CERN, but world wide, such those experiments which create pools of water deep underground to study cosmic rays etc.

deelowe5 hours ago

You're probably getting replies like that because it's a bit of an odd question. Academic research isn't really done to achieve a particular purpose or goal. The piratical benefit literally is academic.

dylan6044 hours ago

It's also one of the first questions from people that very much are criticizing, so even if it was an sincere question it will be lumped together. Not recognizing/addressing this when posing the question does nothing to prevent it from the lumping.

superhuzza3 hours ago

The piratical benefit may be particle cannons? Yarrgh!

renewiltord16 hours ago

Most of the magic is in https://eos-web.web.cern.ch/eos-web/ apparently

hackernewds14 hours ago

[flagged]

qwertox14 hours ago

.ch as in Switzerland. Does .cn (China) prohibit open source?

irusensei10 hours ago

"Confoederatio Helvetica" if anyone wonders why Switzerland uses the CH TLD.

InDubioProRubio12 hours ago

The things you can build when everyone is a rockstar :D

jgalt2128 hours ago

I'm convinced CERN could greatly benefit from "middle out".

maybeben17 hours ago

i mean, they also have one of the largest ceph deployments. anything is scalable with no budget.

pas16 hours ago

slide 22 states that the cost is 1 CHF/TB/month (on 10+2 erasure coded disks), though it would be interesting to do a breakdown of costs (development, hardware, maintenance, datacenter, servicing, management, etc..)

pclmulqdq15 hours ago

1 CHF/TB/month is a bit expensive for storage at that scale, so it would definitely be interesting to see what they're spending the money on and what they are (and aren't) counting in that price.

rob_c12 hours ago

Tape backup, accessibility, networking, availability... At 1CHF/TB that's a lot better than my local university still charging >100x that for such services internally

pclmulqdq5 hours ago

Economies of scale in storage are significant. Also, I don't know why you put up with your university charging 100x that when you can store things on AWS for $5-10/TB/month (or less). That comes with all the guarantees (or more) of durability and availability you get from the university.

hackernewds14 hours ago

No budget often tags along with no accountability

hi-v-rocknroll15 hours ago

They probably consume Panasas, IBM, DDN, and BeeGFS gear and licensing too.

adev_13 hours ago

Nop.

Most internal data is spread between Ceph and home-made distributed storage system named EOS (https://indico.cern.ch/event/138478/contributions/149912/att...) running over commodity hardware.

The only commerical-backed storage system is the long term storage tape system. Still it has an home-made overlay API over it to interface with the rest of the systems.

rob_c12 hours ago

Good god no. Nowhere near anything so crass. CEPH and EOS all the way

niemandhier13 hours ago

People here keep claiming “Anything is possible with unlimited budget”.

Cerns budget is 1.4 billion Euro, 50 million Euro for all IT infrastructure.

https://cds.cern.ch/record/2888205/files/English.pdf#page18

It’s not the money, it’s the people. Update: Added source.

atoav12 hours ago

That kind of place can draw a certain kind of employee. This finding is hard to transfer to commercial projects. Sure employees will always claim to be really motivated, especially in the marketing material, but are they we-are-nerds-working-on-the-bleeding-edge-of-human-knowledge-motivated?

Probably not, but there is surely some manager out there who made themselves believe they can motivate their employees to show the same devotion for the self-made hardships of some mostely pointless SaaS product. If you want to grab that kind of spirit, what you do needs to fundamentally make sense beyond just making somebody money.

sligor10 hours ago

That's exactly how we were able to go to the moon in 55 years ago. And why it's complicated today. It was of course lot of money. But it was mostly a lot of highly skilled, motivated devoted people doing for an ultimate common goal. Money would not have been sufficient by itself.

HPsquared7 hours ago

Since then, a LOT of the smart motivated people have been lured into either banking or adtech. The pay is good and the technical problems can be pretty interesting but the end result lacks that "wow factor".

seb12045 hours ago

I also read that nowadays we are more risk averse and many people/manager/companies are mostly administrators of status quo. Pair that with lack of vision and public engagement for current challenges to humanity.

wvh7 hours ago

In other words, if you permit, pure capitalism isn't a sufficiently good motive to get something significant done. But of course most of us don't work towards an ultimate common goal – and neither did most people in those times. One wonders if there is enough meaning left these days to go 'round and ensure most of us feel passionate about the stuff we (have to) do. Maybe we really need a god or war or common enemy to unite all strands into a strong rope.

dangitman8 hours ago

[dead]

jedrek10 hours ago

Also, CERN does not have a profit motive.

How much good work have the people reading this thread had to trash because it didn't align with Q3 OKRs? How much time and energy did they put into garbage solutions because they had to hit a KPI by the last day of June?

bayindirh8 hours ago

> Also, CERN does not have a profit motive.

This is a great point. We work with CERN on a project, and we're all volunteers, but we work on something we need, and contribute back to it wholeheartedly.

At the end of the day, CERN wants to give away project governance to someone else in the group, because they don't want to be the BDFL of anything they help creating. It allows them to pursue new and shiny things and build them.

There's an air consisting of "it's done when it's done", and "we need this, so we should build this without watering it down", so projects move at a steady pace, but the code and product is always as high quality as possible.

niemandhier10 hours ago

CERN buddy of mine suggested that exposing a colony of physicists to elevated ambient levels of helium would trigger excessive infrastructure building behavior.

quailfarmer11 hours ago

That’s a great observation, and I think generally correct, but there are private companies where that sort of motivation exists, for basically the same reason

guappa8 hours ago

Then they get bought by some megacorp which kills the motivation.

Cthulhu_4 hours ago

Or they are the megacorp that killed it (Google, Xerox?)

dangitman8 hours ago

[dead]

lmihaig5 hours ago

People get this very wrong, CERN is extremely underfunded. People really don't understand how expensive running the accelerators is and most of the budget goes to that. Last years they even had to run for less months than expected because they couldn't afford the rising energy prices.

The buildings are old, the offices suck, you don't even get free coffee and they pay less than the norm in Switzerland. But they have some of the top minds working on very specific and interesting systems, dealing with problems you'd never encounter anywhere else.

I would like to yap more about the current management and their push/reliance on enterprise solutions but to cut it short I really do think cern is a net contributor to open science and they deserve more funding.

lokimedes13 hours ago

Also, the in-kind contributions from hundreds of institutes around the world. Much can, and has, been said about physicist code, but CERN is the center of a massive community of “pre-dropout” geniuses. I can’t count the number of former students that later joined Google and the likes. Many are frequenting HN.

adev_13 hours ago

CERN was a good example of how much can be done with how little when you have the right people.

For a long time, the entire Linux distribution (Scientific Linux) used for ~15K collaborators, the infra and the grid computing was managed by a team of around 4-5 people.

The teams managing the network access (LanDB), the distributed computing system, the scientific framework (ROOT) and the storage are also small, dedicated skilled teams.

And the result speaks for itself.

Unfortunately, most of that went to shit quite recently when they replaced the previous head of IT by a Microsoft fanboy/girl coming from outside of the scientific environment. The first thing he/she did was to force Microsoft bloatware everywhere to replace existing working OSS solutions.

axus5 hours ago

I think the majority of the Scientific Linux software came from Fedora/Red Hat and the Linux kernel. Planning and managing the CERN computing infrastructure is a lot of work, then updating and releasing a famous distro on top of that was impressive.

wuming211 hours ago

> Unfortunately, most of that went to shit quite recently when they replaced the previous head of IT by a Microsoft fanboy(girl?) coming from outside of the scientific environment.

Painful to read so I did a short check. From a news post I don’t want to link here, but easily found searching “CERN, the famous scientific lab where the web was born, tells us why it's ditching Microsoft and helping others do the same”, direction taken in 2019 seemed quite the opposite. I am not sure how current head of IT at CERN, Enrica Porcari, fits in to the story. Insider info will be appreciated.

adev_11 hours ago

> direction taken in 2019 seemed quite the opposite

The head of IT changed in 2021 if it answers your question.

+1
dguest9 hours ago
+1
wuming210 hours ago
jimbat5 hours ago

Scientific Linux was originally a product of Fermilab, with contributions from CERN.

amelius10 hours ago

> Cerns budget is 1.4 billion Euro

Kind of weird that a company like Uber has a valuation of $150 billion Euro.

dguest9 hours ago

Most of the people who make CERN work aren't working for CERN. The IT department is under CERN, but there are many thousands of "users" who don't get payed by CERN at all. Quite a lot of the fabrication and most of the physics analysis is done by national labs and universities around the world.

elashri9 hours ago

CERN budget on experiment level is being paid mostly by contributions from the institutions that is part of this experiment. I am talking about operation, R&D and this would also include personnel contributions to different aspect. There is also service work that each one of the users must do beside doing physics. I am for example work on software development stack beside my current physics analysis. Some of my colleagues working on hardware.

Then there are country level contributions that pays for CERN infrastructure and maintenance (and inter experiment stuff) and direct employees salaries.

dguest7 hours ago

The important point here is that (I believe) the 1.4 billion above doesn't account for all the work done directly by institutes. Institutes pay CERN, but they also channel government grants to fund a huge amount of work directly.

Most of the people I know who "worked at" CERN never got a pay check that said CERN on it.

yccs2710 hours ago

Apples to oranges. Budget is per year, valuation is total.

A better comparison would be Uber's revenue of $37 billion in 2023.

amelius10 hours ago

I don't see why it's Apples to oranges. Uber could pay for 150 CERN-years.

+2
chmod7759 hours ago
gwervc9 hours ago

How many people ordering a meal (often out of laziness) per day vs thinking and searching the mysteries of universe? Economically it makes sense that Uber generates a lot more of cash.

chrisandchris8 hours ago

I think you misinterpreted that there shall be a correlation between _valuation_ and _earnings_. Ubers _first_ ever positive year was 2013, after 15 years in business [1] . Uber may be generating cash, but it's also loosing (lost) cash a lot faster than it was generating it. By taking 2013 as reference (~2 billion), it needs another 5 of those years just to recover from its losses in 2012 (9 billion). I understand the economics behind it, but its valuation is way out of reality.

[1] https://www.theverge.com/2024/2/8/24065999/uber-earnings-pro...

hkwerf5 hours ago

That being said, though, members contribute more than money. A lot of the work done at CERN is not done on CERN budgets, but on the budgets of member institutes.

dauertewigkeit9 hours ago

Good hiring managers can find the hidden gems. These are typically people who don't have the resume to join FAANG immediately, due to lacking the pedigree, but who have lots of potential. Also these same people typically don't last long because they do eventually move on.

Also it helps that Europe is so behind in tech that if you want to do some cutting edge tech you are almost forced to join a public institution because private ones are not doing anything exciting.

sofixa5 hours ago

> Also it helps that Europe is so behind in tech that if you want to do some cutting edge tech you are almost forced to join a public institution because private ones are not doing anything exciting.

This is genuinely cringeworthy. Do you think that companies in the EU all use COBOL on mainframes and nothing newer than 10 years old is allowed? Airlines and banks here(!) are rewriting their apps to be Kubernetes native... And have been doing so for years. Amadeus (top 2 airline booking software in the world) were a top Kubernetes contributor already a decade ago.

The tech problems being solved at Criteo, Revolut, Thales, BackMarket, Airbus, Amadeus (to name a few fun ones off the top of my head) are no less challenging and bleeding edge than... "the Uber of X" app number 831813 in the US. Or fucking Juicero or Theranos or any of the other scams.

guappa8 hours ago

Because doing the millionth CRUD in USA is very exciting?

wvh7 hours ago

One wonders if things win because they really are better, or because there's sufficient financial momentum behind them. I have worked in the public sector for some years, and I don't think Europe is behind, just that the budgets are a lot smaller. If you want to capture a lot of people in an ecosystem or walled garden, you're going to need money, and lots of it. For all that's good and bad about it, most of that excess is concentrated in the US, in a few hotspots. No need to get distracted and put a flag on somebody like a Zuckerberg or Jobs or Gates though.

sofixa5 hours ago

> and I don't think Europe is behind, just that the budgets are a lot smaller. If you want to capture a lot of people in an ecosystem or walled garden, you're going to need money, and lots of it

And the initial market you have is quite a bit smaller. Germany is the biggest EU country by population at 84 million, compared to 333 million in the US. Moving into another EU country means translating into a different language, verifying what laws apply to you, how taxes work, etc. Sometimes it's easy (just a translation), sometimes you might have to redo everything almost from scratch (e.g. Doctolib which schedule healthcare appointments, do meetings online with doctors, can be used to share test results, prescriptions - each new country they enter will have a lot of regulations on healthcare data that will need to be applied).

But it's mostly the budgets.

rob_c12 hours ago

Yes, but that still covers infrastructure (cables) and a lot of equipment for the experiments including but not limited to massive storage and tape backup, distributed local compute, and local cluster management all with users busy trying to pummel it with the latest and greatest ideas of how they can use it faster and better... Not to mention specialist software and licences. 50M doesn't go that far when you factor all of this in

vfclists17 hours ago

[flagged]

coherentpony16 hours ago

https://en.wikipedia.org/wiki/CERN#Scientific_achievements

Here's a couple, in case you don't want to read the page:

- CERN pioneered the introduction of TCP/IP for its intranet, beginning in 1984

- CERN has developed a number of policies and official documents that enable and promote open science

- The CERN Science Gateway, opened in October 2023,[179] is CERN's latest facility for science outreach and education

I purposefully picked items that weren't directly particle physics related.

scottapotamas14 hours ago

Just tacking some detail onto "promote open science".

CERN was/is a large early user and supporter of the open source KiCAD electronics CAD tooling. The downstream impact of improved accessibility to solid ECAD tooling has been a large contributing factor to the growing ecosystem of open electronics.

A lot of really impressive test and measurement equipment to support their research is developed in the open (see https://ohwr.org/project). People on HN are probably most likely to have heard of the White Rabbit timing project, but there's fantastic voltmeter designs, a lot of FPGA projects for carriers, gateware, fun RF designs.

snugglebert13 hours ago

They also use the expensive big ECAD tools for the super complex stuff.

But no secret - they are one of the reasons why Kicad isn't an ugly duckling anymore.

magicalhippo11 hours ago

They have their own page for it: https://kt.cern/

There's a lot of use for the acceleration and sensor knowledge in the medical sector. Technology first developed for high-energy research can be used to improve CT scans[1], better cancer treatment[2] and so on. This goes way back.

[1]: https://home.cern/news/news/knowledge-sharing/spectral-imagi...

[2]: https://kt.cern/success-stories/pioneering-new-cancer-radiot...

tiffanyh16 hours ago

Inventing WWW is arguably the single greatest economic development in the history of mankind.

hollerith16 hours ago

But if Berners-Lee hadn't started the WWW, someone else probably would have within a few years: the hard part was the development of the internet, i.e., a flexible low-cost wide-area network where anyone could start a new service (look in /etc/services to see all the services that people have defined over the years) without the need to get permission from anyone.

IIRC the first WWW server went live in 1990. By then there was already WAIS, Archie and Veronica (search engines for anonymous-FTP sites). In 1991, the first Gopher server went live. Gopher grew rapidly till the late 1990s.

The US government's Advanced Research Projects Agency started funding research into "packet-switched networks" in 1960 which would eventually lead to the internet, which went live in 1969 (under the name ARPAnet, but only a pedant would say that ARPAnet is not the early verion of the internet). Then the USG continued to fund the internet every year till it no longer needed funding in the early 1990s.

So, CERN and Berners-Lee (mostly the latter because no one at CERN other than Berners-Lee cared much about the WWW in its early days before it became a big hit) get some credit for the WWW, but in my reckoning it is a small amount of credit.

bozhark16 hours ago

But if…

But wasn’t.

Enginerrrd17 hours ago

A lot of the benefit has come from learning expertise in applications.

Tons of the data science tools have roots in CERN. Tons of interesting statistical methods, tons of experience R&D with superconductors and all manners of sensors.

Tons of math/ computation techniques / modeling, etc would not be here without for CERN.

It would be sort of silly to expect that any of their actual discoveries or tests of the SM would have any actual application, but the ancillary benefits are there.

hackernewds14 hours ago

Which tons? And why would it be silly? If actual new particles or physical phenomena were found the applications would be trillions

adgjlsfhk113 hours ago

a particle that requires 30 km particle accelerator to produce isn't going to have that many applications on earth

arlort13 hours ago

> practical social and economic benefits to humanity as a whole?

Why does it have to be practical? Scientific discovery is a perfectly valid end in its own even if it only ever means that we understand the universe better

The fact that almost always scientific discovery turns out to have practical purposes in the long run (centuries, not decades) is an added bonus.

It's not like it's a huge expense either. If switzerland decided to it could cover the yearly budget of cern, by itself at the cost of a fraction of a percentage of its gdp alone

dekhn15 hours ago

There's a number of points to unpack here.

High energy physics research has contributed some technology with social and economic benefits. Some of that has been direct results coming from pure research into fundamental properties of matter and electromagnetic radiation, some are indirect results that came about because when you build an institute like CERN, it spontaneously generates advances in other areas that solve more general problems (this is known as the "collect a bunch of smart people in a single place, with a lot of resources, to solve a unique problem" strategy). But no, most of the research, pure or applied, has not really had direct practical social and economic benefits to humanity as a whole.

That's entirely missing the point. We, as a society, have decided that we will balance our economic productivity into several different areas- welfare, infrastructure, military, industry, science/research, technology. We believe that investments in areas of research which have no direct benefit still can have positive outcomes- partly through fundamental discoveries, but also enriching us as a species. We also believe these investments will ensure that we have the freedom to be productive in the future.

A cynic might even say that CERN has played a critical role in keeping people from working on military applications, or working for the enemy.

If your criticism (it's hard not to read your comment as an implicit criticism) is that we should invest the results of our productivity more directly into areas which maximize social and economic benefits- sure, this is argued about all the time. The SSC was cancelled, at least partly because people failed to see the value in having a world-class HEP facility in the US.

hackernewds14 hours ago

[flagged]

rob_c12 hours ago

No, but had cynicism? Off the member states the highest cost per used payer is still less than a bag of peanuts each year and most people with throw that at the TV over whatever upsets them without thinking. It's collective science not big pharma which is soaks tax payer money and then sells the discoveries back to you with 1000x markup. And yes CERN has played an important part in the scientific conversation of where we are in the universe and what is looks like. If you don't think that's important I think flat earth cults are working just as hard to derail conversations they don't want to join in good faith...