Back

Building an open data pipeline in 2024

98 points2 daysblog.twingdata.com
amadio1 day ago

I take issue with this part of the article:

> In general, managed tools will give you stronger governance and access controls compared to open source solutions. For businesses dealing with sensitive data that requires a robust security model, commercial solutions may be worth investing in, as they can provide an added layer of reassurance and a stronger audit trail.

There are definitely open source solutions capable of managing vast amounts of data securely. The storage group at CERN develops EOS (a distributed filesystem based on the XRootD framework), and CERNBox, which puts a nice web interface on top. See https://github.com/xrootd/xrootd and https://github.com/cern-eos/eos for more information. See also https://techweekstorage.web.cern.ch, a recent event we had along with CS3 at CERN.

AnthonyMouse1 day ago

Not only that, open source and proprietary software both generally handle the common case well, because otherwise nobody would use it.

It's when you start doing something outside the norm that you notice a difference. Neither of them will be perfect when you're the first person trying to do something with the software, but for proprietary software that's game over, because you can't fix it yourself.

datadrivenangel18 hours ago

Your options are to use off the shelf and end up with a brittle and janky setup, or use open source and end up with a brittle and janky setup that is more customized to your workflows... It's a tradeoff though, and all the hosting and security work of open source can be a huge time sink.

AnthonyMouse17 hours ago

You don't actually have to do any of that work if you don't want to. Half the open source software companies have that as their business model -- you can take the code and do it yourself or you can buy a support contract and they do it for you. But then you can make your own modifications even if you're paying someone to handle the rest of it.

RadiozRadioz2 days ago

> And if you’re dealing with truly massive datasets you can take advantage of GPUs for your data jobs.

I don't think scale is the key deciding factor for whether GPUs are applicable for a given dataset.

I don't think this is a particularly insightful article. Read the first paragraph of the "Cost" section.

bradford1 day ago

> I don't think this is a particularly insightful article.

Data engineering can be lonely. I like seeing the approach that others are taking, and this article gives me a good idea of the implementation stack.

marcyb5st1 day ago

After reading this I suggest having a look at apache beam if you are not using it already. I have the feeling that you can achieve the same with way fewer elements in the stack.

Also, were you to decide to run it on another "runner".

Additionally, you can truly reuse your apache beam logic for streaming and batch jobs. Other tools perhaps can do that, but from some experiments I ran some time ago it's not as straightforward.

And finally, if one or more of your processing steps need access to GPUs you can request that (granted that your runner supports that: https://beam.apache.org/documentation/runtime/resource-hints... ).

fredguth1 day ago

Same here. I just wished OP pointed to an example repo with a minimum working example.

gchamonlive2 days ago

Yes I also believe both the dataset and the transformation algorithms have to lend themselves well to parallelization for GPUs to be useful. GPUs don't do magic they are just really good at parallel computing.

winwang2 days ago

That's right, and that means most transforms in big data. The fact that the dataset can be distributed at all typically implies that the task is parallel.

nyokodo1 day ago

> The fact that the dataset can be distributed at all typically implies that the task is parallel.

It depends, big data tasks aren’t necessarily CPU bound but IO bound so you won’t see any speed up throwing GPUs at the problem but you will see a speed up throwing more worker nodes that come with their own network bandwidth and memory. CPU bound problems aren’t all appropriate for GPUs either. I suspect the article author is thinking of ML pipelines where at large scales GPUs are definitely necessary but at lower scales you can get away with ordinary CPUs.

+1
winwang1 day ago
dangoldin2 days ago

Author here and there's nuance here but as a rule of thumb data size is a decent enough proxy. Audience here isn't everyone and the goal was to give less experienced data engineers and folk a sense of modern data tools and a possible approach.

But what did you mean by "Read the first paragraph of the `Cost` section"?

llm_trw1 day ago

>Author here and there's nuance here but as a rule of thumb data size is a decent enough proxy.

It isn't though.

What matters is the memory footprint of the algorithm during execution.

If you're doing transformation that take constant time per item regardless of data size, sure, go for a GPU. If you're doing linear work you can't fit more than 24gb on a desktop card and prices go to the moon quickly after that.

Junior devs doing the equivalent of an outer product on data is the number one reason I've seen data pipelines explode in production.

dangoldin1 day ago

Yes but most data-heavy tasks are parallelizable. SQL itself is naturally parallelizable. There's a reason Apache RAPIDs, Voltron, Kinetica, Sqream, etc exist.

Full transparency I don't have huge amount of experience at working on this massive scale and to your point you need to understand the problem and constraints before you propose a solution.

llm_trw13 hours ago

There are more asterisks attached to each assertion you're making than you can shake a stick at.

There is always a 'simple' transformation that the business requires which turns out to need n^2 space that kills the server it's running on because people believe everything you said above.

Or in other words: most of the time you don't need a seat belt in a car either.

cgio1 day ago

You have to revisit the assertion that SQL is naturally paralleliseable. As a guide have a look at the semantics around Spark shuffles.

zX41ZdbW1 day ago

In fact, the opposite is true. While small datasets can be handled on GPU (although there are no good GPU databases comparable in performance to ClickHouse), large datasets don't fit, and unless there is a large amount of computation per byte of data, moving data around will eat the performance.

lmeyerov1 day ago

Sort of

Databricks etc architectures are mostly slowly moving data from A to B and doing little work, and worse when that describes the distributed compute part too. I read a paper awhile back where that was often half the time.

GPU architectures end up being explicitly about scaling the bandwidth. One of my favorite projects to do with teams tackling something like using GPU RAPIDS to scale event/log processing is to get GPU direct storage going. Imagine an SSD array reading back at 100GB/s, which feeds 1-2 GPUs at the same (PCI cards), and then TB/s for loaded data cross-GPU and mind-blowing many FLOPS. Modern GPUs get you 0.5TB+ per-node GPU RAM. So a single GPU node, when you do the IO bandwidth right for such fat streaming, is insane.

So yeah, taking a typical Spark cloud cluster and putting GPUs in vs the above is the difference between drinking from a childrens twirly straw vs a firehose.

datadrivenangel19 hours ago

A100 only has 40GB GPU RAM, so inter-node memory can be a bandwidth issue.

lmeyerov15 hours ago

I don't follow. Maybe the point is a lot of people are not balancing their systems, so by sticking with wimpy-era IO architectures, they're not feeding their GPUs?

I think about balancing nodes differently when designing older Spark CPU clusters vs modern GPU systems. (New spark clusters changed again to look more GPU/vertical, another story.)

In the databricks wonder years, horizontal scale made sense. Lots of cheap wimpy nodes with minimal compute per node was cost effective for a lot of problems. It was faster because the comparison point was older hadoop jobs that didn't run in-memory. But every byte moves far, and each node does very little... slow, energy costs, etc.. Makes sense when vertically scaled components are more expensive for same power budget etc, which used to be true before multicore & GPU chips got a lot cheaper and same with memory & IO (and software caught up too.)

As soon as you jump to GPU nodes, you're back to vertical scaling thinking. Instead of chaining a lot of single-GPU A100 boxes, and waiting on internode IO, you go multi-GPU (intra-node) and bring data closer/wider. One PCI card on a consumer devices might be say 8-30GB/s, and much faster if you go server grade. Similar multiples for IO, like 4-15 SSDs at 2GB/s each, or whatever network you can get (GDS, ...), or getting more CPU RAM (TB is a lot cheaper now!) to feed the local GPUs.

It takes a lot to saturate a single GPU node that looks like those. Foundation model teams like OpenAI & Facebook's core ones doing massive training runs will use hundreds/thousands of GPUs and need those nodes. But people doing fine-tuning, serving inferencing, and 400GB/s ETL... won't. Replace your roomful of Spark racks with a GPU rack or two. E.g., we have a customer who had a big graph database over many CPU nodes, but nowadays we can fit their biggest in 1 GPU's memory. They have more smaller graphs, so we can add a second GPU on the same server, and keep all of their graphs in CPU RAM. So a 2-GPU node with a bunch of CPU RAM can replace a rack of the CPU-era vendor. So not even a rack, just a single node. Nvidia's success stories on cutting down Pixar render farms worked similarly at way more impressive scales.

And for folks who haven't been following... Nvidia RAM increases have been impressive. An H100 doubles the A100's RAM 40GB => 80GB, and the H200s OpenAI started using have 141GB. For a lot of workloads, we see bursty use vs always on, so we actually often price out based on $ per GPU RAM: <3 T4 GPUs, despite being old!

victor10621 hours ago

> Cloudflare R2 (better than AWS S3). the article links to [1]

Is R2 really better than S3?

https://dansdatathoughts.substack.com/p/from-s3-to-r2-an-eco...

dangoldin9 hours ago

From egress + storage cost standpoint absolutely which ends up being a big factor for these large scale data systems.

There’s a prior discussion on HN about that post: https://news.ycombinator.com/item?id=38118577

And full disclosure but I’m author of both posts - just shifted my writing to be more focused on the company one.

esafak17 hours ago

By what metric? They're worth consideration and getting better: https://blog.cloudflare.com/r2-events-gcs-migration-infreque...

esafak17 hours ago

Can someone explain this "semantic layer" business (cube.dev)? Is it just a signal registry that helps you keep track of and query your ETL pipeline outputs?

dangoldin9 hours ago

Author here. Basic idea is you want some way of defining metrics. So something like “revenue = sum(sales) - sum(discount)” or “retention = whatever” which need to be generated via SQL at query time vs built in to a table. Then you can have higher confidence multiple access paths all have the same definitions for the metrics.

Phlogi17 hours ago

Why do I need sqlmesh if i use dbt/snowflake?

dangoldin9 hours ago

You don’t need to. dbt/sqlmesh are competitive. I just like the model of sqlmesh over dbt but dbt is much more dominant.