Citus 10 brings columnar compression to Postgres

georgewfraser • 3 years ago

Beware that simply adding a column-oriented storage engine to a row store like Postgres is not going to get you anywhere near the performance of a ground-up columnar system like Redshift or Snowflake. This paper explains why [1]. Short version: most of the benefits are in the column-oriented execution engine, which differs in every aspect of its implementation from a row-oriented execution engine.

[1] https://stratos.seas.harvard.edu/files/stratos/files/columns...

mslot • 3 years ago

Correct, though it depends whether you are CPU-bound or I/O-bound. We see the latter a lot more often for large data sets.

Columnar storage for PostgreSQL is especially relevant in cloud environments. Most database servers in the cloud use managed, network-attached disks because of durability, availability, and encryption-at-rest requirements. However, those do come with a performance penalty compared to local SSDs. The VMs also have IOPS and bandwidth limits, partly to manage capacity within the IaaS platform.

If you can reduce the data size by 10x, then you are effectively increasing your disk bandwidth by that much as well. Moreover, you can keep more data in memory, so you will read much less data from disk, plus you'll only read the columns used by the query. Hence, you're likely to see speed ups of more than 10x for some queries, even without column-oriented execution.

glogla • 3 years ago

That's exactly what we've seen. I don't know how Citus exactly stores data, but we found difference of 30x between gzipped Parquet and "generic Oracle table".

There's a huge difference whether your analytical query is scanning full 30 GB or 1 GB (or maybe even half that or less if you need to scan just some columns).

antpls • 3 years ago

Wouldn't it be possible to create a new type of index in Postgres (or maybe it already exists) which would take all the data of a column and simply lays it out in columnar format in memory, with all the benefit of compression ?

GordonS • 3 years ago

I agree,l that it's never going to be as effective as something built from the ground up as a column store. But the performance and storage improvements can still be immense over standard Postgres. And if you're already using Postgres, it means one less service to configure and maintain, and means you can easily combine both standard relational data with columnar data. For a lot of folks, that a huge win.

dang • 3 years ago

Some past threads on that paper:

The design and implementation of modern column-oriented database systems - https://news.ycombinator.com/item?id=18076547 - Sept 2018 (42 comments)

Design and Implementation of Modern Column-Oriented Databases (2012) [pdf] - https://news.ycombinator.com/item?id=11803299 - May 2016 (9 comments)

teej • 3 years ago

One of the gotchas of columnar storage (coming from Redshift) is that you lose all of the compression benefits if you have just one column that’s fat or hard to compress.

In Redshift columns are stored in blocks. You want to fit roughly the same number of column values per block across all your columns. But if you have one column where a small number of values can fit in a block, the rest of the columns end up leaving most of the block space unused. The result is wasted disk space and poor query performance.

This Postgres extension has similar-sounding storage ideas with stripes, but it’s not clear to me if it suffers from the same issue.

My first test to vet this would be a table with 50 columns of ints and one column of md5 hashes stored as varchar.

jeff-davis • 3 years ago

In citus columnar, stripes/chunks are variable in size. If you have one large column that doesn't bloat the other columns.

teej • 3 years ago

Cool! So if I select two columns with a predicate in the first, does it scan all stripes for the second column?

jeff-davis • 3 years ago

Stripes aren't really "scanned". They are more of a logical concept that tracks where the physical data for each column is, and only fetches what it needs.

If I understand what you are asking, let me restate: "can you apply a predicate first on column A before reading column B, so that you can avoid reading column B if the predicate on A doesn't match?".

The answer is: "sometimes". Predicates match some rows and not others, so matching rows may be mixed in with non-matching rows, so it's not always possible to avoid the IO to read column B. However, if the matching and non-matching rows happen to be separated (e.g. if your table is naturally in time order and you have a time-based predicate), then it's able to do this optimization. Please see the section on "Chunk Group Filtering".

teej • 3 years ago

Thanks for the in depth explanation! I look forward to exploring it more.

georgewfraser • 3 years ago

Do you have a source for this, or a code sample that can demonstrate it? This would be an extremely naive implementation of columnar storage. There are some truly hard cases around long variable-length strings, but any halfway decent columnar storage engine should be able to handle columns with different widths.

zten • 3 years ago

https://github.com/awslabs/amazon-redshift-utils/blob/master...

The compression actually works as advertised, but that document outlines why you don't want to have the sortkey skew from other columns. The disk I/O balloons as it goes and fetches a larger number of blocks from the other columns that match what you're filtering in the sortkey.

teej • 3 years ago

They call this issue out in the docs specifically when the fat column is a sort/distkey. rsbb in the Intermix slack has studied this in the most detail I know of outside AWS but they haven’t published their book yet.

mrits • 3 years ago

Several different compression algorithms would do very well for this. E.g dict compression for low cardinality wide columns. I'd recommend familiarizing yourself with a handful of different encodings for columnar data stores before you give up on the idea.

tticvs • 3 years ago

This caveat would apply for columns queried, not columns stored right?

biggerfisch • 3 years ago

I'm curious to see how this compares in real life to TimescaleDB hypertables with compression - which to me, reads as much the same thing. I'm wondering if Citus is bringing a lower level implementation of idea possibly?

mslot • 3 years ago

The access method approach followed in Citus is indeed lower level and more generic, which means it can be used on both time series data and other types of data.

For time series data, you can use built-in partitioning in PostgreSQL. It's not as easy to use as TimescaleDB, but pg_partman goes a long way, see: https://docs.citusdata.com/en/latest/use_cases/timeseries.ht...

You can then use the columnar access method to compress old partitions (see the end of the doc), and use distributed tables to shard and parallelize queries and DML.

LogicX • 3 years ago

Came here to say this - I was looking to see how compression compared to timescale’s stated 91% compression.

https://docs.timescale.com/latest/using-timescaledb/compress...

atanasovskib • 3 years ago

There are a lot of differences that need to be taken into account before making a comparison.

1. TimescaleDB implements the compression on a hifher level, the underlying table storage/access method remains the same

2. TimescaleDB doesn't compress latest data, allowing you to keep fast writes and edits for recent data, but also allows you to benefit from compression on row based data

3. Although not currently available, it is possible to have a TimescaleDB hypertable with a column based access method

4. Comparing would have to take into account the data model, access methods (types of queries), ingestion vs query comparison (batch vs real time), backfilling and editing, etc

I agree that this (Columnar) would be closer to Parquet.

mslot • 3 years ago

It always depends on the data, but we've seen 92.5% and more: https://twitter.com/JeffMealo/status/1368030569557286915

akulkarni • 3 years ago

(TimescaleDB person)

TimescaleDB users have seen 98% (ie over 50x) compression rates in some real-world cases (e.g., for some IT monitoring datasets), but compression ratio will definitely vary by dataset. (For example, a dataset of just 0s will compress even better! But that's probably not a realistic dataset :-) )

The reality is that Citus and TimescaleDB [0][1] take very different approaches to columnar compression, which result in different usability and performance trade-offs. In reality one should choose the right tool for their workload.

(As an aside, if you have time-series data, no one has spent more time developing an awesome time-series experience on Postgres than the TimescaleDB team has :-) )

Kudos to the Citus team for this launch! I love seeing how different members of the Postgres community keep pushing the state-of-the art.

[0] Building columnar compression in a row-oriented database (https://blog.timescale.com/blog/building-columnar-compressio...)

[1] Time-series compression algorithms, explained (https://blog.timescale.com/blog/time-series-compression-algo...)

bradleyjg • 3 years ago

This reads to me more like parquet.

simonw • 3 years ago

Reassuring to see big new features like this coming out after the Microsoft acquisition, a healthy sign that this open source product continues to see serious ongoing investment.

ablekh • 3 years ago

Technically, Citus product is open core, not open source. Though open sourcing shard rebalancer is a recent welcome step.

merb • 3 years ago

well because they are the only cloud vendor with a HA master-master postgresql database (citusdb managed) it probably is also a selling point.

manigandham • 3 years ago

Where are you seeing this multi-master feature? As far as I can tell, it uses standby nodes, not multiple active coordinators.

pm90 • 3 years ago

Wait... they have 2 read/write masters? And you can read/write from either master at the same time?

manigandham • 3 years ago

No, it creates a standby (of every node) when HA is enabled.

https://docs.microsoft.com/en-us/azure/postgresql/howto-hype...

merb • 3 years ago

yes but its more like vitess. which is basically more like sharded masters, but most of the time this works way better than something like galera or bdr.

gigatexal • 3 years ago

Great start! Keep in mind the limitations:

What are the Limitations?

These limitations are not set in stone, and we look forward to working on them in the future:

No UPDATE or DELETE support No index support No logical replication or logical decoding support See more limitations in the columnar README

tpetry • 3 years ago

cstore_fdw had the same limitations, so even if they say these limitations may not persist forever, i am not very hopeful they really want to solve this problem.

But they could solve it with just a little bit of work: * Create a hidden bitmap information to store whether the „row“ of the columnar table is still valid * When updating/deleting values only set the bitmap information to zero to indicate the value is no longer valid * Every updated „row“ is added into a special chunk at the end of the table in uncompressed format * When vacuuming the table the old rows with bitmap=0 are deleted and the values from the special chunk are merged

So you would have same performance if update/delete is never done as the complete bitmap index is filled with 1. And every update/delete will make it just a little bit slower, as most often only old date is stored in compressed format not much uncomprossed values will be stored. And a vacuum full would optimize these old tables again.

jeff-davis • 3 years ago

Thank you for the suggestions! We are interested in UPDATE/DELETE.

Can you describe you use case for columnar update/delete in a little more detail? Is it a few random updates, or bulk updates, or something else?

arp242 • 3 years ago

Not the previous poster, but my use case is storing site analytics (i.e. "Google Analytics"-like stuff) and allowing people to retain pageviews only for n amount of days.

jeff-davis • 3 years ago

arp242 • 3 years ago

The table is essentially:

  | site_id | created_at | [..other columns trimmed..]
  | 1       | 2021-06-18 | [..]
  | 2       | 2021-06-18 | [..]
  | 1       | 2021-06-19 | [..]
  | 2       | 2021-06-19 | [..]
  | 3       | 2021-06-19 | [..]

Site 1 may choose to keep things for 30 days, site 2 for 50 days, and site 3 may choose to keep things infinitely.

The solution I use now is to just run a background job where day with "delete from hits where site_id = ? and created_at < now() - interval ?". It's pretty simple, but works.

There are probably some clever things I can do: partition by site ID and then date range, or something like that. But this has a bunch of downsides too:

- I like to keep things as simple as possible to make it easier for people to self-host this, which is a big reason I use PostgreSQL in the first place. If it was just some SaaS I was running then I wouldn't have a lot of problems turning up the complexity a bit if it gave me serious benefits, but when you distribute software you expect other people to be able to run then it's a bit more of a trade-off.

- The entire thing also runs on SQLite, and I'd like to minimize the number of special PostgreSQL/SQLite branches when possible.

Ideally, what I'd like is that people can optionally plug in something like Citus if they need it (and many don't! Even SQLite actually works fine for many) and have standard SQL "just work" without sweeping architectural changes. They can switch it on/off as they please. I don't mind adding a few "if using_citus then ..." exceptions here or there, but there's a limit, especially since many people just don't need it (but will still "stuck" with a much more complex table structure because of it).

This, for me anyway, is the appeal of things like Citus or TimescaleDB vs. more specialized solutions like Clickhouse or whatnot. I don't need to get the "fastest possible solution", and these solutions strike a good middle ground between complexity/ease of setup vs. speed.

There is also a second use case for UPDATE (the above is mostly for DELETE): right now I pre-compute some data and run most queries on that, rather than the main append-only table with events because this is rather faster in various cases. Because the data is aggregated by hour or day, it needs to update those rows when new events come in. The way I do that now is with a unique key and "insert [..] on conflict [..] do update [..]", but there's a bunch of other ways to do this (but this it probably the easiest).

In principle those tables should become obsolete with a column store, but when I tried cstore_fdw last year this wasn't really fast enough (although this version may be, I didn't try yet). Even if it would be fast enough, this still means I'd have to write all queries twice: once for the "normal" PostgreSQL/SQLite use case, and once for the Citus use case, which isn't especially appealing.

So, tl;dr, I want Citus to be an optional thing people can use, and maintain compatibility with mainstream PostgreSQL and SQLite (and, possibly, other engines in the future too).

Perhaps this is a rare use cases? I don't know. But this is my use case.

tpetry • 3 years ago

I would need just random updates. In almost all cases i wouldn‘t update old records, but sometimes some data needs to be records needs to be rewritten or deleted because of gdpr requests.

wyck • 3 years ago

Amazon really missed the boat on Citus, a few more great acquisitions like this and Azure is going to look great (yes I know it can be still self installed on AWS).

manigandham • 3 years ago

AWS has Aurora and Redshift, along with plenty of other database tech.

jfbaro • 3 years ago

I am glad to see Postgresql ecosystem growing. It gets better by the day. Thanks to all people involved!

znpy • 3 years ago

since we're talking postgresql:

i recently started diving into postgresql and it seems to me that there is a patchwork of HA solutions (with patroni being the most feature-full) but no real multi-master solution for postgresql released under an open source license.

There's BDR (bi-directional replication) but apparently 2ndquadrant pulled it back under a proprietary license, am i right?

what's the current status of postgresql multi-master HA?

whitepoplar • 3 years ago

Not multi-master, but pg_auto_failover looks to be a very nice solution for HA, and one that is operationally simple to manage: https://github.com/citusdata/pg_auto_failover

porsager • 3 years ago

I just yesterday set up pg_auto_failover, and it was such a breeze compared to Patroni/PAF/repmgr. Less moving parts and a much simpler, but flexible setup. They did a really good job there! Also forced me to finally implement multihost support in my Postgres.js driver.

merb • 3 years ago

> i recently started diving into postgresql and it seems to me that there is a patchwork of HA solutions (with patroni being the most feature-full) but no real multi-master solution for postgresql released under an open source license.

true multi master is barely needed. but there is citus which uses the gpl, which can be run in multi master. and yes patroni is really really awesome.

znpy • 3 years ago

Thanks, I'll look into that!

brightball • 3 years ago

Awesome news! After seeing the benefits with the cstore extension, I assumed this was only a matter of time.

BenoitP • 3 years ago

So are they using Parquet, ORC or Arrow under the hood; or do they have a custom format?

mslot • 3 years ago

It is a custom format that was originally derived from ORC, but is very different at this point. For instance, all the metadata is kept in PostgreSQL catalog tables to make changes transactional.

truth_seeker • 3 years ago

Memory, IO, Computation and DB connection concurrency all 4 have to be kept in the mind to achieve optimum latency across wide range of complex queries.

It has been my observation that if you follow 3NF design, table partition and carefully crafted Materialized Views works 8 out of 10 times when data size is huge.

nwatson • 3 years ago

Sensage/Addamark was too early to the columnar storage game in 2001-2003 ... https://en.wikipedia.org/wiki/Sensage.

atombender • 3 years ago

Sybase/IQ (now SAP IQ, originally acquired from Expressway) was launched in 1995.

There are other old column databases, such as kdb, but I believe Sybase/Expessway was the first commercial product.

jacques_chester • 3 years ago

I think the claim here is that Citus added it to PostgreSQL.

Depending on how one defines that claim, Greenplum may have been first.

Disclosure: I work for VMware, which sponsors Greenplum development.

clairegiordano • 3 years ago

Definitely not intending to take credit away from any other teams. There are so many good Postgres extensions and forks in this ecosystem.

What we've done is add a Columnar storage feature into the Citus open source extension to Postgres, as part of Citus 10.

One way to think of it: Citus 10 now gives Postgres users (those who are also using the Citus extension) a columnar compression option, for use on a single node and/or a distributed Citus cluster.

jacques_chester • 3 years ago

Thanks. Regardless of who did what, columnar stores are non-trivial engineering and anyone who produces a production-ready one is worthy of admiration.

asah • 3 years ago

Greenplum was founded in 2003, A/S was 2001.

jacques_chester • 3 years ago

Again, my qualifier is "to PostgreSQL". I couldn't see such a connection on a brief skim of the A/S wikipedia entry.

mcrutcher • 3 years ago