Back

Citus 10 brings columnar compression to Postgres

220 points3 yearscitusdata.com
georgewfraser3 years ago

Beware that simply adding a column-oriented storage engine to a row store like Postgres is not going to get you anywhere near the performance of a ground-up columnar system like Redshift or Snowflake. This paper explains why [1]. Short version: most of the benefits are in the column-oriented execution engine, which differs in every aspect of its implementation from a row-oriented execution engine.

[1] https://stratos.seas.harvard.edu/files/stratos/files/columns...

mslot3 years ago

Correct, though it depends whether you are CPU-bound or I/O-bound. We see the latter a lot more often for large data sets.

Columnar storage for PostgreSQL is especially relevant in cloud environments. Most database servers in the cloud use managed, network-attached disks because of durability, availability, and encryption-at-rest requirements. However, those do come with a performance penalty compared to local SSDs. The VMs also have IOPS and bandwidth limits, partly to manage capacity within the IaaS platform.

If you can reduce the data size by 10x, then you are effectively increasing your disk bandwidth by that much as well. Moreover, you can keep more data in memory, so you will read much less data from disk, plus you'll only read the columns used by the query. Hence, you're likely to see speed ups of more than 10x for some queries, even without column-oriented execution.

glogla3 years ago

That's exactly what we've seen. I don't know how Citus exactly stores data, but we found difference of 30x between gzipped Parquet and "generic Oracle table".

There's a huge difference whether your analytical query is scanning full 30 GB or 1 GB (or maybe even half that or less if you need to scan just some columns).

antpls3 years ago

Wouldn't it be possible to create a new type of index in Postgres (or maybe it already exists) which would take all the data of a column and simply lays it out in columnar format in memory, with all the benefit of compression ?

GordonS3 years ago

I agree,l that it's never going to be as effective as something built from the ground up as a column store. But the performance and storage improvements can still be immense over standard Postgres. And if you're already using Postgres, it means one less service to configure and maintain, and means you can easily combine both standard relational data with columnar data. For a lot of folks, that a huge win.

dang3 years ago

Some past threads on that paper:

The design and implementation of modern column-oriented database systems - https://news.ycombinator.com/item?id=18076547 - Sept 2018 (42 comments)

Design and Implementation of Modern Column-Oriented Databases (2012) [pdf] - https://news.ycombinator.com/item?id=11803299 - May 2016 (9 comments)

teej3 years ago

One of the gotchas of columnar storage (coming from Redshift) is that you lose all of the compression benefits if you have just one column that’s fat or hard to compress.

In Redshift columns are stored in blocks. You want to fit roughly the same number of column values per block across all your columns. But if you have one column where a small number of values can fit in a block, the rest of the columns end up leaving most of the block space unused. The result is wasted disk space and poor query performance.

This Postgres extension has similar-sounding storage ideas with stripes, but it’s not clear to me if it suffers from the same issue.

My first test to vet this would be a table with 50 columns of ints and one column of md5 hashes stored as varchar.

jeff-davis3 years ago

In citus columnar, stripes/chunks are variable in size. If you have one large column that doesn't bloat the other columns.

teej3 years ago

Cool! So if I select two columns with a predicate in the first, does it scan all stripes for the second column?

jeff-davis3 years ago

Stripes aren't really "scanned". They are more of a logical concept that tracks where the physical data for each column is, and only fetches what it needs.

If I understand what you are asking, let me restate: "can you apply a predicate first on column A before reading column B, so that you can avoid reading column B if the predicate on A doesn't match?".

The answer is: "sometimes". Predicates match some rows and not others, so matching rows may be mixed in with non-matching rows, so it's not always possible to avoid the IO to read column B. However, if the matching and non-matching rows happen to be separated (e.g. if your table is naturally in time order and you have a time-based predicate), then it's able to do this optimization. Please see the section on "Chunk Group Filtering".

teej3 years ago

Thanks for the in depth explanation! I look forward to exploring it more.

georgewfraser3 years ago

Do you have a source for this, or a code sample that can demonstrate it? This would be an extremely naive implementation of columnar storage. There are some truly hard cases around long variable-length strings, but any halfway decent columnar storage engine should be able to handle columns with different widths.

zten3 years ago

https://github.com/awslabs/amazon-redshift-utils/blob/master...

The compression actually works as advertised, but that document outlines why you don't want to have the sortkey skew from other columns. The disk I/O balloons as it goes and fetches a larger number of blocks from the other columns that match what you're filtering in the sortkey.

teej3 years ago

They call this issue out in the docs specifically when the fat column is a sort/distkey. rsbb in the Intermix slack has studied this in the most detail I know of outside AWS but they haven’t published their book yet.

mrits3 years ago

Several different compression algorithms would do very well for this. E.g dict compression for low cardinality wide columns. I'd recommend familiarizing yourself with a handful of different encodings for columnar data stores before you give up on the idea.

tticvs3 years ago

This caveat would apply for columns queried, not columns stored right?

biggerfisch3 years ago

I'm curious to see how this compares in real life to TimescaleDB hypertables with compression - which to me, reads as much the same thing. I'm wondering if Citus is bringing a lower level implementation of idea possibly?

mslot3 years ago

The access method approach followed in Citus is indeed lower level and more generic, which means it can be used on both time series data and other types of data.

For time series data, you can use built-in partitioning in PostgreSQL. It's not as easy to use as TimescaleDB, but pg_partman goes a long way, see: https://docs.citusdata.com/en/latest/use_cases/timeseries.ht...

You can then use the columnar access method to compress old partitions (see the end of the doc), and use distributed tables to shard and parallelize queries and DML.

LogicX3 years ago

Came here to say this - I was looking to see how compression compared to timescale’s stated 91% compression.

https://docs.timescale.com/latest/using-timescaledb/compress...

atanasovskib3 years ago

There are a lot of differences that need to be taken into account before making a comparison.

1. TimescaleDB implements the compression on a hifher level, the underlying table storage/access method remains the same

2. TimescaleDB doesn't compress latest data, allowing you to keep fast writes and edits for recent data, but also allows you to benefit from compression on row based data

3. Although not currently available, it is possible to have a TimescaleDB hypertable with a column based access method

4. Comparing would have to take into account the data model, access methods (types of queries), ingestion vs query comparison (batch vs real time), backfilling and editing, etc

I agree that this (Columnar) would be closer to Parquet.

mslot3 years ago

It always depends on the data, but we've seen 92.5% and more: https://twitter.com/JeffMealo/status/1368030569557286915

akulkarni3 years ago

(TimescaleDB person)

TimescaleDB users have seen 98% (ie over 50x) compression rates in some real-world cases (e.g., for some IT monitoring datasets), but compression ratio will definitely vary by dataset. (For example, a dataset of just 0s will compress even better! But that's probably not a realistic dataset :-) )

The reality is that Citus and TimescaleDB [0][1] take very different approaches to columnar compression, which result in different usability and performance trade-offs. In reality one should choose the right tool for their workload.

(As an aside, if you have time-series data, no one has spent more time developing an awesome time-series experience on Postgres than the TimescaleDB team has :-) )

Kudos to the Citus team for this launch! I love seeing how different members of the Postgres community keep pushing the state-of-the art.

[0] Building columnar compression in a row-oriented database (https://blog.timescale.com/blog/building-columnar-compressio...)

[1] Time-series compression algorithms, explained (https://blog.timescale.com/blog/time-series-compression-algo...)

bradleyjg3 years ago

This reads to me more like parquet.

simonw3 years ago

Reassuring to see big new features like this coming out after the Microsoft acquisition, a healthy sign that this open source product continues to see serious ongoing investment.

ablekh3 years ago

Technically, Citus product is open core, not open source. Though open sourcing shard rebalancer is a recent welcome step.

merb3 years ago

well because they are the only cloud vendor with a HA master-master postgresql database (citusdb managed) it probably is also a selling point.

manigandham3 years ago

Where are you seeing this multi-master feature? As far as I can tell, it uses standby nodes, not multiple active coordinators.

pm903 years ago

Wait... they have 2 read/write masters? And you can read/write from either master at the same time?

manigandham3 years ago

No, it creates a standby (of every node) when HA is enabled.

https://docs.microsoft.com/en-us/azure/postgresql/howto-hype...

merb3 years ago

yes but its more like vitess. which is basically more like sharded masters, but most of the time this works way better than something like galera or bdr.

gigatexal3 years ago

Great start! Keep in mind the limitations:

What are the Limitations?

These limitations are not set in stone, and we look forward to working on them in the future:

No UPDATE or DELETE support No index support No logical replication or logical decoding support See more limitations in the columnar README

tpetry3 years ago

cstore_fdw had the same limitations, so even if they say these limitations may not persist forever, i am not very hopeful they really want to solve this problem.

But they could solve it with just a little bit of work: * Create a hidden bitmap information to store whether the „row“ of the columnar table is still valid * When updating/deleting values only set the bitmap information to zero to indicate the value is no longer valid * Every updated „row“ is added into a special chunk at the end of the table in uncompressed format * When vacuuming the table the old rows with bitmap=0 are deleted and the values from the special chunk are merged

So you would have same performance if update/delete is never done as the complete bitmap index is filled with 1. And every update/delete will make it just a little bit slower, as most often only old date is stored in compressed format not much uncomprossed values will be stored. And a vacuum full would optimize these old tables again.

jeff-davis3 years ago

Thank you for the suggestions! We are interested in UPDATE/DELETE.

Can you describe you use case for columnar update/delete in a little more detail? Is it a few random updates, or bulk updates, or something else?

arp2423 years ago

Not the previous poster, but my use case is storing site analytics (i.e. "Google Analytics"-like stuff) and allowing people to retain pageviews only for n amount of days.

+1
jeff-davis3 years ago
tpetry3 years ago

I would need just random updates. In almost all cases i wouldn‘t update old records, but sometimes some data needs to be records needs to be rewritten or deleted because of gdpr requests.

wyck3 years ago

Amazon really missed the boat on Citus, a few more great acquisitions like this and Azure is going to look great (yes I know it can be still self installed on AWS).

manigandham3 years ago

AWS has Aurora and Redshift, along with plenty of other database tech.

jfbaro3 years ago

I am glad to see Postgresql ecosystem growing. It gets better by the day. Thanks to all people involved!

znpy3 years ago

since we're talking postgresql:

i recently started diving into postgresql and it seems to me that there is a patchwork of HA solutions (with patroni being the most feature-full) but no real multi-master solution for postgresql released under an open source license.

There's BDR (bi-directional replication) but apparently 2ndquadrant pulled it back under a proprietary license, am i right?

what's the current status of postgresql multi-master HA?

whitepoplar3 years ago

Not multi-master, but pg_auto_failover looks to be a very nice solution for HA, and one that is operationally simple to manage: https://github.com/citusdata/pg_auto_failover

porsager3 years ago

I just yesterday set up pg_auto_failover, and it was such a breeze compared to Patroni/PAF/repmgr. Less moving parts and a much simpler, but flexible setup. They did a really good job there! Also forced me to finally implement multihost support in my Postgres.js driver.

merb3 years ago

> i recently started diving into postgresql and it seems to me that there is a patchwork of HA solutions (with patroni being the most feature-full) but no real multi-master solution for postgresql released under an open source license.

true multi master is barely needed. but there is citus which uses the gpl, which can be run in multi master. and yes patroni is really really awesome.

znpy3 years ago

Thanks, I'll look into that!

brightball3 years ago

Awesome news! After seeing the benefits with the cstore extension, I assumed this was only a matter of time.

BenoitP3 years ago

So are they using Parquet, ORC or Arrow under the hood; or do they have a custom format?

mslot3 years ago

It is a custom format that was originally derived from ORC, but is very different at this point. For instance, all the metadata is kept in PostgreSQL catalog tables to make changes transactional.

truth_seeker3 years ago

Memory, IO, Computation and DB connection concurrency all 4 have to be kept in the mind to achieve optimum latency across wide range of complex queries.

It has been my observation that if you follow 3NF design, table partition and carefully crafted Materialized Views works 8 out of 10 times when data size is huge.

nwatson3 years ago

Sensage/Addamark was too early to the columnar storage game in 2001-2003 ... https://en.wikipedia.org/wiki/Sensage.

atombender3 years ago

Sybase/IQ (now SAP IQ, originally acquired from Expressway) was launched in 1995.

There are other old column databases, such as kdb, but I believe Sybase/Expessway was the first commercial product.

jacques_chester3 years ago

I think the claim here is that Citus added it to PostgreSQL.

Depending on how one defines that claim, Greenplum may have been first.

Disclosure: I work for VMware, which sponsors Greenplum development.

clairegiordano3 years ago

Definitely not intending to take credit away from any other teams. There are so many good Postgres extensions and forks in this ecosystem.

What we've done is add a Columnar storage feature into the Citus open source extension to Postgres, as part of Citus 10.

One way to think of it: Citus 10 now gives Postgres users (those who are also using the Citus extension) a columnar compression option, for use on a single node and/or a distributed Citus cluster.

jacques_chester3 years ago

Thanks. Regardless of who did what, columnar stores are non-trivial engineering and anyone who produces a production-ready one is worthy of admiration.

asah3 years ago

Greenplum was founded in 2003, A/S was 2001.

jacques_chester3 years ago

Again, my qualifier is "to PostgreSQL". I couldn't see such a connection on a brief skim of the A/S wikipedia entry.

+2
mcrutcher3 years ago