Back

Comparing ClickHouse to PostgreSQL and TimescaleDB for time-series data

251 points2 yearsblog.timescale.com
dmw_ng2 years ago

More war stories: found Timescale easier to setup (maybe just because more familiar), but raw query perf is not something you just magically get for free. Timescale requires a lot of investment in planning. In one project we had simple time range scan queries against a less-than-RAM-sized table taking tens of seconds to complete.

ClickHouse has a bit more ops overhead, but requires very little in the way of pre-planning. Just throw whatever you want at it and it seems to sing by default.

Regarding ops overhead, ClickHouse also has a "local" mode where you can query a huge range of compressed formats without ever performing any kind of ETL step. That means queries can run e.g. directly against S3 storage. For batch logs analysis, IMHO local is a gamechanger. Most bulk logging systems produce massive amounts of S3 objects, and ClickHouse lets you tear through these infrequently (but at high speed) when desired without any bulky/opsey ETL step, and no cluster running idle just waiting for a handful of daily jobs to arrive.

(I love this style of system in general, but the clear engineering work behind ClickHouse completely won me over, I'd apply it anywhere I could)

akulkarni2 years ago

(TimescaleDB co-founder)

Thank you for the feedback - it is conversations like this that enable us to understand how we can continue to make TimescaleDB better.

And some of the ideas you are discussion are on our roadmap - if anyone wants to help, we are hiring :-)

https://www.timescale.com/careers

kokizzu42 years ago

I found clickhouse very easy to setup '__') haven't tried timescaledb tho.

andrejserafim2 years ago

Our anecdata: we store telemetry per thing. After loading a month worth of data - timescaldb as hosted by their cloud ran a difference aggregation in seconds. Clickhouse routinely did it in 20 millis.

Simple avg, etc were better, but always clickhouse was an order of magnitude faster than timescale. We didn't invest a whole bunch into optimization other than trying some indexing strategies in timescaledb.

So for our use case the choice is clear.

ryanbooz2 years ago

(N.B. post author)

Thanks for the feedback. Without knowing your situation, one of the things we show in the blog post is that TimescaleDB compression often changes the game on those kinds of queries (data is transformed to columnar storage when you compress). You don't mention if you did that or not, but it's something we've seen/noticed in every other benchmark at this point - that folks don't enable it for the benchmark.

And second point of the article is that you have lots of options for whatever works in your specific situation. But, make sure you're using the chosen database features before counting it out. :-)

maxmcd2 years ago

I wonder if it's worth taking a page out of the MongoDB book and enabling these kinds of benchmark altering settings by default. We certainly selected clickhouse over tailscale internally because of major performance differences in our internal testing that might have gone the other way had we "known better".

ryanbooz2 years ago

Indeed. Lots of discussion over this in the last few months. There are nuances, but I think you'll see some progress in this area over the next year.

ants_a2 years ago

From my experience of benchmarking these databases on scientific data (highly regular timeseries) and looking at the internals of both, these kinds types of number happen when answering the query needs crunching through many rows, but the output has few. i.e. the queries are filtering and/or aggregating a ton of input rows, that can't be excluded by indexes or queried from preaggregations.

From what I can tell it comes down to execution engine differences. TimeScale, even with compressed tables, uses a row by row execution engine architecturally resembling IE6 era JS engines. ClickHouse uses a batched and vectorized execution engine utilizing SIMD. Difference is one to two orders of magnitude of throughput in terms raw number of rows per core pushed through the execution engine.

Postgres/TimeScale could certainly also implement a similar model of execution, but to call it an undertaking would be an understatement considering the breadth and extensibility of features that the execution engine would need to support. To my knowledge no one is seriously working on this outside of limited capability hacks like vops or PG-Strom extensions.

ryanbooz2 years ago

(post author)

You do a great job summarizing some of the benefits of ClickHouse we mentioned in the post, including the vectorized engine!

That said, I'm not sure I'd refer to PostgreSQL/TimescaleDB engine architecture as resembling IE6 JS support. Obviously YMMV, but every release of PG and TimescaleDB bring new advancements to query optimizations for the architecture they are designed for, which was the focus of the post.

I'm personally still impressed, after 20+ years of working with SQL, relational databases, when any optimization engine can use statistics to find the "best" plan among (potentially) thousands in a few ms. Maybe I'm too easily impressed. :-D

ants_a2 years ago

The optimization engine is of course great (despite occasionally missing hard), but I am not referring to it. I am referring to the way that PostgreSQL executes query plans, the way rows are pulled up the execution tree, is very similar to first iterations JavaScript engines - a tree based interpreter. Picking out columns from rows and evaluating expressions used to work the same until PG11, where we got a bytecode based interpreter and a JIT for those. But so far rows are still working the same way, and it hurts pretty bad when row lookup is cheap and the rows end up either thrown away or aggregated together with basic math.

+1
mfreed2 years ago
claytonjy2 years ago

Was this for your primary source-of-truth, or more of a downstream data warehouse, or something else?

I'm struggling to imagine a case where these are the two things being considered; Timescale is the obvious choice for a primary database, Clickhouse the obvious choice for a warehouse. I wouldn't let my user-facing app write to Clickhouse, and while I could potentially get away with a read-only Timescale replica for internal-facing reports I would expect to eventually outgrow that and reach for Clickhouse/Snowflake/Redshift.

encoderer2 years ago

> I wouldn't let my user-facing app write to Clickhouse

I’ve been thinking of doing exactly that. What are your concerns?

dreyfan2 years ago

https://blog.cloudflare.com/http-analytics-for-6m-requests-p...

has some good thoughts. The main thing you'll likely need is some sort of a buffer layer so you can do bulk inserts. Do not write a high-volume of single-row inserts into Clickhouse.

shaklee32 years ago

Chproxy is designed to handle this

https://github.com/Vertamedia/chproxy

+2
encoderer2 years ago
claytonjy2 years ago

I suppose it depends what you're going to let your user do, but OLAPs in general and Clickhouse in particular don't do well under row-oriented workloads, as described in the post here. I'm imagining users primarily operating on small numbers of rows and sometimes making updates to or deleting them, a worst-case scenario for Clickhouse but best-case for an OLTP like Postgres.

+1
encoderer2 years ago
dominotw2 years ago

> Clickhouse the obvious choice for a warehouse > Clickhouse/Snowflake/Redshift.

but clickhouse is very unlike the other two. when i think of a warehouse i think star schema, data modeling ect not something that hates joins.

claytonjy2 years ago

Agreed, I wouldn't use Clickhouse for usual warehouse stuff either, mostly because I can't imagine it plays well with dbt which is a non-starter these days.

I'd still argue Clickhouse is closer to Snowflake/Redshift than anything OLTP, and their name is intentionally chosen to evoke warehouse-like scenarios.

FridgeSeal2 years ago

What makes you think CH doesn’t like joins?

Having used Redshift, Snowflake and CH for similar workloads, I’d much prefer ClickHouse to the other 2.

Snowflake is hideously expensive for the subpar perf it offers in my experience and Redshift is mediocre at best in general.

david382 years ago

Clickhouse is nothing like snowflake. A Postgres user can be useful in Clickhouse in minutes, not even close for snowflake.

hodgesrm2 years ago

Is your comment on ClickHouse and DBT based on using the DBT ClickHouse plugin? [0] If so I would be very interested in understanding what you or others see as deficiencies.

[0] https://github.com/silentsokolov/dbt-clickhouse

GordonS2 years ago

How many data points were those aggregations being computed over? How much memory does your Postgres server have, and are you using SSD storage (with associated postgres config tweaks)?

ryanbooz2 years ago

(Post author)

Howdy! We provided all of those details in the post and you're welcome to join us next week when we live-stream our setup and test!

https://blog.timescale.com/blog/what-is-clickhouse-how-does-...

GordonS2 years ago

I was responding to @andrejserafim, asking about their scenario, not the article.

ryanbooz2 years ago

Gotcha! My apologies for not seeing the thread nature. HN threads get me sometimes. :-)

qaq2 years ago

ClickHouse shines at scales that timescale has no hope of ever supporting. Hence the choice of workloads in the test. Cloudflare was ingesting 6,000,000 rows per second into 36 node (dual E5-2630) ClickHouse cluster (2018) was something like 20PB of data per year.

akulkarni2 years ago

(TimescaleDB co-founder)

6,000,000 rows inserted per second is great! And if you need that for your workload, then you probably should choose ClickHouse over TimescaleDB (well at least, for now ;-)

The reason we don't include that in the benchmark is that most developers _don't need_ 6,000,000 rows inserted per second.

And also - that performance doesn't come for free, but requires giving up a lot of things that most developers may need: e.g., no transactions, immutable tables (can't easily update / delete data), SQL-like but not quite SQL query language, inefficient joins, inefficient for point queries retrieving single rows by their keys, etc. (We go into much more detail in the blog post.)

So it comes down to the fundamental analogy used in the post: Do you need a car (versatility) or a bull dozer (narrow specialization)?

If the answer is that you need to support 6,000,000 rows inserted per second, then by all means, choose the bull dozer.

> ClickHouse shines at scales that timescale has no hope of ever supporting.

I'm not sure if this was a throwaway line, or if it was the result of a detailed analysis of TimescaleDB's architecture, but if you don't mind, I'll share this: with TimescaleDB multi-node [0] we are getting close to that performance, and the product keeps getting better.

[0] https://blog.timescale.com/blog/building-a-distributed-time-...

ryanbooz2 years ago

(post author)

Those are great, impressive numbers. We certainly don't claim to be all things to all people, but the benchmark was run using single instances mostly because that is what most other benchmarks published by others have done.

With a multi-node TimescaleDB cluster, ingest does literally scale to millions of rows/second and we have a numerous users achieving great numbers. One Fortune 100 company has a 25+ TimescaleDB cluster backing their network monitoring stack and loves it for their use case.

At some point, when we can, I'm sure we'll start to do more with multi-node benchmarking too to give some input to the conversation.

qaq2 years ago

and then it will come down to spec of nodes actual fields and so on etc. Also batch size obviously plays a big role here as CH is optimized for very large batch sizes and benchmark is not really using that kind of batch size. BTW. I am not involved with CH but any kind of vendor benchmarking their wears will always select params that will make their offering look good

ryanbooz2 years ago

Sure, these tests were not using really large batch sizes because of the other benchmarks we were trying to replicate (but with more detail). Honestly, for this single instance setup, we saw improvement in CH when we went from (say) 5k, 10k, or 20k batches. But it was a few percentage points at a time, not a magnitude different. I'm sure things changes with a cluster setup too, that just wasn't the focus of this post.

Interestingly, we were just testing a multi-node TimescaleDB cluster the other day and found that 75k rows/batch was the optimal size as nodes increased.

So you're completely correct. I tried to be very clear that we were not intentionally "cooking the books" and there's surely other optimizations we could have made. Most of the suggestions so far, however, require further setup of CH features that haven't been used in other benchmarks, so we tried to over communicate our strategy and process.

We also fully acknowledged in the post that an siloed "insert", wait, then "query" test is not real world. But, it's the current way TSBS has been used and other DB engines have come along and used the methodology for now. Maybe that process will change in time to come with other contributions.

BTW, we'll discuss some of this next week during the live-stream and the video will be available after.

akulkarni2 years ago

Minor correction: It is actually a 40+ node cluster :-)

brightball2 years ago

That’s a really thorough comparison. Much more detailed than I expected.

From what I see, the trade off in disk space usage would point me toward Timescale for most of my workloads. The insert performance tradeoff just wouldn’t justify the difference for me.

ryanbooz2 years ago

(Post author)

Thanks for the compliment! It's becoming a habit with us and benchmarks. We just really want to dig in and understand what's going on and why things work the way they do. ;-)

There really are so many nuances and as we tried to say a number of times, ClickHouse is really great at what it does well. But it's still OLAP at heart (which precludes OLTP features many apps take for granted) and after enabling TimescaleDB compression, the query story isn't as cut and dry. We don't claim that TimescaleDB is the fastest in all circumstances, or that it absolutely has to be for every workload. Features and versatility play a major part in the decision.

ericb2 years ago

A licensing comparison would be a good addition.

PeterZaitsev2 years ago

Good question. Was it Open Source version of TimeScaleDB compared or Source Available ?

ryanbooz2 years ago

Sure. As we shared in the blog post it was tested (like other benchmarks) on dedicated EC2 instances using the freely available Community version.

+1
PeterZaitsev2 years ago
darthShadow2 years ago

Does the disk usage go down later once the numerous parts are merged or not?

I would assume it does but reading the article implies that it does not.

ryanbooz2 years ago

Great question. Yes, eventually it does, but (at least for now) it wasn't something we could reliably force as part of the query cycle and know everything was in it's "best" state with ClickHouse. To be honest, we didn't provide the final compressed size of either database because of the need to wait.

The code that's currently used by TSBS was submitted by Altinity, a heavy supporter of ClickHouse in the U.S., but TSBS is open source and anyone is welcome to contribute and make the process/test better!

darthShadow2 years ago

Thanks for the answer.

May be worth pointing that out in the article since the increased disk usage has been mentioned multiple times in the article without any indication that it's only temporary until ClickHouse merges the parts.

zepearl2 years ago

Concerning Clickhouse, yes it does - exactly the same thing e.g. as when you have 2 compressed files containing each 100 sorted rows, when you merge those 200 rows into a single file, sort them and compress them, the result will be smaller than the sum of the 2 separate files.

How much you save is again exactly the same as when dealing directly with files: it depends on the data and on the compression algo.

PeterZaitsev2 years ago

Altinity folks have suggested number of Clickhouse optimizations for time series benchmark did you enable any of those ? https://altinity.com/blog/clickhouse-continues-to-crush-time...

ryanbooz2 years ago

Hello @PeterZaitsev!

Actually Altinity is the one that contributed the bits to TSBS for benchmarking ClickHouse[1], so we are using the work that they contributed (and anyone is welcome to make a PR for updates or changes). We also had a former ClickHouse engineer look at the setup to verify it matched best practices with how CH is currently designed, given the TSBS dataset.

As for the optimizations in the article you pointed to from 2019 (specifically how to query "last point" data more efficiently in ClickHouse), it uses a different table type (AggregatedMergeTree) and a materialized view to get better query response times for this query type.

We (or someone in the community) could certainly add that optimization to the benchmark, but it wouldn't be using raw data - which we didn't think was appropriate for the benchmark analysis. But if one wanted to use that optimization, then one should also use Continuous Aggregates for TimescaleDB - ie for an apples to apples comparison - which I think would also lead to similar results to what we show today.

It's actually something we've talked about adding to TSBS for TimescaleDB (as an option to turn on/off) and maybe other DBs could do the same.

[1]: https://github.com/timescale/tsbs/pull/26

PeterZaitsev2 years ago

Thank you for your prompt response!

I think the most important thing is Clickhouse is NOT designed for small batch insertion, if you need to do 1000s of Inserts/sec you do queue in front of clickhouse. And query speed can be impacted by batch side a lot. So have you looked at query performance with optimal batch size ?

mfreed2 years ago

Yep! The blog post includes data and graphs from both large (5000-15,000 rows / batch) and small (100-500 rows / batch) sizes. Please see the section "Insert Performance". Thanks!

https://blog.timescale.com/blog/what-is-clickhouse-how-does-...

+1
PeterZaitsev2 years ago
jurajmasar2 years ago

Disclaimer: I'm a co-founder of https://logtail.com, ClickHouse-based hosted log management platform.

PostgreSQL, TimescaleDB, and ClickHouse are all impressive pieces of software. We use both PostgreSQL and ClickHouse at Logtail.

ClickHouse shines for true OLAP use-cases and is very hard to beat performance-wise when configured properly.

Example:

> Poor inserts and much higher disk usage (e.g., 2.7x higher disk usage than TimescaleDB) at small batch sizes (e.g., 100-300 rows/batch).

If your consistency requirements allow, you could use the Buffer Table Engine to get blazing fast inserts: https://clickhouse.com/docs/en/engines/table-engines/special...

Horizontal scalability and compression are also unbeatable from what I've seen, to name a few.

There's a hefty price tag, however: ClickHouse is quite ops heavy and its observability has a seriously steep learning curve. Only go for ClickHouse in production if you really know what you're doing :)

MichaelRazum2 years ago

Could you explain what you mean by ops heavy? Just curious. Actually have a production system where Timescale as well as clickhouse are running in parallel. So far clickhouse didn't do any trouble, but it is rarely used right now.

DevKoala2 years ago

It is not ops heavy. I believe they are trying to sell you on a managed service with that sentence. It is in your best interest for your team to learn how to maintain ClickHouse themselves. The k8s operator will take you far. By the time you need multi-node clusters you will already have a good idea of how it works. If you need a multi cluster deployment from the start then go with some consulting service I guess, but even then I wouldn’t call it “ops heavy”; the clusters are pretty homogeneous, just a bunch of nodes talking over ZK.

eternalban2 years ago

>> Disclaimer: I'm a co-founder of https://logtail.com, ClickHouse-based hosted log management platform.

heliodor2 years ago

> 2.7x higher disk usage than TimescaleDB ...

A classic tenet of computer science is that you can trade speed for space and vice versa. A database index is an example of this concept.

While the article complains about "benchmarketing", its tone is rather dismissive of ClickHouse and parts such as this line caught my eye as unfair.

darksaints2 years ago

It would be awesome to combine the following things:

* PostGIS

* Timescale

* Citus

* Zedstore

This truly would be the relational DB to end all relational DBs. Unfortunately, we run into a couple problems:

* Managing multiple extensions is a burdensome task, which should be in the wheelhouse of cloud providers, but...

* Timescale and Citus are are open core, holding back features for customers. Their primary revenue channels are their cloud offerings. Unfortunately you can't get Citus and Timescale in the same cloud offering, cause you're dealing with two separate companies.

* PostGIS has multiple cloud providers, but none of them have Timescale or Citus available.

* Citus only has cloud offerings on Azure, excluding the other two major players that often have exclusive relationships with companies.

* Zedstore is really cool and together with Citus could be a massive gamechanger by having columnstore and rowstores in the same distributed database. However, development has stalled, and nobody seems to be able to explain what happened.

Sigh...maybe 5 years from now.

avthar2 years ago

> Timescale and Citus are are open core, holding back features for customers.

One clarification. While TimescaleDB is open-core, our community version is source-available and 100% free to use. We do not "hold back features for customers". You do not need to pay to use any of TimescaleDB's best features, it's all free via the Timescale Community license.

You only pay if you'd like to use our hosted offerings (and save the hassle of self-managing your DB): Timescale Cloud or Managed Service for TimescaleDB.

For more see: https://www.timescale.com/products

(Disclaimer: I work at Timescale)

mfreed2 years ago

Timescale Cloud indeed comes with PostGIS installed by default.

Regarding distributed (Citus) and columnar (Zedstore):

- TimescaleDB's compression actually takes a columnar approach (including that it only reads the individual compressed columns that you SELECT), and so combines both row- and column-oriented data. [0]

- TimescaleDB 2.0 also supports distributed deployment, and Timescale Cloud will (very soon) offer one-click deployment of fully-managed multi-node TimescaleDB. [1]

[0] https://blog.timescale.com/blog/building-columnar-compressio...

[1] https://blog.timescale.com/blog/building-a-distributed-time-...

contrahax2 years ago

If you use Aiven for a cloud PG instance you can do both Timescale + PostGIS installed.

I also really wish ClickHouse would prioritize PostGIS support - IIRC it has been on their roadmap for a while but keeps getting kicked around every year or so. Same thing with CockroachDB - PostGIS support kicked down the road every year.

otan_2 years ago

Hi there, CockroachDB dev here. We've supported spatial features from PostGIS since 20.2 - you can get started here https://www.cockroachlabs.com/docs/stable/spatial-data.html! Of course, there's bits and pieces we've missed but if there's something you're missing in particular you can let me know here or through a GitHub issue.

rkwasny2 years ago
csdvrx2 years ago

> There is some creative engineering going here

Agreed. At a previous work, clickhouse outperformed timescale by several orders of magnitude, under about every condition.

The timescale team seems to recognize that (look for the comment about clickhouse being a bulldozer) but they seem to say timescale can be better suited.

In my experience, in about 1% of the cases, yes, timescale will be a better choice (ex: if you do very small batches of insertions, if you need to remove some datapoints) but in 99% of the usecases for a time series database, clickhouse is the right answer.

There seems to have been several improvements to timescale since 2018, with columnar storage, compression, etc. and that's good because more competition is always better.

But in 2021, clickhouse vs timescale for a timeseries is like postgres vs mongo for a regular database: unless you have special constraints [*], the "cool" solution (timescale or mongo) is the wrong one.

[*]: you may think you have a unique problem and you need unique features, but odds are, YAGNI

https://en.wikipedia.org/wiki/You_aren%27t_gonna_need_it

akulkarni2 years ago

> In my experience, in about 1% of the cases, yes, timescale will be a better choice (ex: if you do very small batches of insertions, if you need to remove some datapoints) but in 99% of the usecases for a time series database, clickhouse is the right answer.

I always find comments like this interesting :-). Things are better for different use cases. If you find yourself inserting a lot of data in batches for OLAP-style analysis, then ClickHouse is a better choice today.

If you find yourself performing a lot of time-series related queries, and needing to build an application on top (e.g., where you might want the OLTP features of Postgres), then Timescale is the better choice.

YMMV! And that's OK :-)

> But in 2021, clickhouse vs timescale for a timeseries is like postgres vs mongo for a regular database: unless you have special constraints [*], the "cool" solution (timescale or mongo) is the wrong one.

This is also a funny statement, because TimescaleDB is built on PostgreSQL.

We actually take great pride in being a "boring" option [0] - in fact I think TimescaleDB is many ways is more "boring" than ClickHouse (again, because of its PostgreSQL foundation). But I think that's actually a good thing - because you should want your database to be "boring" - ie you shouldn't have to worry about it!

(Disclaimer: TimescaleDB co-founder)

[0] https://blog.timescale.com/blog/when-boring-is-awesome-build...

csdvrx2 years ago

> This is also a funny statement, because TimescaleDB is built on PostgreSQL.

I know, but doing timeseries with postgres is "cool", not standard, not boring. I'd even say "risky".

> We actually take great pride in being a "boring" option

No, you're not there yet: doing timeseries with timescale is way riskier than with clickhouse, which is both a bit older (not much) and more mature (much more), while also being more widely used (even if you are doing a lot of outreach like these posts)

+1
akulkarni2 years ago
rkwasny2 years ago
ryanbooz2 years ago

We were using tags, so that "else" block isn't the one being used for ClickHouse. Regardless, the table that is created (by the community and verified by former CH engineers) orders by created_at, not time and so that query should be the "fastest" the distinct possible.

ryanbooz2 years ago

(Post author)

I'm not sure why you think that's creative engineering. What you're pointing to is the depth of available configuration that the contributors to TSBS have exposed for each database. It's totally open source and anyone is welcome to add more configuration and options! I believe (although not totally sure) that Altinity and ClickHouse folks added their code a few years ago - at least it wasn't anyone on the Timescale team.

That said, we didn't actually use those scripts to run our tests. Please join us next Wednesday (10AM ET/4PM CET) to see how we set the databases up and ran the benchmarks. We'd be delighted to have you try it on your own too!

rkwasny2 years ago

Ah so the tests you have used are not the ones in https://github.com/timescale/tsbs ?

ryanbooz2 years ago

All the same tests. You simply pointed to a shell script that's configurable to run tests for each database. We provided details in the blog post of exactly what settings we used for each database (cardinality, batch size, time range, TimescaleDB chunk size, etc.) so you can use those script to configure and run the tests too.

sam0x172 years ago

Can someone give me a real-world example of a scenario where they actually need a time series database, like an example query with the business use case / justification? Just super curious.

GordonS2 years ago

We're using TimescaleDB to store log events - so each row has a timestamp, some properties, and a log message. And a lot of that is actually in a JSONB column.

Not the archetypal time series use case, but TimescaleDB is still really useful.

TSDB's compression means we can store a huge volume of data in a fraction of the space of a standard Postgres table. You can achieve even better compression ratios and performance if you spend time designing your schema carefully, but honestly we didn't see the need, as just throwing data in gets us something like 10:1 compression and great performance.

TSDB's chunked storage engine means that queries along chunking dimensions (e.g. timestamp) are super-fast, as it knows exactly which files to read.

Chunking also means that data retention policies execute nearly instantaneously, as it's literally just deleting files from disk, rather than deleting rows one-by-one - millions of rows are gone in an instant!

And best of all, this all works in Postgres, and we can query TSDB data just the same as regular data.

All that combined easily justified the decision to use TSDB - and if you're familiar with Postgres, it's actually really simple to get started with. Really, we'd of needed a business justification not to use it!

Much love for the TimescaleDB team!

LoriP2 years ago

I know this is not on topic but thank you! We'd love to know more about what you do with TimescaleDB if you should want to share.

sam0x172 years ago

So services like Sentry / Honeybadger / etc probably use this architecture?

GordonS2 years ago

I think most online logging SaaS services actually use ElasticSearch.

akulkarni2 years ago

StackGres by Ongres, which achieved 1.0.0 status today, is another example of someone storing logs in TimescaleDB:

https://twitter.com/ongresinc/status/1451194886795849742

(Also, thank you GordonS for the kind words!)

RhodesianHunter2 years ago

I know that at least two of the top ten use ClickHouse.

claytonjy2 years ago

I introduced Timescale at an IIOT company, we had thousands of sensors regularly sending data up and wanted to efficiently display such metrics to users on a per-sensor basis, and one "tick" of data had a lot of metrics. Timescale let us go from Postgres for metadata and OpenTSDB (awful, stay far away) for time series to just one Timescale instance for everything. Huge win for us. We had enough data that doing the same with vanilla Postgres would have performed much worse (billions of rows).

We wrote more about this for an earlier Timescale blog post: https://blog.timescale.com/blog/how-everactive-powers-a-dens...

tnolet2 years ago

We just migrated to Clickhouse. We collect monitoring data. So response times from 20+ different location. We are not super duper big but at least 100M+ individual metrics per month. We want to give our users a snappy, interactive dashboard that lets them explore aggregates of that data over time: averages, p99 etc..

That is where a time series DB is very handy

sarlalian2 years ago

I use TimescaleDB to store data from logs (syslog, http logs, render logs), timeseries host metrics (CPU, Disk, Memory, Network), system inventory data, FDW to LDAP, network flows, etc. I do analysis across multiple dimensions for a variety of business uses.

1. Render job logs, combined with host metrics to determine if any systems are being under utilized. 2. Intranet network usage for monitoring purposes. 3. General monitoring of log events, system metrics, etc with alerting based on specific query criteria. 4. Temperature / Humidity monitoring and trend analysis.

Basically if there are things that happen somewhat often (http logs, system logs, host metrics, power usage, temperature change, etc) and you want to analyze that data, and one of the factors in your analysis is time, then time series databases are for you.

In the end each TSDB has different trade offs, and there are a number of considerations to make, and how you weigh them will change a lot depending on scale. If you are writing 1-20GB/Day vs 1-20TB/Day, you will have different considerations. If you have relatively simple queries vs complex queries with complex joins on other data you will have different considerations. In the end you need to have some idea what your data is, how much of it there is, what the growth looks like and what your write/query usage patterns are going to look like. Insert / Query performance isn't everything, unless for your business it is.

ur-whale2 years ago

> Can someone give me a real-world example of a scenario where they actually need a time series database

Large scale infrastructure monitoring?

If you run a data center with 10K machines in it, whitebox monitoring of these machines and what runs on it generates tons of timestamped data.

These time series can be used to inform an automated alerting system (eg using trends to forecast bad things before they happen).

They can also be analyzed in batch mode to figure out how to optimize many things (power / cooling / workload assignment / etc ...)

pgwhalen2 years ago

Most useful economic or financial markets data is some kind of time series. For example, what was the maximum price of a stock in the last year, and how does it compare to the current price?

FRED (https://fred.stlouisfed.org/) is a great resource for publicly available economic time series data.

jedberg2 years ago

At Netflix all of our monitoring was in a time series database so we could get real time insights into pretty much anything we were monitoring (which was most everything).

ryanbooz2 years ago

(Post author)

This is a great post to give you some talking points:

https://blog.timescale.com/blog/what-the-heck-is-time-series...

I also love this recent one we did with some non-standard time-series data that the NFL provided! Really fun working on that data set.

https://blog.timescale.com/blog/hacking-nfl-data-with-postgr...

detailyang2 years ago

I build TSDB (10,000,000 samples per seconds and average query time is almost 200ms) over clickhouse which support multi insert protocol (influxdb line protocol|prometheus remote write|opentsdb json format) and opentsdb query protocol and incomplete PromQL. Clickhouse is good but something you need to tunning. using memory table to speedup insert,high performance zookeeper to improve replicated table, data shard via timeseries id (distributed_group_by_no_merge=1) to reduce bootstrap node load

bradstewart2 years ago

We capture and store energy readings at ~5second intervals, then display total energy at various time granularity by aggregating the values over minutes, hours, days, months, etc.

LoriP2 years ago

Here's some posts from users on the Timescale blog https://blog.timescale.com/tag/dev-q-a/

cercatrova2 years ago

Has anyone else been seeing an influx of timescale.com articles? I count around 10 in the last month.

carlotasoto2 years ago

(Timescale Team member here)

We've been working really hard on our launches / releases this month! We called it "Always Be Launching" - we've been aiming for releasing multiple things per week during October :)

yumraj2 years ago

That sounds great.

However, as a DB where users may store critical data, should you really be "Always be launching"? That sounds a little like FB's "move fast and break things". There's a reason why some of the mission critical open source technologies move slowly.

mfreed2 years ago

We actually are only having one database software release this month (TimescaleDB v2.5), which is aligned with our normal database release cadence.

Timescale (the company) also provides a managed cloud offering, as well as Promscale (an observability product built on top of TimescaleDB).

So #AlwaysBeLaunching is a company-wide effort across different product & engineering teams, as well as folks in Developer Advocacy and others (e.g., who worked on this comparison benchmarks).

What might be also interesting is our introduction of Experimental Schema features in TimescaleDB - explicitly so that we can "Move fast, but don't break things" (which is also key to getting good community feedback):

https://blog.timescale.com/blog/move-fast-but-dont-break-thi...

(Timescale co-founder)

swyx2 years ago

how long does #AlwaysBeLaunching last? seems like a good initiative but also might cause some burnout

avthar2 years ago

Timescale team member here. We take our responsibility to build a rock-solid platform very seriously. We have multiple "levels" of product within Timescale. At our core, we have the open-source database, TimescaleDB. This product releases on a more deliberate and careful cadence, always making sure that we are optimizing for reliability, security, and performance. This has been our approach since our initial launch [0], where we embraced the mantra "boring is awesome", recognizing that for our users stability and reliability is of paramount importance.

Within the core database, we offer features that are carefully marked as "experimental", which we discuss at length in this blog post [1].

Beyond TimescaleDB, we also offer other products that are more SaaS-y in nature. While they're all based on the rock-solid foundation of TimescaleDB, we are also able to ship new features more quickly because they are UI components that make using the database even easier.

Finally, some of our "launches" are more textual in nature, such as this benchmark, which we have spent months researching and compiling.

[0]: https://blog.timescale.com/blog/when-boring-is-awesome-build...

[1]: https://blog.timescale.com/blog/move-fast-but-dont-break-thi...

benwilson-5122 years ago

We've got a few billion rows in TSDB, pretty happy with it so far. Our workload fits the OLTP workflow more than OLAP though, we're processing / analyzing individual data points from IoT devices as they come in, and then providing various visualizations. This tends to mean that we're doing lots of fetches to relatively small subsets of the data at a time, vs trying to compute summaries of large subsets.

Compression is seriously impressive, we see ~90% compression rate on our real world datasets. Having that data right next to our regular postgres tables and being able to operate on it all transactionally definitely simplifies our application logic.

Where I see a lot of folks run into issues with TimescaleDB is that it does require that your related data models hold on to relevant timestamps. If you want to query a hypertable efficiently, you always want to be able to specify the relevant time range so that it can ignore irrelevant chunks. This may mean that you need to put data_starts_at, data_ends_at columns on various other tables in your database to make sure you always know where you find your data. This is actually just fine though, because it also means you have an easy record of those min / max values on hand and don't need to hit the hypertable at all just to go "When did I last get data for this device".

qorrect2 years ago

> Compression is seriously impressive

Does this effect your query performance ?

benwilson-5122 years ago

In practice we've seen it actually improve performance, because when fetching a data range for a device fewer actual rows have to be fetched from the disk. You pick certain columns (like device ID) that remain uncompressed and indexed for rapid querying, and then the actual value columns are compressed for a range of time.

qorrect2 years ago

Very cool thanks for sharing

> This may mean that you need to put data_starts_at, data_ends_at columns on various other tables in your database to make sure you always know where you find your data.

Do you have a link to docs for this ? Does this mean literally put a first column named (xstartx) and an end column (xendx) as the last column ? How do you then utilize it ?

Thanks so much!

CodesInChaos2 years ago

Can somebody recommend a database suitable for an event sourced application:

* One series of events per user

* Each series grows at about 10 events/minute while the user is active

* Fancy queries are not required, typically a user's event series is consumed in order to update aggregate state for that user

* Either used online, adding events one at a time and needing to immediately update the aggregate state

* Used offline syncing a batch of hours or days at once. When a large time interval, eventually consistent state updates are acceptable

* It must be possible to delete a user's data, regardless how old it is (a nightly batch job deleting multiple users at once is fine, if it helps performance)

* Migrating old data should be possible with reasonable performance and without consuming excessive temporary memory

* Compact storage is important (simple zstd compression should suffice, though columnar compression might be slightly better)

* Being able to use a cheaper object store like S3 for old data would be nice

At a glance timescale community appears to meet most requirements. The userid can be used as `segmentby` key, and the data compressed via columnar compression. But it seems to have limitations with migration (sounds like it requires me to manually decompress and recompress chunks, instead of simply transforming one (chunk, segment) piece at a time) and deletion (I need to delete everything with a specific `segmentby` key).

Alternatively there is the DIY approach, of serializing each entry in a compact format, one file per user, and then once data is old enough compress it (e.g. with zstd) and upload it to S3.

ryanbooz2 years ago

Looks like you edited this with some more detail, so I'll answer higher.

Compression in TimescaleDB used to mean all compressed data was immutable and the table schema couldn't be altered. Since TimescaleDB 2.1, and 2.3 that has changed in a few ways.

- Schema can now have columns added or renamed - Compressed chunks can now have rows inserted into them (partially compressed, the background job will fully compress when it runs)

Row deletion is not possible yet, but I've personally been having some internal conversations around ways to do exactly as you're suggesting in the near-term; deleting rows based on a "segmentby" column. I have some testing to do... but my focus has been taken up by a certain, 33-minute long read, blog post.

Feel free to join our Slack and DM me if you want to talk about it further.

slack.timescale.com

ryanbooz2 years ago

(post author and Timescaler)

What do you mean by "migrating old data"? Don't want to make assumptions before answering further.

CodesInChaos2 years ago

Updating many/all rows as a rare maintenance task, typically as part of deploying a new version of the application.

I know timescale has native support for the most common cases (adding nullable columns/renaming columns). But sometimes the transformation is more complex. Sometimes an sql update statement would suffice, sometimes streaming each segment in chronological order to the application returning the updated row might be required.

smarx0072 years ago

10 events/minute - I would guess SQLite/DuckDB can fit the bill for many years?

CodesInChaos2 years ago

I don't think performance is the issue, but I'd like to keep the storage small/cheap via compression, since this is a hobby project. (though admittedly it's probably premature optimization)

zekrioca2 years ago

I know it is not related, but "ClickHouse" (“_Click_stream” and “Data ware_House_”) doesn't sound like a database name.

LoriP2 years ago

I think the name originally came from company/project origin and how the whole thing started... Recently set up as their own entity, ClickHouse came from Yandex. There are probably others better able to give that history, but that's the gist of it.

mrwnmonm2 years ago

Any plans to do this kind of detailed comparison with Druid too?

ryanbooz2 years ago

There's a long list of DBs users would like to see. Druid is on the list but probably not happening in the near-term without some community help.

Remember, TSBS is open-source and we've had some great contributions from many teams/databases. :-)

eatonphil2 years ago

I was surprised to see that ClickHouse and ElasticSearch have the same number of contributors. That's pretty astounding given how much older and more prominent ElasticSearch has been.

https://github.com/ClickHouse/ClickHouse/graphs/contributors

https://github.com/elastic/elasticsearch/graphs/contributors

Edit: I was very off. The Github contributor graph does not show all actual contributors. ElasticSearch has somewhere around 2-3 times as many contributors as ClickHouse.

AdamProut2 years ago

Clickhouse is by far the leading open source columnar SQL data warehouse at this point. We have had strong open source operational SQL DBs for many years (MySQL, Postgres), but no open source systems that mirrored closed source MPP columnstore until clickhouse. Its interesting that it took "this long" for a strong open source SQL DW to emerge.

LoriP2 years ago

ClickHouse has gained a huge following and honestly that's been pretty well earned. For the kinds of apps that they target it's a great choice, it's great technology with a very able team behind it.

rohitnair2 years ago

As per the landing pages of the projects, ES has 1.6k contributors whereas ClickHouse has 803. The contributors page likely only lists the top contributors to keep the page load time manageable.

eatonphil2 years ago

That makes much more sense. Thanks for pointing that out.

hodgesrm2 years ago

ClickHouse now has more unique contributors with merged PRs on an annual basis. The lines crossed early this year, or even late last year.

eatonphil2 years ago

Thanks! Could you point me at something concrete. :)

hodgesrm2 years ago

Go to one of the several public ClickHouse endpoints and run this query:

  -- Elastic vs CH in a single table. 
  SELECT toYear(created_at) Year, 
         uniqIf(creator_user_login, repo_name in ('elastic/elasticsearch')) "Elastic", 
         uniqIf(creator_user_login, repo_name in ('yandex/ClickHouse', 'ClickHouse/ClickHouse')) "CLickHouse" 
  FROM github_events 
  WHERE event_type = 'PullRequestEvent'
  AND merged = 1
  AND repo_name in ('yandex/ClickHouse', 'ClickHouse/ClickHouse', 'elastic/elasticsearch')
  GROUP BY Year ORDER BY Year
You can access the ClickHouse web UI for this dataset here: https://github.demo.trial.altinity.cloud:8443/play?user=demo. The password is "demo" (type it in the left side.) This is the Altinity.Cloud copy of Alexey Milovidov's excellent github_events dataset.

When I run this query I get the following numbers.

  Year|Elastic|CLickHouse|
  ----|-------|----------|
  2015|    191|         0|
  2016|    299|        40|
  2017|    296|        85|
  2018|    284|       142|
  2019|    341|       232|
  2020|    339|       300|
  2021|    243|       294|
Just speculation on my part but the drop Elastic contributors may be a side effect of the licensing change.
pradeepchhetri2 years ago

One suggestion, if you really want to benchmark systems:

* Create a setup which is production grade i.e. run a multi-node HA setup of those systems.

* Understand the best practices of those systems otherwise result gets biased.

* Validate the results with experts of those systems before publishing.

avinassh2 years ago

Related to TimescaleDB, there was a blog post which explained their internals and also compared with another similar time series DB. I can't seem to find the link, anyone remembers?

avthar2 years ago

Timescaler here. I think you're referring to this comparison of InfluxDB vs TimescaleDB [0]?

There's also comparisons of TimescaleDB vs MongoDB[1] and AWS Timestream [2].

[0]: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-...

[1]: https://blog.timescale.com/blog/how-to-store-time-series-dat...

[2]:https://blog.timescale.com/blog/timescaledb-vs-amazon-timest...

PeterZaitsev2 years ago

Benchmarks which were done a while back did not use compression for TimescaleDB but also did not use new compression settings for ClickHouse too.

https://altinity.com/blog/2019/7/new-encodings-to-improve-cl...

In particularly low_cardinality() for strings and time series specific compression many be very valuable

mrwnmonm2 years ago

It is interesting how people love this detailed comparison, but in the same time when this article https://leventov.medium.com/comparison-of-the-open-source-ol... was posted many times on HN, it didn't receive any attention.

clippingpecific2 years ago

Wow so informative. I have learnt a lot from here. keep going. Need to read something like this more. <a href='https://www.clippingpacific.com/' title='clipping path service'>Clipping Path Service</a>

Upitor2 years ago

Would either of these database systems be proper for a case where you have a mix of large measurement data and small reference/master data that you need to join, filter, etc. ?Example:

SELECT r.country, m.time, SUM(m.measurement) FROM measurement_table AS m INNER JOIN refence_table AS r ON m.device_id = r.device_id

ryanbooz2 years ago

In it's current form/state, ClickHouse is not optimized for typical JOIN-type queries, a point we make in the post. You would have to re-write your statement to get better performance. The other main point is that all data is "immutable", so if your reference data needs to be updated, it would still need to go through some kind of asynchronous transform process to ensure you're getting the correct values at query time.

TimescaleDB is PostgreSQL, so it can easily handle this kind of join aggregate like you would expect. If "m.measurement" was compressed, historical queries with a time predicate would likely be faster than uncompressed state.

bvrmn2 years ago

It's a very strange benchmark.

1) Clickhouse allows to trivially setup a batch buffer on server side. It will always be faster on insert with low disk usage.

2) Suspicious axis on query benchmark. It seems it represents diff in query latencies. So if one executes query in 1.00 and another in 1.34 on graph it will be whopping 34.

oddtodd2 years ago

Last time I looked at TimescaleDB a few months ago, it appeared to me that you had to ingest data into what they call an access node, which made it a non-starter for our use case, but matched perfectly with ClickHouse, which is a symmetric deployment, where every node can be an ingest node.

ryanbooz2 years ago

You are correct. The current multi-node deployment (if you need it for your workload) does have one "managing" access node. All ANs and DNs can be replicated and configured with tooling such as patroni for HA.

We have a few users that have larger multi-node clusters setup this way (one at 40+ nodes so far) and happy with the offering overall. Obviously YMMV depending on requirements/IT support.

mrwnmonm2 years ago

If you have thousands of clients writing to the database individual rows, one per request, and thousands of clients making queries (some of them are complex, some are not). Does ClickHouse even get used in this scenario?

oandrew2 years ago

Yes but typically with some sort of batching proxy in front of it

e.g. https://clickhouse.com/docs/en/interfaces/third-party/proxy/

zepolen2 years ago

Yes, Clickhouse is heavily used for storing http access logs which follows that scenario.

didip2 years ago

Question, which helm chart is the best to install ClickHouse these days?

hodgesrm2 years ago

Don't use helm. The ClickHouse Kubernetes Operator is the way to go. Here's the project: https://github.com/Altinity/clickhouse-operator

This is generally true for most databases these days. Use an operator if it's available. Helm can't handle the dynamic management required to run databases properly.

didip2 years ago

Thank you!

chalcolithic2 years ago

Could putting RedPanda/Kafka in front of ClickHouse make it insert benchmark winner? Of course it means operational expenses but I wonder if this route is worth exploring?

RhodesianHunter2 years ago

Yes that is exactly how you would get the best performance out of ClickHouse.

arunmu2 years ago

What is the difference w.r.t the comparison done by Altinity of clickhouse with timescale ? Clickhouse performed better there for the same test. What gives ?

akulkarni2 years ago

If you are referring to this post: https://altinity.com/blog/clickhouse-for-time-series

That post was written in November 2018 - 3 years ago - when TimescaleDB was barely 1.0.

A lot has changed since then:

1. TimescaleDB launched native columnar compression in 2019, which completely changed its story around storage footprint and query performance [0]

2. TimescaleDB has gotten much better

3. PostgreSQL has also gotten better (which in turn makes TimescaleDB better)

In fact, IIRC Altinity used and contributed ClickHouse to the TSBS [1], which is also what this newer benchmark uses as well

(Disclaimer: TimescaleDB co-founder)

[0] https://blog.timescale.com/blog/building-columnar-compressio...

[1] https://github.com/timescale/tsbs

arunmu2 years ago

Thank you. My only nit is the way the ratio (CH/TS) is shown. What is the purpose of that ? It will show a bigger percentage for cases in which TS is better, but lower percentage for cases where CH is giving better results. From the data representation perspective, I do not thinnk that is fair.

ryanbooz2 years ago

(Post author)

The two big things, which we discuss at length in the post, are:

- Altinity (and others) did not enable compression in TimescaleDB (which converts data into columnar storage) and provides improvement in querying historical data because it can retrieve individual columns in compressed format similar to CH

- They didn't explore different batch sizes to help understand how each database is impacted at various batch sizes.

PeterZaitsev2 years ago

Have you from your side followed all Clickhouse best practices?

Clickhouse design in particular suggests doing ingest request approximately once per second and if you do much more than that when you use it outside of intended usage and if you need that you usually have some sort of queue between whatever produces the data and Clickhouse.

Note the ingest in small batches also can significantly affect query performance

ryanbooz2 years ago

Yep - it's all detailed in the post! The question is how it compares to TimescaleDB, which is an OLTP time-series database that has a lot of other possible use cases (and extensibility). I think it's very fair to explore how smaller batches work since others haven't ever actually shown that (as far as we can see) so that users that would normally be coming from a database like PostgreSQL can understand the impact something like small batches would have.

As for ingest queueing, TSBS does not queue results. We agree, and tell most users that they should queue and batch insert in larger numbers. Not every app is designed that way and so we wanted to understand what that would look like.

But CH did amazingly well regardless of that with batches above 1k-2k and lived up to it's name as a really fast database for ingest!

pradeepchhetri2 years ago

How can i replicate the results of the benchmarks ? I am interested to look at the CH table schema you used.

jonatasdp2 years ago

You can run tsbs by yourself. Just check the options here: https://github.com/timescale/tsbs/blob/master/docs/clickhous...

The blog post benchmarks used `--use-case=cpu-only` case for data ingestion. You can see the table definition here: https://github.com/timescale/tsbs/blob/1eb7705ff921fd31784c0... coming from here: https://github.com/timescale/tsbs/blob/master/pkg/targets/cl...

ryanbooz2 years ago

(Post author)

Howdy! All of the details about our TSBS settings in the performance section of the docs. Also, we'll be streaming a sample benchmark of the two databases next Wednesday at 10AM ET/4PM CET.

https://blog.timescale.com/blog/what-is-clickhouse-how-does-...

twitch.tv/timescaledb

pradeepchhetri2 years ago

Few comments:

- The CH table schema generated by TSBS isn't optimized for the queries. First of all, it doesn't uses CODEC (https://altinity.com/blog/2019/7/new-encodings-to-improve-cl...) and many other optimizations CH have.

> We tried multiple batch sizes and found that in most cases there was little difference in overall insert efficiency

This is wrong in CH world where batch size matters a lot. I would recommend keep this even more higher around (10x of current value).

Humble Suggestion: There are many things not quite properly interpreted about CH and reading through the blog it seems like you're focusing more on areas which CH is lacking/missing. Please don't do these things.

ryanbooz2 years ago

Two quick responses:

- The code that TSBS uses was contributed by Altinity[1]. If there is a better setup, please feel free to submit a PR. As stated elsewhere, we did have a former CH engineer review and even updated ClickHouse to the newest version __yesterday__ based on his suggestion to ensure we had the best numbers. (and some queries did improve after upgrading, which are the numbers we presented)

- It seems like you read the article (great job - it was long!!), so I'm sure you understand that we were trying to answer performance and feature questions at a deeper level than almost any benchmark we've seen to date. Many just show a few graphs and walk away. We fully acknowledged that smaller batches are not recommended by CH, but something many (normally OLTP) users would probably have. It matters and nobody (that we know of) has shown those numbers before. And in our test, larger batch sizes do work well, but not to some great magnitude in this one server setup. Did 10k or 20k rows maybe go a little faster for CH? Sometimes yes, sometimes negligible. The illustration was that we literally spent months and hundreds of benchmark cycles trying to understand the nuances.

I think we're pretty clear in the post that CH is a great database for the intended cases, but it has shortcomings just like TimescaleDB does and we tried to faithfully explore each side.

[1]: https://github.com/timescale/tsbs/pull/26

PeterZaitsev2 years ago

This is what tend to make all vendor benchmarks "benchmarketing" - while many of us fully intend to give a fair shot to other technologies we tend to know best practices for our own software better than "competition"

jonatasdp2 years ago

It also puzzled me when I started benchmarking and helping/reviewing PRs on the tsbs project.

I even wrote an idea about promoting the "time series racing contest" a few months ago: https://gist.github.com/jonatas/a84b734645d288051fb861d9522f...

Orchestrate the race will require a lot of collaboration from all players involved :)

Another small step would be encourage DB companies to bring the most optimized defaults to tsbs or extra scripts and templates to tune the DB server.

carlotasoto2 years ago

(Timescale team member here)

We used the Time Series Benchmark Suite for all these tests https://github.com/timescale/tsbs. Also, Ryan (post author) will be giving all the config details in a Twitch stream happening next Wednesday. We'll be uploading the video to Youtube immediately afterwards too >>

twitch.tv/timescaledb youtube.com/timescaledb

fnord772 years ago

Apache Druid and Apache Pinot are two others to consider for time series. We're using druid at scale and it works pretty well. Pinot appears to be faster for retrieval but it is less mature.

nojito2 years ago

None of these clickhouse queries are optimized.

It is very very hard to beat clickhouse in terms of performance if set up properly.

cachemiss2 years ago

Having used both TSDB and ClickHouse in anger I have some thoughts on this:

They are both fantastic engines, I really like that both have made very specific tradeoffs and can be very clear in what they are good and bad at. Having worked on database engines, I can appreciate the complexity that they are solving.

My most recent use is with ClickHouse, which is great and I think a complete game-changer for the company. However there's a lot of issues (that are being worked on, the core team is great, though there are a few personalities that are a bit frosty to deal with). All of these comments come with love for the system.

1. Joins really need some work, both in the kinds of algorithms (pk aware, merge joins that don't do a full sort etc.), and in query optimizer work to make them better. We have analysts that use our system, and telling them to constantly write subqueries for simple joins is a total PITA. Not having PK aware joins is a massive blocker for higher utilization at our company, which really loves CH otherwise.

2. Some personalities will tell you that not having a query optimizer is a feature, and from an operational standpoint, it is nice to know that a query plan won't change, or try and force the optimizer to do the right thing. However, given #1, making joins performant (we have one huge table with trillions of rows, and a few smaller ones with billions) is really rough.

3. The operations story really needs some work, especially the distribution model. The model of local tables with a distributed table over it is difficult to work with personally. It would be nice to just be able to plug servers in without alot of work, like Scylla, and not have two tables that you have to keep schemas consistent with. THere's also just some odd behavior, like if you insert async into a distributed table, and only have a few shards, it'll only use a thread per shard to move that data over. It would be nice if there wasn't as much to think about.

4. Following #3, there's just too many knobs, maybe if they had a tuning tool or something that would help, but configuring thread pools is difficult to get right. I suspect CH could use a dedicated scheduler like Scylla's, that could dispatch the work, instead of relying on the OS.

5. The storage system relies a lot on the underlying FS and settings on when to fsync etc. I suspect if they had a more dedicated storage engine (controlled by the scheduler above), things could be more reliable. I still don't fully trust data being safe with CH.

6. Deduplication - This is a hard problem, but one that is really difficult to solve in CH. We solve it by having our inserters coordinate so that they always produce identical blocks, using replacing merge trees to catch stragglers (maybe), but it isn't perfect. A suggestion if possible is to try and put the same keys into the same parts, so they'll always get merged out by the replacing merge tree (I understand this is difficult).

The CH team is great, and these will be fixed in time, but these were the problems we ran into with CH.

TSDB was really solid, but we never used it at a scale where it would tip over. Our use case is really aligned with Yandex's so a lot of the functionality they have built is useful to us in a way that TSDB's isn't. (Also, being able to page data to S3 is amazing).

ryanbooz2 years ago

Thanks for your excellent contribution to this discussion. As the post author I wholly agree with your approach: if a solution hits the sweet spot for you in the context of your requirements that's the one you choose. Thank you for considering TimescaleDB alongside ClickHouse in what was obviously a well thought through assessment of these two excellent technologies.

ryanbooz2 years ago

Heh - somehow missed that I had already responded to this one, my apologies. (and no immediate way to edit after the fact).

ryanbooz2 years ago

(post author)

Thanks for the great, thoughtful feedback. We (Timescale) couldn't agree more that there is a lot to love about ClickHouse, especially where it truly excels.

Information like this is helpful for others currently in the "choose the right tool" part of the job and to the developers of the product. I can't imagine how different all of our offerings will look in a few more years! :-)

PeterZaitsev2 years ago

I think it is worth noting while Clickhouse is often used for time series store it is not particularly designed for this use case, but more for storing logs, events and similar data. VictoriaMetrics would be interesting comparable which is inspired by Clickhouse design but Optimized for time series store in particular https://victoriametrics.com/

cyber12 years ago

Very interesting to see comparison TimescaleDB vs VictoriaMetrics. Car vs car :)

mt42or2 years ago

Nobody talking about victoriametrics ?

grzff2 years ago

Fauci funded COVID-19: https://www.zerohedge.com/covid-19/nih-admits-funding-gain-f...

Everyone involved should face the firing squad.

nojvek2 years ago

I’m surprised Timescale hasn’t given a comparison with SingleStoreDB.

I’ve found SingleStore column scans at parity with ClickHouse in speed. At same time SingleStore uses a hybrid skip-list, columnstore data structure in their universal storage (which is default table format).

So you have high throughput transactions, as well as insanely fast aggregate scans.

Usually in column stores, they are great at append, not so much with updates and deletes.

mfreed2 years ago

Beyond being closed source, SingleStoreDB's License explicitly prohibits benchmarking:

https://www.singlestore.com/assets/contracts/singlestore-fre...

2. Restrictions. You acknowledge that the Software, and its structure, organization, and source code, constitute SingleStore’s and its suppliers’ valuable trade secrets, and the Software is subject to the following restrictions. Specifically, Customer shall not, and shall not ... conduct any competitive analysis, publish or share with any third party any results any results of any technical evaluation or benchmark tests performed on the Software, or disclose Software features, errors or bugs to a third party without SingleStore’s prior written consent (“Benchmarking”); or

PeterZaitsev2 years ago

Wow. Not expected to see that in Modern Database Product.

What is the most funny though I remember them in MemSQL days showing benchmarks vs Oracle (which as I understand also has the same restriction)

assface2 years ago

Just ignore it.