I found TerminuDB's white paper on graph compression interesting and I think could be useful to time series compression:
Succinct Data Structures and Delta Encoding for Modern Databases - https://assets.terminusdb.com/research/succinct-data-structu...
Navarro's book Compact Data Structures is an excellent survey of this literature! https://www.cambridge.org/core/books/compact-data-structures...
I love the perspective of computing the information theoretic lower bounds for a data structure and then trying to achieve that with as little overhead as possible while still supporting efficient operations on the structure.
Question for people who've implemented these kind of compression schemes: some of these mechanisms require local context (rle and s8b) how do you handle random access in these cases? Is the metadata embedded as a sort of key frame to which we have to advance before decoding the value? Or as a separate block in the headers?
RLE especially sounds like it could degenerate into a binary search per lookup, maybe aided by caching the bounds of the last run and assuming locality and linear-ish usage?
This might sound obvious, but wasn't mentioned in the article, you can also apply integer based compression schemes on dictionary encoded data.
And floating point values don't always need those 64bits when the use case and input data don't require that level of precision.
> (...) how do you handle random access in these cases?
Given we're discussing time series, random access is something that's not required at all. There is no use case where you're looking for a single, specific data point of a time series. Time series are used to store and retrieve batches of data points in order to process and/or analize/visualize in batches. Time series queries consists mainly of "get me all the data points of a time series reported between timestamps A and B".
> There is no use case where you're looking for a single, specific data point of a time series.
I'm no expert but it seems trivial to come up with some. The article talked of storing temperature and humidity, so in relation to that:
- "What was the hottest day in the past <variable time period>" will require you to start decompressing at $now-$period. Having to start at the dawn of time is kind of a pain
- "I was so warm last night that I even woke up from it at 5am, what were the temperature and humidity then?"
- Record highs are not going to be at night, so if you want to find that, we can skip a lot of data there. Now, temperatures don't change every millisecond so it's not a lot of data even if you would have to start decompressing 50 years back, but in general, this kind of optimization is only possible if you have random access. (A better example here might be if you have petabytes of LHC data compressed this way and want to find something in it.)
Random access is not the common case but not a bad feature to have, either. The comment kind of acknowledges this by mentioning batches, but starts by dismissing the problem
I was thinking the same thing as I was reading this. I doubt you could retain 100% random access. Instead, I think you would generally create "blocks" that are compressed in a time range that is compatible with your application (ie. 1 day blocks or 1 week blocks, etc.) and then when picking date/times you take X blocks and then further refine the query after compression (example: load May 5th block --> May 5th 8:10AM - 11:13AM). At least, that is what I have done in the past in my own home grown apps. Essentially each block then starts and terminates compression - # of blocks is a trade off in compression efficiency vs. granularity.
Most applications don't require true random access.
Instead you're typically reading a range of data, and then you can decompress just the blocks required for the data you want to see.
Caching of partial queries can also help substantially. For example, if many queries involve querying the max() of some per-second data grouped by minute, it is well worth caching that rather than reading the source data every time to calculate the max().
Typically the query engine can keep counts of every subquery and how frequently it's used and how many data points it involves to decide how long to cache it for. As far as I'm aware no opensource tsdb does this, despite it being a massive simple win, especially for alerting systems and dashboards that run very similar queries frequently.
For Wavefront I got this down to about 1.5 bytes per metric reported (on average across customers) where each metric was a tuple of:
name, source, timestamp (s), value (int) and a set of name/value pairs
Key to optimizing it was having a database that supported different compression strategies based on the data.
> Key to optimizing it was having a database that supported different compression strategies based on the data.
I keep wondering why automatic detection of a compression strategy is hardly ever used. Different algorithms are just better at different things.
I thought timsort was one of the few things that does this, but a quick look at Wikipedia actually indicates that I misremember. With that gone, off the top of my head I cannot think of any utility or library that does this sort of thing. It doesn't seem hard to just take a few kilobytes of data, try a handful of good algorithms, pick the best one, and repeat after a few megabytes. I imagine the gains should be double digit percentages for at least a fifth of the datasets out there, without necessarily losing out on (de)compression speed. Curious if anyone ever looked into this, though some uncertainty will always be about bias in the dataset selection.
E.g. 9 out of 10 times xz is clearly better than bzip2 (both at max compression setting), but then sometimes bzip2 actually beats xz with some margin. I don't know when or why, but it has me trying both options whenever I care about the final compressed size. I've also had that a lower level of gzip actually compressed better (I think it was -8 vs -9, but it also have been between -1 and -3 or so).
I've always liked this kind of thing. I've also done some experiments with automatically memorizing better starting dictionaries for a corpus eras where the data is small:
Another interesting case is using a compressed stream to indicate anomalies in the data when the compression ratio spikes down like in log analysis.
That actually sounds pretty bad to me, when many metrics will be constant or very rarely changing, or perfectly correlated with another metric.
Were you arithmetic coding stuff, or still doing bit-twiddling tricks?
If you think that sounds bad, I would love to see what you think is good and show your work then they can just add it to one of the options. It does basically all the tricks in the article and then some. This is a mean, not a median. The median is likely much lower but no one cares about the median when calculating how much space you need.
RE: Correlated with another metric. That would be extremely expensive one to calculate. You need an algorithm that can operate when ingesting millions of points per second 24/7.
Are there similar algorithms useful for compressing stuff like "a list of the top 100 scores in a game?" We already store these as diffs of the list, but we could probably do better than zstd-ing the list-diffs.
This discussion about a Barclay's Bank json list of 74,000 numbers might have some ideas: https://news.ycombinator.com/item?id=28326036
I was able to get it from a 1.3mb json file (uncompressed) to a 151k uncompressed but 4k compressed file using mostly deltas. https://news.ycombinator.com/item?id=28348965
This is great! I'll have to think about how to combine "storing deltas within the list" and "storing deltas of the list," but there should be some significant gains available
You could look into multiset sequence compression methods, where the multiset elements are i sequences max_score(i, t) for at time t for each rank i https://arxiv.org/abs/1401.6410
Simple-8b is not talked about in Wikipedia.
One notable omission from this piece is a technique to compress integer time series with both positive and negative values.
If you naively apply bit-packing using the Simple8b algorithm, you'll find that negative integers are not compressed. This is due to how signed integers are represented in modern computers: negative integers will have their most significant bit set .
Zigzag encoding is a neat transform that circumvents this issue. It works by mapping signed integers to unsigned integers so that numbers with a small absolute value can be encoded using a small number of bits. Put another way, it encodes negative numbers using the least significant bit for sign. 
If you're looking for a quick way to experiment with various time series compression algorithm I highly recommend Daniel Lemire's FastPFor repository  (as linked in the article). I've used the Python bindings  to quickly evaluate various compression algorithms with great success.
Finally I'd like to humbly mention my own tiny contribution , an adaptation of Lemire's C++ Simple8b implementation (including basic methods for delta & zigzag encoding/decoding).
I used C++ templates to make the encoding and decoding routines generic over integer bit-width, which expands support up to 64 bit integers, and offers efficient usage with smaller integers (eg 16 bit). I made a couple other minor tweaks including support for arrays up to 2^64 in length, and tweaking the API/method signatures so they can be used in a more functional style. This implementation is slightly simpler to invoke via FFI, and I intend to add examples showing how to compile for usage via JS (WebAssembly), Python, and C#. I threw my code up quickly in order to share with you all, hopefully someone finds it useful. I intend to expand on usage examples/test cases/etc, and am looking forward to any comments or contributions.
I love zigzag encoding. The way it uses shift and xor is beautiful. It gives it a performance edge of 3 cycles on my cpu versus signed-leb encoding. One thing I did once, to encode UNICODE lookup tables in a smaller form (to reduce Python binary size) is I found out it can be very profitable to (1) delta encode; then (2) zig zag encode it; and finally (3) apply deflate. For example:
It's so cool.
_PyUnicode_PhrasebookOffset2: size is 178,176 bytes _PyUnicode_PhrasebookOffset2: packed size is 100,224 bytes _PyUnicode_PhrasebookOffset2: rle size is 282,216 bytes _PyUnicode_PhrasebookOffset2: deflate size is 52,200 bytes _PyUnicode_PhrasebookOffset2: bz2 size is 76,876 bytes _PyUnicode_PhrasebookOffset2: δleb size is 47,198 bytes _PyUnicode_PhrasebookOffset2: δzd size is 12,748 bytes
For anyone else not familiar: δ is a lowercase delta (at least, I had to look it up)