Inverted Indexes: A Step-by-Step Implementation Guide

the_precipitate • 5 hours ago

To really appreciate inverted indexes, it’s worthwhile to study ISR (Inverted Stream Readers), a concept introduced by the great Mike Burrows. It’s also worth exploring encoding techniques like PForDelta. These elegant ideas demonstrate how true systems design masters can distill complex concepts into simple, powerful abstractions.

Edit: I stand corrected: it's called index stream readers (thanks atombender for pointing this out). For those who knows Mike Burrows only for the Burrows-Wheeler transformation (BZip), you might also want to know that he was also one of the main developers of AltaVista, the first real search engine for the internet. He also designed the early versions of Bing search engine. Eventually he worked for Google and designed their lock service called Chubby.

marginalia_nu • 4 hours ago

Another very nice algorithm in the space is this one[1] for intersecting postings lists in sublinear time generally with very good cache characteristics to boot. Works with tree-based indexes as well as skip lists (though a more modern design might also use simple bloom filters to go with the skip pointers).

[1] https://nlp.stanford.edu/IR-book/html/htmledition/faster-pos...

atombender • 4 hours ago

I think you're thinking of index stream readers?

mrkeen • 4 hours ago

I have heard of neither. But the mention of Burrows leads me to Burrows-Wheeler, which is a compression algorithm (bzip).

I'm not 100% but I don't think you can directly query a BWT in the same way you'd query an inverted index (without the later discovery of wavelet trees and FM-indexes / succinct data structures, and all that jazz.) And that's mostly for genomics? Not sure if it applies to plain old document searches. Would love to be corrected though.

lazamar • 30 minutes ago

At Meta they are using FM indexes to power text search through the entire commit history of their monorepo.

SeanSullivan86 • 4 hours ago

I've sometimes been confused by the term "inverted index". The example in this post feels like what I would just call an "index"... i.e documents indexed by the words they contain. Feels about the same as the index in the back of a physical book.

Is the distinction that an index on a multi-valued attribute is called an inverted index?

atombender • 4 hours ago

Inverted indexes are what databases call indexes. It's used in the IR field to differentiate from forward indexes, which are less common, so you're right that we could just say "index's.

But when we talk about inverted indexes, they are almost always term -> posting list, and most index data structures lay these out so that posting lists are sorted and compressed together. Traditional database indexes like B-trees are optimized for rapid insertion and deletion, while inverted indexes tend to be optimized for batch processing, because you typically deconstruct text into words for a large batch and then lazily integrate this batch into the main index.

Part of this is about scale; a row in a database typically has a single column or maybe 2-3 columns in a composite index; but a document text may tokenize into thousands, hundreds of thousands, or millions of words. At this scale, the fine-grained nature of words mean B-trees aren't as a good a fit.

Another part of it is that inverted indexes aren't for point queries, which is what B-trees are optimized for; you typically search for many words at a time in order to rank your search results by some function like cosine similarity. You rarely want a single posting; you want the union or intersection of many posting sorted by score.

modulovalue • 4 hours ago

NIT: That's not quite correct if your first statement is meant to imply an equality rather than a subset relation.

The idea of an index is more general, as an index can be built for many different domains. For example, B-trees can index monoidal data and inverted indexes are just an instance of such a monoid that a B-tree can efficiently index.

Furthermore, metric spaces (e.g., levenshtein distance) can also be efficiently indexed using other trees: metric trees. So calling inverted indexes just indexes would be really confusing since string data is not the only kind of data that a database might want to support having efficient indexes for.

atombender • 3 hours ago

My point is that all indexes are "inverted" in the sense that they map some searchable value to occurrences of said value. That is true even if method of comparison is not strict equality.

giovannibonetti • 1 hour ago

___tom___ • 3 hours ago

This drove me up the wall, until I researched it.

A document can be viewed as an object with a set of pointers to the words it contains.

The inverse of that, was a word object, with a list of pointers to the documents it is found it. This was referred to an an inverted DOCUMENT index. This is what people would normally just call an index.

At some point, people dropped the "DOCUMENT" part, and started just calling it an "inverted index". This makes no sense, grammatically, as it's the document that is inverted, not the index, but it is what it is.

So, an inverted index is just an index.

nzeid • 55 minutes ago

Love this take.

mrkeen • 4 hours ago

No it's the same thing. With any book you have built-in mechanism to go to a page number see what words are there. An inverted index lets you do the inverse (words -> page numbers).

SeanSullivan86 • 4 hours ago

People (non-tech) don't tend to refer to "go to page 106" as using an index. The pages at the back of the book providing the word->page numbers lookup are commonly known as the book's "index"

grg0 • 4 hours ago

"commonly" is an understatement; that's literally what a book index is by definition.

The only thing "inverted" here is the context. The author even admits themselves that the word->doc mapping is an index:

"If user wants to search by words - then words should be keys in our "database" (index)"

It's a pointless debate of semantics. An inverted map is still a map.

dvh • 3 hours ago

I recently used inverted index (with ranked document retrieval) and it all took only 66 lines of JavaScript: https://github.com/dvhx/ngspicejs/blob/master/js/search.js and I'm kinda proud of that code, it's compact, without dirty tricks or without being overtly smart. Well except for using 1/term_frequency instead of logarithms, it's easier to debug (sums of fractions instead of random numbers produced by logarithms) and I just left it there, it works fine.