Gin Index - Kludged.net

Gin, or Generalized Inverted Index, is commonly used for full text searching. Its name was inspired by the words “Jinn” or “Djinn” meaning “genie” rather than a drink.¹

The inverted index stores “key, posting list” pairs. In other words, it maps a word with a list of all the rows that contain that word. The actual structure is a B-tree built from the set of all keys. It works just like normal B-tree indexes except that it doesn’t have a left link pointer, so it doesn’t support scanning the index backwards. Also, tuples are never deleted from the tree because the list of distinct keys doesn’t change much on larger indexes, which greatly simplifies the process.

Each key is an element of indexed items such as lexemes for tsvector. A lexeme is a string or token that has been normalized. In this case, normalized means it has been slightly transformed; for example, it is converted to lowercase and different forms of the same word are made alike such as “jumped”, “jumps”, and “jumping” using a process called stemming. Tsvector is a data type in Postgres that is used to store results of a document after the words have been normalized.

Each tuple in a leaf page contains either a posting tree or a posting list, if the list is small enough to fit in a single B-tree page. The difference between a posting tree and a posting list is that a posting tree is a pointer to another B-tree containing item pointers (TIDs), and a posting list is a simple list of item pointers contained within the leaf page. While tuples are not deleted, TIDs in a list can be.

Full Text Search

Gin indexes are commonly used to split up strings into individual tokens. Each token becomes its own entry in the B-tree, which allows for full text searching.

JSONB

For the same reason full text searching works, JSON and JSONB columns can benefit from Gin indexes. It makes it possible to filter based on the JSON keys, values, and overall structure. However, at least in PG10, the query planner for JSONB columns isn’t extremely sophisticated. Instead of gathering statistics on each key, it assumes every key has a flat 1% selectivity, i.e. that each key will return 100 values.

This can cause the planner to create inefficient plans and therefore cause JSONB queries to run slower than necessary.

Trigrams

Later in this chapter, we’ll go over trigrams in detail. Trigrams are combinations of 3 letter sequences, which makes it possible to search partial works efficiently. column LIKE '%filter%' becomes much faster because it can use the index.

Creating a Gin Index

CREATE INDEX name ON table USING GIN(column); The column must be of tsvector type, which is pretty much the different string types such as text, json, jsonb, etc.

Gin Readme in PG10 Source: src/backend/access/gin/README ↩