Use GIN to index bit strings

Question

I'm trying to extend PostgreSQL to index bit strings up to 1000 bits. (These bit strings are created by quantization of high-dimensional vectors, so for each dimension up to 4 bits are assigned). Insertions are rather infrequent, whereas searches are the mostly used operation. In a search, I would like to get all the rows that exactly match the bit string.

It looks like a perfect job for GIN (in combination with my own data type), or what do you think?

Erwin Brandstetter · Accepted Answer · 2013-01-31 17:39:55Z

In a search, I would like to get all the rows that exactly match the bit string.

Use a B-Tree index, the default type. I don't see a case for a GIN index here.

Up to 1000 bits result in up to 133 bytes storage size on disc for a bit varying type.

SELECT pg_column_size(repeat('1', 1000)::varbit)  -- 133

Not that much. A plain B-Tree index should do. But maybe the column is big enough that the following trick(s) improves performance.

If a small part of the bitstring column is distinctive enough to narrow your search down to few hits, an index on an expression might give you better performance, because the smaller index can fit into RAM and is faster to process all around. Don't bother for small tables, but could make a big difference for big tables.

Example

Given table:

CREATE TABLE tbl(id serial PRIMARY KEY, b_col varbit);

If the first 10 bit are enough to narrow down a search to a few hits, you could create an index on the expression b_col::bit(10). Casting to bin(n) truncates the bitstring to n bit.

CREATE INDEX tbl_b_col10_idx ON tbl (b_col::bit(10))

Then, instead of the query

SELECT * FROM tbl WHERE b_col = '1111011110111101'::varbit;

You would use:

SELECT *
FROM   tbl
WHERE  b_col::bit(10) = '1111011110111101'::bit(10) -- utilize index
AND    b_col = '1111011110111101'::varbit;  -- narrow down to exact match.

Be aware that shorter values are padded with 0's to the right (lsb) when cast to bit(n).

Optimize further

Since most installations operate with a MAXALIGN of 8 bytes (64 bit OS) (more details here), your index size is the same for any data not exceeding 8 bytes. Effectively, per row:

 4 bytes item pointer
23 bytes for the tuple header
 1 byte padding for data alignment
   actual space for data
   padding to the nearest multiple of 8 bytes

Plus some overhead per page and index / table. Details in the manual or in this related answer on SO.

Therefore, you should be able to further optimize the above approach. Take the first 64 bit (or last or whatever is most distinctive and works for you), cast it to bigint and build an index on this expression.

CREATE INDEX tbl_b_col64_idx ON tbl (b_col::bit(64)::bigint)

I cast twice (::bit(64)::bigint) for there is no cast defined between varbit and bigint. The query to go along with that:

SELECT *
FROM   tbl
WHERE  b_col::bit(64)::bigint = '1111011110111101'::bit(64)::bigint -- utilize index
AND    b_col = '1111011110111101'::varbit;  -- narrow down to exact match

The resulting index should be just as big as the one in the first example, but queries should be considerably faster for three reasons:

The index returns much fewer hits (64 bit of information vs. 10 bit)
Postgres can work with integer arithmetic, which should be faster, even for a plain = operation. (Didn't test to verify that.)
The type integer has no overhead like varbit - 5 or 8 bytes. (In my installation 5 bytes for up to 960 bit, 8 bytes for more).
Effectively, to keep the index at its minimum size, you can only pack 24 bit into a varbit index - compared to 64 bit of information for a bigint index.

CLUSTER

In such a case CLUSTER should improve performance:

CLUSTER TABLE tbl USING tbl_b_col10_idx;

It's a one-time operation and has to be repeated at intervals of your design. Be sure to read the manual on CLUSTER if you want to use that.

If the first 64 bit of your values are unique most of the time, CLUSTER will barely help, since the index scan will return a single row in most cases. If not, CLUSTER will help a lot. Consequently, the effect will be far greater for the first example with the less optimized index.

Effectively, my solution is similar to a hash index, just with a very fast and simple hash algorithm (the cast) and tailored to Postgres' index capabilities.

asked	3 months ago
viewed	95 times
active	3 months ago

Use GIN to index bit strings

1 Answer

Example

Optimize further

CLUSTER

Your Answer

Not the answer you're looking for? Browse other questions tagged postgresql index index-tuning or ask your own question.

Use GIN to index bit strings

1 Answer

Example

Optimize further

CLUSTER

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged postgresql index index-tuning or ask your own question.

Related