In a search, I would like to get all the rows that exactly match the
bit string.
Use a B-Tree index, the default type. I don't see a case for a GIN index here.
Up to 1000 bits result in up to 133 bytes storage size on disc for a bit varying
type.
SELECT pg_column_size(repeat('1', 1000)::varbit) -- 133
Not that much. A plain B-Tree index should do. But maybe the column is big enough that the following trick(s) improves performance.
If a small part of the bitstring column is distinctive enough to narrow your search down to few hits, an index on an expression might give you better performance, because the smaller index can fit into RAM and is faster to process all around. Don't bother for small tables, but could make a big difference for big tables.
Example
Given table:
CREATE TABLE tbl(id serial PRIMARY KEY, b_col varbit);
If the first 10 bit are enough to narrow down a search to a few hits, you could create an index on the expression b_col::bit(10)
. Casting to bin(n)
truncates the bitstring
to n bit.
CREATE INDEX tbl_b_col10_idx ON tbl (b_col::bit(10))
Then, instead of the query
SELECT * FROM tbl WHERE b_col = '1111011110111101'::varbit;
You would use:
SELECT *
FROM tbl
WHERE b_col::bit(10) = '1111011110111101'::bit(10) -- utilize index
AND b_col = '1111011110111101'::varbit; -- narrow down to exact match.
Be aware that shorter values are padded with 0
's to the right (lsb) when cast to bit(n)
.
Optimize further
Since most installations operate with a MAXALIGN
of 8 bytes (64 bit OS) (more details here), your index size is the same for any data not exceeding 8 bytes. Effectively, per row:
4 bytes item pointer
23 bytes for the tuple header
1 byte padding for data alignment
actual space for data
padding to the nearest multiple of 8 bytes
Plus some overhead per page and index / table. Details in the manual or in this related answer on SO.
Therefore, you should be able to further optimize the above approach. Take the first 64 bit (or last or whatever is most distinctive and works for you), cast it to bigint
and build an index on this expression.
CREATE INDEX tbl_b_col64_idx ON tbl (b_col::bit(64)::bigint)
I cast twice (::bit(64)::bigint
) for there is no cast defined between varbit
and bigint
. The query to go along with that:
SELECT *
FROM tbl
WHERE b_col::bit(64)::bigint = '1111011110111101'::bit(64)::bigint -- utilize index
AND b_col = '1111011110111101'::varbit; -- narrow down to exact match
The resulting index should be just as big as the one in the first example, but queries should be considerably faster for three reasons:
The index returns much fewer hits (64 bit of information vs. 10 bit)
Postgres can work with integer arithmetic, which should be faster, even for a plain =
operation. (Didn't test to verify that.)
The type integer
has no overhead like varbit
- 5 or 8 bytes. (In my installation 5 bytes for up to 960 bit, 8 bytes for more).
Effectively, to keep the index at its minimum size, you can only pack 24 bit into a varbit
index - compared to 64 bit of information for a bigint
index.
CLUSTER
In such a case CLUSTER
should improve performance:
CLUSTER TABLE tbl USING tbl_b_col10_idx;
It's a one-time operation and has to be repeated at intervals of your design. Be sure to read the manual on CLUSTER
if you want to use that.
If the first 64 bit of your values are unique most of the time, CLUSTER
will barely help, since the index scan will return a single row in most cases. If not, CLUSTER
will help a lot. Consequently, the effect will be far greater for the first example with the less optimized index.