A good early paper on the topic of bitmap indexes is "Bit Transposed Files" by Wong et al, published in 1985.
In all the query-decoding examples, it is left to the reader to understand how each decoded query is derived. One of the examples is too complex for me to figure out.
In section 5.3.3.b, the paper shows how the BTF query processor would decode the query to find all dogs receiving more than 30 dosage units.
In BTF, the query would look like this:
dosage[>30]
We are told that
Attribute dosage is encoded as a Composite unary with 3 fields of 6 bits. Assume dosage 30 is encoded as 000111,000011,011111."
Thus the paper claims
the query can be translated to
dosage ((b14 & b7 & b4) | (b14 & b8) | b15)
and says no more about it.
How does that work?
By my understanding, the decoded query dosage (b14 & b7 & b4)
should alone select the correct values, if we interpret "more than" as "more than or equal to".
My attempt at understanding the solution follows.
Understanding the encoded value
I quote literal values in decimal unless specified otherwise.
We are told that the encoding represents the value 30.
We should remember a unary field that encodes value n+1 sets all the bits that encode n plus the next rightmost bit. A 0 is encoded by setting no bits.
To help me understand the derivation, I created a table that maps each bit of the attribute encoded value (u for unary) to its field value (d for decimal) and to its attribute ordinal position (n for n-th).
The table looks like this:
n 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
d 6 5 4 3 2 1 6 5 4 3 2 1 6 5 4 3 2 1
u 0 0 0 1 1 1 0 0 0 0 1 1 0 1 1 1 1 1
We can see from the table that the the values of the fields from left to right are (3, 2, and 5) respectively.
We know that 3 * 2 * 5 = 30, so we might assume that the product of the field values gives the attribute value.
If this is true, then:
- the order of the fields should be insignificant.
- 6 * 6 * 6 = 216 is the biggest encodable value.
- each value might have multiple encodings.
Section 4 (Bit Transposition) states that attribute dosage has 200 distinct values, so all attribute values are encodable.
Understanding the decoded query
The decoded query ORs together three expressions.
The first expression (b14 & b7 & b4)
selects attributes whose:
- first field encodes 3, 4, 5, or 6.
- second field encodes 2, 3, 4, 5, or 6
- third field encodes 5 or 6.
The expression selects dosage values greater than 30, but selects dosages of 30 as well. If the aim is to find values strictly greater than 30, then the expression would select values that don't match the BTF query.
The second expression (14 & b8)
and the third expression (b15)
don't select bits for some fields, so these fields could have any value.
If this is true, then the whole expression would select dogs who received a dose of 0 and many other values less than 30.
By my understanding, the decoded query would select many values that don't match the BTF query.
(b14 & b7 & b4)
meansabc where a>=3 and b>=2 and c>=5
, while(b14 & b8)
meansabc where a>=3 and b>=3
and(b15)
meansabc where a>=4
. So you have325
and33c
and4bc
, all greater (or equal) to325
. – ypercube May 8 at 9:3830 = 3 * 2 * 5
is just a coincidence. The number30
is encoded as325
where3
,2
and5
are just digits in a base-6 system. – ypercube May 8 at 10:025 * (6 ^ 0) + 2 * (6 ^ 1) + 3 * (6 ^ 2) = 5 + 12 + 108 = 125
So I don't see a direct relationship to a dosage value of 30. – Iain Elder May 9 at 12:14