PostgreSQL Multi-Column index join with comparison (“<” and “>”) operators

Question

I'm trying to take advantage of a multi-column btree index in PostgreSQL to perform an annoying join between two tables.

               Table "revision_main"
     Column     |          Type          | Modifiers 
----------------+------------------------+-----------
 revision_id    | integer                | 
 page_id        | integer                | 

Indexes:
    "revision_main_pkey" UNIQUE, btree (revision_id)
    "revision_main_cluster_idx" btree (page_id, "timestamp") CLUSTER

This table contains revisions (~ 300 Million rows) to pages in a wiki. There are more columns in my table, but I've discarded them for this example because they shouldn't matter.

               Table "revert"
       Column       |  Type   | Modifiers 
--------------------+---------+-----------
 page_id            | integer | 
 revision_id        | integer | 
 reverted_to        | integer | 
Indexes:
    "revert_page_between_idx" btree (page_id, reverted_to, revision_id) CLUSTER

This table contains reverting revisions (~22 Million rows). If a revisions has been reverted, that revision_id will have a row in the revision_main table and its revision_id will be between reverted_to and revision_id as well as share the same page_id. (See http://en.wikipedia.org/wiki/Wikipedia:Revert if you are curious.)

Joining these two tables to get reverted revisions seems straightforward. Here is what I've come up with:

explain SELECT
    r.revision_id,
    rvt.revision_id
FROM revision_main r
INNER JOIN revert rvt 
    ON r.page_id = rvt.page_id 
    AND r.revision_id > rvt.reverted_to
    AND r.revision_id < rvt.revision_id;
                                       QUERY PLAN                                               
----------------------------------------------------------------------------------------------------
 Merge Join  (cost=4202878.87..15927491478.57 rows=88418194298 width=8)
   Merge Cond: (r.page_id = rvt.page_id)
   Join Filter: ((r.revision_id > rvt.reverted_to) AND (r.revision_id < rvt.revision_id))
   ->  Index Scan using revision_main_page_id_idx on revision_main r  (cost=0.00..9740790.61 rows=223163392 width=8)
   ->  Materialize  (cost=4201592.06..4536465.21 rows=26789852 width=12)
         ->  Sort  (cost=4201592.06..4268566.69 rows=26789852 width=12)
               Sort Key: rvt.page_id
               ->  Seq Scan on revert rvt  (cost=0.00..438534.52 rows=26789852 width=12)

Even though the clustered index on revert should be a Btree index (and thus support comparison operators like "<" and ">"), the query optimizer does not use the index for the join and "explain" predicts a total cost of over 15 billion (might be done next year).

Are comparison operators impossible to use with multi-column (btree) indexes? Am I just doing it wrong?

btilly · Answer 1 · 2011-02-03 21:52:08Z

up vote 5 down vote

It looks like the optimizer knows its job better than you do.

If you are selecting more than a small fraction of a table (what fraction is hardware dependent, let's say 5%), then it is faster to select and order the whole table than it is to use an index. If you were just selecting a few rows, then it should use the index. So it is giving you the correct query plan for your data.

As for the total cost, those numbers are all BS and are only useful when compared in relation to each other, within a single query. (The total costs produced by two very similar queries can be on a very different scale.) The time to execute and the query cost are pretty much unrelated.

edited Feb 3 '11 at 21:52

answered Feb 3 '11 at 20:27

btilly
14.7k11633

I can see how just sorting the whole table could be faster than using the index, but in my experience, the cost estimates tend to be a consistent reflection of execution time. On the other hand, I've never really known what the numbers mean so I'll concede to your understanding. Are you suggesting that I should just run the query and ignore the numbers? – halfak Feb 3 '11 at 22:09

@halfak: Let me look more closely. Databases like to start joins with smaller tables. It is possible that if you added an index on (page_id, revision_id) to revision_main that you'd get a more efficient query. It also might be worse. But if that fails, then the only way to get it to be much more efficient is to find a way to ask for less data. – btilly Feb 3 '11 at 22:46

add a comment |

Tom Morris · Answer 2 · 2014-05-08 16:22:54Z

Your query (based on the SQL) looks like it needs to read the entire revert table, and find the appropriate revision rows for each row in the revert table.

Since the entire revert table needs to be read, a sequential scan of it is appropriate. It seems to expect roughly the right number of rows.

Each revert row is then going to match a number of revisions, which it thinks will be best done through an index scan and merge join. It estimates that on average, each revert row will match roughly 3300 revisions, resulting in 88 billion rows.

I don't know of any ways to select 88 billion rows quickly.

In order to get a more accurate estimate, you'll need a way of convincing PostgreSQL that there are a lot less than 3300 revisions covered by each revert.

You say that you are after reverted revisions, indicating that each revision should appear only once, even if included within multiple reverts.

So try using an EXISTS (subquery) instead of an INNER JOIN

This won't give you the revert revisions though:

EXPLAIN
SELECT
    r.revision_id
FROM revision_main r
WHERE EXISTS (SELECT 1 FROM revert rvt 
    WHERE r.page_id = rvt.page_id 
    AND r.revision_id > rvt.reverted_to
    AND r.revision_id < rvt.revision_id);

"each revert row will match roughly 3300 revisions, resulting in 88 billion rows." --- I see... In practice, each revert should match 1 revision for 99% of revert rows. Is there some way to make this evident and would it matter? — halfak, Feb 4 '11 at 15:41
You could find and store the revisions for a reverted page when the revert occurs. — Stephen Denne, Feb 6 '11 at 6:21

asked	3 years ago
viewed	3753 times
active	2 months ago

current community

your communities

more stack exchange communities

PostgreSQL Multi-Column index join with comparison (“<” and “>”) operators

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged sql optimization postgresql indexing or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

PostgreSQL Multi-Column index join with comparison (“<” and “>”) operators

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged sql optimization postgresql indexing or ask your own question.

Related

Hot Network Questions