How to reduce index scan time for a postgresql IN query with ordered results?

Question

I built a simple app for reading RSS feeds using rails and postgresql, but am running into performance issues when I try to query my feed_entries table for posts from more than one feed. An example query looks like this, to retrieve the 20 most recent entries for a given collection of feed ids:

SELECT * FROM feed_entries WHERE feed_id IN (19, 21, 383, 1867, 3103) ORDER BY published_at DESC LIMIT 20;

The feed_entries table has about 4 million rows in it, is hosted on Heroku Postgres with the Fugu plan, and it has a few indexes, including:

"index_feed_entries_on_feed_id_and_published_at" btree (feed_id, published_at)
"index_feed_entries_on_published_at" btree (published_at)

Here are the results of the query planner:

EXPLAIN ANALYZE SELECT * FROM feed_entries WHERE feed_id IN (19, 21, 383, 1867, 3103) ORDER BY published_at DESC LIMIT 20;

 Limit  (cost=4353.93..4353.94 rows=20 width=1016) (actual time=12172.275..12172.325 rows=20 loops=1)
   ->  Sort  (cost=4353.93..4355.07 rows=2286 width=1016) (actual time=12172.268..12172.284 rows=20 loops=1)
     Sort Key: published_at
     Sort Method: top-N heapsort  Memory: 52kB
     ->  Index Scan using index_feed_entries_on_feed_id_and_published_at on feed_entries  (cost=0.00..4341.76 rows=2286 width=1016) (actual time=8.612..12169.504 rows=630 loops=1)
           Index Cond: (feed_id = ANY ('{19,21,383,1867,3103}'::integer[]))
Total runtime: 12172.520 ms

The planner looks like it's using the appropriate index, yet scanning the index still takes ~12 seconds, which strikes me as too long for a table that has 4 million rows. If I repeat the query planner exactly as above, then the second time it tells me that the whole thing takes only 2 ms, maybe that's simply because the results of the first query are cached, but it's still confusing to me. I also tried running VACUUM ANALYZE before running the query, but it made little difference. Additionally, if I query the table for a single feed_id, then the query planner uses an Index Scan Backward using index_feed_entries_on_feed_id_and_published_at on feed_entries, and total execution time is much faster, on the order of 20ms.

Are there other strategies I could adopt to optimize the performance of this relatively simple IN query?

What is the beahviour with a OR instead of IN in your request ? — willome, Jun 27 at 15:34
the query plan looks different, it does a Bitmap Index Scan on index_feed_entries_on_feed_id_and_published_at for each of the feed_ids in the OR conditions. The total runtime decreased to ~3000 ms, though it's hard to know if that is somewhat related to cached results (I was experimenting with the old query not too long before trying this new one) — tws, Jun 27 at 15:51
Just run each query several times to exclude caching effects. — Erwin Brandstetter, Jun 27 at 16:24

Erwin Brandstetter · Accepted Answer · 2013-06-27 16:40:11Z

Another thing to try would be this alternative query form:

SELECT * 
FROM   feed_entries
JOIN  (unnest('{19,21,383,1867,3103}'::int[]) AS feed_id) sub USING (feed_id)
ORDER  BY published_at DESC
LIMIT  20;

Sort order of columns does matter in multi-column indexes, though. Use:

CREATE index_feed_entries_2 ON feed_entries (feed_id, published_at DESC)

If you CLUSTER your table according to this index, this might give you another little boost, but effectiveness deteriorates with a lot of updates. Read the last chapter of this related answer for more info:
Bitmap Heap Scan performance

Of course, all the usual advice on performance optimization applies, too.

Thanks! The CLUSTER seems to have made a meaningful improvement, though I'm a bit worried that will degrade over time as I add/remove rows from the table. Currently I've scheduled a weekly CLUSTER and VACUUM ANALYZE on the table. The DESC index though doesn't seem to make a difference: the query planner still uses the ASC index, even though the query is clearly sorting by descending — tws, Jul 1 at 2:50

bma · Answer 2 · 2013-06-27 15:50:13Z

up vote 0 down vote

Try creating an index with a DESC order. Eg.

create index feed_entries_published_at_desc_idx on feed_entries ( published_at desc ) with (fillfactor=100);

You could try a similar (compound) index as above on (feed_id, published_at desc) to see how that works too.

answered Jun 27 at 15:50

bma
1,88127

1

The query planner does use that index now, but the results are still slow. also, why fillfactor=100? My understanding was that if you have a table that will be updated frequently, you should use a lower value for fillfactor? – tws Jun 27 at 16:20

Oh, the data is UPDATEd a lot? Sorry, I didn't get that from the question. If that's the case, then yeah, drop the fillfactor. You say the index is already being used -- which one? Your original, or the one I suggested with a sort order of DESC? – bma Jun 27 at 16:24

@tws: Sort order on a single-column B-tree index has hardly any effect. Postgres can scan backwards at (almost) the same speed. – Erwin Brandstetter Jun 27 at 16:33

asked	4 months ago
viewed	107 times
active	4 months ago

How to reduce index scan time for a postgresql IN query with ordered results?

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged postgresql postgresql-9.2 or ask your own question.

Linked

How to reduce index scan time for a postgresql IN query with ordered results?

2 Answers

Your Answer

Sign up or login

Post as a guest

Not the answer you're looking for? Browse other questions tagged postgresql postgresql-9.2 or ask your own question.

Linked

Related