I built a simple app for reading RSS feeds using rails and postgresql, but am running into performance issues when I try to query my feed_entries
table for posts from more than one feed. An example query looks like this, to retrieve the 20 most recent entries for a given collection of feed ids:
SELECT * FROM feed_entries WHERE feed_id IN (19, 21, 383, 1867, 3103) ORDER BY published_at DESC LIMIT 20;
The feed_entries
table has about 4 million rows in it, is hosted on Heroku Postgres with the Fugu plan, and it has a few indexes, including:
"index_feed_entries_on_feed_id_and_published_at" btree (feed_id, published_at)
"index_feed_entries_on_published_at" btree (published_at)
Here are the results of the query planner:
EXPLAIN ANALYZE SELECT * FROM feed_entries WHERE feed_id IN (19, 21, 383, 1867, 3103) ORDER BY published_at DESC LIMIT 20;
Limit (cost=4353.93..4353.94 rows=20 width=1016) (actual time=12172.275..12172.325 rows=20 loops=1)
-> Sort (cost=4353.93..4355.07 rows=2286 width=1016) (actual time=12172.268..12172.284 rows=20 loops=1)
Sort Key: published_at
Sort Method: top-N heapsort Memory: 52kB
-> Index Scan using index_feed_entries_on_feed_id_and_published_at on feed_entries (cost=0.00..4341.76 rows=2286 width=1016) (actual time=8.612..12169.504 rows=630 loops=1)
Index Cond: (feed_id = ANY ('{19,21,383,1867,3103}'::integer[]))
Total runtime: 12172.520 ms
The planner looks like it's using the appropriate index, yet scanning the index still takes ~12 seconds, which strikes me as too long for a table that has 4 million rows. If I repeat the query planner exactly as above, then the second time it tells me that the whole thing takes only 2 ms, maybe that's simply because the results of the first query are cached, but it's still confusing to me. I also tried running VACUUM ANALYZE
before running the query, but it made little difference. Additionally, if I query the table for a single feed_id, then the query planner uses an Index Scan Backward using index_feed_entries_on_feed_id_and_published_at on feed_entries
, and total execution time is much faster, on the order of 20ms.
Are there other strategies I could adopt to optimize the performance of this relatively simple IN query?
Bitmap Index Scan on index_feed_entries_on_feed_id_and_published_at
for each of the feed_ids in the OR conditions. The total runtime decreased to ~3000 ms, though it's hard to know if that is somewhat related to cached results (I was experimenting with the old query not too long before trying this new one) – tws Jun 27 at 15:51