Postgres not using index when index scan is much better option

Question

I have a simple query to join two tables that's being really slow. I found out that the query plan does a seq scan on the large table email_activities (~10m rows) while I think using indexes doing nested loops will actually be faster.

I rewrote the query using a subquery in an attempt to force the use of index, then noticed something interesting. If you look at the two query plans below, you will see that when I limit the result set of subquery to 43k, query plan does use index on email_activities while setting the limit in subquery to even 44k will cause query plan to use seq scan on email_activities. One is clearly more efficient than the other, but Postgres doesn't seem to care.

What could cause this? Does it have a configs somewhere that forces the use of hash join if one of the set is larger than certain size?

explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 43000);
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=118261.50..118261.50 rows=1 width=4) (actual time=224.556..224.556 rows=1 loops=1)
   ->  Nested Loop  (cost=3699.03..118147.99 rows=227007 width=4) (actual time=32.586..209.076 rows=40789 loops=1)
         ->  HashAggregate  (cost=3698.94..3827.94 rows=43000 width=4) (actual time=32.572..47.276 rows=43000 loops=1)
               ->  Limit  (cost=0.09..3548.44 rows=43000 width=4) (actual time=0.017..22.547 rows=43000 loops=1)
                     ->  Index Scan using index_email_recipients_on_email_campaign_id on email_recipients  (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.017..19.168 rows=43000 loops=1)
                           Index Cond: (email_campaign_id = 1607)
         ->  Index Only Scan using index_email_activities_on_email_recipient_id on email_activities  (cost=0.09..2.64 rows=5 width=4) (actual time=0.003..0.003 rows=1 loops=43000)
               Index Cond: (email_recipient_id = email_recipients.id)
               Heap Fetches: 40789
 Total runtime: 224.675 ms

And:

explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 50000);
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=119306.25..119306.25 rows=1 width=4) (actual time=3050.612..3050.613 rows=1 loops=1)
   ->  Hash Semi Join  (cost=4451.08..119174.27 rows=263962 width=4) (actual time=1831.673..3038.683 rows=47935 loops=1)
         Hash Cond: (email_activities.email_recipient_id = email_recipients.id)
         ->  Seq Scan on email_activities  (cost=0.00..107490.96 rows=9359988 width=4) (actual time=0.003..751.988 rows=9360039 loops=1)
         ->  Hash  (cost=4276.08..4276.08 rows=50000 width=4) (actual time=34.058..34.058 rows=50000 loops=1)
               Buckets: 8192  Batches: 1  Memory Usage: 1758kB
               ->  Limit  (cost=0.09..4126.08 rows=50000 width=4) (actual time=0.016..27.302 rows=50000 loops=1)
                     ->  Index Scan using index_email_recipients_on_email_campaign_id on email_recipients  (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.016..22.244 rows=50000 loops=1)
                           Index Cond: (email_campaign_id = 1607)
 Total runtime: 3050.660 ms

The HashAggregate operation might require too much memory for 50k rows. Try to increase work_mem ? — Andomar, Dec 30 '15 at 21:12
Basic information is missing. Please consider instructions in the tag info for [postgresql-perfiormance]. Also, your 2nd query is for LIMIT 50000, not for 44k as stated above. Adds to the difference. — Erwin Brandstetter, 7 mins ago

Ctx · Answer 1 · 2015-12-30 21:01:19Z

up vote 1 down vote

A sequential scan can be more efficient, even when an index exists. In this case, postgres seems to estimate things rather wrong. An ANALYZE <TABLE> on all related tables can help in such cases. If it doesnt, you can set the variable enable_seqscan to OFF, to force postgres to use an index whenever technically possible, at the expense, that sometimes an index-scan will be used when a sequential scan would perform better.

answered Dec 30 '15 at 21:01

Ctx

1,35412

1

Agree with analyze but i really wouldn't recommend to set enable_seqscan to OFF. can cause slow in other queries – itai 2 days ago

add a comment |

asked	4 days ago
viewed	20 times
active	today

current community

your communities

more stack exchange communities

Postgres not using index when index scan is much better option

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged sql postgresql postgresql-performance database-indexes or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Postgres not using index when index scan is much better option

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged sql postgresql postgresql-performance database-indexes or ask your own question.

Related

Hot Network Questions