Sign up ×
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other. Join them; it only takes a minute:

I have a simple query to join two tables that's being really slow. I found out that the query plan does a seq scan on the large table email_activities (~10m rows) while I think using indexes doing nested loops will actually be faster.

I rewrote the query using a subquery in an attempt to force the use of index, then noticed something interesting. If you look at the two query plans below, you will see that when I limit the result set of subquery to 43k, query plan does use index on email_activities while setting the limit in subquery to even 44k will cause query plan to use seq scan on email_activities. One is clearly more efficient than the other, but Postgres doesn't seem to care.

What could cause this? Does it have a configs somewhere that forces the use of hash join if one of the set is larger than certain size?

explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 43000);
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=118261.50..118261.50 rows=1 width=4) (actual time=224.556..224.556 rows=1 loops=1)
   ->  Nested Loop  (cost=3699.03..118147.99 rows=227007 width=4) (actual time=32.586..209.076 rows=40789 loops=1)
         ->  HashAggregate  (cost=3698.94..3827.94 rows=43000 width=4) (actual time=32.572..47.276 rows=43000 loops=1)
               ->  Limit  (cost=0.09..3548.44 rows=43000 width=4) (actual time=0.017..22.547 rows=43000 loops=1)
                     ->  Index Scan using index_email_recipients_on_email_campaign_id on email_recipients  (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.017..19.168 rows=43000 loops=1)
                           Index Cond: (email_campaign_id = 1607)
         ->  Index Only Scan using index_email_activities_on_email_recipient_id on email_activities  (cost=0.09..2.64 rows=5 width=4) (actual time=0.003..0.003 rows=1 loops=43000)
               Index Cond: (email_recipient_id = email_recipients.id)
               Heap Fetches: 40789
 Total runtime: 224.675 ms

And:

explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 50000);
                                                                                            QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=119306.25..119306.25 rows=1 width=4) (actual time=3050.612..3050.613 rows=1 loops=1)
   ->  Hash Semi Join  (cost=4451.08..119174.27 rows=263962 width=4) (actual time=1831.673..3038.683 rows=47935 loops=1)
         Hash Cond: (email_activities.email_recipient_id = email_recipients.id)
         ->  Seq Scan on email_activities  (cost=0.00..107490.96 rows=9359988 width=4) (actual time=0.003..751.988 rows=9360039 loops=1)
         ->  Hash  (cost=4276.08..4276.08 rows=50000 width=4) (actual time=34.058..34.058 rows=50000 loops=1)
               Buckets: 8192  Batches: 1  Memory Usage: 1758kB
               ->  Limit  (cost=0.09..4126.08 rows=50000 width=4) (actual time=0.016..27.302 rows=50000 loops=1)
                     ->  Index Scan using index_email_recipients_on_email_campaign_id on email_recipients  (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.016..22.244 rows=50000 loops=1)
                           Index Cond: (email_campaign_id = 1607)
 Total runtime: 3050.660 ms
share|improve this question
    
The HashAggregate operation might require too much memory for 50k rows. Try to increase work_mem ? – Andomar Dec 30 '15 at 21:12
    
Basic information is missing. Please consider instructions in the tag info for [postgresql-perfiormance]. Also, your 2nd query is for LIMIT 50000, not for 44k as stated above. Adds to the difference. – Erwin Brandstetter 7 mins ago

A sequential scan can be more efficient, even when an index exists. In this case, postgres seems to estimate things rather wrong. An ANALYZE <TABLE> on all related tables can help in such cases. If it doesnt, you can set the variable enable_seqscan to OFF, to force postgres to use an index whenever technically possible, at the expense, that sometimes an index-scan will be used when a sequential scan would perform better.

share|improve this answer
1  
Agree with analyze but i really wouldn't recommend to set enable_seqscan to OFF. can cause slow in other queries – itai 2 days ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.