How to optimize this Postgres count query

Question

EXPLAIN ANALYZE 
SELECT count(*) 
FROM "businesses" 
WHERE (
    source = 'facebook' 
    OR EXISTS( 
        SELECT * 
        FROM provider_business_map pbm 
        WHERE 
            pbm.hotstepper_business_id=businesses.id 
            AND pbm.provider_name='facebook' 
    )
);
PLAN                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=233538965.74..233538965.75 rows=1 width=0) (actual time=116169.720..116169.721 rows=1 loops=1)
   ->  Seq Scan on businesses  (cost=0.00..233521096.48 rows=7147706 width=0) (actual time=11.284..116165.646 rows=3693 loops=1)
         Filter: (((source)::text = 'facebook'::text) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
         SubPlan 1
           ->  Index Scan using idx_provider_hotstepper_business on provider_business_map pbm  (cost=0.00..16.29 rows=1 width=0) (never executed)
                 Index Cond: (((provider_name)::text = 'facebook'::text) AND (hotstepper_business_id = businesses.id))
         SubPlan 2
           ->  Index Scan using idx_provider_hotstepper_business on provider_business_map pbm  (cost=0.00..16.28 rows=1 width=4) (actual time=0.045..5.685 rows=3858 loops=1)
                 Index Cond: ((provider_name)::text = 'facebook'::text)
 Total runtime: 116169.820 ms
(10 rows)

This query takes over a minute and it's doing a count that results in ~3000. It seems the bottleneck is the sequential scan but I'm not sure what index I would need on the database to optimize this. It's also worth noting that I haven't tuned postgres so if there's any tuning that would help it may be worth considering. Although my DB is 15GB and I don't plan on trying to fit all of that in memory anytime soon so I'm not sure changing RAM related values would help a lot.

Bohemian · Accepted Answer · 2013-07-03 21:35:52Z

OR is notorious for lousy performance. Try splitting it into a union of two completely separate queries on the two tables:

SELECT COUNT(*) FROM (
    SELECT id
    FROM businesses 
    WHERE source = 'facebook'
    UNION   -- union makes the ids unique in the result
    SELECT hotstepper_business_id
    FROM provider_business_map
    WHERE provider_name = 'facebook'
    AND hotstepper_business_id IS NOT NULL
) x

If hotstepper_business_id can not be null, you may remove the line

AND hotstepper_business_id IS NOT NULL

If you want the whole business row, you'd could simply wrap the above query with an IN (...):

SELECT * FROM businesses
WHERE ID IN (
    -- above inner query
)

But a much better performing query would be to modify the above query use use a join:

SELECT *
FROM businesses 
WHERE source = 'facebook'
UNION
SELECT b.*
FROM provider_business_map m
JOIN businesses b
  ON b.id = m.hotstepper_business_id
WHERE provider_name = 'facebook'

This makes a lot of sense because I was getting great performance doing this in code - pull both sets into an array and dedup but it seemed like I should be solving something like that on the DB level. Thank you! — wachutu, Jul 3 at 14:38
Would my query make better sense if I wasn't doing the count though? Or should I use your methodology and do SELECT * from businesses where businesses.id IN ( YOUR SUBQUERY ) — wachutu, Jul 3 at 14:42
Try to avoid where id in (subquery) - they don't perform well. See edited answer for the fastest way to get all business rows — Bohemian, Jul 3 at 21:36

Joachim Isaksson · Answer 2 · 2013-07-03 06:19:08Z

up vote 1 down vote

I'd at least try rewriting the dependent subquery as;

SELECT COUNT(DISTINCT b.*)
FROM businesses b
LEFT JOIN provider_business_map pbm
  ON b.id=pbm.hotstepper_business_id
WHERE b.source = 'facebook'
  OR pbm.provider_name = 'facebook';

Unless I'm mis-reading something, an index on businesses.id exists, but make sure there are also indexes on provider_business_map.hotstepper_business_id, businesses.source and provider_business_map.provider_name for best performance.

edited Jul 3 at 6:19

answered Jul 3 at 6:13

Joachim Isaksson
61.6k14069

Wouldn't a multicolumn index on pbm.provider_name and pbm.hotstepper_business_id have the best performance? I have idx_provider_hotstepper_business" btree (provider_name, hotstepper_business_id) – wachutu Jul 3 at 14:33

@wachutu Only hotstepper_business_id is always used, so you'll probably want to reverse that to btree (hotstepper_business_id, provider_name) for this query to run fast. – Joachim Isaksson Jul 3 at 14:37

Not following you. The cardinality is much lower on provider_name so shouldn't I have the multicolumn index limit by that first? Would Postgres not limit the provider_business_map set that it checks first by the provider_name? It seems like that would be the best way to limit the number of rows being checked on provider_business_map – wachutu Jul 3 at 14:58

Hm, yes, I mis-read the schema from the query when commenting, placing source in pbm which is quite different. You're probably correct. – Joachim Isaksson Jul 3 at 15:06

add comment

Clodoaldo Neto · Answer 3 · 2013-07-03 06:44:33Z

up vote 1 down vote

create index index_name on businesses(source);

Since there 3,693 rows matches in more than 7 million rows it will probably use the index. Do not forget to

analyse businesses;

answered Jul 3 at 6:44

Clodoaldo Neto
18.3k11739

add comment

asked	5 months ago
viewed	92 times
active	5 months ago

Explore our sites

How to optimize this Postgres count query

3 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged postgresql query-optimization or ask your own question.

Community Bulletin

Explore our sites

How to optimize this Postgres count query

3 Answers

Your Answer

Sign up or login

Post as a guest

Not the answer you're looking for? Browse other questions tagged postgresql query-optimization or ask your own question.

Community Bulletin

Related