Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.
EXPLAIN ANALYZE 
SELECT count(*) 
FROM "businesses" 
WHERE (
    source = 'facebook' 
    OR EXISTS( 
        SELECT * 
        FROM provider_business_map pbm 
        WHERE 
            pbm.hotstepper_business_id=businesses.id 
            AND pbm.provider_name='facebook' 
    )
);
PLAN                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=233538965.74..233538965.75 rows=1 width=0) (actual time=116169.720..116169.721 rows=1 loops=1)
   ->  Seq Scan on businesses  (cost=0.00..233521096.48 rows=7147706 width=0) (actual time=11.284..116165.646 rows=3693 loops=1)
         Filter: (((source)::text = 'facebook'::text) OR (alternatives: SubPlan 1 or hashed SubPlan 2))
         SubPlan 1
           ->  Index Scan using idx_provider_hotstepper_business on provider_business_map pbm  (cost=0.00..16.29 rows=1 width=0) (never executed)
                 Index Cond: (((provider_name)::text = 'facebook'::text) AND (hotstepper_business_id = businesses.id))
         SubPlan 2
           ->  Index Scan using idx_provider_hotstepper_business on provider_business_map pbm  (cost=0.00..16.28 rows=1 width=4) (actual time=0.045..5.685 rows=3858 loops=1)
                 Index Cond: ((provider_name)::text = 'facebook'::text)
 Total runtime: 116169.820 ms
(10 rows)

This query takes over a minute and it's doing a count that results in ~3000. It seems the bottleneck is the sequential scan but I'm not sure what index I would need on the database to optimize this. It's also worth noting that I haven't tuned postgres so if there's any tuning that would help it may be worth considering. Although my DB is 15GB and I don't plan on trying to fit all of that in memory anytime soon so I'm not sure changing RAM related values would help a lot.

share|improve this question
 
On PostgreSQL version .... ? –  Craig Ringer Jul 3 at 6:19
 
I'm on Postgres 9.1 –  wachutu Jul 3 at 14:36
add comment

3 Answers

up vote 1 down vote accepted

OR is notorious for lousy performance. Try splitting it into a union of two completely separate queries on the two tables:

SELECT COUNT(*) FROM (
    SELECT id
    FROM businesses 
    WHERE source = 'facebook'
    UNION   -- union makes the ids unique in the result
    SELECT hotstepper_business_id
    FROM provider_business_map
    WHERE provider_name = 'facebook'
    AND hotstepper_business_id IS NOT NULL
) x

If hotstepper_business_id can not be null, you may remove the line

AND hotstepper_business_id IS NOT NULL

If you want the whole business row, you'd could simply wrap the above query with an IN (...):

SELECT * FROM businesses
WHERE ID IN (
    -- above inner query
)

But a much better performing query would be to modify the above query use use a join:

SELECT *
FROM businesses 
WHERE source = 'facebook'
UNION
SELECT b.*
FROM provider_business_map m
JOIN businesses b
  ON b.id = m.hotstepper_business_id
WHERE provider_name = 'facebook'
share|improve this answer
 
This makes a lot of sense because I was getting great performance doing this in code - pull both sets into an array and dedup but it seemed like I should be solving something like that on the DB level. Thank you! –  wachutu Jul 3 at 14:38
 
Would my query make better sense if I wasn't doing the count though? Or should I use your methodology and do SELECT * from businesses where businesses.id IN ( YOUR SUBQUERY ) –  wachutu Jul 3 at 14:42
 
Try to avoid where id in (subquery) - they don't perform well. See edited answer for the fastest way to get all business rows –  Bohemian Jul 3 at 21:36
add comment

I'd at least try rewriting the dependent subquery as;

SELECT COUNT(DISTINCT b.*)
FROM businesses b
LEFT JOIN provider_business_map pbm
  ON b.id=pbm.hotstepper_business_id
WHERE b.source = 'facebook'
  OR pbm.provider_name = 'facebook';

Unless I'm mis-reading something, an index on businesses.id exists, but make sure there are also indexes on provider_business_map.hotstepper_business_id, businesses.source and provider_business_map.provider_name for best performance.

share|improve this answer
 
Wouldn't a multicolumn index on pbm.provider_name and pbm.hotstepper_business_id have the best performance? I have idx_provider_hotstepper_business" btree (provider_name, hotstepper_business_id) –  wachutu Jul 3 at 14:33
 
@wachutu Only hotstepper_business_id is always used, so you'll probably want to reverse that to btree (hotstepper_business_id, provider_name) for this query to run fast. –  Joachim Isaksson Jul 3 at 14:37
 
Not following you. The cardinality is much lower on provider_name so shouldn't I have the multicolumn index limit by that first? Would Postgres not limit the provider_business_map set that it checks first by the provider_name? It seems like that would be the best way to limit the number of rows being checked on provider_business_map –  wachutu Jul 3 at 14:58
 
Hm, yes, I mis-read the schema from the query when commenting, placing source in pbm which is quite different. You're probably correct. –  Joachim Isaksson Jul 3 at 15:06
add comment
create index index_name on businesses(source);

Since there 3,693 rows matches in more than 7 million rows it will probably use the index. Do not forget to

analyse businesses;
share|improve this answer
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.