I have a products table with approximately 17,000,000 records.

CREATE TABLE vendor_prices (
  id serial PRIMARY KEY,
  vendor integer NOT NULL,
  sku character varying(25) NOT NULL,
  category_name character varying(100) NOT NULL,
  price numeric(8,5) NOT NULL,
  effective_date timestamp without time zone,
  expiration_date timestamp without time zone DEFAULT (now() + '1 year'::interval)
);

Most of the records are redundant and I keep a separate table with only the relevant records. I effectively remove duplicate records using this answer on dba.stackexchange.

Here is the core of the query (using select instead of delete):

select * from (   
    SELECT id,
        rate = lag(price) OVER w
        AND (lead(id) OVER w) IS NOT NULL AS del
      FROM vendor_prices
      WHERE vendor = 516

      WINDOW w AS (PARTITION BY sku ORDER BY effective_date, id)
) d
   WHERE NOT d.del

The problem though I'm having, is that it takes way too long to execute, especially for vendors with lot's of records. This particular vendor in this query WHERE vendor = 516 has some 3M rows of which only around 80K are not redundant. What can I do to improve this query.

Here is the results doing EXPLAIN ANALYZE:

Aggregate  (cost=987648.74..987648.75 rows=1 width=0) (actual time=38220.825..38220.825 rows=1 loops=1)
  ->  Subquery Scan on d  (cost=862040.12..983596.85 rows=1620756 width=0) (actual time=31758.342..38211.262 rows=84245 loops=1)
        Filter: (NOT d.del)
        Rows Removed by Filter: 3094780
        ->  WindowAgg  (cost=862040.12..951181.72 rows=3241513 width=25) (actual time=31758.220..37929.024 rows=3179025 loops=1)
              ->  Sort  (cost=862040.12..870143.90 rows=3241513 width=25) (actual time=31758.196..34952.249 rows=3179025 loops=1)
                    Sort Key: vendor_prices.sku, vendor_prices.effective_date, vendor_prices.id
                    Sort Method: external merge  Disk: 123448kB
                    ->  Bitmap Heap Scan on vendor_prices  (cost=60790.16..356386.08 rows=3241513 width=25) (actual time=350.911..1512.974 rows=3179025 loops=1)
                          Recheck Cond: (vendor = 516)
                          Heap Blocks: exact=47546
                          ->  Bitmap Index Scan on idx_vendor_number  (cost=0.00..59979.79 rows=3241513 width=0) (actual time=336.936..336.936 rows=3179025 loops=1)
                                Index Cond: (vendor = 516)

ps. I have a multicolumn index as suggested by @Erwin in his answer:

  • A [multicolumn index] on (vendor, sku, effective_date, id) would be perfect for this - in this particular order.

but it's using idx_vendor_number as you can see in the EXPLAIN ANALYZE which is only on the vendor column

share|improve this question
2  
external merge Disk: 123448kB is your biggest problem. Try to increase work_mem until this is done in memory. – a_horse_with_no_name Jul 14 '16 at 20:07
    
@a_horse_with_no_name: Thank you - increased work_mem to 512Meg and it shaved a dozen or so seconds off total by doing merge in memory. It helps but what we really need is improvement by a magnitude at least. – Mosaic Jul 14 '16 at 20:39

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.