I have a products table with approximately 17,000,000 records.
CREATE TABLE vendor_prices ( id serial PRIMARY KEY, vendor integer NOT NULL, sku character varying(25) NOT NULL, category_name character varying(100) NOT NULL, price numeric(8,5) NOT NULL, effective_date timestamp without time zone, expiration_date timestamp without time zone DEFAULT (now() + '1 year'::interval) );
Most of the records are redundant and I keep a separate table with only the relevant records. I effectively remove duplicate records using this answer on dba.stackexchange.
Here is the core of the query (using select
instead of delete
):
select * from ( SELECT id, rate = lag(price) OVER w AND (lead(id) OVER w) IS NOT NULL AS del FROM vendor_prices WHERE vendor = 516 WINDOW w AS (PARTITION BY sku ORDER BY effective_date, id) ) d WHERE NOT d.del
The problem though I'm having, is that it takes way too long to execute, especially for vendors with lot's of records. This particular vendor in this query WHERE vendor = 516
has some 3M rows of which only around 80K are not redundant. What can I do to improve this query.
Here is the results doing EXPLAIN ANALYZE
:
Aggregate (cost=987648.74..987648.75 rows=1 width=0) (actual time=38220.825..38220.825 rows=1 loops=1) -> Subquery Scan on d (cost=862040.12..983596.85 rows=1620756 width=0) (actual time=31758.342..38211.262 rows=84245 loops=1) Filter: (NOT d.del) Rows Removed by Filter: 3094780 -> WindowAgg (cost=862040.12..951181.72 rows=3241513 width=25) (actual time=31758.220..37929.024 rows=3179025 loops=1) -> Sort (cost=862040.12..870143.90 rows=3241513 width=25) (actual time=31758.196..34952.249 rows=3179025 loops=1) Sort Key: vendor_prices.sku, vendor_prices.effective_date, vendor_prices.id Sort Method: external merge Disk: 123448kB -> Bitmap Heap Scan on vendor_prices (cost=60790.16..356386.08 rows=3241513 width=25) (actual time=350.911..1512.974 rows=3179025 loops=1) Recheck Cond: (vendor = 516) Heap Blocks: exact=47546 -> Bitmap Index Scan on idx_vendor_number (cost=0.00..59979.79 rows=3241513 width=0) (actual time=336.936..336.936 rows=3179025 loops=1) Index Cond: (vendor = 516)
ps. I have a multicolumn index
as suggested by @Erwin in his answer:
- A [multicolumn index] on
(vendor, sku, effective_date, id)
would be perfect for this - in this particular order.
but it's using idx_vendor_number
as you can see in the EXPLAIN ANALYZE
which is only on the vendor
column
external merge Disk: 123448kB
is your biggest problem. Try to increasework_mem
until this is done in memory. – a_horse_with_no_name Jul 14 '16 at 20:07