PostgreSQL/SQL query optimization

Question

So, i have log table with something about 8M records. Because of programming error it happened that there are more than 1 record for company within same date. Now, what i need is to delete all records from this log for each company for same date except latest (which has max id). Count of records to be deleted approximately 300K.

The fastest and easiest thing that i tried is this

delete from indexing_log where id not in (
select max(id)
from indexing_log
group by company_id,
"date"
)

But this query is taking enormous time (about 3 days) on production server (which for some reason doesn't have ssd drive). I tried all ways that i know and need some advice. How can it be faster?

UPDATE I decided to do it in bucket way through celery task.

In addition to providing the explain plan what version of PostgreSQL are you on? — Kuberchaun
– Kuberchaun, Commented Aug 14, 2013 at 12:53

roman · Accepted Answer · 2013-08-14 11:30:19Z

2

you can try

delete from indexing_log as l
where
    exists
    (
        select *
        from indexing_log as i
        where i.id < l.id and i.company_id = l.company_id and i.dt = l.dt
    );

answered Aug 14, 2013 at 11:30

roman

118k29 gold badges204 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Clodoaldo Neto · Accepted Answer · 2013-08-14 11:52:22Z

2

Dump the distinct rows to a temporary table

create temporary table t as
select distinct on (company_id, "date") *
from indexing_log
order by company_id, "date", id desc;

Truncate the original

truncate table indexing_log;

Since the table is now empty use the opportunity to do an instantaneous vacuum:

vacuum full indexing_log;

Move the rows from the temporary to the original

insert into indexing_log
select *
from t;

edited Aug 14, 2013 at 11:52

answered Aug 14, 2013 at 11:32

Clodoaldo Neto

127k30 gold badges250 silver badges274 bronze badges

5 Comments

Ihor Romanchenko Over a year ago

distinct on (id ...) is pointless if id is unique. Maybe distinct on (company_id, "date"). Also order by company_id, "date", id desc - otherwise it wont work

roman Over a year ago

Don't you mean select max(id), company_id, date from ... group by company_id, date? otherwise I have to read about distinct on

Clodoaldo Neto Over a year ago

@Roman Yes, you and Igor are right. I corrected it. In this case the distinct on has the great advantage of returning all columns of the row.

free2use Over a year ago

Ok, it seems to be fast. Now i'll try it on production.

free2use Over a year ago

Insert operation is very heavy, because of number of rows which left much bigger than the deleted ones.

Bruellhusten · Accepted Answer · 2013-08-14 11:35:06Z

1

Truncate Table should be much quicker. But there you cannot say "delete everything except..." If it is possible with your data you could write a procedure for that, save your Max IDs into a temptable, trucate the table and write your temptable back. For PostgreSQL the syntax is slighly different (http://www.postgresql.org/docs/9.1/static/sql-selectinto.html)

SELECT * from indexing_log 
INTO #temptable 
WHERE id IN (
    SELECT max(id)
    FROM indexing_log
    GROUP BY company_id,
    "date")

edited Aug 14, 2013 at 11:35

answered Aug 14, 2013 at 11:26

Bruellhusten

3181 silver badge5 bronze badges

Comments

Romesh · Accepted Answer · 2013-08-15 13:05:00Z

1

Not Exists is sometimes faster than Not in

delete from indexing_log 
where not exists (select 1
                    from (select max(id) as iid
                            from indexing_log
                           group by company_id,
                                 "date") mids
                   where id = mids.iid
                 )

edited Aug 15, 2013 at 13:05

answered Aug 14, 2013 at 11:27

Romesh

2,3023 gold badges25 silver badges48 bronze badges

6 Comments

Clodoaldo Neto Over a year ago

Apart from slight syntax error (select **something** from) this will delete nothing as there will always exist an id.

free2use Over a year ago

It appeared to be even slower =(

Romesh Over a year ago

@free2use now check d solution.

wildplasser Over a year ago

The MAX() in the subquery is not needed (except in mysql;-). See @Roman Pekar's solution.

bma Over a year ago

"NOT EXISTS is always faster than NOT IN". That absolute statement is not accurate. Run enough benchmarks and you will see that your assertion in inaccurate.

|

Collectives™ on Stack Overflow

PostgreSQL/SQL query optimization

4 Answers 4

Comments

5 Comments

Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

5 Comments

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related