2

So, i have log table with something about 8M records. Because of programming error it happened that there are more than 1 record for company within same date. Now, what i need is to delete all records from this log for each company for same date except latest (which has max id). Count of records to be deleted approximately 300K.

The fastest and easiest thing that i tried is this

delete from indexing_log where id not in (
select max(id)
from indexing_log
group by company_id,
"date"
)

But this query is taking enormous time (about 3 days) on production server (which for some reason doesn't have ssd drive). I tried all ways that i know and need some advice. How can it be faster?

UPDATE I decided to do it in bucket way through celery task.

2
  • Add an explain plan. Can you define enormous time? Commented Aug 14, 2013 at 11:33
  • In addition to providing the explain plan what version of PostgreSQL are you on? Commented Aug 14, 2013 at 12:53

4 Answers 4

2

you can try

delete from indexing_log as l
where
    exists
    (
        select *
        from indexing_log as i
        where i.id < l.id and i.company_id = l.company_id and i.dt = l.dt
    );
Sign up to request clarification or add additional context in comments.

Comments

2

Dump the distinct rows to a temporary table

create temporary table t as
select distinct on (company_id, "date") *
from indexing_log
order by company_id, "date", id desc;

Truncate the original

truncate table indexing_log;

Since the table is now empty use the opportunity to do an instantaneous vacuum:

vacuum full indexing_log;

Move the rows from the temporary to the original

insert into indexing_log
select *
from t;

5 Comments

distinct on (id ...) is pointless if id is unique. Maybe distinct on (company_id, "date"). Also order by company_id, "date", id desc - otherwise it wont work
Don't you mean select max(id), company_id, date from ... group by company_id, date? otherwise I have to read about distinct on
@Roman Yes, you and Igor are right. I corrected it. In this case the distinct on has the great advantage of returning all columns of the row.
Ok, it seems to be fast. Now i'll try it on production.
Insert operation is very heavy, because of number of rows which left much bigger than the deleted ones.
1

Truncate Table should be much quicker. But there you cannot say "delete everything except..." If it is possible with your data you could write a procedure for that, save your Max IDs into a temptable, trucate the table and write your temptable back. For PostgreSQL the syntax is slighly different (http://www.postgresql.org/docs/9.1/static/sql-selectinto.html)

SELECT * from indexing_log 
INTO #temptable 
WHERE id IN (
    SELECT max(id)
    FROM indexing_log
    GROUP BY company_id,
    "date")

Comments

1

Not Exists is sometimes faster than Not in

delete from indexing_log 
where not exists (select 1
                    from (select max(id) as iid
                            from indexing_log
                           group by company_id,
                                 "date") mids
                   where id = mids.iid
                 )

6 Comments

Apart from slight syntax error (select **something** from) this will delete nothing as there will always exist an id.
It appeared to be even slower =(
@free2use now check d solution.
The MAX() in the subquery is not needed (except in mysql;-). See @Roman Pekar's solution.
"NOT EXISTS is always faster than NOT IN". That absolute statement is not accurate. Run enough benchmarks and you will see that your assertion in inaccurate.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.