Tell me more ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I have a table with the following fields:

id (Unique)
url (Unique)
title
company
site_id

Now, I need to remove rows having same title, company and site_id. One way to do it will be using the following SQL along with a script (PHP):

SELECT title, site_id, location, id, count( * ) 
FROM jobs
GROUP BY site_id, company, title, location
HAVING count( * ) >1

After running this query, I can remove duplicates using a server side script. But, I want to know if this can be done only using SQL query.

share|improve this question
Quick question: do always want duplicate (title, company, site_id) to not exist? If so, I'd set up a constraint in the database to enforce title, company, and site_id to be unique. Which would mean you wouldn't need a cleanup process. And it only takes a single line of SQL. – J. Polfer Jul 22 '10 at 18:25
Please refer this link of stackoverflow.It worked for me as a charm. – Mohsin Khan Nov 30 '12 at 12:11

5 Answers

up vote 44 down vote accepted

A really easy way to do this is to add a UNIQUE index on the 3 columns. When you write the ALTER statement, include the IGNORE keyword. Like so:

ALTER IGNORE TABLE jobs ADD UNIQUE INDEX idx_name (site_id, title, company );

This will drop all the duplicate rows. As an added benefit, future INSERTs that are duplicates will error out. As always, you may want to take a backup before running something like this...

share|improve this answer
1  
Interesting, but the assumptions the IGNORE clause makes for removing those duplicates is a concern that might not match needs. Incorrect values being truncated to the closest acceptable match sound good to you? – OMG Ponies Jul 22 '10 at 18:32
In this particular case, that's definitely true. The collation of the title and company columns definitely matter. What, exactly, does incorrect values mean? I smell another question... – Chris Henry Jul 22 '10 at 19:08
this did the job, thanks a lot! – Chetan Jul 22 '10 at 19:26
2  
Just for the record if your using InnoDB then you may have an issue with it, there is a known bug about using ALTER IGNORE TABLE with InnoDB databases. – DarkMantis Jan 7 at 16:57
1  
The aforementioned bug @DarkMantis referred to and it's solution. – Jordan Arseno Jan 23 at 20:47
show 5 more comments

MySQL has restrictions about referring to the table you are deleting from. You can work around that with a temporary table, like:

create temporary table tmpTable (id int);

insert  tmpTable
        (id)
select  id
from    YourTable yt
where   exists
        (
        select  *
        from    YourTabe yt2
        where   yt2.title = yt.title
                and yt2.company = yt.company
                and yt2.site_id = yt.site_id
                and yt2.id > yt.id
        );

delete  
from    YourTable
where   ID in (select id from tmpTable);
share|improve this answer
7  
+1: Your MySQL-fu is better than mine – OMG Ponies Jul 22 '10 at 18:35
@andomar, this works fine except when one of the fields in the where clause contain nulls. Example: sqlfiddle.com/#!2/983f3/1 – a coder Sep 12 '12 at 15:21
1  
this answer is should be the accepted one – Evan Feb 28 at 16:07
Is the Insert SQL an expensive one? I'm wondering because it times out in my MySQL database. – Cassio Jun 17 at 22:06

If IGNORE statement won't work like in my case, you can use:

CREATE TABLE your_table_deduped like your_table;
INSERT your_table_deduped SELECT * FROM your_table GROUP BY index1_id, index2_id;
RENAME TABLE your_table TO your_table_with_dupes;
RENAME TABLE your_table_deduped TO your_table;
#OPTIONAL
ALTER TABLE `your_table` ADD UNIQUE `unique_index` (`index1_id`, `index2_id`);
#OPTIONAL
DROP TABLE your_table_with_dupes;
share|improve this answer

I have this query snipet for SQLServer but I think It can be used in others DBMS with little changes:

DELETE
FROM Table
WHERE Table.idTable IN  (  
    SELECT MAX(idTable)
    FROM idTable
    GROUP BY field1, field2, field3
    HAVING COUNT(*) > 1)

I forgot to tell you that this query doesn't remove the row with the lowest id of the duplicated rows. If this works for you try this query:

DELETE
FROM jobs
WHERE jobs.id IN  (  
    SELECT MAX(id)
    FROM jobs
    GROUP BY site_id, company, title, location
    HAVING COUNT(*) > 1)
share|improve this answer
   
That won't work if there's more than two duplicates of a group. – OMG Ponies Jul 22 '10 at 18:23
4  
Unfortunately, MySQL does not allow you to select from the table you are deleting from ERROR 1093: You can't specify target table 'Table' for update in FROM clause – Andomar Jul 22 '10 at 18:29
OMG Ponies, I know that, this is just a snipet that I use sometimes and seemed to fit the question, thats why I said that It needed to be changed. Thanks for the comment. Andomar, I didn't know that. Thanks to you too. – eiefai Jul 22 '10 at 18:43

I like to be a bit more specific as to which records I delete so here is my solution:

delete
from jobs c1
where not c1.location = 'Paris'
and  c1.site_id > 64218
and exists 
(  
select * from jobs c2 
where c2.site_id = c1.site_id
and   c2.company = c1.company
and   c2.location = c1.location
and   c2.title = c1.title
and   c2.site_id > 63412
and   c2.site_id < 64219
)
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.