How to remove duplicates in combination of three columns in python with postgresql

Question

I would like to remove the duplicate data only if three columns (name, price and new price) matching with the same data. But in an other python script.

So the data can insert in to the database, but with an other python script, I want to delete this duplicate data by a cron job.

So in this case:

cur.execute("INSERT INTO cars VALUES(8,'Hummer',41400, 49747)")

cur.execute("INSERT INTO cars VALUES(9,'Volkswagen',21600, 36456)")

are duplicates. Example script with inserted data:

import psycopg2
import sys

con = None

try:
    con = psycopg2.connect(database='testdb', user='janbodnar')    
    cur = con.cursor()

    cur.execute("CREATE TABLE cars(id INT PRIMARY KEY, name VARCHAR(20), price INT, new price INT)")
    cur.execute("INSERT INTO cars VALUES(1,'Audi',52642, 98484)")
    cur.execute("INSERT INTO cars VALUES(2,'Mercedes',57127, 874897)")
    cur.execute("INSERT INTO cars VALUES(3,'Skoda',9000, 439788)")
    cur.execute("INSERT INTO cars VALUES(4,'Volvo',29000, 743878)")
    cur.execute("INSERT INTO cars VALUES(5,'Bentley',350000, 434684)")
    cur.execute("INSERT INTO cars VALUES(6,'Citroen',21000, 43874)")
    cur.execute("INSERT INTO cars VALUES(7,'Hummer',41400, 49747)")
    cur.execute("INSERT INTO cars VALUES(8,'Hummer',41400, 49747)")
    cur.execute("INSERT INTO cars VALUES(9,'Volkswagen',21600, 36456)")
    cur.execute("INSERT INTO cars VALUES(10,'Volkswagen',21600, 36456)")

    con.commit()

except psycopg2.DatabaseError, e:
    if con:
        con.rollback()

    print 'Error %s' % e    
    sys.exit(1

finally:    
    if con:
        con.close()

To clarify: You want to identify rows that are duplicates, and remove the duplicated data? — Martijn Pieters, Sep 2, 2012 at 13:54
What exactly do you mean with "tree tabels matching with the same data"? Do you actually mean "three columns matching the same data"? — a_horse_with_no_name, Sep 2, 2012 at 14:11
@a_horse_with_no_name: I've read it as such, certainly the provided example inserts show only columns. Edited the question to improve grammar, spelling, and that mistake. — Martijn Pieters, Sep 2, 2012 at 14:22

Erwin Brandstetter · Accepted Answer · 2012-09-02 14:34:32Z

3

You can do this in one statement without additional round-trips to the server.

DELETE FROM cars
USING (
    SELECT id, row_number() OVER (PARTITION BY name, price, new_price
                                  ORDER BY id) AS rn
    FROM   cars
    ) x
WHERE cars.id = x.id
AND   x.rn > 1;

Requires PostgreSQL 8.4 or later for the window function row_number().
Out of a set of dupes the smallest id survives.
Note that I changed "new price" to new_price.

Or use the EXISTS semi-join, that @wildplasser posted as comment to the same effect.

Or, to by special request of CTE-devotee @wildplasser, with a CTE instead of the subquery ... :)

WITH x AS (
    SELECT id, row_number() OVER (PARTITION BY name, price, new_price
                                  ORDER BY id) AS rn
    FROM   cars
    )
DELETE FROM cars
USING  x
WHERE  cars.id = x.id
AND    x.rn > 1;

Data modifying CTE requires Postgres 9.1 or later.
This form will perform about the same as the one with the subquery.

edited Sep 2, 2012 at 14:34

answered Sep 2, 2012 at 14:26

Erwin Brandstetter

601k145 gold badges1068 silver badges1224 bronze badges

@wildplasser: There you go, your daily fix. :)
– Erwin Brandstetter
Sep 2, 2012 at 14:34
I'll happily admit to not knowing about partitioning, the existence of row_number() in PostgreSQL or CTEs. :-) I'll be off reading up on those now..
– Martijn Pieters ♦
Sep 2, 2012 at 14:36
@MartijnPieters: Those are very cool (more or less new) features and I hope my links to the manual are of service. BTW, the PARTITION clause is just part of the window function syntax. Partitioning would refer to a wholly different area of database design.
– Erwin Brandstetter
Sep 2, 2012 at 14:37
@ErwinBrandstetter: the links are indeed useful, much reading to do.
– Martijn Pieters ♦
Sep 2, 2012 at 14:38
Cool, so a CTE is basically a query definition stored in a variable (oversimplified), and a OVER (PARTITION BY ...) creates a window similarly to a GROUP BY but with access to the individual rows in the group. Nifty indeed, added to my bag of tricks!
– Martijn Pieters ♦
Sep 2, 2012 at 14:49

| Show 1 more comment

Martijn Pieters · Accepted Answer · 2012-09-02 16:21:34Z

Use a GROUP BY SQL statement to identify the rows, together with the initial primary key:

duplicate_query = '''\
SELECT MIN(id), "name", price, "new price"
FROM cars
GROUP BY "name", price, "new price"
HAVING COUNT(ID) > 1
'''

The above query selects the lowest primary key id for each group of (name, price, "new price") rows where there is more than one primary key id. For your sample data, this will return:

7, 'Hummer', 41400, 49747
9, 'Volkswagen', 21600, 36456

You can then use the returned data to delete the duplicates:

delete_dupes = '''
DELETE
FROM cars
WHERE 
    "name"=%(name)s AND price=%(price)s AND "new price"=%(newprice)s AND
    id > %(id)s
'''

cur.execute(duplicate_query)
dupes = cur.fetchall()
cur.executemany(delete_dupes, [
    dict(name=r[1], price=r[2], newprice=r[3], id=r[0])
    for r in dupes])

Note that we delete any row where the primary key id is larger than the first id with the same 3 columns. For the first dupe, only the row with id 8 will match, for the second dupe the row with id 10 matches.

This does do a separate delete for each dupe found. You can combine this into one statement with a WHERE EXISTS sub-select query:

delete_dupes = '''\
DELETE FROM cars cdel
WHERE EXISTS (
    SELECT *
    FROM cars cex
    WHERE 
        cex."name" = cdel."name" AND 
        cex.price = cdel.price AND
        cex."new price" = cdel."new price" AND
        cex.id > cdel.id
)
'''

cur.execute(delete_dupes)

This instructs PostgreSQL to delete any row for which there are other rows with the same name, price and new price but with a primary key that is higher than the current row.

Though this might be correct python-wise it is terrible SQL-wise: you first construct a list of unique entries, transfer it to the client (python) side, which transforms that into a dict(or array, dunno), which in turn uses exactly the same list to perform the actual deletes. It can all be done in one SQL statement: DELETE FROM cars dd WHERE EXISTS (SELECT * FROM cars ex WHERE ex.name = dd.name ...AND more mathing fields... AND ex.id < dd.id); — wildplasser, Sep 2, 2012 at 14:25
@wildplasser: Always better, I'll update. :-) Or, just upvote the competing answer.. — Martijn Pieters, Sep 2, 2012 at 14:26
Please note that both "inner" and "outer" are reserved words in SQL. Most people stick with one- or two-letter correlation names anyway. (I often use "ex" for the exists() subquery (or "nx" for a NOT EXISTS subquery), and in this case "dd" for the DELETE main query). BTW: "name" is a non-reserved word in Postgres (which still means it is special to the parser. Better avoid it, too. — wildplasser, Sep 2, 2012 at 16:17
@wildplasser: Thanks for the pointers; improved my answer. You can probably tell I don't usually deal with SQL much; outside my domain usually. — Martijn Pieters, Sep 2, 2012 at 16:22

Collectives™ on Stack Overflow

How to remove duplicates in combination of three columns in python with postgresql

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
database
postgresql
or ask your own question.

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythondatabasepostgresql or ask your own question.

Related

Not the answer you're looking for? Browse other questions tagged
python
database
postgresql
or ask your own question.