MySQLdb Python prevent duplicate and optimize muliple inserts

Question

I wrote this python script to import a specific xls file into mysql. It works fine but if it's run twice on the same data it will create duplicate entries. I'm pretty sure I need to use MySQL JOIN but I'm not clear on how to do that. Also is executemany() going to have the same overhead as doing inserts in a loop? I'm obviously trying to avoid that.
Here's the code in question...

for row in range(sheet.nrows):
    """name is in the 0th col. email is the 4th col."""
    name = sheet.cell(row, 0).value  
    email =  sheet.cell(row, 4).value
    if name and email:
        mailing_list[name.lstrip()] = email.strip()

for n, e in sorted(mailing_list.iteritems()):
    rows.append((n, e))

db = MySQLdb.connect(host=host, user=user, db=dbname, passwd=pwd)
cursor = db.cursor()
cursor.executemany("""
        INSERT IGNORE INTO mailing_list (name, email) VALUES (%s,%s)""",(rows))

CLARIFICATION...

I read here that...

To be sure, executemany() is effectively the same as simple iteration. However, it is typically faster. It provides an optimized means of affecting INSERT and REPLACE across multiple rows.

Also I took Unodes suggestion and used the UNIQUE constraint. But the IGNORE keyword is better than ON DUPLICATE KEY UPDATE because I want it to fail silently.

TL;DR

~~1. What's the best way prevent duplicate inserts?~~
ANSWER 1: UNIQUE contraint on column with SELECT IGNORE to fail silently or ON DUPLICATE KEY UPDATE to increment the duplicate value and insert it.

Is executemany() as expensive as INSERT in a loop?
@Unode says it's not but my research tells me otherwise. I would like a definitive answer.
Is this the best way or is it going to be really slow with bigger tables and how would I test to be sure?

1) add unique constraint on column, where you want to avoid duplicity — yedpodtrzitko
– yedpodtrzitko, Commented Oct 23, 2012 at 12:13

unode · Accepted Answer · 2012-10-23 12:30:12Z

2

1 - What's the best way prevent duplicate inserts?

Depending on what "preventing" means in your case, you have two strategies and one requirement.

The requirement is that you add a UNIQUE constraint on the column/columns that you want to be unique. This alone will cause an error if insertion of a duplicate entry is attempted. However given you are using executemany the outcome may not be what you would expect.

Then as strategies you can do:

An initial filter step by running a SELECT statement before. This means running one SELECT statement per item in your rows to check if it exists already. This strategy works but is inefficient.
Using ON DUPLICATE KEY UPDATE. This automatically triggers an update if the data already exists. For more information refer to the official documentation.

2 - Is executemany() as expensive as INSERT in a loop?

No, executemany creates one query which inserts in bulk while doing a for loop will create as many queries as the number of elements in your rows.

answered Oct 23, 2012 at 12:30

unode

9,6814 gold badges35 silver badges46 bronze badges

I updated my post. I'm not sure if you're correct about executemany()

noel
– noel

10/23/2012 13:27:15
Commented Oct 23, 2012 at 13:27
@shakabra keep reading that document you linked about executemany. It explains that only one query/request is performed to the server but merges all the INSERT statements. Depending on the size of the data you are trying to pass to executemany it may happen that the server rejects the request due to max_allowed_package size.

unode
– unode

10/23/2012 14:40:22
Commented Oct 23, 2012 at 14:40
You're right I should have kept reading. I'm taking a look at the source right now. Still don't know if there is a better way though especially in regards to limiting the packet size, although 16mb is alot.

noel
– noel

10/23/2012 19:08:11
Commented Oct 23, 2012 at 19:08
executemany is a trade-off. One big query vs lots of tiny ones. If the big query is too big you can simply break your rows into slices. I personally use slices of size 1000 or 10000 depending on the data types.

unode
– unode

10/24/2012 10:55:23
Commented Oct 24, 2012 at 10:55

Add a comment |

Collectives™ on Stack Overflow

MySQLdb Python prevent duplicate and optimize muliple inserts

CLARIFICATION...

1 Answer 1

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

CLARIFICATION...

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related