MySQL Avoiding duplicates in table of unknown column headers

Question

So let's say I have csv files which I have no control over:

a.csv
b.csv
c.csv

and they all have different header names. I dumped all the data into tables a, b, and c. Now, if I get another a.csv with new values (same header fields), how can I insert only the values of the new a which are not in the old one?

For example: The a table has headers name and age:

'Bob'   25
'Mary'  50

And I get a new a.csv which parses out to:

'Bob'   25
'Susie' 60

How do I add only the lines that are unique to the current table (e.g. only add Susie and not Bob to the a table)? I do not have a specific unique ID for each one, so I can't use primary key. There are also multiple header fields, so if I try to use all the header fields as the primary keys it returns with "Specified key was too long".

I need to check that the entire line is unique, then if it is, add it to the table. I tried INSERT IGNORE, but with the lack of a unique key I can't get it to work correctly. Any suggestions? I'll post any additional info if it would help.

Current attempt:

cursor.execute("ALTER TABLE temp ADD PRIMARY KEY" + uniqueline)
cursor.execute("INSERT IGNORE INTO " + tablename + " SELECT * FROM temp")

Where tablename is the name of the table, temp is where the csv code was sent, and uniqueline is CURRENTLY the first 5 fields in the form (field1, field2, field 3, field4, field5). If there is less than 5 fields, then all the fields.

Thanks!

Edit:

cursor.execute("INSERT INTO " + tablename + " SELECT * FROM temp where " + uniqueline + " NOT IN (SELECT * FROM " + tablename + ")")

It works once (with empty tables), but if I run it again to test it essentially freezes, and doesn't ever finish. Now I have these "phantom tables" that if I try to drop it says "unknown table" but if I try to create it it says "table already exists". I also can't add or delete anything from the table without it freezing. I'm going to try giving it a unique index again. Thanks for all your guys' help though!

Is there a reason you can't use all the columns that need to be unique as your primary key? — Dane Hillard, Commented Jun 17, 2013 at 16:11
The problem is that there are over 40 columns, so it says that "Primary Key is too long". — Jonathan Ibers, Commented Jun 17, 2013 at 16:19

Yogendra Singh · Accepted Answer · 2013-06-17 16:13:46Z

0

You may want to update your table to have unique index as:

       ALTER IGNORE TABLE MyTable ADD UNIQUE INDEX idx_name (name, age);

Once done, it should auto filter duplicate rows upon insertion. You may need to handle the exceptions.

Work around approach could be to drop the index before loading all the CSV files. Once the data is uploaded, re-apply the index to drop the duplicate records.

answered Jun 17, 2013 at 16:13

Yogendra Singh

34.4k7 gold badges65 silver badges73 bronze badges

I'll give it a try. Does this have a limit for the number of fields it can have?
– Jonathan Ibers
Commented Jun 17, 2013 at 16:22
@JonathanIbers: Any legitimate number of columns should be OK. More details could be found here
– Yogendra Singh
Commented Jun 17, 2013 at 16:25
It's giving me an error: "_mysql_exceptions.OperationalError: (1071, 'Specified key was too long; max key length is 1000 bytes')". I tried using all the column names, but it appears that I can't really do that.
– Jonathan Ibers
Commented Jun 18, 2013 at 13:13
So I limited the number of bytes in the key to under 1000, and it worked, without adding duplicates! Thanks a bunch!
– Jonathan Ibers
Commented Jun 18, 2013 at 13:32

Add a comment |

Peter Alfvin · Accepted Answer · 2013-06-17 16:22:29Z

0

How about:

insert into MyTable select * from temp where (tempcolumn1, tempcolumn2, ..., tempcolumnn) not in (select * from MyTable)

answered Jun 17, 2013 at 16:22

Peter Alfvin

29.5k9 gold badges71 silver badges113 bronze badges

I tried it with blank tables and it seemed to work. I ran the same files again, just to check to see if it didn't add any duplicates, and it's been running for 20 minutes, and nothing has finished yet. Do you know of any way to make it any more efficient? But I think it worked.
– Jonathan Ibers
Commented Jun 17, 2013 at 19:43
That seems like a long time. ;-) I'm not very knowledgable about MySQL performance issues/approaches, but another approach would be to replace the contents of the table with the SELECT DISTINCT .. UNION of the two tables. That might be more efficient.
– Peter Alfvin
Commented Jun 17, 2013 at 19:57
Thanks, I'll give it a try. I don't know what I did, but I gave up and cntrl-Z'd out of it. Now I'm trying to drop that table and start over, but its taken almost half an hour, with no success yet. I must have multiplied the table size into something I can't even delete somehow. :| Thanks for the suggestion though, I'll give it a shot.
– Jonathan Ibers
Commented Jun 17, 2013 at 20:01
I'd be curious to see the specific SQL you used for your attempt at my first suggestion.
– Peter Alfvin
Commented Jun 17, 2013 at 20:07
Posted, with a little more information. Let me know if you see any errors.
– Jonathan Ibers
Commented Jun 18, 2013 at 13:05

Add a comment |

Collectives™ on Stack Overflow

MySQL Avoiding duplicates in table of unknown column headers

2 Answers 2

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Related