0

So let's say I have csv files which I have no control over:

a.csv
b.csv
c.csv

and they all have different header names. I dumped all the data into tables a, b, and c. Now, if I get another a.csv with new values (same header fields), how can I insert only the values of the new a which are not in the old one?

For example: The a table has headers name and age:

'Bob'   25
'Mary'  50

And I get a new a.csv which parses out to:

'Bob'   25
'Susie' 60

How do I add only the lines that are unique to the current table (e.g. only add Susie and not Bob to the a table)? I do not have a specific unique ID for each one, so I can't use primary key. There are also multiple header fields, so if I try to use all the header fields as the primary keys it returns with "Specified key was too long".

I need to check that the entire line is unique, then if it is, add it to the table. I tried INSERT IGNORE, but with the lack of a unique key I can't get it to work correctly. Any suggestions? I'll post any additional info if it would help.

Current attempt:

cursor.execute("ALTER TABLE temp ADD PRIMARY KEY" + uniqueline)
cursor.execute("INSERT IGNORE INTO " + tablename + " SELECT * FROM temp")

Where tablename is the name of the table, temp is where the csv code was sent, and uniqueline is CURRENTLY the first 5 fields in the form (field1, field2, field 3, field4, field5). If there is less than 5 fields, then all the fields.

Thanks!

Edit:

cursor.execute("INSERT INTO " + tablename + " SELECT * FROM temp where " + uniqueline + " NOT IN (SELECT * FROM " + tablename + ")")

It works once (with empty tables), but if I run it again to test it essentially freezes, and doesn't ever finish. Now I have these "phantom tables" that if I try to drop it says "unknown table" but if I try to create it it says "table already exists". I also can't add or delete anything from the table without it freezing. I'm going to try giving it a unique index again. Thanks for all your guys' help though!

2
  • Is there a reason you can't use all the columns that need to be unique as your primary key? Commented Jun 17, 2013 at 16:11
  • The problem is that there are over 40 columns, so it says that "Primary Key is too long". Commented Jun 17, 2013 at 16:19

2 Answers 2

0

You may want to update your table to have unique index as:

       ALTER IGNORE TABLE MyTable ADD UNIQUE INDEX idx_name (name, age);

Once done, it should auto filter duplicate rows upon insertion. You may need to handle the exceptions.

Work around approach could be to drop the index before loading all the CSV files. Once the data is uploaded, re-apply the index to drop the duplicate records.

4
  • I'll give it a try. Does this have a limit for the number of fields it can have? Commented Jun 17, 2013 at 16:22
  • @JonathanIbers: Any legitimate number of columns should be OK. More details could be found here Commented Jun 17, 2013 at 16:25
  • It's giving me an error: "_mysql_exceptions.OperationalError: (1071, 'Specified key was too long; max key length is 1000 bytes')". I tried using all the column names, but it appears that I can't really do that. Commented Jun 18, 2013 at 13:13
  • So I limited the number of bytes in the key to under 1000, and it worked, without adding duplicates! Thanks a bunch! Commented Jun 18, 2013 at 13:32
0

How about:

insert into MyTable select * from temp where (tempcolumn1, tempcolumn2, ..., tempcolumnn) not in (select * from MyTable)
5
  • I tried it with blank tables and it seemed to work. I ran the same files again, just to check to see if it didn't add any duplicates, and it's been running for 20 minutes, and nothing has finished yet. Do you know of any way to make it any more efficient? But I think it worked. Commented Jun 17, 2013 at 19:43
  • That seems like a long time. ;-) I'm not very knowledgable about MySQL performance issues/approaches, but another approach would be to replace the contents of the table with the SELECT DISTINCT .. UNION of the two tables. That might be more efficient. Commented Jun 17, 2013 at 19:57
  • Thanks, I'll give it a try. I don't know what I did, but I gave up and cntrl-Z'd out of it. Now I'm trying to drop that table and start over, but its taken almost half an hour, with no success yet. I must have multiplied the table size into something I can't even delete somehow. :| Thanks for the suggestion though, I'll give it a shot. Commented Jun 17, 2013 at 20:01
  • I'd be curious to see the specific SQL you used for your attempt at my first suggestion. Commented Jun 17, 2013 at 20:07
  • Posted, with a little more information. Let me know if you see any errors. Commented Jun 18, 2013 at 13:05

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.