Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I have a database table with numerous fields including "category". The source of this table is an xml file which the system receiving occasionally. Each xml file contains data for one category. The table needs to be updated with the new category data from the xml file. The xml file includes all data for that category, not just the changes.

I see two possible ways of handling this:

  1. First deleting all rows from the table where category = categoryID, and then doing inserts based on all the xml data. Obviously the deletion and insertion operations would be contained within a transaction.

  2. For each record in the xml, first do a select to test if that record exists. If it does exist, update the data with that from the xml. If it doesn't exists, insert it.

Obviously, the second approach avoids deleting the data first but involves many more db queries although this could be mitigated by doing an initial select to a hash table, and just querying the hash. The big downside to the second approach is how to handle deletions, i.e. records that no longer appear in the xml and should be removed from the table.

What is the best practice for handling this kind of operation?

Thanks.

share|improve this question
2  
It sounds like you're looking for an upsert or merge – Dan Pichelman Jul 10 '15 at 15:02
    
Recommended reading: Why is asking a question on “best practice” a bad thing? This is purely a wording issue: instead of appealing to the authority of "best practice," simply state what your requirements are and where you are not sure how to proceed. – Snowman Jul 10 '15 at 15:04
1  
@DanPichelman At least in SQL Server, a MERGE statement can have unintended consequences (there are some lingering bugs with its implementation, and it also unconditionally updates all rows that it doesn't insert, which may not be intuitive at first sight). – mgw854 Jul 10 '15 at 18:33

Since you have the primary key of each record, I highly recommend using the approach of examining each record individually.

  • Primary key lookups are extremely fast in any professional-grade database.

  • Updating a record based on primary key has a very granular lock level and is fast.

  • By inserting or updating rather than delete/insert, you maintain the existence of data: at no time is it possible to query the category table and come up empty (unless you are querying a category that you have not gotten around to inserting yet).

In short, do not worry about performing extra queries here because the queries will execute just about as fast as possible. I also assume categories are relatively fixed in quantity, as compared to say a logging event or banking transaction which is always increasing the amount of data.

share|improve this answer

I agree with Snowman and think you should do all the queries. Along with performance being negligible, you can gain some insights about the data changes that may benefit you and/or users:

  1. New Categories
  2. Removed Categories
  3. Categories that have changed.

Someone is eventually going to ask what happened. "We got a new file," won't be enough.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.