Function to store data in MySQL database (using OurSQL)

Question

I am parsing a huge XML file. It contains some million article entries like this one:

<article key="journals/cgf/HaeglerWAGM10" mdate="2010-11-12">
  <author>Simon Haegler</author>
  <author>Peter Wonka</author>
  <author>Stefan Müller Arisona</author>
  <author>Luc J. Van Gool</author>
  <author>Pascal Müller</author>
  <title>Grammar-based Encoding of Facades.</title>
  <pages>1479-1487</pages>
  <year>2010</year>
  <volume>29</volume>
  <journal>Comput. Graph. Forum</journal>
  <number>4</number>
  <ee>http://dx.doi.org/10.1111/j.1467-8659.2010.01745.x</ee>
  <url>db/journals/cgf/cgf29.html#HaeglerWAGM10</url>
</article>

I step through the file and parse those articles by LXML. If I let the code without storing the items into my database it makes some 1000 entries in ~3 seconds. But if I activate the storage which is done by the function below it makes some 10 entries per second. Is this normal? I remember parsing the file once upon a time and the database was not such a bottleneck. But I had a different approach... (looking through my files to find it)

def add_paper(paper, cursor):
    questionmarks = str(('?',)*len(paper)).replace("'", "") # produces (?, ?, ?, ... ,?) for oursql query
    keys, values = paper.keys(), paper.values()
    keys = str(tuple(keys)).replace("'", "") # produces (mdate, title, ... date, some_key)
    query_paper = '''INSERT INTO dblp2.papers {0} VALUES {1};'''.\
                    format(keys, questionmarks)
    values = tuple(v.encode('utf8') for v in values)
    cursor.execute(query_paper, values)
    paper_id = cursor.lastrowid
    return paper_id

def populate_database(paper, authors, cursor):
    paper_id = add_paper(paper, cursor)
    query_author ="""INSERT INTO dblp2.authors (name) VALUES (?) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)"""
    query_link_table = "INSERT INTO dblp2.author_paper (author_id, paper_id) VALUES (?, ?)"
    for author in authors:
        cursor.execute(query_author, (author.encode('utf8'),))
        author_id = cursor.lastrowid
        cursor.execute(query_link_table, (author_id, paper_id))

Edit:

I was able to narrow the problem to those three cursor.executes. Perhaps it is a database problem. I will ask over at Stack Overflow, if someone has an idea, why it is that slow. Meanwhile I would be interested if the code can be refactored to be more pythonic. Any ideas?

Edit 2:

If use a similar approach like me storing row per row into the database, don't use the InnoDB engine. It's slower in orders of magnitude. The code is speeding up again, after I changed the engine.

Winston Ewert · Accepted Answer · 2011-07-08 08:38:05Z

up vote 3 down vote accepted

You might want to try doing everything inside a transaction. It may be faster that way. I think with InnoDB you might be creating a committing a transaction for every statement which will be slow.

keys = str(tuple(keys)).replace("'", "")

That's an odd way of doing that, use

keys = '(%s)' % ','.join(keys)

answered Jul 8 '11 at 8:38

Winston Ewert

44923

Would you please explain what you mean by doing everything inside a transactio, @Winston? – Aufwind Jul 8 '11 at 16:28

1

@Aufwind, lookup database transactions. They allow you to group together statements so that either they all happen or none of them happen. InnoDB supports them whereas the other MySQL engine's don't. I think that might the reason for the speed diference. – Winston Ewert Jul 8 '11 at 23:30

add a comment |

user46953 · Answer 2 · 2014-06-23 14:27:18Z

Have you considered cursor.executemany? You could potentially build up a list of several articles, and then process all at once. The only limitation to doing this would be how much memory you have.

If you've got relatively few authors and lots of papers, you could potentially hold a dictionary of author names to database id. This would make it easier to get the database id without querying the database.

asked	5 years ago
viewed	969 times
active	1 year ago

current community

your communities

more stack exchange communities

Function to store data in MySQL database (using OurSQL)

2 Answers 2

Your Answer

Not the answer you're looking for? Browse other questions tagged python performance mysql or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Function to store data in MySQL database (using OurSQL)

2 Answers 2

Did you find this question interesting? Try our newsletter

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python performance mysql or ask your own question.

Related

Hot Network Questions