I am parsing a huge XML file. It contains some million article entries like this one:
<article key="journals/cgf/HaeglerWAGM10" mdate="2010-11-12">
<author>Simon Haegler</author>
<author>Peter Wonka</author>
<author>Stefan Müller Arisona</author>
<author>Luc J. Van Gool</author>
<author>Pascal Müller</author>
<title>Grammar-based Encoding of Facades.</title>
<pages>1479-1487</pages>
<year>2010</year>
<volume>29</volume>
<journal>Comput. Graph. Forum</journal>
<number>4</number>
<ee>http://dx.doi.org/10.1111/j.1467-8659.2010.01745.x</ee>
<url>db/journals/cgf/cgf29.html#HaeglerWAGM10</url>
</article>
I step through the file and parse those articles by LXML. If I let the code without storing the items into my database it makes some 1000 entries in ~3 seconds. But if I activate the storage which is done by the function below it makes some 10 entries per second. Is this normal? I remember parsing the file once upon a time and the database was not such a bottleneck. But I had a different approach... (looking through my files to find it)
def add_paper(paper, cursor):
questionmarks = str(('?',)*len(paper)).replace("'", "") # produces (?, ?, ?, ... ,?) for oursql query
keys, values = paper.keys(), paper.values()
keys = str(tuple(keys)).replace("'", "") # produces (mdate, title, ... date, some_key)
query_paper = '''INSERT INTO dblp2.papers {0} VALUES {1};'''.\
format(keys, questionmarks)
values = tuple(v.encode('utf8') for v in values)
cursor.execute(query_paper, values)
paper_id = cursor.lastrowid
return paper_id
def populate_database(paper, authors, cursor):
paper_id = add_paper(paper, cursor)
query_author ="""INSERT INTO dblp2.authors (name) VALUES (?) ON DUPLICATE KEY UPDATE id=LAST_INSERT_ID(id)"""
query_link_table = "INSERT INTO dblp2.author_paper (author_id, paper_id) VALUES (?, ?)"
for author in authors:
cursor.execute(query_author, (author.encode('utf8'),))
author_id = cursor.lastrowid
cursor.execute(query_link_table, (author_id, paper_id))
Edit:
I was able to narrow the problem to those three cursor.execute
s. Perhaps it is a database problem. I will ask over at Stack Overflow, if someone has an idea, why it is that slow. Meanwhile I would be interested if the code can be refactored to be more pythonic. Any ideas?
Edit 2:
If use a similar approach like me storing row per row into the database, don't use the InnoDB engine. It's slower in orders of magnitude. The code is speeding up again, after I changed the engine.