Checking for duplicate values in database with Python

Question

I'm working on a Python application where I read and extract data from an HTML file. The data is stored in a list, and the number of items in a list is, on average, 50,000+. The items from the list are added to a database. I cannot allow duplicate datetime values in one of the columns, and so, I have to check if a specific value exists in the database already, and if it does, keep incrementing it until there's no duplicates.

Here's what my code looks like (I'm using SQLAlchemy):

import re
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
from models import db, Notes

def import_file(filename, user):
    file = open(filename, 'r')
    html = file.read()
    file.close()
    soup = BeautifulSoup(html, "lxml")
    for link in soup.find_all('a'):
        validurl = re.compile(
            r'^(?:[a-z0-9\.\-]*)://'
            r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}(?<!-)\.?)|'
            r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|'
            r'\[?[A-F0-9]*:[A-F0-9:]+\]?)'
            r'(?::\d+)?'
            r'(?:/?|[/?]\S+)$', re.IGNORECASE)
        url = validurl.match(link.get('href'))
        if url:
                url = link.get('href')
                if len(url) <= 2000:
                    if link.get('add_date'):
                        date = datetime.utcfromtimestamp(int(link.get('add_date')))
                        while Notes.query.filter_by(user=user, added_on=date).first():
                            date += timedelta(0, 1)
                        title = link.string
                        import_notes = Notes()
                        import_notes.main_url = url
                        import_notes.title = title
                        import_notes.user = user
                        import_notes.added_on = date
                        db.session.add(import_notes)
                        db.session.commit()

However, this code runs very slowly. I calculated how long it was taking per insert, and it came out at exactly 13 seconds per insert. I ran a profiler, and the filter_by query was taking the majority of the time. I figured it was because the script had to run a query so many times per item, and so I changed my code to use a static list instead:

import re
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
from models import db, Notes

def import_file(filename, user):
    file = open(filename, 'r')
    html = file.read()
    file.close()
    soup = BeautifulSoup(html, "lxml")
    for link in soup.find_all('a'):
        validurl = re.compile(
            r'^(?:[a-z0-9\.\-]*)://'
            r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}(?<!-)\.?)|'
            r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|'
            r'\[?[A-F0-9]*:[A-F0-9:]+\]?)'
            r'(?::\d+)?'
            r'(?:/?|[/?]\S+)$', re.IGNORECASE)
        url = validurl.match(link.get('href'))
        if url:
                url = link.get('href')
                if len(url) <= 2000:
                    if link.get('add_date'):
                        date = datetime.utcfromtimestamp(int(link.get('add_date')))
                        while Notes.query.filter_by(user=user, added_on=date).first():
                            date += timedelta(0, 1)
                        title = link.string
                        import_notes = Notes()
                        import_notes.main_url = url
                        import_notes.title = title
                        import_notes.user = user
                        import_notes.added_on = date
                        db.session.add(import_notes)
                        db.session.commit()

I thought this might make the code run faster, but there was little to no change in runtime. For comparison, if I remove the duplicate check, I get ~300 inserts per second.

I cannot let duplicate datetime values into the database. What can I do to make my code faster?

Why do you have that requirement about no multiple dates though? — ferada, Feb 13 at 14:32
I am using the dates for pagination. I can't use ids since I allow imports, and the dates on imported items can be older, but the id would then be higher. If I have too many duplicate dates on a page, then the forward/backward buttons stop working(ie, they redirect to the same page). — Nik Barres, Feb 13 at 14:43
Maybe you should try using a set, checking if a value is inside a set is extremely fast. — Caridorc, Feb 13 at 18:59
@Caridorc your suggestion to use a set solved my problem. Down from 13 seconds per insert to 0.009. If you add your comment as an answer, I'll accept it. — Nik Barres, Feb 14 at 12:38

Caridorc · Accepted Answer · 2015-02-14 12:59:38Z

up vote 0 down vote accepted

In python checking membership of an item in a set is extremely fast.

I suggest you use sets.

answered Feb 14 at 12:59

Caridorc
7,9042946

add a comment |

asked	6 months ago
viewed	79 times
active	6 months ago

current community

your communities

more stack exchange communities

Checking for duplicate values in database with Python

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python performance database or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Checking for duplicate values in database with Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python performance database or ask your own question.

Related

Hot Network Questions