Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I'm working on a Python application where I read and extract data from an HTML file. The data is stored in a list, and the number of items in a list is, on average, 50,000+. The items from the list are added to a database. I cannot allow duplicate datetime values in one of the columns, and so, I have to check if a specific value exists in the database already, and if it does, keep incrementing it until there's no duplicates.

Here's what my code looks like (I'm using SQLAlchemy):

import re
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
from models import db, Notes

def import_file(filename, user):
    file = open(filename, 'r')
    html = file.read()
    file.close()
    soup = BeautifulSoup(html, "lxml")
    for link in soup.find_all('a'):
        validurl = re.compile(
            r'^(?:[a-z0-9\.\-]*)://'
            r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}(?<!-)\.?)|'
            r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|'
            r'\[?[A-F0-9]*:[A-F0-9:]+\]?)'
            r'(?::\d+)?'
            r'(?:/?|[/?]\S+)$', re.IGNORECASE)
        url = validurl.match(link.get('href'))
        if url:
                url = link.get('href')
                if len(url) <= 2000:
                    if link.get('add_date'):
                        date = datetime.utcfromtimestamp(int(link.get('add_date')))
                        while Notes.query.filter_by(user=user, added_on=date).first():
                            date += timedelta(0, 1)
                        title = link.string
                        import_notes = Notes()
                        import_notes.main_url = url
                        import_notes.title = title
                        import_notes.user = user
                        import_notes.added_on = date
                        db.session.add(import_notes)
                        db.session.commit()

However, this code runs very slowly. I calculated how long it was taking per insert, and it came out at exactly 13 seconds per insert. I ran a profiler, and the filter_by query was taking the majority of the time. I figured it was because the script had to run a query so many times per item, and so I changed my code to use a static list instead:

import re
from bs4 import BeautifulSoup
from datetime import datetime, timedelta
from models import db, Notes

def import_file(filename, user):
    file = open(filename, 'r')
    html = file.read()
    file.close()
    soup = BeautifulSoup(html, "lxml")
    for link in soup.find_all('a'):
        validurl = re.compile(
            r'^(?:[a-z0-9\.\-]*)://'
            r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}(?<!-)\.?)|'
            r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}|'
            r'\[?[A-F0-9]*:[A-F0-9:]+\]?)'
            r'(?::\d+)?'
            r'(?:/?|[/?]\S+)$', re.IGNORECASE)
        url = validurl.match(link.get('href'))
        if url:
                url = link.get('href')
                if len(url) <= 2000:
                    if link.get('add_date'):
                        date = datetime.utcfromtimestamp(int(link.get('add_date')))
                        while Notes.query.filter_by(user=user, added_on=date).first():
                            date += timedelta(0, 1)
                        title = link.string
                        import_notes = Notes()
                        import_notes.main_url = url
                        import_notes.title = title
                        import_notes.user = user
                        import_notes.added_on = date
                        db.session.add(import_notes)
                        db.session.commit()

I thought this might make the code run faster, but there was little to no change in runtime. For comparison, if I remove the duplicate check, I get ~300 inserts per second.

I cannot let duplicate datetime values into the database. What can I do to make my code faster?

share|improve this question
    
Why do you have that requirement about no multiple dates though? –  ferada Feb 13 at 14:32
    
I am using the dates for pagination. I can't use ids since I allow imports, and the dates on imported items can be older, but the id would then be higher. If I have too many duplicate dates on a page, then the forward/backward buttons stop working(ie, they redirect to the same page). –  Nik Barres Feb 13 at 14:43
    
Maybe you should try using a set, checking if a value is inside a set is extremely fast. –  Caridorc Feb 13 at 18:59
    
I'll try that and report back. –  Nik Barres Feb 13 at 19:17
    
@Caridorc your suggestion to use a set solved my problem. Down from 13 seconds per insert to 0.009. If you add your comment as an answer, I'll accept it. –  Nik Barres Feb 14 at 12:38

1 Answer 1

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.