Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I have a softmatch function (below) that takes a donor list and a new entry and looks to see if the given donor already exists. The data is not exact, so I have to use a softmatch to determine whether a given record exists (ex: Jon Doe at 123 Sesame St. is the same as John P. Doe at 123 Sesame Street). I have this implemented with difflib, and it works; it's just slow as Christmas. The program currently takes about two days to process 10mb of data. The profiler indicates it's the difflib operations in the softmatch function that is causing the slowness. Is there a way to optimize my matching function to work better?

def softMatch(self, new, donors):
    #Takes new, a contribution record from a DG report, and attempts to soft match it with a record in donors
    #Returns the position of the matched record or -1 if no match 
    #Methodology
        #Name 
        #Address
        #Affiliation,Occupation,Employer not analyzed   
    #Dependences cleanAddr(str), re(from cleanAddr), difflib

    match = -1

    name = new[0]
    address = self.cleanAddr(new[1] + " " + new[2] + " " + new[3] + " " + new[4] + " " + new[5][:5])

    while address.find("  ") != -1:
        address = address.replace("  "," ")

    diff = 0.6

    for x in range(0, len(donors)):
        ratio = difflib.SequenceMatcher(None, name, donors[x].getBestName()[0]).ratio() * difflib.SequenceMatcher(None, address, donors[x].getBestAddr()).ratio()**2
        if ratio > diff:
            diff = ratio
            match = x

    return match

And the profiler output:

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  3091394   53.737    0.000   66.798    0.000 difflib.py:353(find_longest_match)
   368770   10.030    0.000   80.702    0.000 difflib.py:463(get_matching_blocks)
 59454412    7.683    0.000    7.683    0.000 {method 'get' of 'dict' objects}
   368770    6.906    0.000   10.521    0.000 difflib.py:300(__chain_b)
  5210450    4.009    0.000    4.009    0.000 {built-in method __new__ of type object at 0x1001e14a0}
 16895496    3.056    0.000    3.056    0.000 {method 'append' of 'list' objects}
      756    2.863    0.004  104.674    0.138 june12.py:74(softMatch)
  3091394    1.875    0.000    1.875    0.000 {method 'pop' of 'list' objects}
 10299176    1.850    0.000    1.850    0.000 {method 'setdefault' of 'dict' objects}
  2487807    1.808    0.000    4.634    0.000 difflib.py:661(<genexpr>)
  3091394    1.620    0.000    4.345    0.000 <string>:12(__new__)
   368770    1.461    0.000   88.032    0.000 difflib.py:639(ratio)
  2119037    1.258    0.000    2.826    0.000 <string>:16(_make)
  8127502    1.032    0.000    1.032    0.000 {method '__contains__' of 'set' objects}
   184385    0.919    0.000    1.500    0.000 Donor.py:63(getBestAddr)
   368770    0.859    0.000    5.493    0.000 {built-in method sum}
   368770    0.554    0.000   12.131    0.000 difflib.py:154(__init__)
   368817    0.552    0.000    0.552    0.000 {method 'sort' of 'list' objects}
3965153/3965134    0.538    0.000    0.538    0.000 {built-in method len}
   368770    0.421    0.000   11.577    0.000 difflib.py:218(set_seqs)
   368770    0.403    0.000   10.924    0.000 difflib.py:256(set_seq2)
   337141    0.371    0.000    0.371    0.000 {method 'find' of 'str' objects}
   368770    0.281    0.000    0.281    0.000 difflib.py:41(_calculate_ratio)
   368770    0.232    0.000    0.232    0.000 difflib.py:230(set_seq1)
   159562    0.216    0.000    0.216    0.000 {method 'replace' of 'str' objects}
   184385    0.134    0.000    0.134    0.000 Donor.py:59(getBestName)
   1    0.017    0.017  104.718  104.718 june12.py:56(loadDonors)
    ...

(softmatch is called from loadDonors)

share|improve this question
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.