So I developed some code as part of a larger project. I came upon a problem with how to match paragraphs, and wasn't sure how to proceed, so I asked on Stack Overflow here. You can find an in-depth description of my problem there if you're curious.
Just to be clear, I am not reposting the same question here to get an answer.
I came up with a solution to my own problem, but I'm unsure of limits/pitfalls, and here seems like the perfect place for that.
The short version on the explanation is this: I have two strings, one is the revised version of the other. I want to generate markups and preserve the paragraph spacing, thus, I need to correlate the list of paragraphs in each, match them, and then mark the remaining as either new or deleted.
So I have a function (paraMatcher()
) which matches paragraphs and returns a list of tuples as follows:
(num1, num2)
means that the best match for revised paragraphnum1
is original paragraphnum2
(num, '+')
means that there is no match for revised paragraphnum
, so it must be new (designated by the'+'
)(num, '-')
means that no revised paragraph was matched to original paragraphnum
so it must have been deleted (designated by the'-'
)
So without further adiue, here is my function:
def paraMatcher(orParas, revParas):
THRESHOLD = 0.75
matchSet = []
shifter = 0
for revPara in revParas:
print "Checking revPara ", revParas.index(revPara)
matchTuples = [(difflib.SequenceMatcher(a=orPara,b=revPara).ratio(), orParas.index(orPara)) for orPara in orParas]
print "MatchTuples: ", matchTuples
if matchTuples:
bestMatch = sorted(matchTuples, key = lambda tup: tup[0])[-1]
print "Best Match: ", bestMatch
if bestMatch[0] > THRESHOLD:
orParas.pop(bestMatch[1])
print orParas
matchSet.append((revParas.index(revPara), bestMatch[1] + shifter))
shifter += 1
else:
matchSet.append((revParas.index(revPara), "+"))
print ""
print "---------------"
print ""
if orParas:
print "++++++++++++++dealing with extra paragraphs++++++++++++++"
print orParas
for orPara in orParas:
matchSet.insert(orParas.index(orPara) + shifter, (orParas.index(orPara) + shifter, "-"))
return matchSet
While I definitely want review of general coding style, etc, here are a few issues that I'm really interested in getting feedback on:
- The function needs to be called with copies of the lists (
paraMatcher(lst1[:], lst2[:])
) - How might this fail?
- How do I determine an appropriate value for
THRESHOLD
Some Extra Notes:
- I've left the diagnostic printing in there in case any of you want to test it
- Due to other parts of the code, its more convenient that this function take lists of paragraphs as arguments, rather than the strings
- I don't think it matters, but this is 32-bit Python 2.7.6 running on 64-bit Windows 7