Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I want to write a simple diff application in Python using Google's Diff Match Patch APIs. I'm quite new to Python, so I want an example of how to use the Diff Match Patch API for semantically comparing two paragraphs of text. I'm not too sure of how to go about using the diff_match_patch.py file and what to import to from it. Help will be much appreciated!

Additionally, I've tried using difflib, but I found it ineffective for comparing largely varied sentences. I'm using ubuntu 12.04 x64.

share|improve this question

1 Answer 1

up vote 5 down vote accepted

Google's diff-match-patch API is the same for all languages that it is implemented in (Java, JavaScript, Dart, C++, C#, Objective C, Lua and Python 2.x or python 3.x). Therefore one can typically use sample snippets in languages other than one's target language to figure out which particular API calls are needed for various diff/match/patch tasks .

In the case of a simple "semantic" comparison this is what you need

import diff_match_patch

textA = "the cat in the red hat"
textB = "the feline in the blue hat"

#create a diff_match_patch object
dmp = diff_match_patch.diff_match_patch()

# Depending on the kind of text you work with, in term of overall length
# and complexity, you may want to extend (or here suppress) the
# time_out feature
dmp.Diff_Timeout = 0   # or some other value, default is 1.0 seconds

# All 'diff' jobs start with invoking diff_main()
diffs = dmp.diff_main(textA, textB)

# diff_cleanupSemantic() is used to make the diffs array more "human" readable
dmp.diff_cleanupSemantic(diffs)

# and if you want the results as some ready to display HMTL snippet
htmlSnippet = dmp.diff_prettyHtml(diffs)


A word on "semantic" processing by diff-match-patch
Beware that such processing is useful to present the differences to a human viewer because it tends to produce a shorter list of differences by avoiding non-relevant resynchronization of the texts (when for example two distinct words happen to have common letters in their mid). The results produced however are far from perfect, as this processing is just simple heuristics based on the length of differences and surface patterns etc. rather than actual NLP processing based on lexicons and other semantic-level devices.
For example, the textA and textB values used above produce the following "before-and-after-diff_cleanupSemantic" values for the diffs array

[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'r'), (1, 'blu'), (0, 'e'), (-1, 'd'), (0, ' hat')]
[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'red'), (1, 'blue'), (0, ' hat')]

Nice! the letter 'e' that is common to red and blue causes the diff_main() to see this area of the text as four edits, but the cleanupSemantic() fixes as just two edits, nicely singling out the different sems 'blue' and 'red'.

However, if we have, for example

textA = "stackoverflow is cool"
textb = "so is very cool"

The before/after arrays produced are:

[(0, 's'), (-1, 'tack'), (0, 'o'), (-1, 'verflow'), (0, ' is'), (1, ' very'), (0, ' cool')]
[(0, 's'), (-1, 'tackoverflow is'), (1, 'o is very'), (0, ' cool')]

Which shows that the allegedly semantically improved after can be rather unduly "tortured" compared to the before. Note, for example, how the leading 's' is kept as a match and how the added 'very' word is mixed with parts of the 'is cool' expression. Ideally, we'd probably expect something like

[(-1, 'stackoverflow'), (1, 'so'), (0, ' is '), (-1, 'very'), (0, ' cool')]
share|improve this answer
    
Thank you for the quick reply! This is very useful and the algorithm is very good, but I don't find it exactly "readable". Is it possible to give a formatted output like this instead using Python and these APIs? And how would I go about it if it's possible? –  shortstheory Apr 18 '13 at 17:26
    
See edit, at the bottom of the first code section. Essentially you need call dmp.diff_prettyHtml(diffs) which produces an HTML snippet suitable to show the various edits (insertions and deletions) in coded color. –  mjv Apr 18 '13 at 19:42
    
Thanks! This is perfect and it's just what I was looking for! I appreciate all your help! Though, I also want to know if it's possible to give raw output too. I would've upvoted your answer, but I do not have enough reputation. –  shortstheory Apr 19 '13 at 4:13

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.