Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I've got two binary files. They look something like this, but the data is more random:

File A:

FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF ...

File B:

41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37 ...

What I'd like is to call something like:

>>> someDiffLib.diff(file_a_data, file_b_data)

And receive something like:

[Match(pos=4, length=4)]

Indicating that in both files the bytes at position 4 are the same for 4 bytes. The sequence 44 43 42 41 would not match because they're not in the same positions in each file.

Is there a library that will do the diff for me? Or should I just write the loops to do the comparison?

share|improve this question
    
docs.python.org/2/library/difflib.html - first result in google for "diff in python" –  Andrey Apr 3 '13 at 21:26
    
possible duplicate of difference between two strings in python/php –  Andrey Apr 3 '13 at 21:27
    
@Andrey thanks, I tried that, but it appears that get_matching_blocks() doesn't check if the bytes are in the same spot in each files, just that the sequence exists in each file. Otherwise, yeah, that's pretty much what I want. –  omghai2u Apr 3 '13 at 21:28
    
So you want to get a list of every position where a match starts and the length of that match, and you don't care about sections of the file that would match if they were lined up properly? –  Kyle Strand Apr 3 '13 at 21:30
2  
@KyleStrand yes, I think so. Although I'm not sure what "lined up properly" would mean in this case. In my example above, I do not want the 44 43 42 41 to match because they're in different positions; if that's what you mean. –  omghai2u Apr 3 '13 at 21:31

1 Answer 1

up vote 7 down vote accepted

You can use itertools.groupby() for this, here is an example:

from itertools import groupby

# this just sets up some byte strings to use, Python 2.x version is below
# instead of this you would use f1 = open('some_file', 'rb').read()
f1 = bytes(int(b, 16) for b in 'FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF'.split())
f2 = bytes(int(b, 16) for b in '41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37'.split())

matches = []
for k, g in groupby(range(min(len(f1), len(f2))), key=lambda i: f1[i] == f2[i]):
    if k:
        pos = next(g)
        length = len(list(g)) + 1
        matches.append((pos, length))

Or the same thing as above using a list comprehension:

matches = [(next(g), len(list(g))+1)
           for k, g in groupby(range(min(len(f1), len(f2))), key=lambda i: f1[i] == f2[i])
               if k]

Here is the setup for the example if you are using Python 2.x:

f1 = ''.join(chr(int(b, 16)) for b in 'FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF'.split())
f2 = ''.join(chr(int(b, 16)) for b in '41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37'.split())
share|improve this answer
    
Hot. I'm loving what you're doing there. I was hoping for a beautiful answer like this. –  omghai2u Apr 3 '13 at 21:47

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.