Diffing Binary Files In Python

Question

I've got two binary files. They look something like this, but the data is more random:

File A:

FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF ...

File B:

41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37 ...

What I'd like is to call something like:

>>> someDiffLib.diff(file_a_data, file_b_data)

And receive something like:

[Match(pos=4, length=4)]

Indicating that in both files the bytes at position 4 are the same for 4 bytes. The sequence 44 43 42 41 would not match because they're not in the same positions in each file.

Is there a library that will do the diff for me? Or should I just write the loops to do the comparison?

docs.python.org/2/library/difflib.html - first result in google for "diff in python" — Andrey, Apr 3 '13 at 21:26
possible duplicate of difference between two strings in python/php — Andrey, Apr 3 '13 at 21:27
@Andrey thanks, I tried that, but it appears that get_matching_blocks() doesn't check if the bytes are in the same spot in each files, just that the sequence exists in each file. Otherwise, yeah, that's pretty much what I want. — omghai2u, Apr 3 '13 at 21:28
So you want to get a list of every position where a match starts and the length of that match, and you don't care about sections of the file that would match if they were lined up properly? — Kyle Strand, Apr 3 '13 at 21:30
@KyleStrand yes, I think so. Although I'm not sure what "lined up properly" would mean in this case. In my example above, I do not want the 44 43 42 41 to match because they're in different positions; if that's what you mean. — omghai2u, Apr 3 '13 at 21:31

Andrew Clark · Accepted Answer · 2013-04-03 21:49:54Z

You can use itertools.groupby() for this, here is an example:

from itertools import groupby

# this just sets up some byte strings to use, Python 2.x version is below
# instead of this you would use f1 = open('some_file', 'rb').read()
f1 = bytes(int(b, 16) for b in 'FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF'.split())
f2 = bytes(int(b, 16) for b in '41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37'.split())

matches = []
for k, g in groupby(range(min(len(f1), len(f2))), key=lambda i: f1[i] == f2[i]):
    if k:
        pos = next(g)
        length = len(list(g)) + 1
        matches.append((pos, length))

Or the same thing as above using a list comprehension:

matches = [(next(g), len(list(g))+1)
           for k, g in groupby(range(min(len(f1), len(f2))), key=lambda i: f1[i] == f2[i])
               if k]

Here is the setup for the example if you are using Python 2.x:

f1 = ''.join(chr(int(b, 16)) for b in 'FF FF FF FF 00 00 00 00 FF FF 44 43 42 41 FF FF'.split())
f2 = ''.join(chr(int(b, 16)) for b in '41 42 43 44 00 00 00 00 44 43 42 41 40 39 38 37'.split())

Hot. I'm loving what you're doing there. I was hoping for a beautiful answer like this. — omghai2u, Apr 3 '13 at 21:47

asked	2 years ago
viewed	1099 times
active	2 years ago

current community

your communities

more stack exchange communities

Diffing Binary Files In Python

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python diff or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Diffing Binary Files In Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python diff or ask your own question.

Related

Hot Network Questions