Is my algorithm slow because it has problems? Or, is any bitwise common prefix solution not going to give me the performance I'm looking for?

After profiling my algorithm, I found that over 60% of the time was spent on one line len(os.path.commonprefix((item1, item2))). I'm only looking for the length of the prefix

To solve this I tried to write a bitwise prefix solution

def bit_prefix(a, b):
    min_len = min(len(a), len(b))
    if min_len > 0:
        x = str(bin(
            int(a[::-1].encode('hex'), 16) ^ int(b[::-1].encode('hex'), 16)))
        y = x.strip('0')
        if len(y) is 1:
            return min_len
        else:
            return (len(x) - len(y)) / 8
    else:
        return 0

I've only gotten a marginal improvement in speed with long prefixes

a = 'a' * 1000000 + 'z'

b = 'a' * 900000 + 'z'

timeit.timeit(lambda: bit_prefix(a, b), number=100)
Out[34]: 6.340372534867129

timeit.timeit(lambda: len(os.path.commonprefix((a, b))), number=100)
Out[35]: 7.5483549056534684

print bit_prefix(a, b), len(os.path.commonprefix((a, b)))
900000 900000

And my algorithm performs more poorly with short prefixes

a = 'aaz'

b = 'az'

timeit.timeit(lambda: bit_prefix(a, b), number=1000000)
Out[42]: 3.968956086175467

timeit.timeit(lambda: len(os.path.commonprefix((a, b))), number=1000000)
Out[43]: 1.1592788235707303

print bit_prefix(a, b), len(os.path.commonprefix((a, b)))
1 1

If my algorithm isn't broken and a bitwise solution won't give me the performance boost I'm looking for, can you refer me to a common prefix solution that would outperform os.path.commonprefix?

Here's the bit_pefix profile

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    29                                           @profile
    30                                           def bit_prefix(a, b):
    31     36140        81099      2.2      4.2      min_len = min(len(a), len(b))
    32     36140        49386      1.4      2.5      if min_len > 0:
    33     36140        49739      1.4      2.6          x = str(
    34     36140        47846      1.3      2.5              bin(
    35     36140        47232      1.3      2.4              int(
    36     36140        89499      2.5      4.6              a[::-1]
    37     36140       242994      6.7     12.5              .encode('hex'),
    38     36140       164442      4.6      8.4              16)
    39                                                       ^ 
    40     36140        49601      1.4      2.5              int(
    41     36140        88745      2.5      4.6              b[::-1]
    42     36140       216488      6.0     11.1              .encode('hex'),
    43     36140       504425     14.0     25.9              16)))
    44     36140       187571      5.2      9.6          y = x.strip('0')
    45     36140        61027      1.7      3.1          if len(y) is 1:
    46                                                       return min_len
    47                                                   else:
    48     36140        67507      1.9      3.5              return (len(x) - len(y)) / 8
    49                                               else:
    50                                                   return 0

Update

I've created another algorithm that seems to be faster than others. Here's the Code Review for that.

share|improve this question
1  
I don't think you'll get anything that is much faster. I thought that using enumerate() with itertools.izip() would be blisteringly fast since they return generators, so a minimum of characters are iterated, but even that came out with just barely better or just barely worse: about the same. – zondo Mar 28 '16 at 12:14
    
@zondo Thanks for izip suggestion, it was better than everything else I found online! I tested it in another Code Review post. – mattkaeo Mar 29 '16 at 5:16

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.