For a bioinformatic problem, I found myself wanting to create a list of strings within Hamming distance "k" of a reference sequence. I wanted to do so quickly and pythonically. I've implemented it in pure python and in cython, with and without type declarations. The time performance has been identical. (I also compared the compiled python version with an interpreted version defined in ipython, which also performed identically.)
cfi has been set as shorthand to chain.from_iterable
in order to reduce the number of dot operators being used as in the following module-level import and definition:
from itertools import chain
cfi = chain.from_iterable
@cython.returns(list)
def PermuteMotifOnce(cython.str motif, set alphabet={"A", "C", "G", "T"}):
"""
Gets all strings within hamming distance 1 of motif and returns it as a
list.
"""
return list(set(cfi([[
motif[:pos] + alpha + motif[pos + 1:] for
alpha in alphabet] for
pos in range(len(motif))])))
def PyPermuteMotifOnce(motif, alphabet={"A", "C", "G", "T"}):
"""
Gets all strings within hamming distance 1 of motif and returns it as a
list.
"""
return list(set(cfi([[
motif[:pos] + alpha + motif[pos + 1:] for
alpha in alphabet] for
pos in range(len(motif))])))
@cython.returns(list)
def PermuteMotifN(cython.str motif, cython.long n=-1):
assert n > 0
cdef set workingSet
cdef cython.long i
workingSet = {motif}
for i in range(n):
workingSet = set(cfi(map(PermuteMotifOnce, workingSet)))
return list(workingSet)
def PyPermuteMotifN(motif, n=-1):
assert n > 0
workingSet = {motif}
for i in range(n):
workingSet = set(cfi(map(PermuteMotifOnce, workingSet)))
return list(workingSet)
Results:
motif = "ACCTACTGAACT" %timeit -n 5 PermuteMotifN(motif, 6) 5 loops, best of 3: 6.93s per loop %timeit -n 5 PyPermuteMotifN(motif, 6) 5 loops, best of 3: 6.81s per loop %timeit -n 5000 PyPermuteMotifN(motif, 2) 5000 loops, best of 3: 589 microseconds per loop %timeit -n 5000 PermuteMotifN(motif, 2) 5000 loops, best of 3: 645 microseconds per loop
Is it just me, or does the pure Python seem even faster than the Cython? Is there a lot of time lost in extra type checking?