Optimizing NumPy code

Question

I'm try to implement SSIM method. This method has already been implemented in Python (source code), but my goal is implement it with using only Python and NumPy. Also my goal is to using this method on big images (1024x1024 and above). But filter2 works very slow (approx. 62 sec. for 1024x1024). cProfile gives me information that _methods.py:16(_sum), fromnumeric.py:1422(sum), and method 'reduce' of 'numpy.ufunc' objects eat main part time of run. Any help and advises welcome.

import numpy

def filter2(window, img):
    range1 = img.shape[0] - window.shape[0] + 1
    range2 = img.shape[1] - window.shape[1] + 1
    res = numpy.zeros((range1, range2), dtype=numpy.double)
    for i in range(range1):
        for j in range(range2):
            res[i,j] = numpy.sum(numpy.multiply(img[i:11+i, j:11+j], window))
    return res

def ssim(img1, img2):
    window = numpy.array([\
        [0.0000,    0.0000, 0.0000, 0.0001, 0.0002, 0.0003, 0.0002, 0.0001, 0.0000, 0.0000, 0.0000],\
        [0.0000,    0.0001, 0.0003, 0.0008, 0.0016, 0.0020, 0.0016, 0.0008, 0.0003, 0.0001, 0.0000],\
        [0.0000,    0.0003, 0.0013, 0.0039, 0.0077, 0.0096, 0.0077, 0.0039, 0.0013, 0.0003, 0.0000],\
        [0.0001,    0.0008, 0.0039, 0.0120, 0.0233, 0.0291, 0.0233, 0.0120, 0.0039, 0.0008, 0.0001],\
        [0.0002,    0.0016, 0.0077, 0.0233, 0.0454, 0.0567, 0.0454, 0.0233, 0.0077, 0.0016, 0.0002],\
        [0.0003,    0.0020, 0.0096, 0.0291, 0.0567, 0.0708, 0.0567, 0.0291, 0.0096, 0.0020, 0.0003],\
        [0.0002,    0.0016, 0.0077, 0.0233, 0.0454, 0.0567, 0.0454, 0.0233, 0.0077, 0.0016, 0.0002],\
        [0.0001,    0.0008, 0.0039, 0.0120, 0.0233, 0.0291, 0.0233, 0.0120, 0.0039, 0.0008, 0.0001],\
        [0.0000,    0.0003, 0.0013, 0.0039, 0.0077, 0.0096, 0.0077, 0.0039, 0.0013, 0.0003, 0.0000],\
        [0.0000,    0.0001, 0.0003, 0.0008, 0.0016, 0.0020, 0.0016, 0.0008, 0.0003, 0.0001, 0.0000],\
        [0.0000,    0.0000, 0.0000, 0.0001, 0.0002, 0.0003, 0.0002, 0.0001, 0.0000, 0.0000, 0.0000]\
    ], dtype=numpy.double)

    K = [0.01, 0.03]
    L = 65535

    C1 = (K[0] * L) ** 2
    C2 = (K[1] * L) ** 2

    mu1 = filter2(window, img1)
    mu2 = filter2(window, img2)

    mu1_sq = numpy.multiply(mu1, mu1)
    mu2_sq = numpy.multiply(mu2, mu2)
    mu1_mu2 = numpy.multiply(mu1, mu2)

    sigma1_sq = filter2(window, numpy.multiply(img1, img1)) - mu1_sq
    sigma2_sq = filter2(window, numpy.multiply(img2, img2)) - mu2_sq
    sigma12 = filter2(window, numpy.multiply(img1, img2)) - mu1_mu2

    ssim_map = numpy.divide(numpy.multiply((2*mu1_mu2 + C1), (2*sigma12 + C2)), numpy.multiply((mu1_sq + mu2_sq + C1),(sigma1_sq + sigma2_sq + C2)))
    return numpy.mean(ssim_map)

def calc_ssim():

    img1 = numpy.array(numpy.zeros((1024,1024)),dtype=numpy.double)
    img2 = numpy.array(numpy.zeros((1024,1024)),dtype=numpy.double)

    return ssim(img1, img2)

Update: After rewrite filter2 function with using as_strided, execute time reduced to ~42.7 sec. But anyway this is very slow in compare with matlab realisation.

def filter2(window, x):
range1 = x.shape[0] - window.shape[0] + 1
range2 = x.shape[1] - window.shape[1] + 1
res = numpy.zeros((range1, range2), dtype=numpy.double)
x1 = as_strided(x,((x.shape[0] - 10)/1 ,(x.shape[1] - 10)/1 ,11,11), (x.strides[0]*1,x.strides[1]*1,x.strides[0],x.strides[1])) * window
for i in xrange(range1):
    for j in xrange(range2):
         res[i,j] = x1[i,j].sum()
return res

hpaulj · Accepted Answer · 2013-09-26 06:07:29Z

In your strided filter2, x1 is (1014, 1014, 11, 11). You are iterating over the 1st 2 dimensions in order to sum on on the last 2. Let sum do all the work for you, res = x1.sum((2,3))

def filter2(window, x):
    range1 = x.shape[0] - window.shape[0] + 1
    range2 = x.shape[1] - window.shape[1] + 1
    x1 = as_strided(x,((x.shape[0] - 10)/1 ,(x.shape[1] - 10)/1 ,11,11), (x.strides[0]*1,x.strides[1]*1,x.strides[0],x.strides[1])) * window
    res = x1.sum((2,3))
    return res

In my tests this gives a 6x speed improvement.

With numpy iteration, especially nested ones over large dimensions like 1014 is a speed killer. You want to vectorize this kind of thing as much as possible.

Traditionally Matlab had the same speed problems, but newer versions recognize and compile loops like yours. That's why your numpy is so much slower.

thank you, for explanation! On my machine execute time reduced to ~4.3 s. — Renat Zaripov, Sep 30 '13 at 5:34

asked	5 months ago
viewed	166 times
active	5 months ago

current community

more stack exchange communities

Optimizing NumPy code

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged python numpy or ask your own question.

Hot Network Questions

current community

more stack exchange communities

Optimizing NumPy code

1 Answer

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python numpy or ask your own question.

Related

Hot Network Questions