NumPy Scipy optimization

Question

I have an optimization issue and I'm not sure if I can improve the overall speed of my function.

The function draw_w is my actual implementation that gives me the right update for my vector w. You can see that the code is running for 1000 iterations in 9s. I had a first implementation (draw_w_fw) that was quiet good because, well-vectorized, that is an order of magnitude quicker by 0.8s. Unfortunately, this implementation is wrong because the cache[0] is not updated correctly.

I would like to know if someone knows a way to speed up draw_w from what I've wrote.

import scipy.sparse as sps
import numpy as np
import timeit


def draw_w(X, w, cache):

    X = X.tocsr()

    for i in xrange(w.shape[0]):
        x_li = X.getrow(i)
        Y = w[i] * x_li
        Y.data -= np.take(cache[0], Y.indices) 
        h = x_li.multiply(-Y)

        w_mean = h.sum()
        w_sigma_sqr = (x_li.multiply(x_li)).sum() 

        w_sigma_sqr = 1.0 / w_sigma_sqr
        w_mean = - w_sigma_sqr * w_mean

        # update w:
        w_old = np.copy(w[i])

        if np.isinf(w[i]):
            w[i] = 0
        elif np.isnan(w[i]):
            w[i] = 0
        else:
            w[i] = w_mean

        # update error:
        cache[0] -= (w_old - w[i]) * x_li

    #print 'draw_w', w
    #True result w [ 2.34125626  2.37726446  4.00792293  3.71059779  4.00792293  0.11100713
    # -0.28899287 -0.04393113  0.21429929]

####################


def draw_w_fw(X, w, cache):

    x_li = X.tocsr()
    nnz_per_row = np.diff(x_li.indptr)
    Y = sps.csr_matrix((x_li.data * np.repeat(w, nnz_per_row), x_li.indices, x_li.indptr), shape=x_li.shape)
    Y.data -= np.take(cache[0], Y.indices)   #not good because cache[0] is updated...
    h = x_li.multiply(-Y)

    w_mean = np.asarray(h.sum(axis=1).transpose())[0]
    w_sigma_sqr = np.asarray(( x_li.multiply(x_li) ).sum(axis=1).transpose())[0]

    w_sigma_sqr = 1.0 / w_sigma_sqr
    w_mean = - w_sigma_sqr * w_mean

    # update w:
    w_old = np.copy(w)

    w[~np.isnan(w) & ~np.isinf(w)] = w_mean[~np.isnan(w) & ~np.isinf(w)]

    w[np.isinf(w)] = 0.0
    w[np.isnan(w)] = 0.0

    # update error:
    cache[0] -= (w_old - w) * x_li

    #print 'draw_w_fw', w

##########################


def test():

    data = [ 1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.,1.]
    cols = [ 0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14]
    rows = [0,5,1,5,2,5,3,5,4,5,0,6,1,6,2,6,3,6,4,6,1,7,3,7,0,8,2,8,4,8]

    X = sps.coo_matrix((data, (rows, cols)), shape = (9,15))
    X = X.tocsr()

    w = np.array([ 0.16243454, -0.06117564, -0.05281718, -0.10729686,  0.08654076, -0.23015387,  0.17448118, -0.07612069,  0.03190391])

    cache = np.zeros((2, 15))
    cache[0] =  np.asarray([-5.06771933, -5.29132951, -4.28297104, -1.33745073, -2.14361311, -0.66308429,
     -0.88669446, -2.878336,  -4.93281569, -4.73897806, -1.13729633, -5.18341755,
     -0.80566155, -5.02091327, -4.88155533])
    #draw_w(X, w, cache)
    draw_w_fw(X, w, cache)

if __name__ == '__main__':
    print(timeit.timeit("test()", setup="from __main__ import test", number=1000))
    #draw_w:    9.26 for 1000 iterations
    #draw_w_fw: 0.80 for 1000 iterations

Jaime · Accepted Answer · 2013-11-02 15:40:34Z

Before streamlining the code, it almost always pays off to streamline the math. The h you calculate is the dot product of two vectors, X[i] and Y[i] = cache[0] - w[i]*X[i], so we can rewrite it as

h = X[i].cache[0] - w[i]*X[i].X[i]

When you go on to calculate w_mean, you are actually negating h and then dividing it by a number that turns out to be X[i].X[i], so you can calculate w_mean as

w_mean = w[i] - X[i].cache[0] / X[i].X[i]

Setting aside the np.isnan and np.isinf checks you are doing, you then update cache[0] with a value where w[i] cancels out, and you are basically doing:

cache[0] -= X[i] * (X[i].cache[0] / X[i].X[i])

Putting it all together into a single function, I got this very compact code:

from numpy.lib.stride_tricks import as_strided

def draw_w_bis(X, w, cache):
    X = X.tocsr()
    x_rows_sqr = np.add.reduceat(X.data*X.data, X.indptr[:-1])
    rows, cols = X.shape
    row_start_stop = as_strided(X.indptr, shape=(rows, 2),
                                strides=2*X.indptr.strides)
    for row, (start, stop) in enumerate(row_start_stop):
        data = X.data[start:stop]
        cols = X.indices[start:stop]
        delta = np.dot(data, cache[0, cols]) / x_rows_sqr[row]
        w[row] -= delta
        cache[0, cols] -= delta * data

If you are not comfortable with calling as_strided you could replace the definition of row_start_stop with:

row_start_stop = np.column_stack((X.indptr[:-1], X.indptr[1:]))

On my system your test code runs the 1000 iterations over an order of magnitude faster (6 sec vs. 0.4 sec) which seems to be what you were after. There still is a for loop in there, so if you are going to run this on large arrays, it is still going to be slow, and you may want to consider using things like Cython.

There are 2 parts to your speedup. 1) the last 3 lines that streamline the inner loop. 2) iteration using as_strided(X.indptr...). The streamlining saves about 20% when plugged into my fastest pure python version. Using strided instead of lil in my cython code drops the time from 0.43 to 0.41. — hpaulj, Nov 17 '13 at 23:33

hpaulj · Answer 2 · 2013-10-31 04:03:07Z

up vote 0 down vote

At least with your test X, draw_x_fw works with 2 calls. The rows of X fall into two groups, each with a unique set of columns. I don't know if that's a structural part of the problem, or just a random characteristic your chosen test.

    n = 5
    draw_w_fw(X[:n,:], w[:n], cache) # n items
    draw_w_fw(X[n:,:], w[n:], cache)

With simple slicing like this, w[:n] is a view, so changes to w inside the function are visible outside it. If we do a more general row selection, we have to explicitly set the new w values.

edited Oct 31 '13 at 4:03

answered Oct 30 '13 at 18:48

hpaulj
29635

This part is a coincidence. On real dataset, I can't just find some slicing like that but good point. – Thierry Silbermann Oct 31 '13 at 11:33

add a comment |

hpaulj · Answer 3 · 2013-11-01 22:01:28Z

This is a simplified version of draw_w, working with the data for one row.

def draw_np(Xdata, Xindices, w, cache):
    # Xdata, Xindices are from a csr row
    # w scalar
    # cache vector
    Ydata = w * Xdata
    Ydata -=  np.take(cache, Xindices)
    h = - Xdata * Ydata
    w_mean = h.sum()
    w_sigma_sqr = (Xdata*Xdata).sum()
    w_mean = - w_mean / w_sigma_sqr
    cache[Xindices] -= (w - w_mean) * Xdata
    return  w_mean

for i in xrange(w.size):
    xli = X.getrow(i)
    w[i] = draw_np(xli.data, xli.indices , w[i], cache[0])

Compared to the 9 sec for draw_w, this is 1.6 sec. I've omitted the isInf and isNan tests (which should apply to w_mean), but I don't think that's significant. My guess is the speed comes from working with the simpler numpy data vectors rather than the full sparse matrix.

I wrote this simplified Python version as a step toward a faster (hopefully) cython function:

def draw_cy(Xdata, Xindices, w, cache):
    cdef double w_mean
    cdef double w_sigma_sqr
    cdef double ydata
    cdef double value
    cdef int size
    cdef int ind
    size = Xdata.shape[0]
    w_mean = 0.0
    w_sigma_sqr = 0.0
    for i in range(size):
        value = Xdata[i]
        ind = Xindices[i]
        ydata = w * value
        ydata -= cache[ind]
        ydata = - value * ydata
        w_mean += ydata
        w_sigma_sqr += value * value
    w_mean = - w_mean / w_sigma_sqr
    for i in range(size):
        value = Xdata[i]
        ind = Xindices[i]
        cache[ind] -= (w - w_mean) * value
    return  w_mean

# requires a compile to .c and .so and import

for i in xrange(w.size):
    xli = X.getrow(i)
    w[i] = draw_cy(xli.data, xli.indices , w[i], cache[0])

This times at 1.1 sec, nearly as good as the fully vectorized (but erroneous) draw_w_fw. But the speedup compared to draw_np isn't that great.

more on timings. I pulled out various pieces from this last iteration, and deduced that

bare loop (pass) took 0.27 sec  (that includes setup)
9 calls to getrow took 0.64 sec
9 calls to draw_cy took 0.10 sec
9 calls to draw_np took 0.60 sec

It's significant that the X.getrow() took as long as the barebones numpy draw_np calculation. My impression is that scipy.sparse is a convenient way of setting up sparse matricies, and a good way of doing large linear algebra problems (e.g. FEM), but it is not a fast way of doing general matrix operations.

I tried a dense numpy iterator:

Xa = X.A
for i in xrange(w.size):
    x1 = Xa[i,:]; nz = x1.nonzero()[0]; x1=x1[nz]
    w[i] = cy.draw_cy(x1, nz, w[i], cache[0])

The time for this is 0.62 sec., 0.5 less than the getrow version. Using draw_np gives a similar time savings.

In other words, if X isn't so large that you can't store it as a dense array, it might be faster to skip the sparse module entirely. But I haven't explored how these timings vary with the size of X and w.

lil iteration is even a bit faster, since its data and rows are matching lists:

Xl = X.tolil()
for i, (xd, xr) in enumerate(zip(Xl.data, Xl.rows)):
    #w[i] = cy.draw_cy(np.array(xd), xr, w[i], cache[0])
    w[i] = cpy.draw_py(np.array(xd), xr , w[i], cache[0])

If the X.data are all 1 (as in the test case), the calculations can be simplified even more (only X.rows has useful information).

I tested it on my computer and don't have the same speed up. I have 3.75 so it's definitely an improve but still too slow for what I need. I'm impress by the speedup for so little change. I'll definitely take a look at Cython and give you the bounty if nobody else answer. — Thierry Silbermann, Oct 31 '13 at 16:13
Further testing shows that X.getrow() takes a nontrivial amount of time. — hpaulj, Nov 1 '13 at 4:26

asked	11 months ago
viewed	399 times
active	10 months ago

current community

your communities

more stack exchange communities

NumPy Scipy optimization

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged python optimization numpy or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

NumPy Scipy optimization

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python optimization numpy or ask your own question.

Related

Hot Network Questions