Intersection between arrays

Question

I'm a Python beginner (more used to code in R) and I'd like to optimize a function.

2 arrays ares filled by integers from 0 to 100. A number can't appear twice in a row and they're stored in an ascending order.

Array1: nrow = 100 000; ncol = 5
Array2: nrow = 50 000; ncol = 5

For each row of Array1 and each row of Array2 I need to count the number of similar values and store this result in a 3rd array.

Array3: nrow = 100 000; ncol = 50 000

Here is the current function, with a smaller array2 (50 rows instead of 50 000)

array1= np.random.randint(0,100,(100000,5))
array2 = np.random.randint(0,100,(50,5))

def Intersection(array1, array2):
    Intersection = np.empty([ array1.shape[0]  , array2.shape[0] ], dtype=int8)
    for i in range(0, array1.shape[0]):
        for j in range(0, array2.shape[0]):
            Intersection[i,j] = len( set(array1[i,]).intersection(array2[j,]) )
    return Intersection

import time
start = time.time()
Intersection(array1,array2)
end = time.time()
print end - start

23.46 sec. So it should take hours if array2 has 50 000 rows.

How can I optimize this function by keeping it simple to understand?

It would be fair to point out you posted about the same question on Stack Overflow: stackoverflow.com/q/27133165 It saves people the trouble of coming up with solutions that are already posted. — moarningsun, Dec 1 '14 at 12:12
Hi moarningsun, Sorry if my question was unclear, but I wasn't expected a similar solution than the one jme posted. I've just wanted to improve my own solution in order to compare both approaches and improve my numpy skills and you helped with the 1st part of your answer. But you're right, it would have saved you time if I pointed out. My bad ! — sdata, Dec 11 '14 at 15:06

moarningsun · Accepted Answer · 2014-11-30 12:30:48Z

The main optimization I can think of is precomputing the sets. If array1 has shape [m,n] and array2 is [q,p], the original code is creating a set 2*m*q times. You can get this down to m+q.

In addition, you could follow the Python style guide in terms of naming conventions and whitespace, which will help working together with other Python programmers. Also it's nice to loop over the data structure itself instead of over a range of indices.

This would result in for example:

import numpy as np

def intersection(array1, array2):
    intersection = np.empty([array1.shape[0], array2.shape[0]], dtype=np.int8)
    array2_sets = map(set, array2)
    for i,row1 in enumerate(array1):
        set1 = set(row1)
        for j, set2 in enumerate(array2_sets):
            intersection[i,j] = len(set1.intersection(set2))
    return intersection

If you can forget about "keeping it simple to understand" you could also convert your data to sparse matrices and take the matrix product, e.g.:

from scipy.sparse import csr_matrix

def intersection2(A, B, sparse=True):
    """The rows of A and B must contain unique, relatively 
    small integers
    """
    assert min(A.shape[1], B.shape[1]) < 256  # to fit inside uint8
    z = max(A.max(), B.max()) + 1

    def sparsify(A):
        m, n = A.shape
        indptr = np.arange(0, m*n+1, n)
        data = np.ones(m*n, dtype=np.uint8)
        return csr_matrix((data, A.ravel(), indptr), shape=(m,z))

    intersection = sparsify(A) * sparsify(B).T
    return intersection if sparse else intersection.todense()

Thanks a lot moarningsun. I'm gonna test it right now! If I understand one a the bottleneck is the sets creation. For an alternative solution, could it be smart to: 1/ Avoid these set creations by using a function working on arrays like np.intersect1d ? 2/ Use the np.vectorize or the np.apply_along_axis function ? 3/ I forgot to mention in my data, each row of the arrays are in an ascending order, i.e 17-4-25-86-11 is NOT possible. Could I take avantage of this ? — sdata, Dec 1 '14 at 0:28
@sdata; No I don't think these ideas would speed it up much 1) the rows are too short 2) vectorize and apply_along_axis are just convenience functions 3) the range of values is to broad I think. — moarningsun, Dec 1 '14 at 12:09

asked	9 months ago
viewed	379 times
active	9 months ago

current community

your communities

more stack exchange communities

Intersection between arrays

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python performance numpy or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Intersection between arrays

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python performance numpy or ask your own question.

Related

Hot Network Questions