Join the Stack Overflow Community
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

Given the following 2-column array, I want to select items from the second column that correspond to "edges" in the first column. This is just an example, as in reality my a has potentially millions of rows. So, ideally I'd like to do this as fast as possible, and without creating intermediate results.

import numpy as np
a = np.array([[1,4],[1,2],[1,3],[2,6],[2,1],[2,8],[2,3],[2,1],
              [3,6],[3,7],[5,4],[5,9],[5,1],[5,3],[5,2],[8,2],
              [8,6],[8,8]])

i.e. I want to find the result,

desired = np.array([4,6,6,4,2])

which is entries in a[:,1] corresponding to where a[:,0] changes.

One solution is,

b = a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1, 1]

which gives np.array([6,6,4,2]), I could simply prepend the first item, no problem. However, this creates an intermediate array of the indexes of the first items. I could avoid the intermediate by using a list comprehension:

c = [a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y]

This also gives [6,6,4,2]. Assuming a generator-based zip (true in Python 3), this doesn't need to create an intermediate representation and should be very memory efficient. However, the inner loop is not numpy, and it necessitates generating a list which must be subsequently turned back into a numpy array.

Can you come up with a numpy-only version with the memory efficiency of c but the speed efficiency of b? Ideally only one pass over a is needed.

(Note that measuring the speed won't help much here, unless a is very big, so I wouldn't bother with benchmarking this, I just want something that is theoretically fast and memory efficient. For example, you can assume rows in a are streamed from a file and are slow to access -- another reason to avoid the b solution, as it requires a second random-access pass over a.)

Edit: a way to generate a large a matrix for testing:

from itertools import repeat
N, M = 100000, 100
a = np.array(zip([x for y in zip(*repeat(np.arange(N),M)) for x in y ], np.random.random(N*M)))
share|improve this question
    
I guess another more general question is, simply, "How do I perform streaming (generator-like) operations over numpy arrays?" (without turning them into lists!) – Steve Jul 20 at 14:10
    
"...as fast as possible, and without creating intermediate results..." These are sometimes conflicting goals. Which is more important, best performance or minimizing memory use? – Warren Weckesser Jul 20 at 15:36
    
Well, it's not so much about not "wasting memory", but rather being able to operate on array sizes that might not fit into memory at all, while not sacrificing speed too much. (E.g. arrays from memory-mapped files.) It's a shame that it seems that converting to numpy.fromiter seems to mean a 10x sacrifice in speed. – Steve Jul 20 at 17:22
up vote 0 down vote accepted

I am afraid if you are looking to do this in a vectorized way, you can't avoid an intermediate array, as there's no built-in for it.

Now, let's look for vectorized approaches other than nonzero(), which might be more performant. Going by the same idea of performing differentiation as with the original code of (a[1:,0]-a[:-1,0]), we can use boolean indexing after looking for non-zero differentiations that correspond to "edges" or shifts.

Thus, we would have a vectorized approach like so -

a[np.append(True,np.diff(a[:,0])!=0),1]

Runtime test

The original solution a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1,1] would skip the first row. But, let's just say for the sake of timing purposes, it's a valid result. Here's the runtimes with it against the proposed solution in this post -

In [118]: from itertools import repeat
     ...: N, M = 100000, 2
     ...: a = np.array(zip([x for y in zip(*repeat(np.arange(N),M))\
                              for x in y ], np.random.random(N*M)))
     ...: 

In [119]: %timeit a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1,1]
100 loops, best of 3: 6.31 ms per loop

In [120]: %timeit a[1:][np.diff(a[:,0])!=0,1]
100 loops, best of 3: 4.51 ms per loop

Now, let's say you want to include the first row too. The updated runtimes would look something like this -

In [123]: from itertools import repeat
     ...: N, M = 100000, 2
     ...: a = np.array(zip([x for y in zip(*repeat(np.arange(N),M))\
                              for x in y ], np.random.random(N*M)))
     ...: 

In [124]: %timeit a[np.append(0,(a[1:,0]-a[:-1,0]).nonzero()[0]+1),1]
100 loops, best of 3: 6.8 ms per loop

In [125]: %timeit a[np.append(True,np.diff(a[:,0])!=0),1]
100 loops, best of 3: 5 ms per loop
share|improve this answer
    
Both of those solutions are very similar to my b, in that they create intermediate arrays. – Steve Jul 20 at 14:23
    
@Steve By any means would you know the unique elements in the first col beforehand? So, in the sample case, it would be [1,2,3,5,8]. – Divakar Jul 20 at 14:30
    
Nope, the whole point that they must be detected. However, if I were writing a C-like loop, I could easily do this without making extra copies of the array.. e.g. for x in a: y = x[1] iff x[0] != last_x[0]; last_x = x. I'm trying to figure out how to do this efficiently using numpy. – Steve Jul 20 at 14:36
    
@Steve I don't think you can avoid that intermediate with a vectorized NumPy way. Is this search for memory efficiency motivated by some memory error for your actual case with your nonzero approach? – Divakar Jul 20 at 14:46
    
@Steve Check out the edits please. – Divakar Jul 20 at 15:24

Ok actually I found a solution, just learned about np.fromiter, which can build a numpy array based on a generator:

d = np.fromiter((a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y), int)

I think this does it, generates a numpy array without any intermediate arrays. However, the caveat is that it does not seem to be all that efficient! Forgetting what I said in the question about testing:

t = [lambda a: a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1, 1],
     lambda a: np.array([a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y]),
     lambda a: np.fromiter((a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y), int)]

from timeit import Timer
[Timer(x(a)).timeit(number=10) for x in t]

[0.16596235800034265, 1.811289312000099, 2.1662971739997374]

It seems the first solution is drastically faster! I assume this is because even though it generates intermediate data, it is able to perform the inner loop completely in numpy, while in the other it runs Python code for each item in the array.

Like I said, this is why I'm not sure this kind of benchmarking makes sense here -- if accesses to a were much slower, the benchmark wouldn't be CPU-loaded. Thoughts?

Not "accepting" this answer since I am hoping someone can come up with something faster.

share|improve this answer

I think what you are looking for is called the convolution operation. The np.convolve function does this. Convolve function in most of the libraries is optimised not to include any loops and using cython and blas(I think its the same with numpy, although I haven't seen the code). It is evidently much faster than all the results in the above answers. Here is the code

import numpy as np     
kernel = np.array([1, -1])
%timeit a[np.where(np.convolve(a[:,0], kernel, mode='same'))][:,1]
The slowest run took 7.26 times longer than the fastest. 
This could mean that an intermediate result is being cached.
100000 loops, best of 3: 10.4 µs per loop

Although it says that the intermediate results are cached and run times might be biased, it is still very very fast compared to using loops.

share|improve this answer
    
That's exactly the same as using diff1d. – Steve Jul 20 at 17:16

If memory efficiency is your concern, that can be solved as such: The only intermediate of the same size-order as the input data can be made of type bool (a[1:,0] != a[:-1, 0]); and if your input data is int32, that is 8 times smaller than 'a' itself. You can count the nonzeros of that binary array to preallocate the output array as well; though that should not be very either significant if the output of the != is as sparse as your example suggests.

share|improve this answer
    
I agree. I didn't go that way because it has the same big-O space, but it's true that it's quite a lot smaller to deal with boolean arrays. – Steve Jul 20 at 21:25

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.