2

Given the following 2-column array, I want to select items from the second column that correspond to "edges" in the first column. This is just an example, as in reality my a has potentially millions of rows. So, ideally I'd like to do this as fast as possible, and without creating intermediate results.

import numpy as np
a = np.array([[1,4],[1,2],[1,3],[2,6],[2,1],[2,8],[2,3],[2,1],
              [3,6],[3,7],[5,4],[5,9],[5,1],[5,3],[5,2],[8,2],
              [8,6],[8,8]])

i.e. I want to find the result,

desired = np.array([4,6,6,4,2])

which is entries in a[:,1] corresponding to where a[:,0] changes.

One solution is,

b = a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1, 1]

which gives np.array([6,6,4,2]), I could simply prepend the first item, no problem. However, this creates an intermediate array of the indexes of the first items. I could avoid the intermediate by using a list comprehension:

c = [a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y]

This also gives [6,6,4,2]. Assuming a generator-based zip (true in Python 3), this doesn't need to create an intermediate representation and should be very memory efficient. However, the inner loop is not numpy, and it necessitates generating a list which must be subsequently turned back into a numpy array.

Can you come up with a numpy-only version with the memory efficiency of c but the speed efficiency of b? Ideally only one pass over a is needed.

(Note that measuring the speed won't help much here, unless a is very big, so I wouldn't bother with benchmarking this, I just want something that is theoretically fast and memory efficient. For example, you can assume rows in a are streamed from a file and are slow to access -- another reason to avoid the b solution, as it requires a second random-access pass over a.)

Edit: a way to generate a large a matrix for testing:

from itertools import repeat
N, M = 100000, 100
a = np.array(zip([x for y in zip(*repeat(np.arange(N),M)) for x in y ], np.random.random(N*M)))
3
  • I guess another more general question is, simply, "How do I perform streaming (generator-like) operations over numpy arrays?" (without turning them into lists!) Commented Jul 20, 2016 at 14:10
  • "...as fast as possible, and without creating intermediate results..." These are sometimes conflicting goals. Which is more important, best performance or minimizing memory use? Commented Jul 20, 2016 at 15:36
  • Well, it's not so much about not "wasting memory", but rather being able to operate on array sizes that might not fit into memory at all, while not sacrificing speed too much. (E.g. arrays from memory-mapped files.) It's a shame that it seems that converting to numpy.fromiter seems to mean a 10x sacrifice in speed. Commented Jul 20, 2016 at 17:22

3 Answers 3

0

I am afraid if you are looking to do this in a vectorized way, you can't avoid an intermediate array, as there's no built-in for it.

Now, let's look for vectorized approaches other than nonzero(), which might be more performant. Going by the same idea of performing differentiation as with the original code of (a[1:,0]-a[:-1,0]), we can use boolean indexing after looking for non-zero differentiations that correspond to "edges" or shifts.

Thus, we would have a vectorized approach like so -

a[np.append(True,np.diff(a[:,0])!=0),1]

Runtime test

The original solution a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1,1] would skip the first row. But, let's just say for the sake of timing purposes, it's a valid result. Here's the runtimes with it against the proposed solution in this post -

In [118]: from itertools import repeat
     ...: N, M = 100000, 2
     ...: a = np.array(zip([x for y in zip(*repeat(np.arange(N),M))\
                              for x in y ], np.random.random(N*M)))
     ...: 

In [119]: %timeit a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1,1]
100 loops, best of 3: 6.31 ms per loop

In [120]: %timeit a[1:][np.diff(a[:,0])!=0,1]
100 loops, best of 3: 4.51 ms per loop

Now, let's say you want to include the first row too. The updated runtimes would look something like this -

In [123]: from itertools import repeat
     ...: N, M = 100000, 2
     ...: a = np.array(zip([x for y in zip(*repeat(np.arange(N),M))\
                              for x in y ], np.random.random(N*M)))
     ...: 

In [124]: %timeit a[np.append(0,(a[1:,0]-a[:-1,0]).nonzero()[0]+1),1]
100 loops, best of 3: 6.8 ms per loop

In [125]: %timeit a[np.append(True,np.diff(a[:,0])!=0),1]
100 loops, best of 3: 5 ms per loop
5
  • Both of those solutions are very similar to my b, in that they create intermediate arrays. Commented Jul 20, 2016 at 14:23
  • @Steve By any means would you know the unique elements in the first col beforehand? So, in the sample case, it would be [1,2,3,5,8]. Commented Jul 20, 2016 at 14:30
  • Nope, the whole point that they must be detected. However, if I were writing a C-like loop, I could easily do this without making extra copies of the array.. e.g. for x in a: y = x[1] iff x[0] != last_x[0]; last_x = x. I'm trying to figure out how to do this efficiently using numpy. Commented Jul 20, 2016 at 14:36
  • @Steve I don't think you can avoid that intermediate with a vectorized NumPy way. Is this search for memory efficiency motivated by some memory error for your actual case with your nonzero approach? Commented Jul 20, 2016 at 14:46
  • Ok, I'll accept "impossible" as the answer. Shame that numpy doesn't allow binary selection criteria without generating an intermediate array, but I guess that's the reality without adding a function to numpy. Ultimately such a function would need to be able to guess the final size of the array, so perhaps the idea is fundamentally incompatible with numpy. Commented Jul 20, 2016 at 17:20
0

Ok actually I found a solution, just learned about np.fromiter, which can build a numpy array based on a generator:

d = np.fromiter((a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y), int)

I think this does it, generates a numpy array without any intermediate arrays. However, the caveat is that it does not seem to be all that efficient! Forgetting what I said in the question about testing:

t = [lambda a: a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1, 1],
     lambda a: np.array([a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y]),
     lambda a: np.fromiter((a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y), int)]

from timeit import Timer
[Timer(x(a)).timeit(number=10) for x in t]

[0.16596235800034265, 1.811289312000099, 2.1662971739997374]

It seems the first solution is drastically faster! I assume this is because even though it generates intermediate data, it is able to perform the inner loop completely in numpy, while in the other it runs Python code for each item in the array.

Like I said, this is why I'm not sure this kind of benchmarking makes sense here -- if accesses to a were much slower, the benchmark wouldn't be CPU-loaded. Thoughts?

Not "accepting" this answer since I am hoping someone can come up with something faster.

0

If memory efficiency is your concern, that can be solved as such: The only intermediate of the same size-order as the input data can be made of type bool (a[1:,0] != a[:-1, 0]); and if your input data is int32, that is 8 times smaller than 'a' itself. You can count the nonzeros of that binary array to preallocate the output array as well; though that should not be very either significant if the output of the != is as sparse as your example suggests.

1
  • I agree. I didn't go that way because it has the same big-O space, but it's true that it's quite a lot smaller to deal with boolean arrays. Commented Jul 20, 2016 at 21:25

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.