Given the following 2-column array, I want to select items from the second column that correspond to "edges" in the first column. This is just an example, as in reality my a
has potentially millions of rows. So, ideally I'd like to do this as fast as possible, and without creating intermediate results.
import numpy as np
a = np.array([[1,4],[1,2],[1,3],[2,6],[2,1],[2,8],[2,3],[2,1],
[3,6],[3,7],[5,4],[5,9],[5,1],[5,3],[5,2],[8,2],
[8,6],[8,8]])
i.e. I want to find the result,
desired = np.array([4,6,6,4,2])
which is entries in a[:,1]
corresponding to where a[:,0]
changes.
One solution is,
b = a[(a[1:,0]-a[:-1,0]).nonzero()[0]+1, 1]
which gives np.array([6,6,4,2])
, I could simply prepend the first item, no problem. However, this creates an intermediate array of the indexes of the first items. I could avoid the intermediate by using a list comprehension:
c = [a[i+1,1] for i,(x,y) in enumerate(zip(a[1:,0],a[:-1,0])) if x!=y]
This also gives [6,6,4,2]
. Assuming a generator-based zip
(true in Python 3), this doesn't need to create an intermediate representation and should be very memory efficient. However, the inner loop is not numpy, and it necessitates generating a list which must be subsequently turned back into a numpy array.
Can you come up with a numpy-only version with the memory efficiency of c
but the speed efficiency of b
? Ideally only one pass over a
is needed.
(Note that measuring the speed won't help much here, unless a
is very big, so I wouldn't bother with benchmarking this, I just want something that is theoretically fast and memory efficient. For example, you can assume rows in a
are streamed from a file and are slow to access -- another reason to avoid the b
solution, as it requires a second random-access pass over a
.)
Edit: a way to generate a large a
matrix for testing:
from itertools import repeat
N, M = 100000, 100
a = np.array(zip([x for y in zip(*repeat(np.arange(N),M)) for x in y ], np.random.random(N*M)))
numpy.fromiter
seems to mean a 10x sacrifice in speed. – Steve Jul 20 at 17:22