using index values to select items from a numpy array

Question

Suppose I have a pair of numpy arrays X and I that look like this (X is 2D, I is 1D)

X               I
-----------------
3.4  9.13       0
3.5  3.43       1
3.6  2.01       2
3.7  6.11       0
3.8  4.95       1
3.9  7.02       2
4.0  4.41       3
4.1  0.23       0
4.2  0.99       1
4.3  1.02       0
4.4  5.61       1
4.5  7.55       2
4.6  8.10       0
4.7  0.33       2
4.8  0.80       1

I would like to do two things:

Y = indexby(X,I,I0): Given a value I0, find the rows of X which have matching values in I. For example if I is 2, I would want to find the following array:
```
3.6  2.01  
3.9  7.02  
4.5  7.55  
4.7  0.33  
```

Y = indexby(X,I): Return a dictionary with all the possible keys k such that Y[k] == indexby(X,I,k). In my example data, this would produce the following:

Are there numpy functions which do this? I'm not sure what to look for so it's hard to find them.

I know I can do this manually, but for performance reasons I would like to use a builtin numpy function since the arrays in my application are typically have row counts in the 100,000 - 1,000,000 range.

@Happy001 ...could you elaborate? I'm already using pandas elsewhere + can use it here if it makes sense. — Jason S, Jul 26 '14 at 0:16
@JasonS see my answer below. I highly recommend pandas (especially if you've worked with SAS/R). It's based on numpy. — Happy001, Jul 26 '14 at 2:09

abarnert · Answer 1 · 2014-07-25 23:13:29Z

There are some higher-level functions, but let's see how to do it using just the simplest stuff in the library, because you're going to need those simple functions every day.

>>> matches = (I == 2)
>>> matches
array([False, False,  True, False, False,  True, False, False, False,
       False, False,  True, False,  True, False], dtype=bool)    
>>> indices = np.nonzero(matches)
>>> indices
(array([ 2,  5, 11, 13]),)
>>> xvals = X[indices]
>>> xvals
array([[ 3.6 ,  2.01],
       [ 3.9 ,  7.02],
       [ 4.5 ,  7.55],
       [ 4.7 ,  0.33]])

The last step may look confusing. See Indexing in the tutorial for further information.

Once you understand how the == operator and nonzero work, look through the other functions in the same section as nonzero and you should find two shorter ways to do this.

Gabriel · Answer 2 · 2014-07-26 02:47:24Z

First I'll show a nice solution using structured arrays. The linked documentation has lots of good information on various way to index, sort, and create them.

Lets define a subset of your data,

import numpy as np

X = np.array( [[3.4,9.13], [3.5,3.43], [3.6,2.01], [3.7,6.11], 
               [3.8,4.95], [3.9,7.02], [4.0,4.41]] )

I = np.array( [0,1,2,0,1,2,3], dtype=np.int32 )

Structured Array

If we make a structured array (i.e. an array of structs) from this data, the problem is trivial,

sa = np.zeros( len(X), dtype=[('I',np.int64),('X',np.float64,(2))] )

Here we've made an empty structured array. Each element of the array is a 64 bit integer and a 2 element array of 64 bit floats. The list passed to dtype defines the struct with each tuple representing a component of the struct. The tuples contain a label, a type, and a shape. The shape part is optional and defaults to a scalar entry.

Next we fill the structured array with your data,

sa['I'] = I
sa['X'] = X

At this point you can access the records like so,

>>> sa['X'][sa['I']==2]
array([[ 3.6 ,  2.01],
       [ 3.9 ,  7.02]])

Here we've asked for all the 'X' records and indexed them using the bool array created by the statement sa['I']==2. The dictionary you want can then be constructed using a comprehension,

d = { i:sa['X'][sa['I']==i] for i in np.unique(sa['I']) }

Next are two solutions using standard numpy arrays. The first uses np.where and leaves the arrays unmodified and another that involves sorting the arrays which should be faster for large I.

Using `np.where`

The use of np.where is not strictly necessary as arrays can be indexed using the bool array produced from I==I0 below, but having the actual indices as ints is useful in some circumstances.

def indexby1( X,I,I0 ):
    indx = np.where( I==I0 )
    sub = X[indx[0],:]
    return sub

def indexby2( X,I ):
    d = {}
    I0max = I.max()
    for I0 in range(I0max+1):
        d[I0] = indexby1( X, I, I0 )
    return d

d = indexby2( X, I )

Sorting and pulling out chunks

Alternatively you can use the sorting solution mentioned and just return chunks,

def order_arrays( X, I ):
    indx = I.argsort()
    I = I[indx]
    X = [indx]  # equivalent to X = X[indx,:]
    return X, I

def indexby(X, I, I0=None):
    if I0 == None:
        d = {}
        for I0 in range(I.max()+1):
            d[I0] = indexby( X, I, I0 )
        return d
    else:
        ii = I.searchsorted(I0)
        ff = I.searchsorted(I0+1)
        sub = X[ii:ff]
        return sub

X,I = order_array( X, I )
d = indexby( X, I )

Here I've combined the two previous functions into one recursive function as you described the signature in your question. This will of course modify the original arrays.

Happy001 · Answer 3 · 2014-07-26 02:05:13Z

If you would like to try pandas, it's really powerful in groupby data. Here's how you can achieve what you need:

In [34]: import numpy as np

In [35]: import pandas as pd

#I defined you X, I already
In [36]: X
Out[36]: 
array([[ 3.4 ,  9.13],
       [ 3.5 ,  3.43],
       [ 3.6 ,  2.01],
       [ 3.7 ,  6.11],
       [ 3.8 ,  4.95],
       [ 3.9 ,  7.02],
       [ 4.  ,  4.41],
       [ 4.1 ,  0.23],
       [ 4.2 ,  0.99],
       [ 4.3 ,  1.02],
       [ 4.4 ,  5.61],
       [ 4.5 ,  7.55],
       [ 4.6 ,  8.1 ],
       [ 4.7 ,  0.33],
       [ 4.8 ,  0.8 ]])

In [37]: I
Out[37]: array([0, 1, 2, 0, 1, 2, 3, 0, 1, 0, 1, 2, 0, 2, 1], dtype=int64)

In [38]: dataframe=pd.DataFrame (data=X, index=I, columns=['X1','X2'])

In [39]: dataframe.index.name='I' #This is not necessary
In [40]: print dataframe
    X1    X2
I           
0  3.4  9.13
1  3.5  3.43
2  3.6  2.01
0  3.7  6.11
1  3.8  4.95
2  3.9  7.02
3  4.0  4.41
0  4.1  0.23
1  4.2  0.99
0  4.3  1.02
1  4.4  5.61
2  4.5  7.55
0  4.6  8.10
2  4.7  0.33
1  4.8  0.80

This defines a dataframe with I as index and X as data. Now if you need rows with I=2, you can do

In [42]: print dataframe.ix[2]
    X1    X2
I           
2  3.6  2.01
2  3.9  7.02
2  4.5  7.55
2  4.7  0.33

If you want to list all groups:

In [43]: for i, grouped_data in dataframe.groupby(level='I'): #without level=, you can group by a regular column like X1
   ....:     print i
   ....:     print grouped_data
   ....:     
0
    X1    X2
I           
0  3.4  9.13
0  3.7  6.11
0  4.1  0.23
0  4.3  1.02
0  4.6  8.10
1
    X1    X2
I           
1  3.5  3.43
1  3.8  4.95
1  4.2  0.99
1  4.4  5.61
1  4.8  0.80
2
    X1    X2
I           
2  3.6  2.01
2  3.9  7.02
2  4.5  7.55
2  4.7  0.33
3
   X1    X2
I          
3   4  4.41

If you just want to see statistics of each group, you can do

In [47]: print dataframe.groupby(level='I').sum() #try other funcs like mean, var, .
     X1     X2
I             
0  20.1  24.59
1  20.7  15.78
2  16.7  16.91
3   4.0   4.41

sweet, I've used pandas before, just can never remember the magic functions — Jason S, Jul 26 '14 at 2:34

asked	8 months ago
viewed	84 times
active	8 months ago

current community

your communities

more stack exchange communities

using index values to select items from a numpy array

3 Answers 3

Structured Array

Using `np.where`

Sorting and pulling out chunks

Your Answer

Not the answer you're looking for? Browse other questions tagged python arrays numpy or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

using index values to select items from a numpy array

3 Answers 3

Structured Array

Using np.where

Sorting and pulling out chunks

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python arrays numpy or ask your own question.

Related

Hot Network Questions

Using `np.where`