Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

Suppose I have a pair of numpy arrays X and I that look like this (X is 2D, I is 1D)

X               I
-----------------
3.4  9.13       0
3.5  3.43       1
3.6  2.01       2
3.7  6.11       0
3.8  4.95       1
3.9  7.02       2
4.0  4.41       3
4.1  0.23       0
4.2  0.99       1
4.3  1.02       0
4.4  5.61       1
4.5  7.55       2
4.6  8.10       0
4.7  0.33       2
4.8  0.80       1

I would like to do two things:

  1. Y = indexby(X,I,I0): Given a value I0, find the rows of X which have matching values in I. For example if I is 2, I would want to find the following array:

    3.6  2.01  
    3.9  7.02  
    4.5  7.55  
    4.7  0.33  
    
  2. Y = indexby(X,I): Return a dictionary with all the possible keys k such that Y[k] == indexby(X,I,k). In my example data, this would produce the following:

    Y[0] = 
    3.4  9.13       
    3.7  6.11       
    4.1  0.23       
    4.3  1.02       
    4.6  8.10       
    
    Y[1] = 
    3.5  3.43      
    3.8  4.95      
    4.2  0.99      
    4.4  5.61       
    4.8  0.80  
    
    Y[2] = 
    3.6  2.01  
    3.9  7.02  
    4.5  7.55  
    4.7  0.33  
    
    Y[3] = 
    4.0  4.41
    

Are there numpy functions which do this? I'm not sure what to look for so it's hard to find them.

I know I can do this manually, but for performance reasons I would like to use a builtin numpy function since the arrays in my application are typically have row counts in the 100,000 - 1,000,000 range.

share|improve this question
    
np.where is probably a good place to start. –  Gabriel Jul 25 '14 at 23:12
    
np.searchsorted is another nice one –  Gabriel Jul 25 '14 at 23:52
    
These two are perfectly solved by pandas –  Happy001 Jul 25 '14 at 23:53
    
@Happy001 ...could you elaborate? I'm already using pandas elsewhere + can use it here if it makes sense. –  Jason S Jul 26 '14 at 0:16
    
@JasonS see my answer below. I highly recommend pandas (especially if you've worked with SAS/R). It's based on numpy. –  Happy001 Jul 26 '14 at 2:09

3 Answers 3

There are some higher-level functions, but let's see how to do it using just the simplest stuff in the library, because you're going to need those simple functions every day.

>>> matches = (I == 2)
>>> matches
array([False, False,  True, False, False,  True, False, False, False,
       False, False,  True, False,  True, False], dtype=bool)    
>>> indices = np.nonzero(matches)
>>> indices
(array([ 2,  5, 11, 13]),)
>>> xvals = X[indices]
>>> xvals
array([[ 3.6 ,  2.01],
       [ 3.9 ,  7.02],
       [ 4.5 ,  7.55],
       [ 4.7 ,  0.33]])

The last step may look confusing. See Indexing in the tutorial for further information.

Once you understand how the == operator and nonzero work, look through the other functions in the same section as nonzero and you should find two shorter ways to do this.

share|improve this answer

First I'll show a nice solution using structured arrays. The linked documentation has lots of good information on various way to index, sort, and create them.

Lets define a subset of your data,

import numpy as np

X = np.array( [[3.4,9.13], [3.5,3.43], [3.6,2.01], [3.7,6.11], 
               [3.8,4.95], [3.9,7.02], [4.0,4.41]] )

I = np.array( [0,1,2,0,1,2,3], dtype=np.int32 )

Structured Array

If we make a structured array (i.e. an array of structs) from this data, the problem is trivial,

sa = np.zeros( len(X), dtype=[('I',np.int64),('X',np.float64,(2))] )

Here we've made an empty structured array. Each element of the array is a 64 bit integer and a 2 element array of 64 bit floats. The list passed to dtype defines the struct with each tuple representing a component of the struct. The tuples contain a label, a type, and a shape. The shape part is optional and defaults to a scalar entry.

Next we fill the structured array with your data,

sa['I'] = I
sa['X'] = X 

At this point you can access the records like so,

>>> sa['X'][sa['I']==2]
array([[ 3.6 ,  2.01],
       [ 3.9 ,  7.02]])

Here we've asked for all the 'X' records and indexed them using the bool array created by the statement sa['I']==2. The dictionary you want can then be constructed using a comprehension,

d = { i:sa['X'][sa['I']==i] for i in np.unique(sa['I']) }

Next are two solutions using standard numpy arrays. The first uses np.where and leaves the arrays unmodified and another that involves sorting the arrays which should be faster for large I.

Using np.where

The use of np.where is not strictly necessary as arrays can be indexed using the bool array produced from I==I0 below, but having the actual indices as ints is useful in some circumstances.

def indexby1( X,I,I0 ):
    indx = np.where( I==I0 )
    sub = X[indx[0],:]
    return sub

def indexby2( X,I ):
    d = {}
    I0max = I.max()
    for I0 in range(I0max+1):
        d[I0] = indexby1( X, I, I0 )
    return d

d = indexby2( X, I )

Sorting and pulling out chunks

Alternatively you can use the sorting solution mentioned and just return chunks,

def order_arrays( X, I ):
    indx = I.argsort()
    I = I[indx]
    X = [indx]  # equivalent to X = X[indx,:]
    return X, I

def indexby(X, I, I0=None):
    if I0 == None:
        d = {}
        for I0 in range(I.max()+1):
            d[I0] = indexby( X, I, I0 )
        return d
    else:
        ii = I.searchsorted(I0)
        ff = I.searchsorted(I0+1)
        sub = X[ii:ff]
        return sub

X,I = order_array( X, I )
d = indexby( X, I )

Here I've combined the two previous functions into one recursive function as you described the signature in your question. This will of course modify the original arrays.

share|improve this answer

If you would like to try pandas, it's really powerful in groupby data. Here's how you can achieve what you need:

In [34]: import numpy as np

In [35]: import pandas as pd

#I defined you X, I already
In [36]: X
Out[36]: 
array([[ 3.4 ,  9.13],
       [ 3.5 ,  3.43],
       [ 3.6 ,  2.01],
       [ 3.7 ,  6.11],
       [ 3.8 ,  4.95],
       [ 3.9 ,  7.02],
       [ 4.  ,  4.41],
       [ 4.1 ,  0.23],
       [ 4.2 ,  0.99],
       [ 4.3 ,  1.02],
       [ 4.4 ,  5.61],
       [ 4.5 ,  7.55],
       [ 4.6 ,  8.1 ],
       [ 4.7 ,  0.33],
       [ 4.8 ,  0.8 ]])

In [37]: I
Out[37]: array([0, 1, 2, 0, 1, 2, 3, 0, 1, 0, 1, 2, 0, 2, 1], dtype=int64)

In [38]: dataframe=pd.DataFrame (data=X, index=I, columns=['X1','X2'])

In [39]: dataframe.index.name='I' #This is not necessary
In [40]: print dataframe
    X1    X2
I           
0  3.4  9.13
1  3.5  3.43
2  3.6  2.01
0  3.7  6.11
1  3.8  4.95
2  3.9  7.02
3  4.0  4.41
0  4.1  0.23
1  4.2  0.99
0  4.3  1.02
1  4.4  5.61
2  4.5  7.55
0  4.6  8.10
2  4.7  0.33
1  4.8  0.80

This defines a dataframe with I as index and X as data. Now if you need rows with I=2, you can do

In [42]: print dataframe.ix[2]
    X1    X2
I           
2  3.6  2.01
2  3.9  7.02
2  4.5  7.55
2  4.7  0.33

If you want to list all groups:

In [43]: for i, grouped_data in dataframe.groupby(level='I'): #without level=, you can group by a regular column like X1
   ....:     print i
   ....:     print grouped_data
   ....:     
0
    X1    X2
I           
0  3.4  9.13
0  3.7  6.11
0  4.1  0.23
0  4.3  1.02
0  4.6  8.10
1
    X1    X2
I           
1  3.5  3.43
1  3.8  4.95
1  4.2  0.99
1  4.4  5.61
1  4.8  0.80
2
    X1    X2
I           
2  3.6  2.01
2  3.9  7.02
2  4.5  7.55
2  4.7  0.33
3
   X1    X2
I          
3   4  4.41

If you just want to see statistics of each group, you can do

In [47]: print dataframe.groupby(level='I').sum() #try other funcs like mean, var, .
     X1     X2
I             
0  20.1  24.59
1  20.7  15.78
2  16.7  16.91
3   4.0   4.41
share|improve this answer
    
sweet, I've used pandas before, just can never remember the magic functions –  Jason S Jul 26 '14 at 2:34

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.