I have a NumPy 2D array as shown below:

data.dat

X1  X2  X3  X4
1   1   1   1
2   2   4   2
3   3   9   3
4   4   16  4
5   5   25  5
6   6   36  6
7   nan 49  7
8   nan 64  8
9   nan 81  nan
10  nan nan nan

Now how do I output the last element of each column ignoring nan in the array. I tried without success the code:

A[~np.isnan(A)][-1]

Code used

import numpy as np
with open('data.dat', "r") as data:
    while True:
        line = data.readline()
        if not line.startswith('#'):
            break
    data_header = [i for i in line.strip().split('\t') if i]
A = np.genfromtxt('data.dat', names = data_header, dtype = float, delimiter = '\t')
share|improve this question
1  
Please post the actual array or something that users can at least copy paste. – Ashwini Chaudhary May 24 '15 at 20:44
    
What are you doing in the while loop? Wouldn't arr = np.genfromtxt("data.dat",names=True,delimiter = '\t') do the same thing? – Padraic Cunningham May 24 '15 at 21:08

If A were a plain NumPy array of dtype 'float' (instead of a structured array) then you could use

import numpy as np
nan = np.nan
A = np.array([[  1.,   1.,   1.,   1.],
              [  2.,   2.,   4.,   2.],
              [  3.,   3.,   9.,   3.],
              [  4.,   4.,  16.,   4.],
              [  5.,   5.,  25.,   5.],
              [  6.,   6.,  36.,   6.],
              [  7.,  nan,  49.,   7.],
              [  8.,  nan,  64.,   8.],
              [  9.,  nan,  81.,  nan],
              [ 10.,  nan,  nan,  nan]])

print(A[(~np.isnan(A)).cumsum(axis=0).argmax(axis=0), np.arange(A.shape[1])])

which yields

array([ 10.,   6.,  81.,   8.])

Given the structured array, such as the one generated by

import numpy as np

with open('data.dat', "r") as data:
    # per Padraic Cunningham's suggestion
    A = np.genfromtxt("data.dat", names=True, delimiter = '\t')

I think the easiest way to obtain the desired result is to view the structured array as a plain NumPy array of dtype 'float':

B = A.view('float').reshape(A.shape[0], -1)

and then proceed as before:

print(B[(~np.isnan(B)).cumsum(axis=0).argmax(axis=0), np.arange(B.shape[1])])

How it works:

Given a plain NumPy array of dtype 'float', such as

In [357]: B
Out[357]: 
array([[ nan,  nan,  nan,  nan],
       [  1.,   1.,   1.,   1.],
       [  2.,   2.,   4.,   2.],
       [  3.,   3.,   9.,   3.],
       [  4.,   4.,  16.,   4.],
       [  5.,   5.,  25.,   5.],
       [  6.,   6.,  36.,   6.],
       [  7.,  nan,  49.,   7.],
       [  8.,  nan,  64.,   8.],
       [  9.,  nan,  81.,  nan],
       [ 10.,  nan,  nan,  nan]])

we can use np.isnan to find where the non-nan values are:

In [358]: ~np.isnan(B)
Out[358]: 
array([[False, False, False, False],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True, False,  True,  True],
       [ True, False,  True,  True],
       [ True, False,  True, False],
       [ True, False, False, False]], dtype=bool)

Now we can use cumsum to compute a cumulative sum for each column (False is treated as 0, True as 1):

In [359]: (~np.isnan(B)).cumsum(axis=0)
Out[359]: 
array([[ 0,  0,  0,  0],
       [ 1,  1,  1,  1],
       [ 2,  2,  2,  2],
       [ 3,  3,  3,  3],
       [ 4,  4,  4,  4],
       [ 5,  5,  5,  5],
       [ 6,  6,  6,  6],
       [ 7,  6,  7,  7],
       [ 8,  6,  8,  8],
       [ 9,  6,  9,  8],
       [10,  6,  9,  8]])

Notice that the maximum value in each column is due to the value achieved by the last True in each column.

Therefore, we can find the index corresponding the the first occurrance of the maximum value in each column by using np.argmax:

In [360]: (~np.isnan(B)).cumsum(axis=0).argmax(axis=0)
Out[360]: array([10,  6,  9,  8])

This gives the row index number for each column. To find the corresponding value in the array we could then use:

In [361]: B[(~np.isnan(B)).cumsum(axis=0).argmax(axis=0), np.arange(B.shape[1])]
Out[361]: array([ 10.,   6.,  81.,   8.])
share|improve this answer
    
Would it be better using pandas? df = pd.read_csv("data.dat", header=0, comment="#",delimiter="\t"), then use .values? – Padraic Cunningham May 24 '15 at 21:44
1  
@PadraicCunningham: Yes, for large CSV, pd.read_csv is faster than np.genfromtxt. See wesmckinney.com/blog/… for some discussion and benchmarks. The benchmarks there do not use np.genfromtxt, but I did make a big CSV for this problem using df = pd.concat([pd.read_table('data', sep='\s+')]*10000, ignore_index=True) and found pd.read_csv to be faster about 5x faster than np.genfromtxt. – unutbu May 24 '15 at 21:57
    
Interesting, thanks. – Padraic Cunningham May 24 '15 at 22:01

You could try:

import numpy as np

ss = """1   1   1   1
        2   2   4   2
        3   3   9   3
        4   4   16  4
        5   5   25  5
        6   6   36  6
        7   nan 49  7
        8   nan 64  8
        9   nan 81  nan
        10  nan nan nan"""

# build Matrix from string:
AA = np.array([[float(c) for c in rw.split()]
               for rw in ss.splitlines()])

# extract last finte column element:
cl_finite = np.array([cl[np.isfinite(cl)][-1] for cl in AA.T])
print(cl_finite)
share|improve this answer

This works fine, at least if all columns have a non-nan element.

In [36]: [l[~np.isnan(l)][-1] for l in A.transpose()]
Out[36]: [10.0, 6.0, 81.0, 8.0]
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.