How to output last column element of NumPy 2D array ignoring nan in Python?

Question

I have a NumPy 2D array as shown below:

data.dat

X1  X2  X3  X4
1   1   1   1
2   2   4   2
3   3   9   3
4   4   16  4
5   5   25  5
6   6   36  6
7   nan 49  7
8   nan 64  8
9   nan 81  nan
10  nan nan nan

Now how do I output the last element of each column ignoring nan in the array. I tried without success the code:

A[~np.isnan(A)][-1]

Code used

import numpy as np
with open('data.dat', "r") as data:
    while True:
        line = data.readline()
        if not line.startswith('#'):
            break
    data_header = [i for i in line.strip().split('\t') if i]
A = np.genfromtxt('data.dat', names = data_header, dtype = float, delimiter = '\t')

Please post the actual array or something that users can at least copy paste. — Ashwini Chaudhary, May 24 '15 at 20:44
What are you doing in the while loop? Wouldn't arr = np.genfromtxt("data.dat",names=True,delimiter = '\t') do the same thing? — Padraic Cunningham, May 24 '15 at 21:08

unutbu · Answer 1 · 2015-05-24 21:28:51Z

If A were a plain NumPy array of dtype 'float' (instead of a structured array) then you could use

import numpy as np
nan = np.nan
A = np.array([[  1.,   1.,   1.,   1.],
              [  2.,   2.,   4.,   2.],
              [  3.,   3.,   9.,   3.],
              [  4.,   4.,  16.,   4.],
              [  5.,   5.,  25.,   5.],
              [  6.,   6.,  36.,   6.],
              [  7.,  nan,  49.,   7.],
              [  8.,  nan,  64.,   8.],
              [  9.,  nan,  81.,  nan],
              [ 10.,  nan,  nan,  nan]])

print(A[(~np.isnan(A)).cumsum(axis=0).argmax(axis=0), np.arange(A.shape[1])])

which yields

array([ 10.,   6.,  81.,   8.])

Given the structured array, such as the one generated by

import numpy as np

with open('data.dat', "r") as data:
    # per Padraic Cunningham's suggestion
    A = np.genfromtxt("data.dat", names=True, delimiter = '\t')

I think the easiest way to obtain the desired result is to view the structured array as a plain NumPy array of dtype 'float':

B = A.view('float').reshape(A.shape[0], -1)

and then proceed as before:

print(B[(~np.isnan(B)).cumsum(axis=0).argmax(axis=0), np.arange(B.shape[1])])

How it works:

Given a plain NumPy array of dtype 'float', such as

In [357]: B
Out[357]: 
array([[ nan,  nan,  nan,  nan],
       [  1.,   1.,   1.,   1.],
       [  2.,   2.,   4.,   2.],
       [  3.,   3.,   9.,   3.],
       [  4.,   4.,  16.,   4.],
       [  5.,   5.,  25.,   5.],
       [  6.,   6.,  36.,   6.],
       [  7.,  nan,  49.,   7.],
       [  8.,  nan,  64.,   8.],
       [  9.,  nan,  81.,  nan],
       [ 10.,  nan,  nan,  nan]])

we can use np.isnan to find where the non-nan values are:

In [358]: ~np.isnan(B)
Out[358]: 
array([[False, False, False, False],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True, False,  True,  True],
       [ True, False,  True,  True],
       [ True, False,  True, False],
       [ True, False, False, False]], dtype=bool)

Now we can use cumsum to compute a cumulative sum for each column (False is treated as 0, True as 1):

In [359]: (~np.isnan(B)).cumsum(axis=0)
Out[359]: 
array([[ 0,  0,  0,  0],
       [ 1,  1,  1,  1],
       [ 2,  2,  2,  2],
       [ 3,  3,  3,  3],
       [ 4,  4,  4,  4],
       [ 5,  5,  5,  5],
       [ 6,  6,  6,  6],
       [ 7,  6,  7,  7],
       [ 8,  6,  8,  8],
       [ 9,  6,  9,  8],
       [10,  6,  9,  8]])

Notice that the maximum value in each column is due to the value achieved by the last True in each column.

Therefore, we can find the index corresponding the the first occurrance of the maximum value in each column by using np.argmax:

In [360]: (~np.isnan(B)).cumsum(axis=0).argmax(axis=0)
Out[360]: array([10,  6,  9,  8])

This gives the row index number for each column. To find the corresponding value in the array we could then use:

In [361]: B[(~np.isnan(B)).cumsum(axis=0).argmax(axis=0), np.arange(B.shape[1])]
Out[361]: array([ 10.,   6.,  81.,   8.])

Would it be better using pandas? df = pd.read_csv("data.dat", header=0, comment="#",delimiter="\t"), then use .values? — Padraic Cunningham, May 24 '15 at 21:44
@PadraicCunningham: Yes, for large CSV, pd.read_csv is faster than np.genfromtxt. See wesmckinney.com/blog/… for some discussion and benchmarks. The benchmarks there do not use np.genfromtxt, but I did make a big CSV for this problem using df = pd.concat([pd.read_table('data', sep='\s+')]*10000, ignore_index=True) and found pd.read_csv to be faster about 5x faster than np.genfromtxt. — unutbu, May 24 '15 at 21:57

Dietrich · Answer 2 · 2015-05-24 20:58:44Z

You could try:

import numpy as np

ss = """1   1   1   1
        2   2   4   2
        3   3   9   3
        4   4   16  4
        5   5   25  5
        6   6   36  6
        7   nan 49  7
        8   nan 64  8
        9   nan 81  nan
        10  nan nan nan"""

# build Matrix from string:
AA = np.array([[float(c) for c in rw.split()]
               for rw in ss.splitlines()])

# extract last finte column element:
cl_finite = np.array([cl[np.isfinite(cl)][-1] for cl in AA.T])
print(cl_finite)

matiasg · Answer 3 · 2015-05-24 21:01:57Z

up vote 0 down vote

This works fine, at least if all columns have a non-nan element.

In [36]: [l[~np.isnan(l)][-1] for l in A.transpose()]
Out[36]: [10.0, 6.0, 81.0, 8.0]

answered May 24 '15 at 21:01

matiasg

1,1041025

add a comment |

asked	1 year, 10 months ago
viewed	419 times
active	1 year, 10 months ago

How to output last column element of NumPy 2D array ignoring nan in Python?

3 Answers 3

Your Answer

Not the answer you're looking for? Browse other questions tagged python arrays numpy nan or ask your own question.

Hot Network Questions

How to output last column element of NumPy 2D array ignoring nan in Python?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python arrays numpy nan or ask your own question.

Related

Hot Network Questions