numpy array: replace nan values with average of columns

Question

I've got a numpy array filled mostly with real numbers, but there is a few nan values in it as well.

How can I replace the nans with averages of columns where they are?

Daniel · Accepted Answer · 2013-09-08 22:51:06Z

up vote 25 down vote accepted

No loops required:

import  scipy.stats as stats
print a
[[ 0.93230948         nan  0.47773439  0.76998063]
 [ 0.94460779  0.87882456  0.79615838  0.56282885]
 [ 0.94272934  0.48615268  0.06196785         nan]
 [ 0.64940216  0.74414127         nan         nan]]

#Obtain mean of columns as you need, nanmean is just convenient.
col_mean = stats.nanmean(a,axis=0)
print col_mean
[ 0.86726219  0.7030395   0.44528687  0.66640474]

#Find indicies that you need to replace
inds = np.where(np.isnan(a))

#Place column means in the indices. Align the arrays using take
a[inds]=np.take(col_mean,inds[1])

print a
[[ 0.93230948  0.7030395   0.47773439  0.76998063]
 [ 0.94460779  0.87882456  0.79615838  0.56282885]
 [ 0.94272934  0.48615268  0.06196785  0.66640474]
 [ 0.64940216  0.74414127  0.44528687  0.66640474]]

answered Sep 8 '13 at 22:51

Daniel

10.1k32855

Nice answer. I didn't know nanmean existed! (+1) – Hammer Sep 8 '13 at 22:54

2

any reason you use take instead of just indexing? – Hammer Sep 8 '13 at 22:58

1

@Hammer They are adding nanmean to numpy in 1.8. Should be interesting. I use take instead of fancy indexing due to this question. There is a lot of evidence that indexing is ~5x slower then take. Plus this works in older versions also. – Daniel Sep 8 '13 at 23:00

@Jaime Can you elaborate on that some? – Daniel Sep 8 '13 at 23:10

5

You can now use numpy.nanmean() instead of import scipy: docs.scipy.org/doc/numpy-dev/reference/generated/… – Roving Richard May 25 '16 at 21:06

| show 5 more comments

Donald Hobson · Answer 2 · 2016-08-29 15:18:59Z

up vote 3 down vote

If partial is your original data, and replace is an array of the same shape containing averaged values then this code will use the value from partial if one exists.

Complete= np.where(np.isnan(partial),replace,partial)

answered Aug 29 '16 at 15:18

Donald Hobson

1312

This is a much, much cleaner solution than any of the others presented. – naught101 Sep 6 '16 at 0:55

Except that it requires more memory, to hold the repeated mean values. – Benjamin Dec 19 '16 at 15:56

add a comment |

Praveen · Answer 3 · 2016-10-24 00:29:41Z

Using masked arrays

The standard way to do this using only numpy would be to use the masked array module.

Scipy is a pretty heavy package which relies on external libraries, so it's worth having a numpy-only method. This borrows from @DonaldHobson's answer.

Edit: np.nanmean is now a numpy function. However, it doesn't handle all-nan columns...

Suppose you have an array a:

>>> a
array([[  0.,  nan,  10.,  nan],
       [  1.,   6.,  nan,  nan],
       [  2.,   7.,  12.,  nan],
       [  3.,   8.,  nan,  nan],
       [ nan,   9.,  14.,  nan]])

>>> import numpy.ma as ma
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=0), a)    
array([[  0. ,   7.5,  10. ,   0. ],
       [  1. ,   6. ,  12. ,   0. ],
       [  2. ,   7. ,  12. ,   0. ],
       [  3. ,   8. ,  12. ,   0. ],
       [  1.5,   9. ,  14. ,   0. ]])

Note that the masked array's mean does not need to be the same shape as a, because we're taking advantage of the implicit broadcasting over rows.

Also note how the all-nan column is nicely handled. The mean is zero since you're taking the mean of zero elements. The method using nanmean doesn't handle all-nan columns:

>>> col_mean = np.nanmean(a, axis=0)
/home/praveen/.virtualenvs/numpy3-mkl/lib/python3.4/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice
  warnings.warn("Mean of empty slice", RuntimeWarning)
>>> inds = np.where(np.isnan(a))
>>> a[inds] = np.take(col_mean, inds[1])
>>> a
array([[  0. ,   7.5,  10. ,   nan],
       [  1. ,   6. ,  12. ,   nan],
       [  2. ,   7. ,  12. ,   nan],
       [  3. ,   8. ,  12. ,   nan],
       [  1.5,   9. ,  14. ,   nan]])

Explanation

Converting a into a masked array gives you

>>> ma.array(a, mask=np.isnan(a))
masked_array(data =
 [[0.0 --  10.0 --]
  [1.0 6.0 --   --]
  [2.0 7.0 12.0 --]
  [3.0 8.0 --   --]
  [--  9.0 14.0 --]],
             mask =
 [[False  True False  True]
 [False False  True  True]
 [False False False  True]
 [False False  True  True]
 [ True False False  True]],
       fill_value = 1e+20)

And taking the mean over columns gives you the correct answer, normalizing only over the non-masked values:

>>> ma.array(a, mask=np.isnan(a)).mean(axis=0)
masked_array(data = [1.5 7.5 12.0 --],
             mask = [False False False  True],
       fill_value = 1e+20)

Further, note how the mask nicely handles the column which is all-nan!

Finally, np.where does the job of replacement.

Row-wise mean

To replace nan values with row-wise mean instead of column-wise mean requires a tiny change for broadcasting to take effect nicely:

>>> a
array([[  0.,   1.,   2.,   3.,  nan],
       [ nan,   6.,   7.,   8.,   9.],
       [ 10.,  nan,  12.,  nan,  14.],
       [ nan,  nan,  nan,  nan,  nan]])

>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1), a)
ValueError: operands could not be broadcast together with shapes (4,5) (4,) (4,5)

>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1)[:, np.newaxis], a)
array([[  0. ,   1. ,   2. ,   3. ,   1.5],
       [  7.5,   6. ,   7. ,   8. ,   9. ],
       [ 10. ,  12. ,  12. ,  12. ,  14. ],
       [  0. ,   0. ,   0. ,   0. ,   0. ]])

IMO there's nothing wrong with having np.nan values as means for all-NaN column case. But it is indeed a neat case of use for masked arrays. — Vlas Sokolov, Oct 24 '16 at 0:39
@VlasSokolov Well, having a mask is even better I think. i.e., making a into a masked array and keeping it masked even after applying the mean. Then you don't need to worry about performing operations on it, which might cause the nans to "spread" to the non-nan values. — Praveen, Oct 24 '16 at 0:44

Hammer · Answer 4 · 2013-09-08 22:51:08Z

up vote 1 down vote

This isn't very clean but I can't think of a way to do it other than iterating

#example
a = np.arange(16, dtype = float).reshape(4,4)
a[2,2] = np.nan
a[3,3] = np.nan

indices = np.where(np.isnan(a)) #returns an array of rows and column indices
for row, col in zip(*indices):
    a[row,col] = np.mean(a[~np.isnan(a[:,col]), col])

edited Sep 8 '13 at 22:51

answered Sep 8 '13 at 22:42

Hammer

6,0771839

Thanks a lot for this! – piokuc Sep 8 '13 at 23:14

add a comment |

Ulf Aslak · Answer 5 · 2016-03-18 08:52:13Z

Alternative: Replacing NaNs with interpolation of columns.

def interpolate_nans(X):
    """Overwrite NaNs with column value interpolations."""
    for j in range(X.shape[1]):
        mask_j = np.isnan(X[:,j])
        X[mask_j,j] = np.interp(np.flatnonzero(mask_j), np.flatnonzero(~mask_j), X[~mask_j,j])
    return X

Example use:

X_incomplete = np.array([[10,     20,     30    ],
                         [np.nan, 30,     np.nan],
                         [np.nan, np.nan, 50    ],
                         [40,     50,     np.nan    ]])

X_complete = interpolate_nans(X_incomplete)

print X_complete
[[10,     20,     30    ],
 [20,     30,     40    ],
 [30,     40,     50    ],
 [40,     50,     50    ]]

I use this bit of code for time series data in particular, where columns are attributes and rows are time-ordered samples.

LetsPlayYahtzee · Answer 6 · 2016-10-24 18:04:47Z

To extend Donald's Answer I provide a minimal example. Let's say a is an ndarray and we want to replace its zero values with the mean of the column.

In [231]: a
Out[231]: 
array([[0, 3, 6],
       [2, 0, 0]])


In [232]: col_mean = np.nanmean(a, axis=0)
Out[232]: array([ 1. ,  1.5,  3. ])

In [228]: np.where(np.equal(a, 0), col_mean, a)
Out[228]: 
array([[ 1. ,  3. ,  6. ],
       [ 2. ,  1.5,  3. ]])

Benjamin · Answer 7 · 2016-12-19 15:50:47Z

up vote -1 down vote

If you don't have access to a numpy version with numpy.nanmean, this is a compact way to do it, by indexing into the means.

m = numpy.isnan(a)
a[m] = numpy.mean(numpy.nan_to_num(a), axis=0)[numpy.where(m)[0]]

answered Dec 19 '16 at 15:50

Benjamin

4,88573791

add a comment |

Statham · Answer 8 · 2016-12-20 02:23:56Z

up vote -1 down vote

import numpy as np
import pandas as pd

def replace_nan_with_mean(matrix_in):
    for i in np.arange(matrix_in.shape[1]):
        mean=np.mean(matrix_in[~pd.isnull(matrix_in[:,i]),i],axis=0) # caculate the mean, ignoring all the na values
        matrix_in[pd.isnull(matrix_in[:,i]),i]=mean; # assign all nan value with mean
    return matrix_in

edited Dec 20 '16 at 2:23

answered Dec 19 '16 at 15:22

Statham

80111

While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. Code-only answers are discouraged. – Ajean Dec 19 '16 at 18:50

add a comment |

ifryed · Answer 9 · 2015-03-09 15:35:48Z

up vote -2 down vote

you might want to try this built-in function:

x = np.array([np.inf, -np.inf, np.nan, -128, 128])
np.nan_to_num(x)
array([  1.79769313e+308,  -1.79769313e+308,   0.00000000e+000,
-1.28000000e+002,   1.28000000e+002])

answered Mar 9 '15 at 15:35

ifryed

483515

add a comment |

asked	3 years ago
viewed	8107 times
active	1 month ago

current community

your communities

more stack exchange communities

numpy array: replace nan values with average of columns

9 Answers 9

Using masked arrays

Your Answer

Not the answer you're looking for? Browse other questions tagged python arrays numpy nan or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

numpy array: replace nan values with average of columns

9 Answers 9

Using masked arrays

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python arrays numpy nan or ask your own question.

Linked

Related

Hot Network Questions