Join the Stack Overflow Community
Stack Overflow is a community of 6.6 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

I've got a numpy array filled mostly with real numbers, but there is a few nan values in it as well.

How can I replace the nans with averages of columns where they are?

share|improve this question
up vote 25 down vote accepted

No loops required:

import  scipy.stats as stats
print a
[[ 0.93230948         nan  0.47773439  0.76998063]
 [ 0.94460779  0.87882456  0.79615838  0.56282885]
 [ 0.94272934  0.48615268  0.06196785         nan]
 [ 0.64940216  0.74414127         nan         nan]]

#Obtain mean of columns as you need, nanmean is just convenient.
col_mean = stats.nanmean(a,axis=0)
print col_mean
[ 0.86726219  0.7030395   0.44528687  0.66640474]

#Find indicies that you need to replace
inds = np.where(np.isnan(a))

#Place column means in the indices. Align the arrays using take
a[inds]=np.take(col_mean,inds[1])

print a
[[ 0.93230948  0.7030395   0.47773439  0.76998063]
 [ 0.94460779  0.87882456  0.79615838  0.56282885]
 [ 0.94272934  0.48615268  0.06196785  0.66640474]
 [ 0.64940216  0.74414127  0.44528687  0.66640474]]
share|improve this answer
    
Nice answer. I didn't know nanmean existed! (+1) – Hammer Sep 8 '13 at 22:54
2  
any reason you use take instead of just indexing? – Hammer Sep 8 '13 at 22:58
1  
@Hammer They are adding nanmean to numpy in 1.8. Should be interesting. I use take instead of fancy indexing due to this question. There is a lot of evidence that indexing is ~5x slower then take. Plus this works in older versions also. – Daniel Sep 8 '13 at 23:00
    
@Jaime Can you elaborate on that some? – Daniel Sep 8 '13 at 23:10
5  
You can now use numpy.nanmean() instead of import scipy: docs.scipy.org/doc/numpy-dev/reference/generated/… – Roving Richard May 25 '16 at 21:06

If partial is your original data, and replace is an array of the same shape containing averaged values then this code will use the value from partial if one exists.

Complete= np.where(np.isnan(partial),replace,partial)
share|improve this answer
    
This is a much, much cleaner solution than any of the others presented. – naught101 Sep 6 '16 at 0:55
    
Except that it requires more memory, to hold the repeated mean values. – Benjamin Dec 19 '16 at 15:56

Using masked arrays

The standard way to do this using only numpy would be to use the masked array module.

Scipy is a pretty heavy package which relies on external libraries, so it's worth having a numpy-only method. This borrows from @DonaldHobson's answer.

Edit: np.nanmean is now a numpy function. However, it doesn't handle all-nan columns...

Suppose you have an array a:

>>> a
array([[  0.,  nan,  10.,  nan],
       [  1.,   6.,  nan,  nan],
       [  2.,   7.,  12.,  nan],
       [  3.,   8.,  nan,  nan],
       [ nan,   9.,  14.,  nan]])

>>> import numpy.ma as ma
>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=0), a)    
array([[  0. ,   7.5,  10. ,   0. ],
       [  1. ,   6. ,  12. ,   0. ],
       [  2. ,   7. ,  12. ,   0. ],
       [  3. ,   8. ,  12. ,   0. ],
       [  1.5,   9. ,  14. ,   0. ]])

Note that the masked array's mean does not need to be the same shape as a, because we're taking advantage of the implicit broadcasting over rows.

Also note how the all-nan column is nicely handled. The mean is zero since you're taking the mean of zero elements. The method using nanmean doesn't handle all-nan columns:

>>> col_mean = np.nanmean(a, axis=0)
/home/praveen/.virtualenvs/numpy3-mkl/lib/python3.4/site-packages/numpy/lib/nanfunctions.py:675: RuntimeWarning: Mean of empty slice
  warnings.warn("Mean of empty slice", RuntimeWarning)
>>> inds = np.where(np.isnan(a))
>>> a[inds] = np.take(col_mean, inds[1])
>>> a
array([[  0. ,   7.5,  10. ,   nan],
       [  1. ,   6. ,  12. ,   nan],
       [  2. ,   7. ,  12. ,   nan],
       [  3. ,   8. ,  12. ,   nan],
       [  1.5,   9. ,  14. ,   nan]])

Explanation

Converting a into a masked array gives you

>>> ma.array(a, mask=np.isnan(a))
masked_array(data =
 [[0.0 --  10.0 --]
  [1.0 6.0 --   --]
  [2.0 7.0 12.0 --]
  [3.0 8.0 --   --]
  [--  9.0 14.0 --]],
             mask =
 [[False  True False  True]
 [False False  True  True]
 [False False False  True]
 [False False  True  True]
 [ True False False  True]],
       fill_value = 1e+20)

And taking the mean over columns gives you the correct answer, normalizing only over the non-masked values:

>>> ma.array(a, mask=np.isnan(a)).mean(axis=0)
masked_array(data = [1.5 7.5 12.0 --],
             mask = [False False False  True],
       fill_value = 1e+20)

Further, note how the mask nicely handles the column which is all-nan!

Finally, np.where does the job of replacement.


Row-wise mean

To replace nan values with row-wise mean instead of column-wise mean requires a tiny change for broadcasting to take effect nicely:

>>> a
array([[  0.,   1.,   2.,   3.,  nan],
       [ nan,   6.,   7.,   8.,   9.],
       [ 10.,  nan,  12.,  nan,  14.],
       [ nan,  nan,  nan,  nan,  nan]])

>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1), a)
ValueError: operands could not be broadcast together with shapes (4,5) (4,) (4,5)

>>> np.where(np.isnan(a), ma.array(a, mask=np.isnan(a)).mean(axis=1)[:, np.newaxis], a)
array([[  0. ,   1. ,   2. ,   3. ,   1.5],
       [  7.5,   6. ,   7. ,   8. ,   9. ],
       [ 10. ,  12. ,  12. ,  12. ,  14. ],
       [  0. ,   0. ,   0. ,   0. ,   0. ]])
share|improve this answer
    
IMO there's nothing wrong with having np.nan values as means for all-NaN column case. But it is indeed a neat case of use for masked arrays. – Vlas Sokolov Oct 24 '16 at 0:39
    
@VlasSokolov Well, having a mask is even better I think. i.e., making a into a masked array and keeping it masked even after applying the mean. Then you don't need to worry about performing operations on it, which might cause the nans to "spread" to the non-nan values. – Praveen Oct 24 '16 at 0:44

This isn't very clean but I can't think of a way to do it other than iterating

#example
a = np.arange(16, dtype = float).reshape(4,4)
a[2,2] = np.nan
a[3,3] = np.nan

indices = np.where(np.isnan(a)) #returns an array of rows and column indices
for row, col in zip(*indices):
    a[row,col] = np.mean(a[~np.isnan(a[:,col]), col])
share|improve this answer
    
Thanks a lot for this! – piokuc Sep 8 '13 at 23:14

Alternative: Replacing NaNs with interpolation of columns.

def interpolate_nans(X):
    """Overwrite NaNs with column value interpolations."""
    for j in range(X.shape[1]):
        mask_j = np.isnan(X[:,j])
        X[mask_j,j] = np.interp(np.flatnonzero(mask_j), np.flatnonzero(~mask_j), X[~mask_j,j])
    return X

Example use:

X_incomplete = np.array([[10,     20,     30    ],
                         [np.nan, 30,     np.nan],
                         [np.nan, np.nan, 50    ],
                         [40,     50,     np.nan    ]])

X_complete = interpolate_nans(X_incomplete)

print X_complete
[[10,     20,     30    ],
 [20,     30,     40    ],
 [30,     40,     50    ],
 [40,     50,     50    ]]

I use this bit of code for time series data in particular, where columns are attributes and rows are time-ordered samples.

share|improve this answer

To extend Donald's Answer I provide a minimal example. Let's say a is an ndarray and we want to replace its zero values with the mean of the column.

In [231]: a
Out[231]: 
array([[0, 3, 6],
       [2, 0, 0]])


In [232]: col_mean = np.nanmean(a, axis=0)
Out[232]: array([ 1. ,  1.5,  3. ])

In [228]: np.where(np.equal(a, 0), col_mean, a)
Out[228]: 
array([[ 1. ,  3. ,  6. ],
       [ 2. ,  1.5,  3. ]])
share|improve this answer

If you don't have access to a numpy version with numpy.nanmean, this is a compact way to do it, by indexing into the means.

m = numpy.isnan(a)
a[m] = numpy.mean(numpy.nan_to_num(a), axis=0)[numpy.where(m)[0]]
share|improve this answer
import numpy as np
import pandas as pd

def replace_nan_with_mean(matrix_in):
    for i in np.arange(matrix_in.shape[1]):
        mean=np.mean(matrix_in[~pd.isnull(matrix_in[:,i]),i],axis=0) # caculate the mean, ignoring all the na values
        matrix_in[pd.isnull(matrix_in[:,i]),i]=mean; # assign all nan value with mean
    return matrix_in    
share|improve this answer
    
While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. Code-only answers are discouraged. – Ajean Dec 19 '16 at 18:50

you might want to try this built-in function:

x = np.array([np.inf, -np.inf, np.nan, -128, 128])
np.nan_to_num(x)
array([  1.79769313e+308,  -1.79769313e+308,   0.00000000e+000,
-1.28000000e+002,   1.28000000e+002])
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.