I've got a numpy array filled mostly with real numbers, but there is a few nan
values in it as well.
How can I replace the nan
s with averages of columns where they are?
|
No loops required:
|
|||||||||||||||||||||
|
If partial is your original data, and replace is an array of the same shape containing averaged values then this code will use the value from partial if one exists.
|
|||
Using masked arraysThe standard way to do this using only numpy would be to use the masked array module. Scipy is a pretty heavy package which relies on external libraries, so it's worth having a numpy-only method. This borrows from @DonaldHobson's answer. Edit: Suppose you have an array
Note that the masked array's mean does not need to be the same shape as Also note how the all-nan column is nicely handled. The mean is zero since you're taking the mean of zero elements. The method using
Explanation Converting
And taking the mean over columns gives you the correct answer, normalizing only over the non-masked values:
Further, note how the mask nicely handles the column which is all-nan! Finally, Row-wise mean To replace
|
|||||||||
|
This isn't very clean but I can't think of a way to do it other than iterating
|
|||||
|
Alternative: Replacing NaNs with interpolation of columns.
Example use:
I use this bit of code for time series data in particular, where columns are attributes and rows are time-ordered samples. |
|||
|
To extend Donald's Answer I provide a minimal example. Let's say
|
||||
|
If you don't have access to a numpy version with
|
|||
|
|
|||||
|
you might want to try this built-in function:
|
|||
|