X is a DataFrame w/ about 90% missing values and around 10% actual values. My goal is to use nmf
in a successive imputation loop to predict the actual values I have hidden. The mask, msk
, selects a random 80% of the actual values (or 80% of the 10% actual values). I initialize all but these 80% to 0 and begin to impute them. Line 2 looks odd because I couldn't find a way to get a random 80% (train set) of the values who weren't np.nan
so if I add an np.nan
to a number, the value stays np.nan
. Then if I subtract that X.values
back off the only values that are effected are the non-null values of the array X_imputed
. This allows me to get a random 80% of the non-null values.
import pandas as pd
from pandas import DataFrame
import numpy as np
from sklearn.decomposition import ProjectedGradientNMF
# toy example data, actual data is ~500 by ~ 250
customers = range(20)
features = range(15)
toy_vals = np.random.random(20*15).reshape((20,15))
toy_mask = toy_vals < 0.9
toy_vals[toy_mask] = np.nan
X = DataFrame(toy_vals, index=customers, columns=features)
# end toy example data gen.
# imputation w/ nmf loops
X_imputed = X.copy()
msk = (X.values + np.random.randn(*X.shape) - X.values) < 0.8
X_imputed.values[~msk] = 0
nmf_model = ProjectedGradientNMF(n_components = 5)
W = nmf_model.fit_transform(X_imputed.values)
H = nmf_model.components_
while nmf_model.reconstruction_err_**2 > 10:
nmf_model.fit_transform(X_imputed.values)
W = nmf_model.fit_transform(X_imputed.values)
H = nmf_model.components_
X_imputed.values[~msk] = W.dot(H)[~msk]
I'm pretty sure this can be written in fewer lines but I'm not sure how to do it.
while
starts only when I crank up the number of dimensions of the matrix to 200 and 150. I think for smaller matrices the bound you have for the approximation error is met after a single iterations. – Curt F. Jul 14 '15 at 16:18nmf_model.n_iter_
parameter, it isn't there, which seems to conflict with the documentation for the module you are using. Is this a scikit-learn bug? – Curt F. Jul 14 '15 at 16:19