Imputing values with non-negative matrix factorization

Question

X is a DataFrame w/ about 90% missing values and around 10% actual values. My goal is to use nmf in a successive imputation loop to predict the actual values I have hidden. The mask, msk, selects a random 80% of the actual values (or 80% of the 10% actual values). I initialize all but these 80% to 0 and begin to impute them. Line 2 looks odd because I couldn't find a way to get a random 80% (train set) of the values who weren't np.nan so if I add an np.nan to a number, the value stays np.nan. Then if I subtract that X.values back off the only values that are effected are the non-null values of the array X_imputed. This allows me to get a random 80% of the non-null values.

import pandas as pd
from pandas import DataFrame
import numpy as np

from sklearn.decomposition import ProjectedGradientNMF

# toy example data, actual data is ~500 by ~ 250
customers = range(20)
features = range(15)

toy_vals = np.random.random(20*15).reshape((20,15))
toy_mask = toy_vals < 0.9
toy_vals[toy_mask] = np.nan

X = DataFrame(toy_vals, index=customers, columns=features)
# end toy example data gen.

# imputation w/ nmf loops
X_imputed = X.copy()
msk = (X.values + np.random.randn(*X.shape) - X.values) < 0.8
X_imputed.values[~msk] = 0
nmf_model = ProjectedGradientNMF(n_components = 5)
W = nmf_model.fit_transform(X_imputed.values)
H = nmf_model.components_

while nmf_model.reconstruction_err_**2 > 10:
   nmf_model.fit_transform(X_imputed.values)
   W = nmf_model.fit_transform(X_imputed.values)
   H = nmf_model.components_
   X_imputed.values[~msk] = W.dot(H)[~msk]

I'm pretty sure this can be written in fewer lines but I'm not sure how to do it.

I get your code to run, but the while starts only when I crank up the number of dimensions of the matrix to 200 and 150. I think for smaller matrices the bound you have for the approximation error is met after a single iterations. — Curt F., Jul 14 '15 at 16:18
Also, strangely, when I try to access the nmf_model.n_iter_ parameter, it isn't there, which seems to conflict with the documentation for the module you are using. Is this a scikit-learn bug? — Curt F., Jul 14 '15 at 16:19
Yeah, I should have changed the break condition for the while loop. It runs for at least 60s on my machine with the actual data so either scaling up the toy dimensions or scaling down the while loop condition (10 to 10e-3 or something?). I have acces to nmf.model.n_iter_ parameter on my end. Could be a bug. Try to update or reinstall maybe? — Sasspedence, Jul 14 '15 at 16:31

Curt F. · Accepted Answer · 2015-07-14 17:06:21Z

In the while loop, the first call you make to nmf_model.fit_transform() is superfluous and can be removed. You aren't even using the results of the transformation calculation. The next line, where you have W = nmf_model.fit_transform(X_imputed.values) is doing all the work. Removing this line halves the number of model fits and speeds things up by ~twofold.
You don't need to assign H outside/before the while loop.
If minimization of code lines is the goal, you can avoid assigning to temporary variables and just put the expression that you would have used to define the variable in the code line that uses it. I did that for H in the while loop. It is more compact but probably harder to understand.
You don't seem to need to full pandas module, so that import can be removed.
I didn't change anything in my code below, but why are you squaring nmf_model.reconstruction_err_? According to the docs this error is the Frobenius norm of the difference matrix (X - WH), so it will always be positive even without squaring.

This is a tad more compact and significantly faster (because of item 1):

from pandas import DataFrame
import numpy as np
from sklearn.decomposition import ProjectedGradientNMF

# Example data matrix X
nrows, ncols = 200, 150
toy_vals = np.random.random(nrows*ncols).reshape((nrows, ncols))
toy_vals[toy_vals < 0.9] = np.nan
X = DataFrame(toy_vals, index=range(nrows), columns=range(ncols))

# Hiding values to test imputation
X_imputed = X.copy()
msk = (X.values + np.random.randn(*X.shape) - X.values) < 0.8
X_imputed.values[~msk] = 0

# Initializing model
nmf_model = ProjectedGradientNMF(n_components=5)
nmf_model.fit(X_imputed.values)

# iterate model
while nmf_model.reconstruction_err_**2 > 10:
    W = nmf_model.fit_transform(X_imputed.values)
    X_imputed.values[~msk] = W.dot(nmf_model.components_)[~msk]
    print nmf_model.reconstruction_err_

If you want, the while loop could be a single (but long and confusing) line: X_imputed.values[~msk] = nmf_model.fit_transform(X_imputed.values).dot(nmf_model.components_)[~msk]. — Curt F., Jul 14 '15 at 17:02
Great suggestions! This is definitely better. I had the F-norm squared b/c I just had it that way in my matlab version. Not a real reason. Would there be any real complexity gains (reductions) from assigning the temporary variable W? I don't mind compressing it into one line but if there are marginal improvements I may as well opt for readability and use both W and H temp variables within the loop. — Sasspedence, Jul 15 '15 at 13:56
I actually like the version with explicit assignment to W and H (only inside the while loop) better but since you'd asked about "fewer lines" I just thought I'd mention it. There should be essentially no difference in speed either way. — Curt F., Jul 15 '15 at 15:56

asked	9 months ago
viewed	298 times
active	9 months ago

current community

your communities

more stack exchange communities

Imputing values with non-negative matrix factorization

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python numpy pandas or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Imputing values with non-negative matrix factorization

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python numpy pandas or ask your own question.

Related

Hot Network Questions