AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

Question

I am extracting features out of a text corpus, and I am using a td-fidf vectorizer and truncated singular value decomposition from scikit-learn in order to achieve that. However, since the algorithm I want to try out requires dense matrices and the vectorizer returns sparse matrices I need to convert those matrices to dense arrays. But, whenever I try to convert those arrays I get an error telling me that my numpy array object has no atribute "toarray". What am I doing wrong?

The function:

def feature_extraction(train,train_test,test_set):
    vectorizer = TfidfVectorizer(min_df = 3,strip_accents = "unicode",analyzer = "word",token_pattern = r'\w{1,}',ngram_range = (1,2))        

    print("fitting Vectorizer")
    vectorizer.fit(train)

    print("transforming text")
    train = vectorizer.transform(train)
    train_test = vectorizer.transform(train_test)
    test_set = vectorizer.transform(test_set)

    print("Dimensionality reduction")
    svd = TruncatedSVD(n_components = 100)
    svd.fit(train)
    train = svd.transform(train)
    train_test = svd.transform(train_test)
    test_set = svd.transform(test_set)

    print("convert to dense array")
    train = train.toarray()
    test_set = test_set.toarray()
    train_test = train_test.toarray()

    print(train.shape)
    return train,train_test,test_set

traceback:

Traceback (most recent call last):
  File "C:\Users\Anonymous\workspace\final_submission\src\linearSVM.py", line 24, in <module>
    x_train,x_test,test_set = feature_extraction(x_train,x_test,test_set)
  File "C:\Users\Anonymous\workspace\final_submission\src\Preprocessing.py", line 57, in feature_extraction
    train = train.toarray()
AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

Update: Willy pointed out that my assumption of the matrix being sparse might be wrong. So I tried feeding my data to my algorithm with dimensionality reduction and it actually worked without any conversion, however when I excluded dimensionality reduction, which gave me around 53k features I get the following error:

    Traceback (most recent call last):
  File "C:\Users\Anonymous\workspace\final_submission\src\linearSVM.py", line 28, in <module>
    result = bayesian_ridge(x_train,x_test,y_train,y_test,test_set)
  File "C:\Users\Anonymous\workspace\final_submission\src\Algorithms.py", line 84, in bayesian_ridge
    algo = algo.fit(x_train,y_train[:,i])
  File "C:\Python27\lib\site-packages\sklearn\linear_model\bayes.py", line 136, in fit
    dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 220, in check_arrays
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Can someone explain this?

Update2

As requested, I'll give all the code involved. Since it is scattered over different files I'll just post it in steps. For clarity I'll leave all the module imports out.

This is how I preprocess my code:

def regexp(data):
    for row in range(len(data)):
        data[row] = re.sub(r'[\W_]+'," ",data[row])
        return data

def clean_the_text(data):
    alist = []
    data = nltk.word_tokenize(data)
    for j in data:
        j = j.lower()
        alist.append(j.rstrip('\n'))
    alist = " ".join(alist)
    return alist
def loop_data(data):
    for i in range(len(data)):
        data[i] = clean_the_text(data[i])
    return data  


if __name__ == "__main__":
    print("loading train")
    train_text = porter_stemmer(loop_data(regexp(list(np.array(p.read_csv(os.path.join(dir,"train.csv")))[:,1]))))
    print("loading test_set")
    test_set = porter_stemmer(loop_data(regexp(list(np.array(p.read_csv(os.path.join(dir,"test.csv")))[:,1]))))

After splitting my train_set into a x_train and a x_test for cross_validation I transform my data using the feature_extraction function above.

x_train,x_test,test_set = feature_extraction(x_train,x_test,test_set)

Finally I feed them into my algorithm

def bayesian_ridge(x_train,x_test,y_train,y_test,test_set):
    algo = linear_model.BayesianRidge()
    algo = algo.fit(x_train,y_train)
    pred = algo.predict(x_test)
    error = pred - y_test
    result.append(algo.predict(test_set))
    print("Bayes_error: ",cross_val(error))
    return result

If train is already an ndarray, then your assumption about it returning a sparse matrix is incorrect. — willy, Nov 22 '13 at 18:26
you should include all the code, not just messages. ndarray is dense by definition, sparse matrices are represented in different objects, so there is rather an error in your code (which you did not attach) — lejlot, Nov 22 '13 at 19:50

larsmans · Accepted Answer · 2013-11-23 13:07:12Z

up vote 1 down vote accepted

TruncatedSVD.transform returns an array, not a sparse matrix. In fact, in the present version of scikit-learn, only the vectorizers return sparse matrices.

answered Nov 23 '13 at 13:07

larsmans
205k24363542

Thank you! I didn't know that. – Learner Nov 23 '13 at 15:09

@Learner: it's in the docstring for that method. – larsmans Nov 23 '13 at 16:00

add a comment |

asked	1 year ago
viewed	2350 times
active	1 year ago

current community

your communities

more stack exchange communities

AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python numpy machine-learning scikit-learn or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python numpy machine-learning scikit-learn or ask your own question.

Related

Hot Network Questions