Sign up ×
Stack Overflow is a community of 4.7 million programmers, just like you, helping each other. Join them; it only takes a minute:

I am extracting features out of a text corpus, and I am using a td-fidf vectorizer and truncated singular value decomposition from scikit-learn in order to achieve that. However, since the algorithm I want to try out requires dense matrices and the vectorizer returns sparse matrices I need to convert those matrices to dense arrays. But, whenever I try to convert those arrays I get an error telling me that my numpy array object has no atribute "toarray". What am I doing wrong?

The function:

def feature_extraction(train,train_test,test_set):
    vectorizer = TfidfVectorizer(min_df = 3,strip_accents = "unicode",analyzer = "word",token_pattern = r'\w{1,}',ngram_range = (1,2))        

    print("fitting Vectorizer")
    vectorizer.fit(train)

    print("transforming text")
    train = vectorizer.transform(train)
    train_test = vectorizer.transform(train_test)
    test_set = vectorizer.transform(test_set)

    print("Dimensionality reduction")
    svd = TruncatedSVD(n_components = 100)
    svd.fit(train)
    train = svd.transform(train)
    train_test = svd.transform(train_test)
    test_set = svd.transform(test_set)

    print("convert to dense array")
    train = train.toarray()
    test_set = test_set.toarray()
    train_test = train_test.toarray()

    print(train.shape)
    return train,train_test,test_set

traceback:

Traceback (most recent call last):
  File "C:\Users\Anonymous\workspace\final_submission\src\linearSVM.py", line 24, in <module>
    x_train,x_test,test_set = feature_extraction(x_train,x_test,test_set)
  File "C:\Users\Anonymous\workspace\final_submission\src\Preprocessing.py", line 57, in feature_extraction
    train = train.toarray()
AttributeError: 'numpy.ndarray' object has no attribute 'toarray'

Update: Willy pointed out that my assumption of the matrix being sparse might be wrong. So I tried feeding my data to my algorithm with dimensionality reduction and it actually worked without any conversion, however when I excluded dimensionality reduction, which gave me around 53k features I get the following error:

    Traceback (most recent call last):
  File "C:\Users\Anonymous\workspace\final_submission\src\linearSVM.py", line 28, in <module>
    result = bayesian_ridge(x_train,x_test,y_train,y_test,test_set)
  File "C:\Users\Anonymous\workspace\final_submission\src\Algorithms.py", line 84, in bayesian_ridge
    algo = algo.fit(x_train,y_train[:,i])
  File "C:\Python27\lib\site-packages\sklearn\linear_model\bayes.py", line 136, in fit
    dtype=np.float)
  File "C:\Python27\lib\site-packages\sklearn\utils\validation.py", line 220, in check_arrays
    raise TypeError('A sparse matrix was passed, but dense '
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Can someone explain this?

Update2

As requested, I'll give all the code involved. Since it is scattered over different files I'll just post it in steps. For clarity I'll leave all the module imports out.

This is how I preprocess my code:

def regexp(data):
    for row in range(len(data)):
        data[row] = re.sub(r'[\W_]+'," ",data[row])
        return data

def clean_the_text(data):
    alist = []
    data = nltk.word_tokenize(data)
    for j in data:
        j = j.lower()
        alist.append(j.rstrip('\n'))
    alist = " ".join(alist)
    return alist
def loop_data(data):
    for i in range(len(data)):
        data[i] = clean_the_text(data[i])
    return data  


if __name__ == "__main__":
    print("loading train")
    train_text = porter_stemmer(loop_data(regexp(list(np.array(p.read_csv(os.path.join(dir,"train.csv")))[:,1]))))
    print("loading test_set")
    test_set = porter_stemmer(loop_data(regexp(list(np.array(p.read_csv(os.path.join(dir,"test.csv")))[:,1]))))

After splitting my train_set into a x_train and a x_test for cross_validation I transform my data using the feature_extraction function above.

x_train,x_test,test_set = feature_extraction(x_train,x_test,test_set)

Finally I feed them into my algorithm

def bayesian_ridge(x_train,x_test,y_train,y_test,test_set):
    algo = linear_model.BayesianRidge()
    algo = algo.fit(x_train,y_train)
    pred = algo.predict(x_test)
    error = pred - y_test
    result.append(algo.predict(test_set))
    print("Bayes_error: ",cross_val(error))
    return result
share|improve this question
4  
If train is already an ndarray, then your assumption about it returning a sparse matrix is incorrect. – willy Nov 22 '13 at 18:26
    
You might be right, let me check that. – Learner Nov 22 '13 at 18:28
    
Checked it. Going to add an edit to my question right now. – Learner Nov 22 '13 at 18:33
    
you should include all the code, not just messages. ndarray is dense by definition, sparse matrices are represented in different objects, so there is rather an error in your code (which you did not attach) – lejlot Nov 22 '13 at 19:50
    
Ok, I'll add all the code involved. – Learner Nov 22 '13 at 20:22

1 Answer 1

up vote 1 down vote accepted

TruncatedSVD.transform returns an array, not a sparse matrix. In fact, in the present version of scikit-learn, only the vectorizers return sparse matrices.

share|improve this answer
    
Thank you! I didn't know that. – Learner Nov 23 '13 at 15:09
    
@Learner: it's in the docstring for that method. – larsmans Nov 23 '13 at 16:00

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.