Using multiple features with scikit-learn

Question

I'm working on text classification using scikit-learn. Things work well with a single feature, but introducing multiple features is giving me errors. I think the problem is that I'm not formatting the data in the way that the classifier expects.

For example, this works fine:

data = np.array(df['feature1'])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)

But this:

data = np.array(df[['feature1', 'feature2']])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)

classifier = Pipeline(...)

classifier.fit(X_train, Y_train)

dies with

Traceback (most recent call last):
  File "/Users/jed/Dropbox/LegalMetric/LegalMetricML/motion_classifier.py", line 157, in <module>
    classifier.fit(X_train, Y_train)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 130, in fit
    Xt, fit_params = self._pre_transform(X, y, **fit_params)
  File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 120, in _pre_transform
    Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
    for feature in analyze(doc):
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
    tokenize(preprocess(self.decode(doc))), stop_words)
  File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 195, in <lambda>
    return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'

during the preprocessing stage after classifier.fit() is called. I think the problem is that way I'm formatting the data, but I can't figure out how to get it right.

feature1 and feature2 are both English text strings, as is the target. I'm using LabelEncoder() to encode target, which seems to work fine.

Here's an example of what print data returns, to give you a sense of how it's formatted right now.

[['some short english text'
  'a paragraph of english text']
 ['some more short english text'
  'a second paragraph of english text']
 ['some more short english text'
  'a third paragraph of english text']]

Well, how are you formatting the data? I've generally found that I can just pass the pandas DataFrame directly to scikit functions and it works okay. — BrenBarn, Feb 5 at 21:48
I tried passing the DataFrame directly to train_test_split() and I get the same error. train_test_split(df['feature1'], label_encoder.transform(df['target'])) works fine. train_test_split(df[['feature1', 'feature2']], label_encoder.transform(df['matches'])) doesn't. — James Daily, Feb 5 at 22:11
Can you print out what X_train looks like in each of the two cases. — EMS, Feb 6 at 14:31
With two features X_train looks the same as the print data example in the question (not literally the same, since it was split, of course). With one feature X_train looks like this: ['short english text' 'additional english text' 'more short english text' ..., 'still more short english text' 'yet more short english text' 'english text'] So with two features it's an array of arrays of strings and with one feature it's an array of strings. Presumably that's the problem, but I don't know what fit() expects it to look like. — James Daily, Feb 6 at 15:45
Looking at the documentation I knew that fit() expect an {array-like, sparse matrix}, shape = [n_samples, n_features]. Printing X_train.shape with two features gives (4630, 2). With one feature it's (4630,). So that seems correct. Not sure what I'm missing. — James Daily, Feb 6 at 15:56

EMS · Answer 1 · 2014-02-05 21:49:45Z

The particular error message makes it seem like your code somewhere expects something to be a str (so that .lower may be called) but instead it is receiving a whole array (probably a whole array of strs).

Can you edit the question to better describe the data and also post the full traceback, not just the small part with the named error?

In the meantime, you can also try

data = df[['feature1', 'feature2']].values

and

df['target'].values

instead of explicitly casting to np.ndarray yourself.

It looks to me like an array is being made where it is 1x1 and the singleton element in the "array" is itself an ndarray.

I tried using .values and got the same error. – James Daily Feb 5 at 22:13 — James Daily, Feb 5 at 22:13

asked	7 months ago
viewed	277 times
active	7 months ago

current community

your communities

more stack exchange communities

Using multiple features with scikit-learn

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged python pandas machine-learning scikit-learn or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Using multiple features with scikit-learn

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python pandas machine-learning scikit-learn or ask your own question.

Related

Hot Network Questions