I'm working on text classification using scikit-learn. Things work well with a single feature, but introducing multiple features is giving me errors. I think the problem is that I'm not formatting the data in the way that the classifier expects.
For example, this works fine:
data = np.array(df['feature1'])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)
classifier = Pipeline(...)
classifier.fit(X_train, Y_train)
But this:
data = np.array(df[['feature1', 'feature2']])
classes = label_encoder.transform(np.asarray(df['target']))
X_train, X_test, Y_train, Y_test = train_test_split(data, classes)
classifier = Pipeline(...)
classifier.fit(X_train, Y_train)
dies with
Traceback (most recent call last):
File "/Users/jed/Dropbox/LegalMetric/LegalMetricML/motion_classifier.py", line 157, in <module>
classifier.fit(X_train, Y_train)
File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 130, in fit
Xt, fit_params = self._pre_transform(X, y, **fit_params)
File "/Library/Python/2.7/site-packages/sklearn/pipeline.py", line 120, in _pre_transform
Xt = transform.fit_transform(Xt, y, **fit_params_steps[name])
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 715, in _count_vocab
for feature in analyze(doc):
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 229, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/Library/Python/2.7/site-packages/sklearn/feature_extraction/text.py", line 195, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
during the preprocessing stage after classifier.fit() is called. I think the problem is that way I'm formatting the data, but I can't figure out how to get it right.
feature1 and feature2 are both English text strings, as is the target. I'm using LabelEncoder() to encode target, which seems to work fine.
Here's an example of what print data
returns, to give you a sense of how it's formatted right now.
[['some short english text'
'a paragraph of english text']
['some more short english text'
'a second paragraph of english text']
['some more short english text'
'a third paragraph of english text']]
scikit
functions and it works okay. – BrenBarn Feb 5 at 21:48train_test_split()
and I get the same error.train_test_split(df['feature1'], label_encoder.transform(df['target']))
works fine.train_test_split(df[['feature1', 'feature2']], label_encoder.transform(df['matches']))
doesn't. – James Daily Feb 5 at 22:11X_train
looks like in each of the two cases. – EMS Feb 6 at 14:31X_train
looks the same as theprint data
example in the question (not literally the same, since it was split, of course). With one featureX_train
looks like this:['short english text' 'additional english text' 'more short english text' ..., 'still more short english text' 'yet more short english text' 'english text']
So with two features it's an array of arrays of strings and with one feature it's an array of strings. Presumably that's the problem, but I don't know whatfit()
expects it to look like. – James Daily Feb 6 at 15:45fit()
expect an{array-like, sparse matrix}, shape = [n_samples, n_features]
. PrintingX_train.shape
with two features gives(4630, 2)
. With one feature it's(4630,)
. So that seems correct. Not sure what I'm missing. – James Daily Feb 6 at 15:56