I am currently working on large scale hierarchical text classification of ODP documents. The dataset provided to me is in the libSVM format. I am trying to run the linear kernel SVM of python's scikit-learn to develop the model. Below is the sample data from training samples:
29 9454:1 11742:1 18884:14 26840:1 35147:1 52782:1 72083:1 73244:1 78945:1 79913:1 79986:1 86710:3 117286:1 139820:1 142458:1 146315:1 151005:2 161454:3 172237:1 1091130:1 1113562:1 1133451:1 1139046:1 1157534:1 1180618:2 1182024:1 1187711:1 1194345:3
33 2474:1 8152:1 19529:2 35038:1 48104:1 59738:1 61854:3 67943:1 74093:1 78945:1 88558:1 90848:1 97087:1 113284:16 118917:1 122375:1 124939:1
The following is the code I have used to construct the linear SVM model
from sklearn.datasets import load_svmlight_file
from sklearn import svm
X_train, y_train = load_svmlight_file("/path-to-file/train.txt")
X_test, y_test = load_svmlight_file("/path-to-file/test.txt")
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
print clf.score(X_test,y_test)
Upon running clf.score(), I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-b285fbfb3efe> in <module>()
1 start_time = time.time()
----> 2 print clf.score(X_test,y_test)
3 print time.time() - start_time, "seconds"
/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in score(self, X, y)
292 """
293 from .metrics import accuracy_score
--> 294 return accuracy_score(y, self.predict(X))
295
296
/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
464 Class labels for samples in X.
465 """
--> 466 y = super(BaseSVC, self).predict(X)
467 return self.classes_.take(y.astype(np.int))
468
/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
280 y_pred : array, shape (n_samples,)
281 """
--> 282 X = self._validate_for_predict(X)
283 predict = self._sparse_predict if self._sparse else self._dense_predict
284 return predict(X)
/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in _validate_for_predict(self, X)
402 raise ValueError("X.shape[1] = %d should be equal to %d, "
403 "the number of features at training time" %
--> 404 (n_features, self.shape_fit_[1]))
405 return X
406
ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time
Can someone please let me know what is exactly wrong with either this code or the piece of data I have? Thanks in advance
Below attached are the values of X_train, y_train, X_test, and y_test:
X_train:
(0, 9453) 1.0
(0, 11741) 1.0
(0, 18883) 14.0
(0, 26839) 1.0
(0, 35146) 1.0
(0, 52781) 1.0
(0, 72082) 1.0
(0, 73243) 1.0
(0, 78944) 1.0
(0, 79912) 1.0
(0, 79985) 1.0
(0, 86709) 3.0
(0, 117285) 1.0
(0, 139819) 1.0
(0, 142457) 1.0
(0, 146314) 1.0
(0, 151004) 2.0
(0, 161453) 3.0
(0, 172236) 1.0
(0, 187531) 2.0
(0, 202462) 1.0
(0, 210417) 1.0
(0, 250581) 1.0
(0, 251689) 1.0
(0, 296384) 2.0
: :
(4462, 735469) 1.0
(4462, 737059) 15.0
(4462, 740127) 1.0
(4462, 743798) 1.0
(4462, 766063) 1.0
(4462, 778958) 2.0
(4462, 784004) 4.0
(4462, 837264) 2.0
(4462, 839095) 22.0
(4462, 844735) 6.0
(4462, 859721) 2.0
(4462, 875267) 1.0
(4462, 910761) 1.0
(4462, 931244) 1.0
(4462, 945069) 6.0
(4462, 948728) 1.0
(4462, 948850) 2.0
(4462, 957682) 1.0
(4462, 975170) 1.0
(4462, 989192) 1.0
(4462, 1014294) 1.0
(4462, 1042424) 1.0
(4462, 1049027) 1.0
(4462, 1072931) 1.0
(4462, 1145790) 1.0
y_train:
[ 2.90000000e+01 3.30000000e+01 3.30000000e+01 ..., 1.65475000e+05
1.65518000e+05 1.65518000e+05]
X_test:
(0, 18573) 1.0
(0, 23501) 1.0
(0, 29954) 1.0
(0, 42112) 1.0
(0, 46402) 1.0
(0, 63041) 2.0
(0, 67942) 2.0
(0, 83522) 1.0
(0, 88413) 2.0
(0, 99454) 1.0
(0, 126041) 1.0
(0, 139819) 1.0
(0, 142678) 1.0
(0, 151004) 1.0
(0, 166351) 2.0
(0, 173794) 1.0
(0, 192162) 3.0
(0, 210417) 2.0
(0, 254468) 1.0
(0, 263895) 2.0
(0, 277567) 1.0
(0, 278419) 2.0
(0, 279181) 2.0
(0, 281319) 2.0
(0, 298898) 1.0
: :
(1857, 1100504) 3.0
(1857, 1103247) 1.0
(1857, 1105578) 1.0
(1857, 1108986) 2.0
(1857, 1118486) 1.0
(1857, 1120807) 9.0
(1857, 1129243) 2.0
(1857, 1131786) 1.0
(1857, 1134029) 2.0
(1857, 1134410) 5.0
(1857, 1134494) 1.0
(1857, 1139045) 25.0
(1857, 1142239) 3.0
(1857, 1142651) 1.0
(1857, 1144787) 1.0
(1857, 1151891) 1.0
(1857, 1152094) 1.0
(1857, 1157533) 1.0
(1857, 1159376) 1.0
(1857, 1178944) 1.0
(1857, 1181310) 2.0
(1857, 1182023) 1.0
(1857, 1187098) 1.0
(1857, 1194344) 2.0
(1857, 1195819) 9.0
y_test:
[ 2.90000000e+01 3.30000000e+01 1.56000000e+02 ..., 1.65434000e+05
1.65475000e+05 1.65518000e+05]