Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I am currently working on large scale hierarchical text classification of ODP documents. The dataset provided to me is in the libSVM format. I am trying to run the linear kernel SVM of python's scikit-learn to develop the model. Below is the sample data from training samples:

29 9454:1 11742:1 18884:14 26840:1 35147:1 52782:1 72083:1 73244:1 78945:1 79913:1 79986:1 86710:3 117286:1 139820:1 142458:1 146315:1 151005:2 161454:3 172237:1 1091130:1 1113562:1 1133451:1 1139046:1 1157534:1 1180618:2 1182024:1 1187711:1 1194345:3 

33 2474:1 8152:1 19529:2 35038:1 48104:1 59738:1 61854:3 67943:1 74093:1 78945:1 88558:1 90848:1 97087:1 113284:16 118917:1 122375:1 124939:1 

The following is the code I have used to construct the linear SVM model

from sklearn.datasets import load_svmlight_file
from sklearn import svm
X_train, y_train = load_svmlight_file("/path-to-file/train.txt")
X_test, y_test = load_svmlight_file("/path-to-file/test.txt")
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y_train)
print clf.score(X_test,y_test)

Upon running clf.score(), I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-b285fbfb3efe> in <module>()
      1 start_time = time.time()
----> 2 print clf.score(X_test,y_test)
      3 print time.time() - start_time, "seconds"

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/base.pyc in score(self, X, y)
    292         """
    293         from .metrics import accuracy_score
--> 294         return accuracy_score(y, self.predict(X))
    295 
    296 

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
    464             Class labels for samples in X.
    465         """
--> 466         y = super(BaseSVC, self).predict(X)
    467         return self.classes_.take(y.astype(np.int))
    468 

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in predict(self, X)
    280         y_pred : array, shape (n_samples,)
    281         """
--> 282         X = self._validate_for_predict(X)
    283         predict = self._sparse_predict if self._sparse else self._dense_predict
    284         return predict(X)

/Users/abc/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in _validate_for_predict(self, X)
    402             raise ValueError("X.shape[1] = %d should be equal to %d, "
    403                              "the number of features at training time" %
--> 404                              (n_features, self.shape_fit_[1]))
    405         return X
    406 

ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time

Can someone please let me know what is exactly wrong with either this code or the piece of data I have? Thanks in advance

Below attached are the values of X_train, y_train, X_test, and y_test:

X_train:

  (0, 9453)         1.0
  (0, 11741)    1.0
  (0, 18883)    14.0
  (0, 26839)    1.0
  (0, 35146)    1.0
  (0, 52781)    1.0
  (0, 72082)    1.0
  (0, 73243)    1.0
  (0, 78944)    1.0
  (0, 79912)    1.0
  (0, 79985)    1.0
  (0, 86709)    3.0
  (0, 117285)   1.0
  (0, 139819)   1.0
  (0, 142457)   1.0
  (0, 146314)   1.0
  (0, 151004)   2.0
  (0, 161453)   3.0
  (0, 172236)   1.0
  (0, 187531)   2.0
  (0, 202462)   1.0
  (0, 210417)   1.0
  (0, 250581)   1.0
  (0, 251689)   1.0
  (0, 296384)   2.0
  : :
  (4462, 735469)    1.0
  (4462, 737059)    15.0
  (4462, 740127)    1.0
  (4462, 743798)    1.0
  (4462, 766063)    1.0
  (4462, 778958)    2.0
  (4462, 784004)    4.0
  (4462, 837264)    2.0
  (4462, 839095)    22.0
  (4462, 844735)    6.0
  (4462, 859721)    2.0
  (4462, 875267)    1.0
  (4462, 910761)    1.0
  (4462, 931244)    1.0
  (4462, 945069)    6.0
  (4462, 948728)    1.0
  (4462, 948850)    2.0
  (4462, 957682)    1.0
  (4462, 975170)    1.0
  (4462, 989192)    1.0
  (4462, 1014294)   1.0
  (4462, 1042424)   1.0
  (4462, 1049027)   1.0
  (4462, 1072931)   1.0
  (4462, 1145790)   1.0

y_train:

[  2.90000000e+01   3.30000000e+01   3.30000000e+01 ...,   1.65475000e+05
   1.65518000e+05   1.65518000e+05]

X_test:

  (0, 18573)    1.0
  (0, 23501)    1.0
  (0, 29954)    1.0
  (0, 42112)    1.0
  (0, 46402)    1.0
  (0, 63041)    2.0
  (0, 67942)    2.0
  (0, 83522)    1.0
  (0, 88413)    2.0
  (0, 99454)    1.0
  (0, 126041)   1.0
  (0, 139819)   1.0
  (0, 142678)   1.0
  (0, 151004)   1.0
  (0, 166351)   2.0
  (0, 173794)   1.0
  (0, 192162)   3.0
  (0, 210417)   2.0
  (0, 254468)   1.0
  (0, 263895)   2.0
  (0, 277567)   1.0
  (0, 278419)   2.0
  (0, 279181)   2.0
  (0, 281319)   2.0
  (0, 298898)   1.0
  : :
  (1857, 1100504)   3.0
  (1857, 1103247)   1.0
  (1857, 1105578)   1.0
  (1857, 1108986)   2.0
  (1857, 1118486)   1.0
  (1857, 1120807)   9.0
  (1857, 1129243)   2.0
  (1857, 1131786)   1.0
  (1857, 1134029)   2.0
  (1857, 1134410)   5.0
  (1857, 1134494)   1.0
  (1857, 1139045)   25.0
  (1857, 1142239)   3.0
  (1857, 1142651)   1.0
  (1857, 1144787)   1.0
  (1857, 1151891)   1.0
  (1857, 1152094)   1.0
  (1857, 1157533)   1.0
  (1857, 1159376)   1.0
  (1857, 1178944)   1.0
  (1857, 1181310)   2.0
  (1857, 1182023)   1.0
  (1857, 1187098)   1.0
  (1857, 1194344)   2.0
  (1857, 1195819)   9.0

y_test:

[  2.90000000e+01   3.30000000e+01   1.56000000e+02 ...,   1.65434000e+05
   1.65475000e+05   1.65518000e+05]
share|improve this question
1  
Could you give the shapes of all your Xs and Ys ? –  tk. Mar 4 '14 at 9:04
    
@tk I have updated my question to post the values. –  user3377770 Mar 4 '14 at 10:20

1 Answer 1

The error message

ValueError: X.shape[1] = 1199847 should be equal to 1199830, the number of features at training time

explains itself: the number of features in the testing data is different compared to the training data, which has been used to train the model. That is, X_train.shape[1] is not equal to X_test.shape[1].

You should check why they are not equal, as they should be.

One possibility is that they are loaded as sparse matrices and the number of features is inferred by load_svmlight_file. If the testing data contains features unseen by the training data, the resulting X_test might have a larger dimension. To avoid this, you can specify the number of features in load_svmlight_file by passing the argument n_features.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.