Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I have a dataset, data, and a labeled array, target, with which I build in scikit-learn a supervised model using the k-Nearest Neighbors algorithm.

neigh = KNeighborsClassifier()
neigh.fit(data, target)

I am now able to classify my learning set using this very model. To get the classification score :

neigh.score(data, target)


Now my problem is that this score depends on the type of the target object.

  • If it is a python list, that is, created using list() and filled in with target.append(), the score method returns 0.68.
  • If it is a numpy array, created using target = np.empty(shape=(length,1), dtype="S36") (it contains only 36-character strings), and filled in with target[k] = value, the score method returns 0.008.

To make sure whether results were really different or not, I created text files that list the results of

for k in data:
    neigh.predict(k)

in each case. The results were the same.

What can explain the score difference ?

share|improve this question
 
Which NumPy version? –  larsmans Jul 16 '13 at 10:52
 
What happens if you specify the array's shape as (length) only? So that its shape will be (length,) and not (length, 1)? –  Harel Jul 16 '13 at 10:56
 
@Harel, Thank you! That solved the problem. But I don't really understand why, how did you think about it ? –  cardboard Jul 16 '13 at 11:35
 
Nothing like experience... there's a difference in numpy between two-dimensional arrays with one dimension size = 1 and one-dimensional arrays. While they are often used interchangeably, they are not identical, and in this case this slight difference produced a problem. Is it ethical to ask for an upvote on my original comment? :) –  Harel Jul 17 '13 at 12:29
 
@Harel, would love to, but not enough reputation to upvote..! –  cardboard Jul 18 '13 at 14:24
add comment

1 Answer

up vote 0 down vote accepted

@Harel spotted the problem, here's the explanation:

np.empty(shape=(length, 1), dtype="S36")

creates an array of the wrong shape. scikit-learn estimators almost invariably want 1-d arrays, i.e. shape=length. The fact that this doesn't raise an exception is an oversight.

share|improve this answer
add comment

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.