scikit-learn is a machine-learning library for Python that provides simple and efficient tools for data analysis and data mining. It is accessible to everybody and reusable in various contexts. It is built on NumPy, SciPy, and matplotlib. The project is open source and commercially usable (BSD ...

learn more… | top users | synonyms (2)

4
votes
0answers
160 views

How to improve the performance of this tiny distance Python function

I'm running into a performance bottleneck when using a custom distance metric function for a clustering algorithm from sklearn. The result as shown by Run Snake Run is this: Clearly the problem is ...
4
votes
0answers
181 views

How do you visualize a ward tree from sklearn.cluster.ward_tree?

In sklearn there is one agglomerative clustering algorithm implemented, the ward method minimizing variance. Usually sklearn is documented with lots of nice usage examples, but I couldn't find ...
3
votes
0answers
60 views

Numpy View Reshape Without Copy (2d Moving/Sliding Window, Strides, Masked Memory Structures)

I have an image stored as a 2d numpy array (possibly multi-d). I can make a view onto that array that reflects a 2d sliding window, but when I reshape it so that each row is a flattened window (rows ...
3
votes
0answers
48 views

multiprocessing.Pool hangs if child causes a segmentation fault

I want to apply a function in parallel using multiprocessing.Pool. The problem is that if one function call triggers a segmentation fault the Pool hangs forever. Has anybody an idea how I can make a ...
3
votes
0answers
29 views

Scoring function for RidgeClassifierCV

I'm trying to implement a custom scoring function for RidgeClassifierCV in scikit-learn. This involves passing a custom scoring function as the score_func when initializing the RidgeClassifierCV ...
3
votes
0answers
271 views

PyInstaller: a module is not included into --onefile, but works fine with --onedir

I'm using PyInstaller to bundle my application into one .exe file. The problem is that it works fine with --onedir option, but can't find a module when built with --onefile. Both --onedir and ...
3
votes
0answers
1k views

How to weight classes in a RandomForest implementation

I am working on 3D point identification using the RandomForest method from scikit. One of the issues I keep running into is that certain classes are present more often then other classes. This means ...
3
votes
0answers
491 views

K-Fold Cross Validation for Naive Bayes Classifier

I had created a classifier using nltk, it will classify the reviews to 3 classes pos, neg and neu. def get_feature(word): return dict([(word, True)]) def bag_of_words(words): return ...
2
votes
0answers
51 views

scikit-learn svm module and predict function not working

I am trying to get an SVM to work using scikit-learn but cannot get the results I am expecting. I would like to use k-means to classify roughly 2-5 data clusters and then use an SVM to build a model ...
2
votes
0answers
28 views

Creating a sklearn.linear_model.LogisticRegression instance from existing coefficients

Can one create such an instance based on existing coefficients which were calculated say in a different implementation (e.g. Java)? I tried creating an instance then setting coef_ and intercept_ ...
2
votes
0answers
93 views

Saving scikit-learn classifier causes memory error

My machine has 16G RAM and the training program uses memory up to 2.6G. But when I want to save the classifier (trained using sklearn.svm.SVC from a large dataset) as pickle file, it consumes too much ...
2
votes
0answers
62 views

scikit grid search over multiple classifiers python

I wanted to know if there is a better more inbuilt way to do grid search and test multiple models in a single pipeline. Of course the parameters of the models would be different, which made is ...
2
votes
0answers
86 views

code in multiprocessing.pool goes freezed, but works fine through loop or joblib.Parallel

ok, I have the following code: pool = Pool(worker_amount) results = pool.imap(task_handler, tasks) for result in results: do_something(result) pool.close() pool.join() ... it never ...
2
votes
0answers
17 views

return intercept from sklearn enet_path

When using functions like sklearn.linear_model.lasso_path, if return_models is set to False, the returned values are the alphas and the coefficients. However, the intercepts for the path are NOT ...
2
votes
0answers
119 views

Can I make partial plots for DecisionTreeClassifier in scikit-learn (and R)

I have some old code using scikit-learn's DecisionTreeClassifier. I'd like to make partial plots based on this classifier. All the examples I've seen so far (such as ...
2
votes
0answers
236 views

sklearn selectKbest: which variables were chosen?

I'm trying to get sklearn to select the best k variables (for example k=1) for a linear regression. This works and I can get the R-squared, but it doesn't tell me which variables were the best. How ...
2
votes
0answers
196 views

Factor Loadings using sklearn

I want the correlations between individual variables and principal components in python. I am using PCA in sklearn. I don't understand how can I achieve the loading matrix after I have decomposed my ...
2
votes
0answers
222 views

Color image segmentation with Ward clustering in sklearn

I'm trying to use the Ward method in sklearn to segment a color image. I've been working from the sklearn example that segments a grayscale image ...
2
votes
0answers
149 views

sklearn OMP: Error #15 when fitting models

I have recently uninstalled a nicely working copy of Enthought Canopy 32-bit and installed Canopy version 1.1.0 (64 bit). When I try to use sklearn to fit a model my kernel crashes, and I get the ...
2
votes
0answers
63 views

Int overflow using matthews_corrcoef on windows 64bit

I'm not really sure whether this is a numpy or a scikit-learn issue... or a windows issue? When calculating the matthews_corrcoef using scikit learn on the code below, i get an overflow warning. it is ...
2
votes
0answers
161 views

tf-idf - accessing a large sparse scipy matrix & getting the highest values

For the tfidf result matrix, I wanted to get the top tfidf values. I saw how one could set max features amount for the tfidf vectorizer, but that is for the words with the top tf count. I want to ...
2
votes
0answers
440 views

Multilabel Classification with Feature Selection (scikit-learn)

I am using scikit-learn to solving a multi-label classification problem with a large number of labels. I followed the ideas from one of the core devs of the project (larsmans). It gives me a runtime ...
2
votes
0answers
624 views

Is there a python implementation for the SMOTE algorithm?

I want to use smote (Synthetic Minority Over-sampling Technique) algorithm in python. I've found implementation in weka and R. Do someone knows an implementation in python? I've found an ...
2
votes
0answers
205 views

scikit-learn: using sample_weight in grid_search

Is it possible to perfom a grid_search (to get the best SVM's C) and yet specify the sample_weight with scikit-learn? Here's the error I'm confronted to: gs = GridSearchCV(svm.SVC(C=1), [{'kernel': ...
1
vote
0answers
26 views

Python not installing sklearn

I am working with ubuntu 14. I have downloaded the dpkg package for sklearn and unpacked it. i try to run sudo python setup.py installBut it seems to be stuck in a loop compiling C++ sources C ...
1
vote
0answers
42 views

Runtime warnings when using scikit-learn

When putting this from sklearn import svm I am getting the following Error: /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/importlib/_bootstrap.py:321: RuntimeWarning: ...
1
vote
0answers
17 views

Associating original data with Kmeans clusters

I am using scikit-learn. Suppose we have data as follows: a = [1, 2, 3, 4, 5, 4, 3, 2, 1] b = [2, 1, 3, 4, 6, 7, 7, 4, 2] c = [2, 3, 4, 3, 5, 6, 6, 6, 4] and we run the following: temp.append(a) ...
1
vote
0answers
21 views

scikit learn: how to check coefficients significance

i tried to do a LR with SKLearn for a rather large dataset with ~600 dummy and only few interval variables (and 300 K lines in my dataset) and the resulting confusion matrix looks suspicious. I wanted ...
1
vote
0answers
150 views

Import errors after upgrading to sklearn 0.15

Using Ubuntu 13.10 64 bit and python 2.7.5. I've been using sklearn 0.14 for quite some time. After upgrading to version 0.15 via: pip install --upgrade scikit-learn I've encountered the following: ...
1
vote
0answers
34 views

from sklearn import svm ImportError: cannot import name lsqr

I am getting the following error when I import svm modules. I have installed scipy as per the instructions. Here is the code and error. >>> from sklearn import svm Traceback (most recent ...
1
vote
0answers
95 views

sklearn setting learning rate of SGDClassifier vs LogsticRegression

As in sklearn, LogisticRegression(short for LR) has not direct method for solving weighted LR, so i pass to SGDClassifier(SGD). As with my experiment: i generate data follow LR distribution with ...
1
vote
0answers
66 views

Very few distinct prediction probabilities for CV instances with sparse SVM

I’m having an issue using the prediction probabilities for sparse SVM, where many of the predictions come out the same for my test instances. These probabilities are produced during cross validation, ...
1
vote
0answers
268 views

Scikit-Learn grid search with custom CountVectorizer tokenizer

I am currently learning more about scikit learn and nltk and I am building a text classifier. I am not a python expert but I am learning as I go (I do have backgrounds in various other programming ...
1
vote
0answers
42 views

How identify left (True) and right (False) branch

I exported a scikit-learn DecisionTree to a .dot file with export_graphviz. In a different module I want to load the tree from the .dot file and fill a different tree structure. Question: How do I ...
1
vote
0answers
57 views

Scikit learn + Random forest - features of single trees

I have a very specific question regarding random forests and its implementation in scikit. I constructed a forest, and prediction works just fine so far. However, I need to know which particular ...
1
vote
0answers
33 views

Pointing to source file from IDLE editor in python

I'm working from the book called "Building Machine Learning Systems with Python". I've downloaded some data from MLComp to use as a training set. The file I downloaded (20news-18828) is currently in ...
1
vote
0answers
44 views

How to extend the ensemble methods in scikit-learn with a new learning algorithm

I have a new decision tree ensemble regression method algorithm I need to implement, and I would like to build on the infrastructure that the Python-based scikit-learn package provides if I can. I ...
1
vote
0answers
72 views

Random forests: weighting individual observations when resampling

I'm currently using a random forest on a nationally representative dataset with probability weights incorporated for each observation, with the hope that I can use these weights in the bootstrapping ...
1
vote
0answers
96 views

Comparing confidence scores from decision_function() for scikit-learn LinearSVC

I am using scikit-learn's LinearSVC SVM implementation to perform tagging of text. I have about 100 classifiers trained for different tags. I now want to rank the tags in order of their similarity ...
1
vote
0answers
190 views

Content based recommender system with sklearn or numpy

I am trying to build a content-based recommender system in python/pandas/numpy/sklearn. Here are the matrix involved and their size: X: n_customers * n_features (contains the features of each ...
1
vote
0answers
44 views

How to get tf-idf matrix of a large size corpus, where features are pre-specified?

I have a corpus consisting 3,500,000 text documents. I want to construct a tf-idf matrix of (3,500,000 * 5,000) size. Here I have 5,000 distinct features (words). I am using scikit sklearn in python. ...
1
vote
0answers
49 views

Using sklearn in android device

I am currently using sklearn doing machine learning for the sensor data I collected from an android device. But the thing is I need to do prediction after the model is trained. Since there will be ...
1
vote
0answers
192 views

kmeans scikit-learn tutorial

I'm trying out Python instead of R for data analysis and am having a bit of trouble. So I've been reading scikit-learn's documentation and tried running their kmeans example on my own but get this ...
1
vote
0answers
100 views

iPython: cannot import module named sklearn

I am able to import sklearn using the python interpreter, but when I try to do the same in an iPython notebook, iPython throws an ImportError. Any idea what is causing this issue? I need to use a ...
1
vote
0answers
43 views

IPython Notebook - Log messages from Scikit Parallel

I have a script that uses scikit-learn's parallel features (implemented by the joblib library). Typically I run it with higher verbosity, so that I can monitor the progress: grid = ...
1
vote
0answers
98 views

unable to use neighbors.KNeighborsClassifier for multilabel sparse data

I have a file having data as: 1,2,3 4:5 6:7................ 11,12,13,14 15:16 17:18 19:20...... . . . I have loaded this file as X_train,Y_train = load_svmlight_file(filename, multilabel=True) ...
1
vote
0answers
95 views

how to use python sklearn package in hadoop streaming

Hi: currently im running jobs using hadoop streaming, in my mapper, i need to use sklearn package as part of my program, but unfortunately sklearn package is not installed in my hadoop cluster nodes. ...
1
vote
0answers
258 views

scikit-learn cross validation, negative values with mean squared error

When I use the following code with Data matrix X of size (952,144) and output vector y of size (952), mean_squared_error metric returns negative values, which is unexpected. Do you have any idea? ...
1
vote
0answers
71 views

dtype mismatch in sklearn on k-means

I am attempting to run the first answer to this question Python Relating k-means cluster to instance however I am getting the following error: Traceback (most recent call last): File "test.py", ...
1
vote
0answers
75 views

non DAG task dependencies in LoadBalancedView in iPython

I want to train a lot of models in parallel using ipython parallel with LoadBalancedView. However I want the constraint that after each task is done, that the particular node must "check" with ...