Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upsklearn_api `transform()` methods not compatible with generators #2825
Comments
|
Just sloppy programming, I think. The whole
I'd guess this The code shouldn't expect A PR to clean this up is welcome, if you're up for it @straygar. |
|
But, isn't the implied contract from I believe |
|
It'd be good to hear from an actual user, what they expect from this wrapper.
My understanding is that these sklearn wrappers still expect gensim-like inputs (stream of sparse documents). But instead of using them through gensim-like interfaces, they allow creating the lego-like sklearn pipelines. So I imagined something like this:
But I never used this myself. @straygar what's your use case? |
|
Thanks for the history, @piskvorky! Didn't know it was an externally contributed bit. I'm currently working on scaling up some text featurization code in a project I'm working on. We currently have a pipeline of: preprocess(document) -> sklearn's I was going to start with replacing I worked on this a bit more, and I think we could return a I'd be happy to work on a PR, but probably not until next week. |
|
Your workflow is pretty similar to what I was thinking, thanks for confirming. Basically, you want to use Gensim's streaming capabilities but through an sklearn-compatible interface.
Well, a sparse CSR matrix is not a numpy array, so that wouldn't address @gojomo's type objection. And if you're getting OOMs, converting to an in-memory matrix (sparse or dense) is not a good idea anyway. I see two options:
The other third option, converting between streamed (gensim) and in-memory (sklearn) data structures at each pipeline step seems wasteful and defeats the purpose of using a wrapper. |
|
I like your first option. I'm not familiar with Gensim's API, could you provide a code snippet on chaining gensim's TF-IDF transform and SVD to produce a result like the sklearn pipeline? Or a link to some docs? |
Example (from TfidfTransformer)
This method expects a list of tuples, instead of an iterable. This means that the entire corpus has to be stored as a list in memory, instead of just the TFIDF matrix produced at the end. This is unfeasible for large datasets.
Why do we need to create a list from
docs, instead of just doing: