Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn_api `transform()` methods not compatible with generators #2825

Open
straygar opened this issue May 4, 2020 · 6 comments
Open

sklearn_api `transform()` methods not compatible with generators #2825

straygar opened this issue May 4, 2020 · 6 comments

Comments

@straygar
Copy link

@straygar straygar commented May 4, 2020

Example (from TfidfTransformer)

if isinstance(docs[0], tuple):
    docs = [docs]
return [self.gensim_model[doc] for doc in docs]

This method expects a list of tuples, instead of an iterable. This means that the entire corpus has to be stored as a list in memory, instead of just the TFIDF matrix produced at the end. This is unfeasible for large datasets.

Why do we need to create a list from docs, instead of just doing:

return (self.gensim_model[doc] for doc in docs)
@piskvorky
Copy link
Member

@piskvorky piskvorky commented May 4, 2020

Just sloppy programming, I think. The whole sklearn_api subpackage was contributed externally and its quality is so-so.

Why do we need to create a list from docs, instead of just doing:

I'd guess this if is trying to catch a case where the docs variable is a single document, rather than a sequence of documents. The code comment # input as python lists is not very illuminating, I agree.

The code shouldn't expect docs to be a list though, that's a bug (the documentation says iterable).

A PR to clean this up is welcome, if you're up for it @straygar.

@gojomo
Copy link
Member

@gojomo gojomo commented May 4, 2020

But, isn't the implied contract from scikit-learn even stricter, not accepting generators?

I believe transform() X parameter should match fit_transform() X, which in the TransformerMixin class (which this TfidfTransfomer inherits-from) is defined as numpy array of shape [n_samples, n_features].

@piskvorky
Copy link
Member

@piskvorky piskvorky commented May 4, 2020

It'd be good to hear from an actual user, what they expect from this wrapper.

X cannot possibly be a dense numpy array. That would be too inefficient.

My understanding is that these sklearn wrappers still expect gensim-like inputs (stream of sparse documents). But instead of using them through gensim-like interfaces, they allow creating the lego-like sklearn pipelines.

So I imagined something like this:

  1. Large (sparse) corpus stream as input.
  2. Sklearn pipeline working over the stream: strings => tfidf => lsi…
  3. Exit point into sklearn world: at some point, convert the pipeline output into a numpy matrix that other sklearn modules understand.
  4. Use the output seamlessly for e.g. classification, within the same pipeline.

But I never used this myself. @straygar what's your use case?

@straygar
Copy link
Author

@straygar straygar commented May 5, 2020

Thanks for the history, @piskvorky! Didn't know it was an externally contributed bit.

I'm currently working on scaling up some text featurization code in a project I'm working on. We currently have a pipeline of: preprocess(document) -> sklearn's TfidfVectorizer -> TruncatedSVD.

I was going to start with replacing TfidfVectorizer, which stores the corpus in memory (and lead to OOM issues as our data grew), to a gensim Dictionary + TfidfTransformer. I chose to go for the sklearn API to avoid having to do manual data reshaping when using gensim's TfidfModel. Maybe there's a better way to achieve this?

I worked on this a bit more, and I think we could return a scipy.sparse.csr_matrix at the end of that transform method, which is what sklearn.TfidfTransformer.transform() seems to do, I believe.

I'd be happy to work on a PR, but probably not until next week.

@piskvorky
Copy link
Member

@piskvorky piskvorky commented May 5, 2020

Your workflow is pretty similar to what I was thinking, thanks for confirming. Basically, you want to use Gensim's streaming capabilities but through an sklearn-compatible interface.

I think we could return a scipy.sparse.csr_matrix at the end of that transform method

Well, a sparse CSR matrix is not a numpy array, so that wouldn't address @gojomo's type objection.

And if you're getting OOMs, converting to an in-memory matrix (sparse or dense) is not a good idea anyway.

I see two options:

  1. Get the truncated SVD through the standard gensim API; forget about the sklearn API. Plug in the rest of your sklearn pipeline only after gensim gives you the SVD.
  2. Get gensim's sklearn wrapper to work on streamed data. That is, accept and return iterables backed by generators, not in-memory matrices. I thought that's what the wrapper did, but apparently not. I'm not sure how much work that is or what limitations there are – it may be trivial, or impossible.

The other third option, converting between streamed (gensim) and in-memory (sklearn) data structures at each pipeline step seems wasteful and defeats the purpose of using a wrapper.

@straygar
Copy link
Author

@straygar straygar commented May 14, 2020

I like your first option. I'm not familiar with Gensim's API, could you provide a code snippet on chaining gensim's TF-IDF transform and SVD to produce a result like the sklearn pipeline? Or a link to some docs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.