sklearn_api `transform()` methods not compatible with generators #2825

straygar · 2020-05-04T16:50:16Z

if isinstance(docs[0], tuple):
    docs = [docs]
return [self.gensim_model[doc] for doc in docs]

This method expects a list of tuples, instead of an iterable. This means that the entire corpus has to be stored as a list in memory, instead of just the TFIDF matrix produced at the end. This is unfeasible for large datasets.

Why do we need to create a list from docs, instead of just doing:

return (self.gensim_model[doc] for doc in docs)

piskvorky · 2020-05-04T17:14:53Z

Just sloppy programming, I think. The whole sklearn_api subpackage was contributed externally and its quality is so-so.

Why do we need to create a list from docs, instead of just doing:

I'd guess this if is trying to catch a case where the docs variable is a single document, rather than a sequence of documents. The code comment # input as python lists is not very illuminating, I agree.

The code shouldn't expect docs to be a list though, that's a bug (the documentation says iterable).

A PR to clean this up is welcome, if you're up for it @straygar.

gojomo · 2020-05-04T17:30:36Z

But, isn't the implied contract from scikit-learn even stricter, not accepting generators?

I believe transform() X parameter should match fit_transform() X, which in the TransformerMixin class (which this TfidfTransfomer inherits-from) is defined as numpy array of shape [n_samples, n_features].

piskvorky · 2020-05-04T17:40:30Z

It'd be good to hear from an actual user, what they expect from this wrapper.

X cannot possibly be a dense numpy array. That would be too inefficient.

My understanding is that these sklearn wrappers still expect gensim-like inputs (stream of sparse documents). But instead of using them through gensim-like interfaces, they allow creating the lego-like sklearn pipelines.

So I imagined something like this:

Large (sparse) corpus stream as input.
Sklearn pipeline working over the stream: strings => tfidf => lsi…
Exit point into sklearn world: at some point, convert the pipeline output into a numpy matrix that other sklearn modules understand.
Use the output seamlessly for e.g. classification, within the same pipeline.

But I never used this myself. @straygar what's your use case?

straygar · 2020-05-05T13:46:26Z

Thanks for the history, @piskvorky! Didn't know it was an externally contributed bit.

I'm currently working on scaling up some text featurization code in a project I'm working on. We currently have a pipeline of: preprocess(document) -> sklearn's TfidfVectorizer -> TruncatedSVD.

I was going to start with replacing TfidfVectorizer, which stores the corpus in memory (and lead to OOM issues as our data grew), to a gensim Dictionary + TfidfTransformer. I chose to go for the sklearn API to avoid having to do manual data reshaping when using gensim's TfidfModel. Maybe there's a better way to achieve this?

I worked on this a bit more, and I think we could return a scipy.sparse.csr_matrix at the end of that transform method, which is what sklearn.TfidfTransformer.transform() seems to do, I believe.

I'd be happy to work on a PR, but probably not until next week.

piskvorky · 2020-05-05T14:21:45Z

Your workflow is pretty similar to what I was thinking, thanks for confirming. Basically, you want to use Gensim's streaming capabilities but through an sklearn-compatible interface.

I think we could return a scipy.sparse.csr_matrix at the end of that transform method

Well, a sparse CSR matrix is not a numpy array, so that wouldn't address @gojomo's type objection.

And if you're getting OOMs, converting to an in-memory matrix (sparse or dense) is not a good idea anyway.

I see two options:

Get the truncated SVD through the standard gensim API; forget about the sklearn API. Plug in the rest of your sklearn pipeline only after gensim gives you the SVD.
Get gensim's sklearn wrapper to work on streamed data. That is, accept and return iterables backed by generators, not in-memory matrices. I thought that's what the wrapper did, but apparently not. I'm not sure how much work that is or what limitations there are – it may be trivial, or impossible.

The other third option, converting between streamed (gensim) and in-memory (sklearn) data structures at each pipeline step seems wasteful and defeats the purpose of using a wrapper.

straygar · 2020-05-14T14:59:53Z

I like your first option. I'm not familiar with Gensim's API, could you provide a code snippet on chaining gensim's TF-IDF transform and SVD to produce a result like the sklearn pipeline? Or a link to some docs?

piskvorky added bug easy reach LOW impact LOW labels May 4, 2020

piskvorky mentioned this issue May 9, 2020

Incorrect documentation for `fit_transform` in `sklearn_api` API #2753

Open

RaRe-Technologies / gensim

sklearn_api `transform()` methods not compatible with generators #2825

sklearn_api `transform()` methods not compatible with generators #2825

straygar commented May 4, 2020 •

edited

piskvorky commented May 4, 2020 •

edited

gojomo commented May 4, 2020 •

edited

piskvorky commented May 4, 2020 •

edited

straygar commented May 5, 2020

piskvorky commented May 5, 2020 •

edited

straygar commented May 14, 2020

RaRe-Technologies / gensim

Join GitHub today

sklearn_api `transform()` methods not compatible with generators #2825

sklearn_api `transform()` methods not compatible with generators #2825

Comments

straygar commented May 4, 2020 • edited

piskvorky commented May 4, 2020 • edited

gojomo commented May 4, 2020 • edited

piskvorky commented May 4, 2020 • edited

straygar commented May 5, 2020

piskvorky commented May 5, 2020 • edited

straygar commented May 14, 2020

straygar commented May 4, 2020 •

edited

piskvorky commented May 4, 2020 •

edited

gojomo commented May 4, 2020 •

edited

piskvorky commented May 4, 2020 •

edited

piskvorky commented May 5, 2020 •

edited