Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation needed on how to speed up the nlp.pipe() usage #5239

Open
RevanthRameshkumar opened this issue Mar 31, 2020 · 3 comments
Open

Documentation needed on how to speed up the nlp.pipe() usage #5239

RevanthRameshkumar opened this issue Mar 31, 2020 · 3 comments

Comments

@RevanthRameshkumar
Copy link

@RevanthRameshkumar RevanthRameshkumar commented Mar 31, 2020

There is documentation on how to use nlp.pipe() using a single process and not specifying batch size:
https://spacy.io/usage/processing-pipelines

And there is brief documentation on setting n_process and batch size:
https://spacy.io/api/language#pipe

But I am finding it hard to get a clear answer on the relationship between batch_size and n_processes for a simple use like entity extraction. So far using the vanilla nlp.pipe() is significantly faster than nlp()...as expected. However, using nlp.pipe(text, n_process=cpu_count()-1) is much slower than just nlp.pipe() even after scanning through batch_size options (50 to 1000).
On a small dataset of 2000 sentences:
data = ["I 'm very happy .", "I want to say that I 'm very happy ."]*1000
nlp.pipe() takes ~2 seconds whereas nlp.pipe(text, n_process=cpu_count()-1) takes upto 30 and just nlp() takes ~14

It would be good to know how to set the parameters of n_process and batch_size given a max cpu_count() from the multiprocessing library.

Additional info, I'm using windows and have a 12 core cpu.

@adrianeboyd
Copy link
Member

@adrianeboyd adrianeboyd commented Apr 1, 2020

The main issue is that multiprocessing has a lot of overhead when starting child processes, and this overhead is especially high in windows, which uses spawn instead of fork. You might see improvements with multiprocessing for tasks that take much longer than a few seconds with one process, but it's not going to be helpful for short tasks.

You might also see some improvements by using a smaller number of processes for shorter tasks, but you'd have to experiment with this. It depends on the model and pipeline, how long your texts are, etc. and there's no single set of guidelines that will be optimal for every case.

See: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

@RevanthRameshkumar
Copy link
Author

@RevanthRameshkumar RevanthRameshkumar commented Apr 2, 2020

You might see improvements with multiprocessing for tasks that take much longer than a few seconds with one process, but it's not going to be helpful for short tasks.

In this case, are longer tasks equivalent to a much larger batch size? Are we creating a child process per batch in Spacy?

@bgeneto
Copy link

@bgeneto bgeneto commented Jun 17, 2020

There is documentation on how to use nlp.pipe() using a single process and not specifying batch size:
https://spacy.io/usage/processing-pipelines

And there is brief documentation on setting n_process and batch size:
https://spacy.io/api/language#pipe

But I am finding it hard to get a clear answer on the relationship between batch_size and n_processes for a simple use like entity extraction. So far using the vanilla nlp.pipe() is significantly faster than nlp()...as expected. However, using nlp.pipe(text, n_process=cpu_count()-1) is much slower than just nlp.pipe() even after scanning through batch_size options (50 to 1000).
On a small dataset of 2000 sentences:
data = ["I 'm very happy .", "I want to say that I 'm very happy ."]*1000
nlp.pipe() takes ~2 seconds whereas nlp.pipe(text, n_process=cpu_count()-1) takes upto 30 and just nlp() takes ~14

It would be good to know how to set the parameters of n_process and batch_size given a max cpu_count() from the multiprocessing library.

Additional info, I'm using windows and have a 12 core cpu.

I've tried everything I could but couldn't find a single example where n_process > 1 resulted in better performance. In fact performance are terribly worse even with n_process = 2.
I think developers should provide a minimal speed up example, otherwise so many people out there will lose precious time testing and benchmarking just to find out that there is no way to parallelize this kind of job with spaCy/scispacy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.