Documentation needed on how to speed up the nlp.pipe() usage #5239

RevanthRameshkumar · 2020-03-31T20:20:02Z

There is documentation on how to use nlp.pipe() using a single process and not specifying batch size:
https://spacy.io/usage/processing-pipelines

And there is brief documentation on setting n_process and batch size:
https://spacy.io/api/language#pipe

But I am finding it hard to get a clear answer on the relationship between batch_size and n_processes for a simple use like entity extraction. So far using the vanilla nlp.pipe() is significantly faster than nlp()...as expected. However, using nlp.pipe(text, n_process=cpu_count()-1) is much slower than just nlp.pipe() even after scanning through batch_size options (50 to 1000).
On a small dataset of 2000 sentences:
data = ["I 'm very happy .", "I want to say that I 'm very happy ."]*1000
nlp.pipe() takes ~2 seconds whereas nlp.pipe(text, n_process=cpu_count()-1) takes upto 30 and just nlp() takes ~14

It would be good to know how to set the parameters of n_process and batch_size given a max cpu_count() from the multiprocessing library.

Additional info, I'm using windows and have a 12 core cpu.

adrianeboyd · 2020-04-01T07:13:23Z

The main issue is that multiprocessing has a lot of overhead when starting child processes, and this overhead is especially high in windows, which uses spawn instead of fork. You might see improvements with multiprocessing for tasks that take much longer than a few seconds with one process, but it's not going to be helpful for short tasks.

You might also see some improvements by using a smaller number of processes for shorter tasks, but you'd have to experiment with this. It depends on the model and pipeline, how long your texts are, etc. and there's no single set of guidelines that will be optimal for every case.

See: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods

RevanthRameshkumar · 2020-04-02T16:14:11Z

You might see improvements with multiprocessing for tasks that take much longer than a few seconds with one process, but it's not going to be helpful for short tasks.

In this case, are longer tasks equivalent to a much larger batch size? Are we creating a child process per batch in Spacy?

bgeneto · 2020-06-17T20:50:33Z

There is documentation on how to use nlp.pipe() using a single process and not specifying batch size:
https://spacy.io/usage/processing-pipelines

And there is brief documentation on setting n_process and batch size:
https://spacy.io/api/language#pipe

But I am finding it hard to get a clear answer on the relationship between batch_size and n_processes for a simple use like entity extraction. So far using the vanilla nlp.pipe() is significantly faster than nlp()...as expected. However, using nlp.pipe(text, n_process=cpu_count()-1) is much slower than just nlp.pipe() even after scanning through batch_size options (50 to 1000).
On a small dataset of 2000 sentences:
data = ["I 'm very happy .", "I want to say that I 'm very happy ."]*1000
nlp.pipe() takes ~2 seconds whereas nlp.pipe(text, n_process=cpu_count()-1) takes upto 30 and just nlp() takes ~14

It would be good to know how to set the parameters of n_process and batch_size given a max cpu_count() from the multiprocessing library.

Additional info, I'm using windows and have a 12 core cpu.

I've tried everything I could but couldn't find a single example where n_process > 1 resulted in better performance. In fact performance are terribly worse even with n_process = 2.
I think developers should provide a minimal speed up example, otherwise so many people out there will lose precious time testing and benchmarking just to find out that there is no way to parallelize this kind of job with spaCy/scispacy.

svlandeg added docs feat / pipeline perf / speed labels Mar 31, 2020

explosion / spaCy

Documentation needed on how to speed up the nlp.pipe() usage #5239

Documentation needed on how to speed up the nlp.pipe() usage #5239

RevanthRameshkumar commented Mar 31, 2020

adrianeboyd commented Apr 1, 2020

RevanthRameshkumar commented Apr 2, 2020

bgeneto commented Jun 17, 2020

explosion / spaCy

Join GitHub today

Documentation needed on how to speed up the nlp.pipe() usage #5239

Documentation needed on how to speed up the nlp.pipe() usage #5239

Comments

RevanthRameshkumar commented Mar 31, 2020

adrianeboyd commented Apr 1, 2020

RevanthRameshkumar commented Apr 2, 2020

bgeneto commented Jun 17, 2020