Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upDocumentation needed on how to speed up the nlp.pipe() usage #5239
Comments
|
The main issue is that multiprocessing has a lot of overhead when starting child processes, and this overhead is especially high in windows, which uses You might also see some improvements by using a smaller number of processes for shorter tasks, but you'd have to experiment with this. It depends on the model and pipeline, how long your texts are, etc. and there's no single set of guidelines that will be optimal for every case. See: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods |
In this case, are longer tasks equivalent to a much larger batch size? Are we creating a child process per batch in Spacy? |
I've tried everything I could but couldn't find a single example where n_process > 1 resulted in better performance. In fact performance are terribly worse even with n_process = 2. |
There is documentation on how to use nlp.pipe() using a single process and not specifying batch size:
https://spacy.io/usage/processing-pipelines
And there is brief documentation on setting n_process and batch size:
https://spacy.io/api/language#pipe
But I am finding it hard to get a clear answer on the relationship between batch_size and n_processes for a simple use like entity extraction. So far using the vanilla
nlp.pipe()is significantly faster thannlp()...as expected. However, usingnlp.pipe(text, n_process=cpu_count()-1)is much slower than justnlp.pipe()even after scanning through batch_size options (50 to 1000).On a small dataset of 2000 sentences:
data = ["I 'm very happy .", "I want to say that I 'm very happy ."]*1000nlp.pipe()takes ~2 seconds whereasnlp.pipe(text, n_process=cpu_count()-1)takes upto 30 and justnlp()takes ~14It would be good to know how to set the parameters of n_process and batch_size given a max cpu_count() from the multiprocessing library.
Additional info, I'm using windows and have a 12 core cpu.