Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upGitHub is where the world builds software
Millions of developers and companies build, ship, and maintain their software on GitHub — the largest and most advanced development platform in the world.
Difference between loss.backward() and model_engine.backward(loss) ? #329
Comments
|
Thanks for the report @rsn870! Our I would think that we can simply add a @tjruwase / @samyam / @jeffra : do you have any thoughts on this? |
|
Hi , if one does loss.backward() instead of model_engine.backward(loss), so it will be wrong because gradients will not be averaged and therefore at every GPU, the weights will differ eventually leading us to train N models on N GPUs with batch_size/N samples per iteration. Please correct me if I am wrong |
|
If the model_engine() is based on DDP() for example I think computed loss should be synchronized among all nodes ? |
Correct, you need
It is not based on DDP directly, so without
Not currently, DeepSpeed needs to propagate additional arguments to our backward interface. I self-assigned but will not be able to get to it until hopefully next week. We are very welcoming of PRs if you would like to contribute :-). I would be happy to help guide through the process. |
|
@ShadenSmith it would be great if we can add |
|
@ShadenSmith is this issue still open? If yes can I pick this up? |
|
@asit2898 yes it is, apologies for us not getting to it quickly. PRs are very much welcome. |
|
Thanks a lot @ShadenSmith, I'll work on this issue. This would be my first commit, so any pointers would be truly helpful! |
Hi ,
I have tried out both loss.backward() and model_engine.backward(loss) for my code. There are several subtle differences that I have observed , for one retain_graph = True does not work for model_engine.backward(loss) . This is creating a problem since buffers are not being retained every time I run the code for some reason.
Please look into this if you could.