Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between loss.backward() and model_engine.backward(loss) ? #329

Open
rsn870 opened this issue Aug 21, 2020 · 8 comments
Open

Difference between loss.backward() and model_engine.backward(loss) ? #329

rsn870 opened this issue Aug 21, 2020 · 8 comments
Assignees

Comments

@rsn870
Copy link

@rsn870 rsn870 commented Aug 21, 2020

Hi ,

I have tried out both loss.backward() and model_engine.backward(loss) for my code. There are several subtle differences that I have observed , for one retain_graph = True does not work for model_engine.backward(loss) . This is creating a problem since buffers are not being retained every time I run the code for some reason.

Please look into this if you could.

@ShadenSmith
Copy link
Contributor

@ShadenSmith ShadenSmith commented Aug 21, 2020

Thanks for the report @rsn870! Our model_engine.backward() is a wrapper around loss.backward() that also manages gradient accumulation, gradient reduction for data parallelism, timing, etc. You are seeing differences because additional arguments like retain_graph are not passed along to loss.backward().

I would think that we can simply add a **backward_kwargs argument to our model_engine.backward() that forwards to loss.backward(**backward_kwargs).

@tjruwase / @samyam / @jeffra : do you have any thoughts on this?

@aakash-saboo
Copy link

@aakash-saboo aakash-saboo commented Aug 21, 2020

Hi , if one does loss.backward() instead of model_engine.backward(loss), so it will be wrong because gradients will not be averaged and therefore at every GPU, the weights will differ eventually leading us to train N models on N GPUs with batch_size/N samples per iteration.
Also please tell me if there is any workaround for retain_graph=True for now?

Please correct me if I am wrong

@rsn870
Copy link
Author

@rsn870 rsn870 commented Aug 21, 2020

If the model_engine() is based on DDP() for example I think computed loss should be synchronized among all nodes ?

@ShadenSmith
Copy link
Contributor

@ShadenSmith ShadenSmith commented Aug 24, 2020

Hi , if one does loss.backward() instead of model_engine.backward(loss), so it will be wrong because gradients will not be averaged and therefore at every GPU, the weights will differ eventually leading us to train N models on N GPUs with batch_size/N samples per iteration.

Correct, you need model_engine.backward() to reduce gradients and perform other DeepSpeed bookkeeping.

If the model_engine() is based on DDP() for example I think computed loss should be synchronized among all nodes ?

It is not based on DDP directly, so without model_engine.backward() you don't get synchronization.

Also please tell me if there is any workaround for retain_graph=True for now?

Not currently, DeepSpeed needs to propagate additional arguments to our backward interface. I self-assigned but will not be able to get to it until hopefully next week. We are very welcoming of PRs if you would like to contribute :-). I would be happy to help guide through the process.

@kracwarlock
Copy link
Member

@kracwarlock kracwarlock commented Aug 26, 2020

@ShadenSmith it would be great if we can add retain_graph and other arguments.

@asit2898
Copy link

@asit2898 asit2898 commented Nov 3, 2020

@ShadenSmith is this issue still open? If yes can I pick this up?

@ShadenSmith
Copy link
Contributor

@ShadenSmith ShadenSmith commented Nov 3, 2020

@asit2898 yes it is, apologies for us not getting to it quickly. PRs are very much welcome.

@asit2898
Copy link

@asit2898 asit2898 commented Nov 4, 2020

Thanks a lot @ShadenSmith, I'll work on this issue. This would be my first commit, so any pointers would be truly helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.