Difference between loss.backward() and model_engine.backward(loss) ? #329

rsn870 · 2020-08-21T06:39:17Z

Hi ,

I have tried out both loss.backward() and model_engine.backward(loss) for my code. There are several subtle differences that I have observed , for one retain_graph = True does not work for model_engine.backward(loss) . This is creating a problem since buffers are not being retained every time I run the code for some reason.

Please look into this if you could.

ShadenSmith · 2020-08-21T16:10:58Z

Thanks for the report @rsn870! Our model_engine.backward() is a wrapper around loss.backward() that also manages gradient accumulation, gradient reduction for data parallelism, timing, etc. You are seeing differences because additional arguments like retain_graph are not passed along to loss.backward().

I would think that we can simply add a **backward_kwargs argument to our model_engine.backward() that forwards to loss.backward(**backward_kwargs).

@tjruwase / @samyam / @jeffra : do you have any thoughts on this?

aakash-saboo · 2020-08-21T16:17:44Z

Hi , if one does loss.backward() instead of model_engine.backward(loss), so it will be wrong because gradients will not be averaged and therefore at every GPU, the weights will differ eventually leading us to train N models on N GPUs with batch_size/N samples per iteration.
Also please tell me if there is any workaround for retain_graph=True for now?

Please correct me if I am wrong

rsn870 · 2020-08-21T16:22:23Z

If the model_engine() is based on DDP() for example I think computed loss should be synchronized among all nodes ?

ShadenSmith · 2020-08-24T14:02:29Z

Hi , if one does loss.backward() instead of model_engine.backward(loss), so it will be wrong because gradients will not be averaged and therefore at every GPU, the weights will differ eventually leading us to train N models on N GPUs with batch_size/N samples per iteration.

Correct, you need model_engine.backward() to reduce gradients and perform other DeepSpeed bookkeeping.

If the model_engine() is based on DDP() for example I think computed loss should be synchronized among all nodes ?

It is not based on DDP directly, so without model_engine.backward() you don't get synchronization.

Also please tell me if there is any workaround for retain_graph=True for now?

Not currently, DeepSpeed needs to propagate additional arguments to our backward interface. I self-assigned but will not be able to get to it until hopefully next week. We are very welcoming of PRs if you would like to contribute :-). I would be happy to help guide through the process.

kracwarlock · 2020-08-26T15:35:11Z

@ShadenSmith it would be great if we can add retain_graph and other arguments.

asit2898 · 2020-11-03T11:59:55Z

@ShadenSmith is this issue still open? If yes can I pick this up?

ShadenSmith · 2020-11-03T14:42:15Z

@asit2898 yes it is, apologies for us not getting to it quickly. PRs are very much welcome.

asit2898 · 2020-11-04T12:40:45Z

Thanks a lot @ShadenSmith, I'll work on this issue. This would be my first commit, so any pointers would be truly helpful!

ShadenSmith added the enhancement label Aug 21, 2020

ShadenSmith added the good first issue label Aug 24, 2020

ShadenSmith self-assigned this Aug 24, 2020

microsoft / DeepSpeed

Difference between loss.backward() and model_engine.backward(loss) ? #329

Difference between loss.backward() and model_engine.backward(loss) ? #329

rsn870 commented Aug 21, 2020

ShadenSmith commented Aug 21, 2020 •

edited

aakash-saboo commented Aug 21, 2020 •

edited

rsn870 commented Aug 21, 2020

ShadenSmith commented Aug 24, 2020

kracwarlock commented Aug 26, 2020

asit2898 commented Nov 3, 2020

ShadenSmith commented Nov 3, 2020

asit2898 commented Nov 4, 2020

microsoft / DeepSpeed

Join GitHub today

GitHub is where the world builds software

Difference between loss.backward() and model_engine.backward(loss) ? #329

Difference between loss.backward() and model_engine.backward(loss) ? #329

Comments

rsn870 commented Aug 21, 2020

ShadenSmith commented Aug 21, 2020 • edited

aakash-saboo commented Aug 21, 2020 • edited

rsn870 commented Aug 21, 2020

ShadenSmith commented Aug 24, 2020

kracwarlock commented Aug 26, 2020

asit2898 commented Nov 3, 2020

ShadenSmith commented Nov 3, 2020

asit2898 commented Nov 4, 2020

Essential cookies

Always active

Analytics cookies

ShadenSmith commented Aug 21, 2020 •

edited

aakash-saboo commented Aug 21, 2020 •

edited