-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Closed
Description
Hi ALL,
I run DeepSpeedExamples/BingBertSquad using docker image deepspeed/deepspeed:latest . I launched 16 ranks on two GPU servers, each of which is equipped with 8 V100 GPUs.
However, only 20% GPU memory saving, aka. 2.5GB per GPU vs 1.9 GB per GPU, was observed when I turned on zero_optimization and activation_checkpointing. Is this normal?
The config is attached as below.
{
"train_batch_size": 256,
"train_micro_batch_size_per_gpu": 16,
"steps_per_print": 32,
"gradient_accumulation_steps": 1,
"optimizer": {
"type": "Adam",
"params": {
"lr": 3e-5,
"weight_decay": 0.0,
"bias_correction": false
}
},
"gradient_clipping": 1.0,
"fp16": {
"enabled": true
},
"zero_optimization": {
"stage": 2
},
"activation_checkpointing": {
"partition_activations": true,
"cpu_checkpointing": false,
"contiguous_memory_optimization": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": true
}
}The network I used is Bert-large with 1024 hidden_size and 24 num_hidden_layers.
Any suggestion or feedback is highly appreciated.
Thank you.
Metadata
Metadata
Assignees
Labels
No labels