Not see desirable GPU memory saving when running DeepSpeedExamples/BingBertSquad

Hi ALL,

I run DeepSpeedExamples/BingBertSquad using docker image `deepspeed/deepspeed:latest `. I launched 16 ranks on two GPU servers, each of which is equipped with 8 V100 GPUs. 

However, only 20% GPU memory saving, aka. 2.5GB per GPU vs 1.9 GB per GPU,  was observed when I turned on `zero_optimization` and `activation_checkpointing`. Is this normal?

The config is attached as below.
```json
{
  "train_batch_size": 256,
  "train_micro_batch_size_per_gpu": 16,
  "steps_per_print": 32,
  "gradient_accumulation_steps": 1,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 3e-5,
      "weight_decay": 0.0,
      "bias_correction": false
    }
  },
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 2
  },
  "activation_checkpointing": {
    "partition_activations": true,
    "cpu_checkpointing": false,
    "contiguous_memory_optimization": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": true
    }
}
```

The network I used is Bert-large with 1024 `hidden_size` and 24 `num_hidden_layers`.

Any suggestion or feedback is highly appreciated.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Not see desirable GPU memory saving when running DeepSpeedExamples/BingBertSquad #259

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Not see desirable GPU memory saving when running DeepSpeedExamples/BingBertSquad #259

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions