[cifar ds training]: Set cuda device during initialization of distributed backend. #931

jagadish-amd · 2024-10-15T19:16:31Z

The commit is needed to avoid GPU 0 being set as the compute stream via torch.cuda.current_stream() during initialization across all GPUs.
The perf RunningAvgSamplesPerSec metrics improves on a multi gpu node, tested on AMD GPU with ROCm stack.
As number of GPUs increases; without this commit, GPU 0 takes in more load compared to other GPUs.

The commit is needed to avoid GPU 0 being set as the compute stream via torch.cuda.current_stream() during initialization across all GPUs. Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

jagadish-amd · 2024-10-15T19:21:52Z

ping @jeffdaily

training/cifar/cifar10_deepspeed.py

Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

jagadish-amd · 2024-10-17T19:46:07Z

@tjruwase can you please review / merge ?

tjruwase · 2024-10-29T13:47:39Z

@tjruwase can you please review / merge ?

@jagadish-amd, apologies for the delay. Done.

…uted backend. (#931) * Set cuda device during initialization of distributed backend. The commit is needed to avoid GPU 0 being set as the compute stream via torch.cuda.current_stream() during initialization across all GPUs. Signed-off-by: Jagadish Krishnamoorthy <[email protected]> * Use device-agnostic accelerator API. Signed-off-by: Jagadish Krishnamoorthy <[email protected]> --------- Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

Set cuda device during initialization of distributed backend.

3d679a5

The commit is needed to avoid GPU 0 being set as the compute stream via torch.cuda.current_stream() during initialization across all GPUs. Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

jagadish-amd requested review from awan-10 and tjruwase as code owners October 15, 2024 19:16

jeffdaily approved these changes Oct 15, 2024

View reviewed changes

tjruwase reviewed Oct 15, 2024

View reviewed changes

training/cifar/cifar10_deepspeed.py Outdated Show resolved Hide resolved

Use device-agnostic accelerator API.

7f91988

Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

tjruwase approved these changes Oct 29, 2024

View reviewed changes

tjruwase merged commit 130fb58 into deepspeedai:master Oct 29, 2024
1 check passed

jagadish-amd deleted the cifar-set_device branch October 29, 2024 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[cifar ds training]: Set cuda device during initialization of distributed backend. #931

[cifar ds training]: Set cuda device during initialization of distributed backend. #931

Uh oh!

jagadish-amd commented Oct 15, 2024 •

edited

Loading

Uh oh!

jagadish-amd commented Oct 15, 2024

Uh oh!

Uh oh!

jagadish-amd commented Oct 17, 2024 •

edited

Loading

Uh oh!

tjruwase commented Oct 29, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[cifar ds training]: Set cuda device during initialization of distributed backend. #931

[cifar ds training]: Set cuda device during initialization of distributed backend. #931

Uh oh!

Conversation

jagadish-amd commented Oct 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jagadish-amd commented Oct 15, 2024

Uh oh!

Uh oh!

jagadish-amd commented Oct 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjruwase commented Oct 29, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jagadish-amd commented Oct 15, 2024 •

edited

Loading

jagadish-amd commented Oct 17, 2024 •

edited

Loading