Skip to content

[WIP] Update metrics.py - fix for ogbg pytorch #871

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

davidtweedle
Copy link

It seems that the problem affecting the pytorch ogbg workloads (but only if they run for some length of time) has to do with jax/xla cpu compilation of the metrics computation. By converting the jax arrays to numpy, hopefully this can be avoided. The next step is to test on schedule free and shampoo, which I hope to do very soon.

It seems that the problem affecting the pytorch ogbg workloads (but only if they run for some length of time) has to do with jax/xla cpu compilation of the metrics computation. By converting the jax arrays to numpy, hopefully this can be avoided. The next step is to test on schedule free and shampoo, which I hope to do very soon.
Copy link

github-actions bot commented Jun 5, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@davidtweedle
Copy link
Author

You can see here that schedule free now completes the run with the changes.
https://pastebin.com/RgqMqgkb

@davidtweedle
Copy link
Author

Tested again but this time only replacing the call to jax for sigmoid with numpy sigmoid. This also avoids the crash. So the problem seems to be calling jax sigmoid in pytorch.

The problem with torchrun and jax seems to be caused by jax.nn.sigmoid.
Changed from lambda expression which pylint doesn't like.
Defined np sigmoid inside use_pytorch_ddp
Added white space before and after sigmoid_np
Fix white space
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant