Skip to content

Conversation

@keshavb96
Copy link
Contributor

This document presents a detailed tutorial on how Ray can be used together with JAX to achieve fault tolerant training.

Copy link
Contributor

@gspschmid gspschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this out and needed to make a few changes to get the container ready (see inline comments).

Once that was set up I ran into the following error on a viking node:

$ docker build -t ray_resiliency_example -f Dockerfile .
...
$ docker run --gpus=all --name resilient_jax --network=host --security-opt seccomp=unconfined --cap-add SYS_PTRACE -it --shm-size=50g --ulimit memlock=-1 ray_resiliency_example
root@viking-prod-283:/ray_resiliency_example# ./launch_ray_job.sh
...
redis.exceptions.ConnectionError: Error 111 connecting to 10.78.2.240:6380. Connection refused.

Full log here: https://gist.github.com/gspschmid/ff1d8e7873a5010d880cc8350bf314f1

Nvm, launch_ray_job.sh had the line to launch redis commented out, it seems to work after uncommenting that!

Copy link
Contributor

@gspschmid gspschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment re triggering multiple non-rank-0 failures

@keshavb96 keshavb96 marked this pull request as ready for review March 12, 2025 02:13
@keshavb96 keshavb96 closed this Mar 18, 2025
@gspschmid
Copy link
Contributor

Closed in favor of #1349

@gspschmid gspschmid mentioned this pull request Mar 18, 2025
gspschmid pushed a commit that referenced this pull request Mar 18, 2025
Adds a self-contained example of using Ray for a resilient Jax training loop.

Original PR: #1302
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants