[docs] Add the Verl RL guide#231
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new guide for RL post-training using the verl framework on Kinetic. The guide covers building compatible GPU images, submitting jobs, and handling checkpoints. The review feedback correctly identifies several violations of the repository's naming conventions for infrastructure resources, such as Artifact Registry repositories and GCS buckets, which must be cluster-scoped. Additionally, it suggests improving script portability by avoiding hardcoded project IDs.
| export KINETIC_VERL_REPO="us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/kinetic-verl" | ||
|
|
||
| gcloud artifacts repositories create kinetic-verl \ |
There was a problem hiding this comment.
The Artifact Registry repository name should follow the repository's naming convention for consistency across clusters in the same project. According to the style guide, the convention for Artifact Registry repos is kn-{cluster_name}.
| export KINETIC_VERL_REPO="us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/kinetic-verl" | |
| gcloud artifacts repositories create kinetic-verl \ | |
| export KINETIC_VERL_REPO="us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/kn-your-cluster-name" | |
| gcloud artifacts repositories create kn-your-cluster-name \ |
References
- All infrastructure resources must be cluster-scoped. The naming convention is kn-{cluster_name} for Artifact Registry repos. (link)
| import kinetic | ||
|
|
||
|
|
||
| VERL_BASE_REPO = "us-docker.pkg.dev/your-project-id/kinetic-verl" |
There was a problem hiding this comment.
Hardcoding the project ID in the script makes it less portable and inconsistent with the shell examples provided earlier in the guide. It is better to use an environment variable or a placeholder that matches the previous steps.
| VERL_BASE_REPO = "us-docker.pkg.dev/your-project-id/kinetic-verl" | |
| VERL_BASE_REPO = f"us-docker.pkg.dev/{os.environ['GOOGLE_CLOUD_PROJECT']}/kn-your-cluster-name" |
| ```python | ||
| job = run_verl_gsm8k_ppo( | ||
| prepared_data_dir=kinetic.Data( | ||
| "gs://your-bucket/verl-data/gsm8k/", |
There was a problem hiding this comment.
The bucket naming convention should follow the repository's recommended pattern {project}-kn-{cluster_name}-{purpose} to ensure resources are cluster-scoped and independent.
| "gs://your-bucket/verl-data/gsm8k/", | |
| "gs://your-project-id-kn-your-cluster-name-data/gsm8k/", |
References
- Every resource managed by the CLI must include the cluster name in its identifier. The naming convention is {project}-kn-{cluster_name}-{purpose} for buckets. (link)
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a new guide and example script for RL post-training with the verl framework on Kinetic, covering image building, job submission, and checkpoint management. The review feedback emphasizes the importance of using "fuse=True" within "kinetic.Data" for checkpoint directories to ensure data durability and prevent loss of model weights upon job completion. Additionally, it is recommended to use "os.getenv" instead of "os.environ" for the project ID in the example script to improve robustness against missing environment variables.
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| job = run_verl_gsm8k_ppo(Data("gs://your-bucket/verl-checkpoints/")) |
There was a problem hiding this comment.
To ensure that checkpoints are durable and written back to Google Cloud Storage, you must use fuse=True. Without this, the Data object downloads the prefix to local ephemeral storage, and any writes to that directory inside the pod will be lost when the job completes.
| job = run_verl_gsm8k_ppo(Data("gs://your-bucket/verl-checkpoints/")) | |
| job = run_verl_gsm8k_ppo(Data("gs://your-bucket/verl-checkpoints/", fuse=True)) |
References
- Demand robustness and ensure that the implementation handles data durability correctly.
|
|
||
| ```python | ||
| job = run_verl_gsm8k_ppo( | ||
| checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/"), |
There was a problem hiding this comment.
The checkpoint_dir should use fuse=True to ensure that model weights written by the trainer are persisted back to GCS. Without FUSE, these files are written to the pod's ephemeral storage and will be lost upon job completion.
| checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/"), | |
| checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/", fuse=True), |
|
|
||
| ```python | ||
| job = run_verl_gsm8k_ppo( | ||
| checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/"), |
There was a problem hiding this comment.
Similar to the initial run example, the resume path must also use fuse=True so that the trainer can both read existing checkpoints and write new ones back to the persistent GCS bucket.
| checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/"), | |
| checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/", fuse=True), |
| VERL_BASE_REPO = ( | ||
| f"us-docker.pkg.dev/{os.environ['GOOGLE_CLOUD_PROJECT']}/kn-your-cluster-name" | ||
| ) |
There was a problem hiding this comment.
Accessing os.environ at the module level will cause a KeyError immediately upon import if the environment variable is missing. This is brittle for an example script. It is safer to use os.getenv with a placeholder or handle the missing variable gracefully.
PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT", "your-project-id")
VERL_BASE_REPO = f"us-docker.pkg.dev/{PROJECT_ID}/kn-your-cluster-name"References
- Catch user errors early and provide clear feedback instead of raw exceptions.
|
@ChiragSW Thank you for this contribution. Do you mind adding a screenshot of a successful run to the pR description? |
|
@divyashreepathihalli I don’t currently have a working GKE setup to produce the screenshot. Could a maintainer with the configured environment verify the run? |
|
so how are you testing your code? how do you know it works? |
|
I validated the launcher locally for syntax issues and checked it against the documented Kinetic prebuilt-image workflow.
right now I am not able to execute it as an OOM error appears. I have tried it locally and on colab both. I have used reduced params but still the error appears due to it using almost 12.85GB mem which exceeds both my local and colab runtime. That's why I needed help, if someone has a configured setup, they can help with a successful run. |
The Verl RL guide has been added. It has been ensured that steps work for kinetic.