Skip to content

[docs] Add the Verl RL guide#231

Open
ChiragSW wants to merge 3 commits into
keras-team:mainfrom
ChiragSW:issue#230
Open

[docs] Add the Verl RL guide#231
ChiragSW wants to merge 3 commits into
keras-team:mainfrom
ChiragSW:issue#230

Conversation

@ChiragSW
Copy link
Copy Markdown

@ChiragSW ChiragSW commented May 9, 2026

The Verl RL guide has been added. It has been ensured that steps work for kinetic.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new guide for RL post-training using the verl framework on Kinetic. The guide covers building compatible GPU images, submitting jobs, and handling checkpoints. The review feedback correctly identifies several violations of the repository's naming conventions for infrastructure resources, such as Artifact Registry repositories and GCS buckets, which must be cluster-scoped. Additionally, it suggests improving script portability by avoiding hardcoded project IDs.

Comment thread docs/examples/verl_rl.md Outdated
Comment on lines +86 to +88
export KINETIC_VERL_REPO="us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/kinetic-verl"

gcloud artifacts repositories create kinetic-verl \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Artifact Registry repository name should follow the repository's naming convention for consistency across clusters in the same project. According to the style guide, the convention for Artifact Registry repos is kn-{cluster_name}.

Suggested change
export KINETIC_VERL_REPO="us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/kinetic-verl"
gcloud artifacts repositories create kinetic-verl \
export KINETIC_VERL_REPO="us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/kn-your-cluster-name"
gcloud artifacts repositories create kn-your-cluster-name \
References
  1. All infrastructure resources must be cluster-scoped. The naming convention is kn-{cluster_name} for Artifact Registry repos. (link)

Comment thread docs/examples/verl_rl.md Outdated
import kinetic


VERL_BASE_REPO = "us-docker.pkg.dev/your-project-id/kinetic-verl"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Hardcoding the project ID in the script makes it less portable and inconsistent with the shell examples provided earlier in the guide. It is better to use an environment variable or a placeholder that matches the previous steps.

Suggested change
VERL_BASE_REPO = "us-docker.pkg.dev/your-project-id/kinetic-verl"
VERL_BASE_REPO = f"us-docker.pkg.dev/{os.environ['GOOGLE_CLOUD_PROJECT']}/kn-your-cluster-name"

Comment thread docs/examples/verl_rl.md Outdated
```python
job = run_verl_gsm8k_ppo(
prepared_data_dir=kinetic.Data(
"gs://your-bucket/verl-data/gsm8k/",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The bucket naming convention should follow the repository's recommended pattern {project}-kn-{cluster_name}-{purpose} to ensure resources are cluster-scoped and independent.

Suggested change
"gs://your-bucket/verl-data/gsm8k/",
"gs://your-project-id-kn-your-cluster-name-data/gsm8k/",
References
  1. Every resource managed by the CLI must include the cluster name in its identifier. The naming convention is {project}-kn-{cluster_name}-{purpose} for buckets. (link)

@ChiragSW
Copy link
Copy Markdown
Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new guide and example script for RL post-training with the verl framework on Kinetic, covering image building, job submission, and checkpoint management. The review feedback emphasizes the importance of using "fuse=True" within "kinetic.Data" for checkpoint directories to ensure data durability and prevent loss of model weights upon job completion. Additionally, it is recommended to use "os.getenv" instead of "os.environ" for the project ID in the example script to improve robustness against missing environment variables.

Comment thread examples/verl_rl.py Outdated


if __name__ == "__main__":
job = run_verl_gsm8k_ppo(Data("gs://your-bucket/verl-checkpoints/"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To ensure that checkpoints are durable and written back to Google Cloud Storage, you must use fuse=True. Without this, the Data object downloads the prefix to local ephemeral storage, and any writes to that directory inside the pod will be lost when the job completes.

Suggested change
job = run_verl_gsm8k_ppo(Data("gs://your-bucket/verl-checkpoints/"))
job = run_verl_gsm8k_ppo(Data("gs://your-bucket/verl-checkpoints/", fuse=True))
References
  1. Demand robustness and ensure that the implementation handles data durability correctly.

Comment thread docs/examples/verl_rl.md Outdated

```python
job = run_verl_gsm8k_ppo(
checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The checkpoint_dir should use fuse=True to ensure that model weights written by the trainer are persisted back to GCS. Without FUSE, these files are written to the pod's ephemeral storage and will be lost upon job completion.

Suggested change
checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/"),
checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/", fuse=True),

Comment thread docs/examples/verl_rl.md Outdated

```python
job = run_verl_gsm8k_ppo(
checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/"),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Similar to the initial run example, the resume path must also use fuse=True so that the trainer can both read existing checkpoints and write new ones back to the persistent GCS bucket.

Suggested change
checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/"),
checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/", fuse=True),

Comment thread examples/verl_rl.py Outdated
Comment on lines +6 to +8
VERL_BASE_REPO = (
f"us-docker.pkg.dev/{os.environ['GOOGLE_CLOUD_PROJECT']}/kn-your-cluster-name"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Accessing os.environ at the module level will cause a KeyError immediately upon import if the environment variable is missing. This is brittle for an example script. It is safer to use os.getenv with a placeholder or handle the missing variable gracefully.

PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT", "your-project-id")
VERL_BASE_REPO = f"us-docker.pkg.dev/{PROJECT_ID}/kn-your-cluster-name"
References
  1. Catch user errors early and provide clear feedback instead of raw exceptions.

@divyashreepathihalli
Copy link
Copy Markdown
Collaborator

@ChiragSW Thank you for this contribution. Do you mind adding a screenshot of a successful run to the pR description?

@ChiragSW
Copy link
Copy Markdown
Author

@divyashreepathihalli I don’t currently have a working GKE setup to produce the screenshot. Could a maintainer with the configured environment verify the run?

@divyashreepathihalli
Copy link
Copy Markdown
Collaborator

divyashreepathihalli commented May 13, 2026

so how are you testing your code? how do you know it works?

@ChiragSW
Copy link
Copy Markdown
Author

ChiragSW commented May 14, 2026

I validated the launcher locally for syntax issues and checked it against the documented Kinetic prebuilt-image workflow.
The code I wrote is only the Kinetic wrapper around the upstream verl command, submit it as a GPU job, pass credentials through env capture, use GCS/Fuse for checkpoints, and reduce epochs for smoke-run defaults
references I took:

  1. verl Docker installation docs: https://verl.readthedocs.io/en/latest/start/install.html
  2. verl PPO baselines: https://verl.readthedocs.io/en/v0.2.x/experiment/ppo.html
  3. Kinetic Python API docs: https://kinetic.readthedocs.io/en/stable/api.html

right now I am not able to execute it as an OOM error appears. I have tried it locally and on colab both. I have used reduced params but still the error appears due to it using almost 12.85GB mem which exceeds both my local and colab runtime. That's why I needed help, if someone has a configured setup, they can help with a successful run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants