[docs] Add the Verl RL guide by ChiragSW · Pull Request #231 · keras-team/kinetic

ChiragSW · 2026-05-09T10:27:39Z

The Verl RL guide has been added. It has been ensured that steps work for kinetic.

gemini-code-assist

Code Review

This pull request introduces a new guide for RL post-training using the verl framework on Kinetic. The guide covers building compatible GPU images, submitting jobs, and handling checkpoints. The review feedback correctly identifies several violations of the repository's naming conventions for infrastructure resources, such as Artifact Registry repositories and GCS buckets, which must be cluster-scoped. Additionally, it suggests improving script portability by avoiding hardcoded project IDs.

gemini-code-assist · 2026-05-09T10:32:14Z

+export KINETIC_VERL_REPO="us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/kinetic-verl"
+
+gcloud artifacts repositories create kinetic-verl \


The Artifact Registry repository name should follow the repository's naming convention for consistency across clusters in the same project. According to the style guide, the convention for Artifact Registry repos is kn-{cluster_name}.

Suggested change

export KINETIC_VERL_REPO="us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/kinetic-verl"

gcloud artifacts repositories create kinetic-verl \

export KINETIC_VERL_REPO="us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/kn-your-cluster-name"

gcloud artifacts repositories create kn-your-cluster-name \

References

All infrastructure resources must be cluster-scoped. The naming convention is kn-{cluster_name} for Artifact Registry repos. ^(link)

gemini-code-assist · 2026-05-09T10:32:14Z

+import kinetic
+
+
+VERL_BASE_REPO = "us-docker.pkg.dev/your-project-id/kinetic-verl"


Hardcoding the project ID in the script makes it less portable and inconsistent with the shell examples provided earlier in the guide. It is better to use an environment variable or a placeholder that matches the previous steps.

Suggested change

VERL_BASE_REPO = "us-docker.pkg.dev/your-project-id/kinetic-verl"

VERL_BASE_REPO = f"us-docker.pkg.dev/{os.environ['GOOGLE_CLOUD_PROJECT']}/kn-your-cluster-name"

gemini-code-assist · 2026-05-09T10:32:14Z

+```python
+job = run_verl_gsm8k_ppo(
+    prepared_data_dir=kinetic.Data(
+        "gs://your-bucket/verl-data/gsm8k/",


The bucket naming convention should follow the repository's recommended pattern {project}-kn-{cluster_name}-{purpose} to ensure resources are cluster-scoped and independent.

Suggested change

"gs://your-bucket/verl-data/gsm8k/",

"gs://your-project-id-kn-your-cluster-name-data/gsm8k/",

References

Every resource managed by the CLI must include the cluster name in its identifier. The naming convention is {project}-kn-{cluster_name}-{purpose} for buckets. ^(link)

ChiragSW · 2026-05-11T09:40:11Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new guide and example script for RL post-training with the verl framework on Kinetic, covering image building, job submission, and checkpoint management. The review feedback emphasizes the importance of using "fuse=True" within "kinetic.Data" for checkpoint directories to ensure data durability and prevent loss of model weights upon job completion. Additionally, it is recommended to use "os.getenv" instead of "os.environ" for the project ID in the example script to improve robustness against missing environment variables.

gemini-code-assist · 2026-05-11T09:41:39Z

+
+
+if __name__ == "__main__":
+  job = run_verl_gsm8k_ppo(Data("gs://your-bucket/verl-checkpoints/"))


To ensure that checkpoints are durable and written back to Google Cloud Storage, you must use fuse=True. Without this, the Data object downloads the prefix to local ephemeral storage, and any writes to that directory inside the pod will be lost when the job completes.

Suggested change

job = run_verl_gsm8k_ppo(Data("gs://your-bucket/verl-checkpoints/"))

job = run_verl_gsm8k_ppo(Data("gs://your-bucket/verl-checkpoints/", fuse=True))

References

Demand robustness and ensure that the implementation handles data durability correctly.

gemini-code-assist · 2026-05-11T09:41:39Z

+
+```python
+job = run_verl_gsm8k_ppo(
+    checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/"),


The checkpoint_dir should use fuse=True to ensure that model weights written by the trainer are persisted back to GCS. Without FUSE, these files are written to the pod's ephemeral storage and will be lost upon job completion.

Suggested change

checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/"),

checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/", fuse=True),

gemini-code-assist · 2026-05-11T09:41:39Z

+
+```python
+job = run_verl_gsm8k_ppo(
+    checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/"),


Similar to the initial run example, the resume path must also use fuse=True so that the trainer can both read existing checkpoints and write new ones back to the persistent GCS bucket.

Suggested change

checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/"),

checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/", fuse=True),

gemini-code-assist · 2026-05-11T09:41:40Z

+VERL_BASE_REPO = (
+  f"us-docker.pkg.dev/{os.environ['GOOGLE_CLOUD_PROJECT']}/kn-your-cluster-name"
+)


Accessing os.environ at the module level will cause a KeyError immediately upon import if the environment variable is missing. This is brittle for an example script. It is safer to use os.getenv with a placeholder or handle the missing variable gracefully.

PROJECT_ID = os.getenv("GOOGLE_CLOUD_PROJECT", "your-project-id") VERL_BASE_REPO = f"us-docker.pkg.dev/{PROJECT_ID}/kn-your-cluster-name"

References

Catch user errors early and provide clear feedback instead of raw exceptions.

divyashreepathihalli · 2026-05-11T22:49:13Z

@ChiragSW Thank you for this contribution. Do you mind adding a screenshot of a successful run to the pR description?

ChiragSW · 2026-05-13T16:53:00Z

@divyashreepathihalli I don’t currently have a working GKE setup to produce the screenshot. Could a maintainer with the configured environment verify the run?

divyashreepathihalli · 2026-05-13T23:36:26Z

so how are you testing your code? how do you know it works?

ChiragSW · 2026-05-14T08:34:19Z

I validated the launcher locally for syntax issues and checked it against the documented Kinetic prebuilt-image workflow.
The code I wrote is only the Kinetic wrapper around the upstream verl command, submit it as a GPU job, pass credentials through env capture, use GCS/Fuse for checkpoints, and reduce epochs for smoke-run defaults
references I took:

verl Docker installation docs: https://verl.readthedocs.io/en/latest/start/install.html
verl PPO baselines: https://verl.readthedocs.io/en/v0.2.x/experiment/ppo.html
Kinetic Python API docs: https://kinetic.readthedocs.io/en/stable/api.html

right now I am not able to execute it as an OOM error appears. I have tried it locally and on colab both. I have used reduced params but still the error appears due to it using almost 12.85GB mem which exceeds both my local and colab runtime. That's why I needed help, if someone has a configured setup, they can help with a successful run.

added verl rl guide 1

a931d05

ChiragSW requested review from JyotinderSingh, divyashreepathihalli and jeffcarp as code owners May 9, 2026 10:27

gemini-code-assist Bot reviewed May 9, 2026

View reviewed changes

removed helper, used Data(...)

c2bfeb5

gemini-code-assist Bot reviewed May 11, 2026

View reviewed changes

added fuse=true

dd8de5b

		export KINETIC_VERL_REPO="us-docker.pkg.dev/${GOOGLE_CLOUD_PROJECT}/kinetic-verl"

		gcloud artifacts repositories create kinetic-verl \

		import kinetic


		VERL_BASE_REPO = "us-docker.pkg.dev/your-project-id/kinetic-verl"

	VERL_BASE_REPO = "us-docker.pkg.dev/your-project-id/kinetic-verl"
	VERL_BASE_REPO = f"us-docker.pkg.dev/{os.environ['GOOGLE_CLOUD_PROJECT']}/kn-your-cluster-name"

	"gs://your-bucket/verl-data/gsm8k/",
	"gs://your-project-id-kn-your-cluster-name-data/gsm8k/",



		if __name__ == "__main__":
		job = run_verl_gsm8k_ppo(Data("gs://your-bucket/verl-checkpoints/"))

	checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/"),
	checkpoint_dir=kinetic.Data("gs://your-bucket/verl-checkpoints/", fuse=True),

Conversation

ChiragSW commented May 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

ChiragSW commented May 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

divyashreepathihalli commented May 11, 2026

Uh oh!

ChiragSW commented May 13, 2026

Uh oh!

divyashreepathihalli commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ChiragSW commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

divyashreepathihalli commented May 13, 2026 •

edited

Loading

ChiragSW commented May 14, 2026 •

edited

Loading