Config-driven batch job submission for OSMO RL training and evaluation.
# Using uv (recommended)
uv pip install -e .
# Or with pip
pip install -e .# Preview jobs without submitting
osmo-deploy show experiments/my_experiment.yaml
# Submit jobs (dry run first)
osmo-deploy submit experiments/multi_task.yaml --dry-run
# Submit for real
osmo-deploy submit experiments/multi_task.yaml
# Check status
osmo-deploy status wf-abc123
# Or check all jobs from a manifest
osmo-deploy status --manifest experiments/manifests/my_experiment_20260119.json
# Cancel workflows
osmo-deploy cancel wf-abc123| Command | Description |
|---|---|
osmo-deploy submit <config> |
Submit batch jobs from a sweep or eval config |
osmo-deploy show <config> |
Preview expanded job configurations |
osmo-deploy list experiments |
List experiment configs |
osmo-deploy list templates |
List available templates |
osmo-deploy list manifests |
List job manifests |
osmo-deploy status <wf-id> |
Check workflow status |
osmo-deploy cancel <wf-id> |
Cancel workflows |
osmo-deploy db init |
Initialize the evaluation database |
osmo-deploy db sync |
Sync evaluation results to database |
osmo-deploy db list |
List evaluation results |
osmo-deploy db summary |
Show database summary statistics |
osmo-deploy db export |
Export evaluation results to file |
# experiments/seed_sweep.yaml
name: cartpole_seed_sweep
template: single_gpu.yaml
# Fixed parameters for all runs
base_params:
task: Isaac-Cartpole-v0
project_name: osmo_rl_ablation
num_envs: 1024
max_iterations: 500
num_gpu: 1
# Parameters to sweep (Cartesian product)
sweep_params:
seed: [42, 123, 456, 789, 1024]
# Job submission settings
max_parallel: 3
tags:
- seed-sweep
- cartpole# experiments/multi_node_train.yaml
name: multi_node_training
template: multi_node.yaml
base_params:
num_gpu: 2
project_name: osmo_rl_distributed
experiment_name: cartpole_multi_node
sweep_params:
run_name:
- distributed_run_1
max_parallel: 1
tags:
- multi-node
- distributed# experiments/eval_tasks.yaml
type: eval
name: eval_sweep
template: eval.yaml
base_params:
output_dataset: robot-policy-eval-dataset
num_envs: 64
video_length: 200
sources:
- manifest: experiments/manifests/multi_task_comparison_20260119.json
task: Isaac-Cartpole-v0
max_parallel: 5osmo-deploy submit <config> [OPTIONS]
Options:
--dry-run Preview jobs without submitting
-p, --max-parallel Override max parallel jobs
-s, --set KEY=VAL Override config valuesosmo-deploy cancel <workflow_ids>... [OPTIONS]
Options:
-m, --manifest Cancel all workflows from a manifest file
--dry-run Preview without canceling# Initialize database
osmo-deploy db init [--db-path PATH]
# Sync from OSMO datasets or manifests
osmo-deploy db sync <datasets>... [OPTIONS]
osmo-deploy db sync --manifest experiments/manifests/eval_sweep.json
osmo-deploy db sync --local ./robot-policy-eval-dataset
Options:
-m, --manifest Sync all eval jobs from a manifest file
-l, --local Scan local directory for eval_info.json files
-p, --parallel Max parallel fetches (default: 10)
# List evaluation results
osmo-deploy db list [OPTIONS]
Options:
-t, --training-uid Filter by training run UID
--task Filter by task name
-n, --limit Maximum results to show (default: 50)
# Show summary
osmo-deploy db summary
# Export results
osmo-deploy db export results.json
osmo-deploy db export results.csv --format csv# Sweep with different learning rates
osmo-deploy submit experiments/lr_sweep.yaml
# Override parallelism
osmo-deploy submit experiments/big_sweep.yaml --max-parallel 10
# Dry run with config override
osmo-deploy submit experiments/sweep.yaml --dry-run -s max_iterations=100
# Multi-node distributed training
osmo-deploy submit experiments/multi_node_train.yaml --dry-run
osmo-deploy submit experiments/multi_node_train.yaml
# Cancel workflows from a manifest
osmo-deploy cancel --manifest experiments/manifests/sweep_20260119.json
# Check status of all jobs in a manifest
osmo-deploy status experiments/manifests/sweep_20260119.jsonosmo-deploy/
├── osmo_deploy/ # Main package
│ ├── __init__.py
│ ├── cli.py # Click CLI commands
│ ├── config.py # Config schemas
│ ├── core.py # Job creation & submission
│ ├── database.py # Evaluation database
│ └── eval.py # Eval job creation
├── experiments/ # Experiment configs
│ ├── seed_sweep.yaml
│ ├── multi_task.yaml
│ ├── multi_node_train.yaml
│ ├── eval_tasks.yaml
│ └── manifests/ # Job submission records
├── templates/ # OSMO workflow templates
│ └── reinforcement_learning/
│ ├── single_gpu.yaml
│ ├── multi_node.yaml
│ ├── eval.yaml
│ └── tune.yaml
├── docs/ # Documentation
│ ├── batch-jobs.md
│ ├── hyperparameter-tuning.md
│ └── system-architecture.md
└── pyproject.toml
Additional documentation is available in the docs/ directory:
- Batch Jobs - Guide to batch job submission
- Hyperparameter Tuning - Sweep and tuning configurations
- System Architecture - Overview of the system design
Apache-2.0
