Skip to content

nv-sachdevkartik/osmo-deploy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

osmo-deploy

Config-driven batch job submission for OSMO RL training and evaluation.

Architecture

osmo-deploy Architecture

Installation

# Using uv (recommended)
uv pip install -e .

# Or with pip
pip install -e .

Quick Start

# Preview jobs without submitting
osmo-deploy show experiments/my_experiment.yaml

# Submit jobs (dry run first)
osmo-deploy submit experiments/multi_task.yaml --dry-run

# Submit for real
osmo-deploy submit experiments/multi_task.yaml

# Check status
osmo-deploy status wf-abc123

# Or check all jobs from a manifest
osmo-deploy status --manifest experiments/manifests/my_experiment_20260119.json

# Cancel workflows
osmo-deploy cancel wf-abc123

Commands

Command Description
osmo-deploy submit <config> Submit batch jobs from a sweep or eval config
osmo-deploy show <config> Preview expanded job configurations
osmo-deploy list experiments List experiment configs
osmo-deploy list templates List available templates
osmo-deploy list manifests List job manifests
osmo-deploy status <wf-id> Check workflow status
osmo-deploy cancel <wf-id> Cancel workflows
osmo-deploy db init Initialize the evaluation database
osmo-deploy db sync Sync evaluation results to database
osmo-deploy db list List evaluation results
osmo-deploy db summary Show database summary statistics
osmo-deploy db export Export evaluation results to file

Experiment Config Format

Sweep Config

# experiments/seed_sweep.yaml
name: cartpole_seed_sweep
template: single_gpu.yaml

# Fixed parameters for all runs
base_params:
  task: Isaac-Cartpole-v0
  project_name: osmo_rl_ablation
  num_envs: 1024
  max_iterations: 500
  num_gpu: 1

# Parameters to sweep (Cartesian product)
sweep_params:
  seed: [42, 123, 456, 789, 1024]

# Job submission settings
max_parallel: 3
tags:
  - seed-sweep
  - cartpole

Multi-Node Config

# experiments/multi_node_train.yaml
name: multi_node_training
template: multi_node.yaml

base_params:
  num_gpu: 2
  project_name: osmo_rl_distributed
  experiment_name: cartpole_multi_node

sweep_params:
  run_name:
    - distributed_run_1

max_parallel: 1
tags:
  - multi-node
  - distributed

Eval Config

# experiments/eval_tasks.yaml
type: eval
name: eval_sweep
template: eval.yaml

base_params:
  output_dataset: robot-policy-eval-dataset
  num_envs: 64
  video_length: 200

sources:
  - manifest: experiments/manifests/multi_task_comparison_20260119.json
    task: Isaac-Cartpole-v0

max_parallel: 5

CLI Options

Submit

osmo-deploy submit <config> [OPTIONS]

Options:
  --dry-run          Preview jobs without submitting
  -p, --max-parallel Override max parallel jobs
  -s, --set KEY=VAL  Override config values

Cancel

osmo-deploy cancel <workflow_ids>... [OPTIONS]

Options:
  -m, --manifest     Cancel all workflows from a manifest file
  --dry-run          Preview without canceling

Database

# Initialize database
osmo-deploy db init [--db-path PATH]

# Sync from OSMO datasets or manifests
osmo-deploy db sync <datasets>... [OPTIONS]
osmo-deploy db sync --manifest experiments/manifests/eval_sweep.json
osmo-deploy db sync --local ./robot-policy-eval-dataset

Options:
  -m, --manifest     Sync all eval jobs from a manifest file
  -l, --local        Scan local directory for eval_info.json files
  -p, --parallel     Max parallel fetches (default: 10)

# List evaluation results
osmo-deploy db list [OPTIONS]

Options:
  -t, --training-uid Filter by training run UID
  --task             Filter by task name
  -n, --limit        Maximum results to show (default: 50)

# Show summary
osmo-deploy db summary

# Export results
osmo-deploy db export results.json
osmo-deploy db export results.csv --format csv

Examples

# Sweep with different learning rates
osmo-deploy submit experiments/lr_sweep.yaml

# Override parallelism
osmo-deploy submit experiments/big_sweep.yaml --max-parallel 10

# Dry run with config override
osmo-deploy submit experiments/sweep.yaml --dry-run -s max_iterations=100

# Multi-node distributed training
osmo-deploy submit experiments/multi_node_train.yaml --dry-run
osmo-deploy submit experiments/multi_node_train.yaml

# Cancel workflows from a manifest
osmo-deploy cancel --manifest experiments/manifests/sweep_20260119.json

# Check status of all jobs in a manifest
osmo-deploy status experiments/manifests/sweep_20260119.json

Project Structure

osmo-deploy/
├── osmo_deploy/              # Main package
│   ├── __init__.py
│   ├── cli.py                # Click CLI commands
│   ├── config.py             # Config schemas
│   ├── core.py               # Job creation & submission
│   ├── database.py           # Evaluation database
│   └── eval.py               # Eval job creation
├── experiments/              # Experiment configs
│   ├── seed_sweep.yaml
│   ├── multi_task.yaml
│   ├── multi_node_train.yaml
│   ├── eval_tasks.yaml
│   └── manifests/            # Job submission records
├── templates/                # OSMO workflow templates
│   └── reinforcement_learning/
│       ├── single_gpu.yaml
│       ├── multi_node.yaml
│       ├── eval.yaml
│       └── tune.yaml
├── docs/                     # Documentation
│   ├── batch-jobs.md
│   ├── hyperparameter-tuning.md
│   └── system-architecture.md
└── pyproject.toml

Documentation

Additional documentation is available in the docs/ directory:

License

Apache-2.0

About

Simple tool built over NVIDIA OSMO for parallel experimentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages