Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ export DATACACHE_DIR="" # path to data cache directory
########################################################
# CONTAINER
########################################################
export CONTAINER_IMAGE="" # path to container image, e.g., nvcr.io/nvidia/pytorch:26.01-py3
export CONTAINER_IMAGE="nvcr.io/nvidia/pytorch:26.01-py3" # path to container image or .sqsh file
export CONTAINER_MOUNTS="" # container mounts, e.g., "/lustre/fs1:/lustre/fs1"
export CONTAINER_WORKDIR="" # container work directory, e.g., "<path-to-modelopt>/Model-Optimizer/examples/llm_qad"

Expand Down
2 changes: 1 addition & 1 deletion examples/llm_qad/configs/qwen3-8b_template.conf
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ export DATACACHE_DIR="" # path to data cache directory
########################################################
# CONTAINER
########################################################
export CONTAINER_IMAGE="" # path to container image, e.g., nvcr.io/nvidia/pytorch:26.01-py3
export CONTAINER_IMAGE="nvcr.io/nvidia/pytorch:26.01-py3" # path to container image or .sqsh file
export CONTAINER_MOUNTS="" # container mounts, e.g., "/lustre/fs1:/lustre/fs1"
export CONTAINER_WORKDIR="" # container work directory

Expand Down
11 changes: 0 additions & 11 deletions examples/llm_qad/sbatch_qad.sh
Original file line number Diff line number Diff line change
Expand Up @@ -58,19 +58,8 @@ if [[ -n "$CONFIG_FILE" ]]; then
fi
fi

# === Default Paths (override in config) ===
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing seems too brutal. keeping them with some sane default value is good that people know what env vars are expected.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just followed whatever previous approach was followed. The .conf file has many such env variables and all are defined from that file

MLM_DIR="${MLM_DIR:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/workspace/Megatron-LM}"
MODELOPT_DIR="${MODELOPT_DIR:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/workspace/TensorRT-Model-Optimizer}"
MODELS_ROOT="${MODELS_ROOT:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/models}"
QAD_CHECKPOINT_ROOT="${QAD_CHECKPOINT_ROOT:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/checkpoints}"
DATACACHE_DIR="${DATACACHE_DIR:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/data_cache}"
LOG_DIR="${LOG_DIR:-${QAD_CHECKPOINT_ROOT}/logs_slurm}"

# Container settings
CONTAINER_IMAGE="${CONTAINER_IMAGE:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/containers/pytorch_25.06-py3.sqsh}"
CONTAINER_MOUNTS="${CONTAINER_MOUNTS:-/lustre/fs1:/lustre/fs1}"
CONTAINER_WORKDIR="${CONTAINER_WORKDIR:-/lustre/fs1/portfolios/coreai/projects/coreai_dlalgo_modelopt/users/weimingc/workspace/TensorRT-Model-Optimizer/examples/llm_qad}"

# Parallelism (required from config)
TP_SIZE="${TP_SIZE:?ERROR: TP_SIZE must be set in config}"
MBS="${MBS:?ERROR: MBS must be set in config}"
Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# LTX-2 QAD Training Configuration
model:
model_path: "/lustre/fsw/portfolios/adlr/projects/adlr_psx_numerics/users/ynankani/ComfyUI/models/checkpoints/ltx-av-step-1933500-split-new-vae.safetensors"
model_path: "/path/to/ltx2/checkpoint.safetensors" # TODO: Set your LTX-2 checkpoint path
training_mode: "full"
load_checkpoint:
text_encoder_path: "/lustre/fsw/portfolios/adlr/users/dhutchins/models/gemma"
text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path
Comment on lines 5 to +6
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix YAML nesting under load_checkpoint to preserve schema.

text_encoder_path is currently aligned as a sibling of load_checkpoint instead of a child key. This changes the config structure and can break consumers expecting model.load_checkpoint.text_encoder_path.

Proposed fix
 model:
   model_path: "/path/to/ltx2/checkpoint.safetensors"  # TODO: Set your LTX-2 checkpoint path
   training_mode: "full"
   load_checkpoint:
-  text_encoder_path: "/path/to/gemma"  # TODO: Set your Gemma text encoder path
+    text_encoder_path: "/path/to/gemma"  # TODO: Set your Gemma text encoder path
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
load_checkpoint:
text_encoder_path: "/lustre/fsw/portfolios/adlr/users/dhutchins/models/gemma"
text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path
load_checkpoint:
text_encoder_path: "/path/to/gemma" # TODO: Set your Gemma text encoder path
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/windows/torch_onnx/diffusers/qad_example/ltx2_qad.yaml` around lines
5 - 6, The YAML key text_encoder_path is misindented as a sibling of
load_checkpoint rather than a child, breaking the expected
model.load_checkpoint.text_encoder_path schema; fix by nesting text_encoder_path
under load_checkpoint (adjust indentation) so that the configuration key becomes
model.load_checkpoint.text_encoder_path and consumers of
load_checkpoint/text_encoder_path find the value correctly.


training_strategy:
name: "text_to_video"
Expand All @@ -26,7 +26,7 @@ acceleration:
load_text_encoder_in_8bit: true

data:
preprocessed_data_root: "/lustre/fsw/portfolios/adlr/users/scavallari/ltx-qad/qad-dataset"
preprocessed_data_root: "/path/to/preprocessed" # TODO: Set your preprocessed dataset path
num_dataloader_workers: 2

validation:
Expand Down
Loading