Skip to content

mps: quickstart tutorial is SIGKILLed late in training #327

Description

@lvyufeng

Summary

  • quickstart_tutorial.py with USE_CANDLE=1 on local MPS learns normally for several epochs, but the process is still killed by the host (Killed: 9) late in epoch 5.
  • Other tutorial blockers on this branch were fixed already (default_collate, MPS nll_loss backward, legacy _rebuild_tensor compatibility, and full-module Parameter pickling), so this appears to be the remaining MPS-specific tutorial/runtime issue.
  • This is no longer the previous zero-gradient bug: training semantics look reasonable before the process is killed.

Repro

source ~/miniconda3/etc/profile.d/conda.sh
USE_CANDLE=1 MPLBACKEND=Agg \
PYTHONPATH="/Users/lvyufeng/Projects/candle/.worktrees/test-pytorch-basics-mps:/Users/lvyufeng/Projects/candle/.worktrees/test-pytorch-basics-mps/src" \
conda run -n candle311-tutorials \
python /Users/lvyufeng/Projects/candle/.worktrees/test-pytorch-basics-mps/.tmp_tutorials/quickstart_tutorial.py

Observed tail:

Killed: 9
ERROR conda.cli.main_run:execute(127): `conda run python quickstart_tutorial.py` failed.

Evidence gathered

A 5-epoch diagnostic replica of the tutorial reached:

  • epoch 1: test loss 2.1940, acc 0.3491
  • epoch 2: test loss 1.9793, acc 0.5190
  • epoch 3: test loss 1.6272, acc 0.5989
  • epoch 4: test loss 1.3211, acc 0.6321
  • then the process was killed during epoch 5 after step 800

Tutorial-style object counting (train + test loop, no manual cleanup) showed:

  • baseline: tensors=12, nodes=0, saved=0, mps_storages=6, RSS ~288 MB
  • during training: tensors=106, nodes=18, saved=23, mps_storages=28, RSS ~344 MB
  • during test: tensors=108, nodes=18, saved=23, mps_storages=30
  • after epoch end: tensors=42, nodes=18, saved=23, mps_storages=30

A separate train-only diagnostic with explicit per-step cleanup did not show unbounded growth:

  • CPU stayed around 7 live tensors
  • MPS stayed around 6 live tensors / 4 MPS storages

So this does not look like a simple linear leak in the core train step. It seems specific to the full tutorial-style long-running train+eval process on MPS.

Suspected area

Remaining MPS-specific runtime/resource-management issue in the full quickstart train+eval loop, possibly involving long-lived tensors around the tutorial evaluation path or host/runtime limits reached only in the longer run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions