Summary
quickstart_tutorial.py with USE_CANDLE=1 on local MPS learns normally for several epochs, but the process is still killed by the host (Killed: 9) late in epoch 5.
- Other tutorial blockers on this branch were fixed already (
default_collate, MPS nll_loss backward, legacy _rebuild_tensor compatibility, and full-module Parameter pickling), so this appears to be the remaining MPS-specific tutorial/runtime issue.
- This is no longer the previous zero-gradient bug: training semantics look reasonable before the process is killed.
Repro
source ~/miniconda3/etc/profile.d/conda.sh
USE_CANDLE=1 MPLBACKEND=Agg \
PYTHONPATH="/Users/lvyufeng/Projects/candle/.worktrees/test-pytorch-basics-mps:/Users/lvyufeng/Projects/candle/.worktrees/test-pytorch-basics-mps/src" \
conda run -n candle311-tutorials \
python /Users/lvyufeng/Projects/candle/.worktrees/test-pytorch-basics-mps/.tmp_tutorials/quickstart_tutorial.py
Observed tail:
Killed: 9
ERROR conda.cli.main_run:execute(127): `conda run python quickstart_tutorial.py` failed.
Evidence gathered
A 5-epoch diagnostic replica of the tutorial reached:
- epoch 1: test loss
2.1940, acc 0.3491
- epoch 2: test loss
1.9793, acc 0.5190
- epoch 3: test loss
1.6272, acc 0.5989
- epoch 4: test loss
1.3211, acc 0.6321
- then the process was killed during epoch 5 after step 800
Tutorial-style object counting (train + test loop, no manual cleanup) showed:
- baseline:
tensors=12, nodes=0, saved=0, mps_storages=6, RSS ~288 MB
- during training:
tensors=106, nodes=18, saved=23, mps_storages=28, RSS ~344 MB
- during test:
tensors=108, nodes=18, saved=23, mps_storages=30
- after epoch end:
tensors=42, nodes=18, saved=23, mps_storages=30
A separate train-only diagnostic with explicit per-step cleanup did not show unbounded growth:
- CPU stayed around
7 live tensors
- MPS stayed around
6 live tensors / 4 MPS storages
So this does not look like a simple linear leak in the core train step. It seems specific to the full tutorial-style long-running train+eval process on MPS.
Suspected area
Remaining MPS-specific runtime/resource-management issue in the full quickstart train+eval loop, possibly involving long-lived tensors around the tutorial evaluation path or host/runtime limits reached only in the longer run.
Summary
quickstart_tutorial.pywithUSE_CANDLE=1on local MPS learns normally for several epochs, but the process is still killed by the host (Killed: 9) late in epoch 5.default_collate, MPSnll_lossbackward, legacy_rebuild_tensorcompatibility, and full-moduleParameterpickling), so this appears to be the remaining MPS-specific tutorial/runtime issue.Repro
Observed tail:
Evidence gathered
A 5-epoch diagnostic replica of the tutorial reached:
2.1940, acc0.34911.9793, acc0.51901.6272, acc0.59891.3211, acc0.6321Tutorial-style object counting (train + test loop, no manual cleanup) showed:
tensors=12,nodes=0,saved=0,mps_storages=6, RSS~288 MBtensors=106,nodes=18,saved=23,mps_storages=28, RSS~344 MBtensors=108,nodes=18,saved=23,mps_storages=30tensors=42,nodes=18,saved=23,mps_storages=30A separate train-only diagnostic with explicit per-step cleanup did not show unbounded growth:
7live tensors6live tensors /4MPS storagesSo this does not look like a simple linear leak in the core train step. It seems specific to the full tutorial-style long-running train+eval process on MPS.
Suspected area
Remaining MPS-specific runtime/resource-management issue in the full quickstart train+eval loop, possibly involving long-lived tensors around the tutorial evaluation path or host/runtime limits reached only in the longer run.