Medium track WR, ema on top of muon, includes PR124. #129
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Smoothed Muon Updates (1% improvement, includes PR 124 changes, runs PR 124 and 128 as parallel baselines)
This submission adds a small EMA filter to the Muon update:
Various notions of update smoothing are common in optimization literature; this submission tests the most minimal version I could think of that can be easily "tuned" to be close to a baseline of no-smoothing.
The EMA weight is rather small, starting at 0.5 and decaying to 0.2 over training.
Other changes:
See the README.md file for detailed more info and stats.
Overall final improvement on my machine is 23.52 min / 23.72 min = 99.1% of PR124's time and marginally faster than PR128's time (23.52 min vs 23.59 min). PR128's validation loss was not sufficiently low in my replication however (see stats below).
Note that the times on my machine are slightly slower than those reported by the previous baselines.
I don't know why this is, but my final reported time is slightly faster than the reported baseline time, so I believe the speed improvement should be robust.
Simple stats:
I ran 80 trials for baselines and ablations. For the update smoothing change, I run four sets of only 40 runs each to check that the p-value has a reasonable chance of being small after a moderate number of runs.
That said, while the p-value has a good chance of being <0.01, after 40 trials, it's not an extremely high chance. I think it is pretty likely to be small after 80 runs. I don't know what the standards should be for making reproducibility easy here. I did test a run with 5630 iters, which I think will be extremely reliable in this regard.
Baselines:
PR124:
PR128:
This p-value is pretty high. I'm not sure what's wrong and I haven't investigated. The method itself seems reasonable and somewhat similar to the one proposed here.
Update smoothing data
with EMA update smoothing (4 replicates of 40 runs each to ensure p-value has reasonable chance of being small):
So, one replicate has a high p-value (0.135), two runs are very close to 0.01 (0.0103 and 0.0086), and one run is moderate value (0.0245). If we group into two sets of 80 runs, then the p-values are 0.0125 and 0.00025.
Full stats over all 160 replicates
Shorter run for larger p-value:
To get a more reliable p-value, I increased iterations to 5630 (so 60 less than PR124). I also restored the muon lr to 0.025. After 40 runs, this yields:
so, a slightly slower run, but much higher confidence.
Simple Ablation
increase iters to 5940, remove update smoothing, keep other changes the same:
So, seems a little slower and doesn't hit the baseline. Not necessarily conclusive (better lr tuning might fix it), but at least this is suggestive that smoothing is helpful.
Pytorch/CUDA info
as copied from output file: