Skip to content

Conversation

@tomMoral
Copy link
Member

@tomMoral tomMoral commented Dec 6, 2025

Try sinusoidal init from Sinusoidal Initialization, Time for a New Start.
Taken the implementation from jmiravet/Sinusoidal-Initialization by @jmiravet.

Seems like it consistently improves for AdamW solver but not for Scion.
Full interactive results here

image image

@jmiravet
Copy link

jmiravet commented Dec 6, 2025

Love it! Thanks, @tomMoral for testing it. I forked the project this morning. You probably know this benchmark much better than I do. The variance gain for each layer has already been optimized beyond the recommended initializations. I’m updating my repo to incorporate the fan_mode and gain parameters.

@tomMoral
Copy link
Member Author

Thanks for the feedback, if you see any misuse of the init, feel free to let me know how to fix it.

From the results, it seems to work well with Adam but does not provide improvement with scion (optimizer with orthogonal updates). Does that match your observations?

@jmiravet
Copy link

Hi. We have tested the sinusoidal initialization with SGD, Adam, and AdamW and observed an improvement over Xavier, He, Orthogonal, and LSUV initialization methods.

Xavier and He are both based on normal or uniform distributions, but the std value is quite important, so for the sinusoidal initialization we matched the He std values.

You got everything right. The only consideration is that the default GPT init from the NanoGPT benchmark defines custom std values for each layer, which can make a difference in terms of training speed.

Regarding other optimizers, I haven’t tested any optimizer with orthogonal updates. I did experiment with a custom variant of Adam that modified the update to keep the mean of the weights equal to 0, but the results were worse and I stopped pursuing that direction.

@tomMoral
Copy link
Member Author

Ok cool! Thanks a lot for the feedback :) I will keep the improved sinusoidal init for AdamW and remove it for Scion then.

For the custom init, are you talking about this init for instance with custom uniform bound?
https://github.com/KellerJordan/modded-nanogpt/blob/master/train_gpt.py#L899

@jmiravet
Copy link

Yes, exactly.

In that snippet the std is explicitly set first, and the uniform bounds are then computed from that std. My guess is that this choice is motivated by “edge-of-chaos” considerations: tuning the variance to improve information propagation and training behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants