ENH add sinusoidal init #10

tomMoral · 2025-12-06T22:59:36Z

Try sinusoidal init from Sinusoidal Initialization, Time for a New Start.
Taken the implementation from jmiravet/Sinusoidal-Initialization by @jmiravet.

Seems like it consistently improves for AdamW solver but not for Scion.
Full interactive results here

jmiravet · 2025-12-06T23:37:15Z

Love it! Thanks, @tomMoral for testing it. I forked the project this morning. You probably know this benchmark much better than I do. The variance gain for each layer has already been optimized beyond the recommended initializations. I’m updating my repo to incorporate the fan_mode and gain parameters.

tomMoral · 2025-12-10T07:55:57Z

Thanks for the feedback, if you see any misuse of the init, feel free to let me know how to fix it.

From the results, it seems to work well with Adam but does not provide improvement with scion (optimizer with orthogonal updates). Does that match your observations?

jmiravet · 2025-12-10T11:10:09Z

Hi. We have tested the sinusoidal initialization with SGD, Adam, and AdamW and observed an improvement over Xavier, He, Orthogonal, and LSUV initialization methods.

Xavier and He are both based on normal or uniform distributions, but the std value is quite important, so for the sinusoidal initialization we matched the He std values.

You got everything right. The only consideration is that the default GPT init from the NanoGPT benchmark defines custom std values for each layer, which can make a difference in terms of training speed.

Regarding other optimizers, I haven’t tested any optimizer with orthogonal updates. I did experiment with a custom variant of Adam that modified the update to keep the mean of the weights equal to 0, but the results were worse and I stopped pursuing that direction.

tomMoral · 2025-12-10T12:31:48Z

Ok cool! Thanks a lot for the feedback :) I will keep the improved sinusoidal init for AdamW and remove it for Scion then.

For the custom init, are you talking about this init for instance with custom uniform bound?
https://github.com/KellerJordan/modded-nanogpt/blob/master/train_gpt.py#L899

jmiravet · 2025-12-10T19:08:13Z

Yes, exactly.

In that snippet the std is explicitly set first, and the uniform bounds are then computed from that std. My guess is that this choice is motivated by “edge-of-chaos” considerations: tuning the variance to improve information propagation and training behavior.

tomMoral added 2 commits December 6, 2025 23:53

ENH add sinusoidal init

401b384

FIX test in simulated

b6d20fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH add sinusoidal init #10

ENH add sinusoidal init #10

Uh oh!

tomMoral commented Dec 6, 2025

Uh oh!

jmiravet commented Dec 6, 2025

Uh oh!

tomMoral commented Dec 10, 2025

Uh oh!

jmiravet commented Dec 10, 2025

Uh oh!

tomMoral commented Dec 10, 2025

Uh oh!

jmiravet commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ENH add sinusoidal init #10

Are you sure you want to change the base?

ENH add sinusoidal init #10

Uh oh!

Conversation

tomMoral commented Dec 6, 2025

Uh oh!

jmiravet commented Dec 6, 2025

Uh oh!

tomMoral commented Dec 10, 2025

Uh oh!

jmiravet commented Dec 10, 2025

Uh oh!

tomMoral commented Dec 10, 2025

Uh oh!

jmiravet commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants