Hey, thanks a ton for open-sourcing this — awesome work!
I was curious if you could share a bit about the compute setup you used to train the model:
-
what kind of GPUs you ran on (and how many)
-
about how long training took (days or GPU hours) , both per stage and overall
Just trying to get a sense of the scale involved. Appreciate any details you can share!