Inquiry about training compute setup

Hey, thanks a ton for open-sourcing this — awesome work! 

I was curious if you could share a bit about the compute setup you used to train the model:

- what kind of GPUs you ran on (and how many)

- about how long training took (days or GPU hours) , both per stage and overall

Just trying to get a sense of the scale involved. Appreciate any details you can share!