Skip to content

Conversation

@leloykun
Copy link
Contributor

I'm building an RL env for NanoGPT speedrunning and this was one of the good patches found so far. It reduces wallclock time by 1-2 secs on my 8xH100s.

I'm still working on the env, but I'm dropping this here in case others wanna add it to their records.

@leloykun leloykun marked this pull request as draft September 11, 2025 17:27
@leloykun
Copy link
Contributor Author

leloykun commented Sep 11, 2025

Weird, switching to another cloud platform for the 8H100s caused a regression... currently double checking if I copy-pasted the right patch...

@Gusarich
Copy link
Contributor

it gets about 3.35 val loss in the end when i run it

@leloykun
Copy link
Contributor Author

Yeah, for some reason this works perfectly well on Modal Sandboxes, but not on PrimeIntellect machines (both SXM 8xH100s). This has been making me crazy tbh.

@ClassicLarry
Copy link
Collaborator

This direction looks promising to me, but might require Nsight Profiler deep dive to fully understand when these streams are getting scheduled. My concern is that if the hardware is deciding when to prioritize this cpu-to-gpu data transfer stream, it might block the main GPU stream in hard to detect ways. Im not sure on the best way to interleave this with the forward and backwards pass, but ideally we can do it in a way that executes consistently across different GPU providers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants