Skip to content

Conversation

@Steboss
Copy link
Contributor

@Steboss Steboss commented Nov 14, 2025

Fixing nsys-jax. The profiler had an issue in data loader, when profiling NCCL operations, in_load_nvtx_gpu_proj_trace_single when trying to link thunks to their parent XLA modules.

Investigation of the input profiles revealed an unexpected hierarchy structure. The original code assumed the NVTX range hierarchy would contain only:

  • XlaModule ranges
  • Thunk ranges
  • Direct parent-child relationships between them
    However, NCCL collective operations inject intermediate ranges into the hierarchy:
Expected:   Thunk -> XlaModule
            Thunk -> Thunk -> XlaModule

Actual:     Thunk -> ncclGroupEnd -> XlaModule
            Thunk -> ncclGroupStart -> XlaModule

The code's assertion assert (mask == mod_id_names.str.startswith(thunk_prefix)).all() failed because it assumed any non-module parent must be a thunk, but NCCL ranges like ncclGroupEnd are neither modules nor thunks.

Here I've changed:

  • in data loaders the way we're dealing with intermediate range types, the alignment with pandas indexes.
  • the analysis, adding an overlapping logic. Tasks that overlap with parallel regions are now classified based on where most of their execution time falls (before or after the parallel region)

@Steboss Steboss requested a review from gpupuck November 14, 2025 17:09
@gpupuck
Copy link
Contributor

gpupuck commented Nov 14, 2025

Is this a new NCCL behavior or it's always like that to inject a range?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants