Skip to content

Conversation

@Kapil-Shyam-Pawar
Copy link
Contributor

@Kapil-Shyam-Pawar Kapil-Shyam-Pawar commented Nov 14, 2025

Details

Do not mention proprietary info or link to internal work items in this PR.

Work item: LWPCLPAT-620, LWPCLPAT-624

What were the changes?
Added replay_log_converter.py to convert replay logs between BIN and JSON formats, and to generate standardized JSON output which can be parsed by standard JSON libraries, and sanitize JSON logs for easier comparison.

Why were the changes made?
Currently JSON allows users to open the file and read what's going on, however Replayer currently only works with .BIN input.

Additionally, comparing logs from different test runs is difficult due to variable pointer addresses and timestamps.

This tool will help users to convert the generated BIN logs to JSON format (for the same run) which can be viewed and analyzed, and normalize logs for comparison.

How was the outcome achieved?
The tool can convert between the two formats using the commands:
Binary to JSON: python3 replay_log_converter.py <basename> tojson
JSON to Binary: python3 replay_log_converter.py <basename> tobin
Standardize JSON: python3 replay_log_converter.py <basename> --standardize
Sanitize JSON: python3 replay_log_converter.py <basename> --sanitize
Sanitize JSON (No Timestamp): python3 replay_log_converter.py <basename> --sanitize --no-timestamp (or --nts) sets all timestamps to 0.0

Additional Documentation:
Since the JSON logs generated by recorder do not record GroupStart and GroupEnd calls as of now, converting JSON logs to BIN format and executing RcclReplayer against them may not work as expected.

The --sanitize option normalizes logs for easier comparison by:
* Remapping pointers to readable identifiers (e.g., comm : 0x7fb680328010comm : comm_001)
* Normalizing timestamps relative to the first call (e.g., time : 1762969171532.248535time : 0.000000)
* Preserving relationships: same pointer values get the same sanitized identifier
* Sanitized fields: communicators (comm), unique IDs (uniqueID), streams (stream), buffer addresses (addr/base/ptr/acc), handles (handle), thread IDs (thread), and process IDs (pid)

Approval Checklist

Do not approve until these items are satisfied.

  • Verify the CHANGELOG has been updated, if
    • there are any NCCL API version changes,
    • any changes impact library users, and/or
    • any changes impact any other ROCm library.

@nileshnegi nileshnegi added the noCI Disable Jenkins for this PR. label Nov 14, 2025
@nileshnegi
Copy link
Contributor

@Kapil-Shyam-Pawar, just merge this when ready. CI does not exercise this code path.

@Kapil-Shyam-Pawar Kapil-Shyam-Pawar merged commit 5fd8602 into ROCm:develop Nov 24, 2025
14 of 17 checks passed
@Kapil-Shyam-Pawar Kapil-Shyam-Pawar deleted the PAT-620/Conv_tool branch November 24, 2025 18:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

noCI Disable Jenkins for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants