Skip to content

Conversation

@wsttiger
Copy link
Collaborator

@wsttiger wsttiger commented Nov 7, 2025

Add TensorRT Decoder Training Tutorial

Overview

This PR introduces comprehensive documentation for training and deploying neural network-based quantum error correction decoders using the TensorRT decoder plugin (trt_decoder), which is being released with CUDA-Q QEC v0.5.0.

Motivation

With the release of the TensorRT decoder, users need clear guidance on:

  • How to generate training data for QEC decoding tasks
  • How to train custom neural network decoders
  • How to export models to ONNX format
  • How to deploy trained models with TensorRT for accelerated GPU inference

This tutorial provides an end-to-end workflow demonstrating the complete pipeline from data generation to production deployment.

Changes Made

New Documentation

  • training_ai_decoders.rst: Comprehensive tutorial covering:
    • Training neural network decoders with PyTorch and Stim
    • Surface code circuit generation and data sampling
    • MLP architecture design and training workflow
    • ONNX model export
    • TensorRT deployment with Python and C++ examples
    • Converting ONNX models to TensorRT engines using trtexec
    • Performance tuning and best practices

Training Script

  • train_mlp_decoder.py: Complete working example demonstrating:
    • Stim-based synthetic data generation for surface codes
    • PyTorch MLP decoder implementation
    • Training loop with validation and early stopping
    • ONNX export for TensorRT deployment
    • Moved from unittest directory to docs/sphinx/examples/qec/python/

Documentation Updates

  • examples.rst: Added new tutorial to QEC examples table of contents
  • decoders.rst: Minor formatting cleanup

Tutorial Features

  • ✅ Complete end-to-end workflow from training to deployment
  • ✅ Both Python and C++ usage examples
  • ✅ Multiple precision modes (fp16, fp32, bf16, int8, fp8, tf32, best)
  • ✅ Production deployment guidance with pre-built TensorRT engines
  • ✅ Best practices and performance tuning tips
  • ✅ Clear dependency requirements

Target Audience

  • Users training custom neural network decoders for QEC
  • Researchers experimenting with ML-based decoding approaches
  • Developers deploying production QEC systems with TensorRT acceleration

Testing

The training script has been tested and successfully:

  • Generates training data using Stim
  • Trains an MLP decoder on surface code syndromes
  • Exports models to ONNX format compatible with TensorRT
  • Achieves convergence with validation accuracy monitoring

Related Components:

- Add comprehensive tutorial for training neural network decoders
- Demonstrate PyTorch/Stim workflow for surface code decoding
- Include ONNX export and TensorRT deployment examples
- Move training script to examples directory

Signed-off-by: Scott Thornton <[email protected]>
Signed-off-by: Scott Thornton <[email protected]>
@bmhowe23 bmhowe23 added the documentation Improvements or additions to documentation label Nov 7, 2025
Consolidate AI decoder training documentation into decoders.rst and enhance
PyTorch installation guidance:

- Merge training_ai_decoders.rst content into decoders.rst for better
  organization and discoverability
- Update terminology from "Neural Network" to "AI" decoders throughout
- Add "Optional Dependencies" section in installation guide with:
  - PyTorch requirements for Tensor Network Decoder, Generative Quantum
    Eigensolver (GQE), and AI decoder training
  - Link to PyTorch installation page
  - CUDA 12.8+ requirement note
- Add cross-reference links between installation guide and AI decoder
  deployment section
- Clarify that AI decoders don't use parity check matrix (placeholder
  only) in Python and C++ code examples
- Remove standalone training_ai_decoders.rst file
- Update examples.rst table of contents

These changes improve documentation clarity and help users understand
PyTorch dependencies and AI decoder workflows.

Signed-off-by: Scott Thornton <[email protected]>
@bmhowe23
Copy link
Collaborator

@wsttiger regarding the CI failures, what is the plan? Do the CI runners need an extra package installed? If so, can you add the package to the runners as part of this PR?

wsttiger and others added 2 commits November 13, 2025 22:58
Add comprehensive tutorial for training and deploying AI decoders with
TensorRT, along with improved PyTorch installation guidance:

Documentation changes:
- Add train_mlp_decoder.py example showing complete workflow for training
  neural network decoders using PyTorch and Stim
- Update decoders.rst with AI decoder deployment documentation and
  TensorRT decoder usage examples
- Add "Optional Dependencies" section in installation guide with:
  - PyTorch requirements for Tensor Network Decoder (CPU only), Generative
    Quantum Eigensolver (GQE), and AI decoder training
  - Link to PyTorch installation page
  - Note for Blackwell architecture users requiring CUDA 12.8+
- Clarify that AI decoders don't use parity check matrix (placeholder only)
  in Python and C++ code examples
- Add cross-reference links between installation guide and AI decoder section

CI/CD changes:
- Update GitHub Actions workflows to support new documentation examples
- Modify all_libs.yaml, all_libs_release.yaml, lib_qec.yaml, and
  lib_solvers.yaml for proper build integration

These changes provide users with a complete guide for training custom AI
decoders and deploying them with optimized TensorRT inference.

Signed-off-by: Scott Thornton <[email protected]>
@wsttiger
Copy link
Collaborator Author

/ok to test 3e3ecc1

@bmhowe23
Copy link
Collaborator

Approved, but let's hold off on merging until the release is done.

@bmhowe23 bmhowe23 enabled auto-merge (squash) November 19, 2025 00:54
@bmhowe23 bmhowe23 merged commit 3079c21 into NVIDIA:main Nov 19, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants