Skip to content

ILGPU V2.0: Faster ILGPUC Compile Times + NuGet Packaging#1585

Open
m4rs-mt wants to merge 14 commits into
masterfrom
faster_compile_time
Open

ILGPU V2.0: Faster ILGPUC Compile Times + NuGet Packaging#1585
m4rs-mt wants to merge 14 commits into
masterfrom
faster_compile_time

Conversation

@m4rs-mt
Copy link
Copy Markdown
Owner

@m4rs-mt m4rs-mt commented Apr 30, 2026

A follow-on to the V2.0 stack focused on two largely-independent concerns: (1) cutting frontend compile time on the new AOT pipeline and (2) shipping ILGPUC as a real NuGet package with an end-to-end test harness that proves it.

A note on authorship

This PR was done with AI-assistant pair programming (Claude via Claude Code). Every real commit carries a Co-Authored-By: Claude trailer.

What's in this PR

1. Compile-time performance (Src/ILGPUC/Frontend/)

Profiling a trivial a[i] = b[i] * c[i] kernel showed the frontend dominating compile time at ~95 ms / iteration (~90% of total). Breakdown:

  • DisassembleMethods walk: ~66 ms wall (1720 methods, only ~9 ms of which was real disassembly — the rest was lock contention).
  • LoadDebugSymbols (PDB): ~28 ms wall (5 streams opened + parsed every iter).
  • Sequence-point attach: 0.05 ms.

Both are backend-independent and identical across every kernel compiled by one KernelCompiler instance, but none of that work was being shared. Two commits address this:

  • Cached disassembly + PDB load across kernel compiles — a new ILFrontendCache shares disassembled methods and PDB streams across kernel compiles within a single KernelCompiler.
  • Lazy frontend disassembly with codegen-time intrinsic safety net — replaces the eager whole-assembly walk with on-demand disassembly. A codegen-time safety net makes sure intrinsic-bound calls are still resolved correctly even when the surrounding method was never disassembled.
  • New KernelLibraryAttribute (Src/ILGPU/KernelLibraryAttribute.cs) lets library assemblies opt in to having their kernel-relevant methods discovered without forcing a full assembly walk in user code.

2. Compile-time perf regression coverage (Src/ILGPUC.Tests/PerfTests/, Src/ILGPUC.Tests/IRTests/)

A new test layer keeps the wins above from regressing:

  • PerfTests/PerfTestBase.cs, PerfTests/CompilePerfRegressionTests.cs — perf budget assertions over representative kernels.
  • PerfTests/CompileBenchFacts.cs — checked-in profiling fact (xUnit [Fact]) so the breakdown above can be re-measured on demand.
  • Kernels/PerfRegressionKernels.cs — the kernel zoo the perf tests compile against.
  • IRTests/DeepCallStackIRTests.cs + Kernels/DeepCallStackKernels.cs — exercise the lazy walk with deep call graphs and add depth + negative assertions so a future regression can't quietly fall back to eager disassembly.
  • Framework/CompilationTestBase.cs, Framework/MsBuildRunner.cs — small framework additions to support the new layers.

3. NuGet packaging for ILGPUC (Src/ILGPUC/, Src/scripts/pack-ilgpuc.sh)

ILGPUC is now packaged as a NuGet with R2R-compiled native binaries:

  • Src/ILGPUC/ILGPUC.csproj — packs the AOT-built compiler binaries per RID and wires up MSBuild integration props/targets.
  • Src/ILGPUC/PACKAGE.md — the README that ships in the package.
  • Src/ILGPUC/build/ILGPU.Kernels.targets — consumer-side MSBuild integration so referencing the package is enough to drive kernel compilation.
  • Src/scripts/pack-ilgpuc.sh — pack script that runs the R2R build + nupkg assembly across RIDs.
  • ILGPU is published as a transitive NuGet dependency of ILGPUC, so a consumer only references one package.

4. End-to-end NuGet consumer harness (EndToEndTest/, Samples/LocalNuGetConsumer/)

The previous Src/ILGPUC.Tests/IntegrationTests/NuGetIntegrationTests.cs (and its NuGetHello template scaffolding) was a transient, in-process test. It's replaced by a persistent, on-disk consumer project that exercises the real toolchain end-to-end:

  • EndToEndTest/HelloKernel/ — a standalone consumer project (.csproj + Program.cs) that pulls the packed NuGets, compiles a kernel, and runs it.
  • EndToEndTest/run.sh — driver script: pack ILGPUC locally, restore against the local feed, build, run.
  • EndToEndTest/NuGet.config.template, EndToEndTest/README.md — template feed config + docs.
  • Samples/LocalNuGetConsumer/ — a user-facing sample mirroring the same flow (with its own pack-local.sh and README), so consumers can copy the pattern.

5. KernelLibrary sample (Samples/KernelLibraryAttribute/)

A new sample showing how to ship a kernel-helper library decorated with [KernelLibrary] and consume it from a downstream project (MyKernelLib/ + Consumer/).

6. CI (.github/workflows/ci.yml)

  • New e2e-test job that runs EndToEndTest/run.sh against the freshly packed NuGets.
  • The four GPU compile jobs (CUDA / ROCm / OpenCL / Metal) are now gated on e2e-test succeeding, so a packaging regression fails fast before fanning out across GPU runners.

Scope boundary

  • No changes to backend code generation behavior — backends are touched only insofar as the ILGPUC.csproj packaging picks up their existing R2R outputs.
  • No new public surface beyond KernelLibraryAttribute and the EndToEndTest/ + Samples/LocalNuGetConsumer/ directories.
  • The replaced NuGetIntegrationTests.cs and its NuGetHello templates are deleted outright; their coverage moves to EndToEndTest/.

Depends on

PR #1584 (and the rest of the V2.0 stack: #1355, #1576, #1577, #1578, #1579, #1580).

Known limitations

  • The R2R native binaries shipped in the NuGet are per-RID; consumers on a RID that isn't packed will fall back to a managed-only path (or fail to restore, depending on the project's RID settings).

m4rs-mt and others added 13 commits April 30, 2026 23:44
The frontend dominates compile time at ~95 ms / iteration on a
trivial `a[i] = b[i] * c[i]` kernel (~90 % of total). Breakdown:

  DisassembleMethods walk : ~66 ms wall (1720 methods, only 9 ms is
                                         real disassembly — rest is
                                         lock contention)
  LoadDebugSymbols (PDB)  : ~28 ms wall (5 streams opened + parsed
                                         every iter)
  Sequence-point attach   :  0.05 ms

Both disassembly and PDB load are backend-independent and identical
across every kernel compiled by one KernelCompiler instance. None of
that work was being shared.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
@m4rs-mt m4rs-mt added this to the v2.0 milestone Apr 30, 2026
@m4rs-mt m4rs-mt force-pushed the faster_compile_time branch from c7098ae to e1c1c51 Compare April 30, 2026 22:54
Co-Authored-By: Claude <noreply@anthropic.com>
@m4rs-mt m4rs-mt force-pushed the faster_compile_time branch from e1c1c51 to c65779e Compare April 30, 2026 23:02
@m4rs-mt m4rs-mt marked this pull request as ready for review May 1, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant