ILGPU V2.0: Fixed ArrayView2D/3D launcher flattening and multi-dimensional kernel launchers by m4rs-mt · Pull Request #1592 · m4rs-mt/ILGPU

m4rs-mt · 2026-05-09T13:43:05Z

Summary

Routed test framework field accessors through LauncherStubGenerator (ProgramBuilder.ExtractFieldAccessors was returning only top-level accessor names, so ArrayView2D's IR decomposition [BaseView, Extent.X, Extent.Y, Stride.YStride] was emitted as three accessors against four IR slots, producing invalid C# at the launcher boundary). Adds LauncherFieldAccessorFlatteningTests (9 tests pinning the leaf accessor lists for ArrayView1D, ArrayView2D, ArrayView3D, and ArrayView<T>) and ArrayView2DExecutionTests (4 Index2D-launch round-trip programs: DenseX, DenseY, General, padded-bitmap).
Emitted struct literal for multi-field GetField span (ExpressionEmitter.EmitGetField emitted a single source.Field{Index} access regardless of FieldSpan.Span, producing struct_NNNN tmp = source.Field4 (int → struct, CS0029) on any backend when extracting a multi-field sub-struct from a flattened parent). Fix: when Span > 1 and the result type is StructureType, emit a struct literal copying each constituent flat field individually (both C# and C-style literal forms). Adds ArrayView3DExecutionTests (3 programs: DenseXY, DenseZY, General).
Dropped B.2 cross-type struct-assignment gate: removed [KnownFailingOn(Cuda, ROCm, OpenCL)] from InterleaveFieldsKernel and MatrixMultiplyTiledKernel — both were resolved by the multi-field GetField fix above. Deleted the now-redundant InterleaveFieldsKnownFailureTests.cs.
Pinned native Allocate{2,3}D + GetAsArray{2,3}D round-trip: added Allocate2DDenseXRoundTrip and Allocate3DDenseXYRoundTrip test programs that drive the native MemoryBuffer{2,3}D allocation path end-to-end (the exact pattern from issue Explain how ArrayView2D works? #1464).
Added SimpleArrayView2D / SimpleArrayView3D samples demonstrating the canonical Index2D/Index3D launch + ArrayView2D/ArrayView3D parameter + GetAsArray2D/GetAsArray3D round-trip. Each sample header carries a layout cheat sheet (Index2D(W,H) is X-then-Y, DenseX makes X contiguous, GetAsArray2D returns T[extent.X, extent.Y]). Verified end-to-end on Metal (M4 Max).
Documented MemoryBuffer2D/3D layout conventions and ArrayView2D semantics in the Beginner tutorial (02_MemoryBuffers-and-ArrayViews.md) and in the MemoryBufferStrides sample header, covering both mental models (bitmap: X=column contiguous; C# multidim: X=slowest axis), the GetAsArray2D shape, and the canonical Explain how ArrayView2D works? #1464 allocation-vs-launch transposition pitfall.
Added IR snapshot tests for the six previously uncovered kernel classes (AdvancedView, GenericKernel, InterleaveFields, MatrixMultiply, RadixSort, ThreadBuiltin). Also fixed FixPointAnalysis.Merge to fall back to a scalar merge when two AnalysisValues have mismatched field counts (IndexOutOfRangeException at O0 on KernelIndex-based kernels).

Test plan

Verify ArrayView2DExecutionTests passes on CPU and Metal (DenseX, DenseY, General, PaddedBitmap round-trips)
Verify ArrayView3DExecutionTests passes on CPU and Metal (DenseXY, DenseZY, General round-trips)
Verify Allocate2DDenseXRoundTrip and Allocate3DDenseXYRoundTrip pass on all available backends
Verify LauncherFieldAccessorFlatteningTests (9 unit tests) all pass — no hardware dependency
Verify BackendTests.NativeCompilation passes for InterleaveFieldsKernel and MatrixMultiplyTiledKernel on Cuda (confirmed 480/480 locally against Docker compiler service); ROCm and OpenCL gated on CI
Verify all IR snapshot tests pass (3942/3942) in strict verify mode
Verify sample build succeeds for SimpleArrayView2D and SimpleArrayView3D on all available backends

Fixes

Closes #1464

Dependencies

PR #1586

Metal execution is mandatory (with a Metal-device preflight that fails the job if no GPU is enumerable on the runner). Cuda/ROCm/OpenCL execution jobs are advisory: hosted Linux runners have no GPU, so the existing per-test Availability.IsRuntimeAvailable gate skips every test and the job exits success. They only block CI when a real device is present and a test fails. The asymmetric required/advisory contract is encoded entirely in checks-completed via check_required vs. check_gpu_private. Co-Authored-By: Claude <noreply@anthropic.com>

Co-Authored-By: Claude <noreply@anthropic.com>

Replaces ProgramBuilder's local non-flattening ExtractFieldAccessors with a call to LauncherStubGenerator.ExtractFieldAccessors, the same helper the production ilgpuc-build path uses (MainKernelProvider). The local copy returned only top-level accessor names so ArrayView2D's IR decomposition ([BaseView, Extent.X, Extent.Y, Stride.YStride]) was emitted as [BaseView, Extent, Stride] — three accessors mapped against four IR slots — producing invalid C# (Field3 dangling, LongIndex2D/Stride2D.DenseX directly assigned to long) at the launcher boundary. Adds NonKernelTests/LauncherFieldAccessorFlatteningTests with nine tests that pin the leaf accessor lists for ArrayView1D.Dense, ArrayView2D's DenseX/DenseY/General, ArrayView3D's DenseXY/DenseZY/General, plain ArrayView<T>, and the null-for-non-struct sentinel. Adds ExecutionTests/ArrayView2DExecutionTests with four Index2D-launch round-trip programs that pin the #1464 layout contract on every backend: DenseX, DenseY, General (XStride=1, YStride=W), and the canonical bitmap- with-row-padding pattern (YStride > extent.X) that the issue describes. Co-Authored-By: Claude <noreply@anthropic.com>

ExpressionEmitter.EmitGetField used to emit a single source.Field{Index} access regardless of FieldSpan.Span. For a sub-struct extract from an already-flattened parent (e.g. ArrayView3D's Stride wrapper, two int slots, extracted from the parameter struct) that yielded a primitive expression while the IR Value's declared type was still the multi-field sub-struct, producing 'struct_NNNN tmp_0 = source.Field4;' (int → struct, CS0029) in generated CPU kernel bodies. The 2D path never tripped because Stride2D.DenseX/Y has only one non-unit stride field, so the IR optimizer collapses the multi-field load to Span == 1 before emit. Fix: when Span > 1 and the result type is StructureType, emit a struct literal copying each constituent flat field individually. Both literal forms supported via LanguageConfig.UsesCStyleStructLiterals — C# (CPU) 'new T { Field0 = src.FieldI, Field1 = src.Field(I+1) }' and C-style (CUDA / Metal / OpenCL) '(T){ src.FieldI, src.Field(I+1) }'. Adds ExecutionTests/ArrayView3DExecutionTests with three Index3D-launch round-trip programs (DenseXY, DenseZY, General) that pin both the indexer contract (view[Index3D(x, y, z)] yields the kernel-written value regardless of stride flavour) and the linear memory order (z*W*H + y*W + x for DenseXY, x*H*D + y*D + z for DenseZY). Co-Authored-By: Claude <noreply@anthropic.com>

Adds a top-of-file cheat sheet explaining the Index2D(X, Y) ordering, the DenseX vs DenseY linear-layout distinction, the T[extent.X, extent.Y] shape returned by GetAsArray2D() (first dim is X, not the row/Y index as in the typical C# bitmap convention), and the canonical bitmap-with-row- padding pattern from #1464 — including the exact mistake the issue describes (allocating with Index2D(height, width*4) then launching with Index2D(width, height)). Co-Authored-By: Claude <noreply@anthropic.com>

Two minimal samples demonstrating the canonical Index2D / Index3D launch + ArrayView2D / ArrayView3D parameter + GetAsArray2D / GetAsArray3D round-trip that #1464 was looking for. The kernels write a position-encoded value at every cell and the host self-checks the read-back, so a layout mistake (launch-extent vs. allocation-extent mismatch, [x, y] vs. [y, x] indexing) fails loudly with the offending coordinate. Each sample header carries a layout cheat sheet that mirrors the one added to MemoryBufferStrides — Index2D(W, H) is X-then-Y, Allocate2DDenseX makes X the contiguous axis, GetAsArray2D returns T[extent.X, extent.Y] (first dim is X, NOT the row index as in the typical C# bitmap convention). Verified end-to-end on Metal (M4 Max) through the default sample build, which routes kernels through the ILGPUC AOT rewriter. Co-Authored-By: Claude <noreply@anthropic.com>

Captures the empirical finding from investigating the gate: with the launcher accessor flattening fix (1d5207eb) and the multi-field GetField struct-literal emit (aa5598c0) both landed, BackendTests.NativeCompilation now passes for both gated kernels (InterleaveFieldsKernels.InterleaveFieldsKernel and MatrixMultiplyKernels.MatrixMultiplyTiledKernel) when run against the Cuda Docker compiler service. ROCm and OpenCL toolchains weren't available locally during planning, so the document is structured as a verification plan — confirm on the remaining two backends, then drop the [KnownFailingOn] attributes, delete the now-redundant KnownIssues negative test, and tighten the build-samples-{cuda,rocm,opencl} CI gates. Includes the current generated CUDA source for InterleaveFieldsKernel as a reference snapshot, plus three hypothesised per-backend deltas (OpenCL address-space qualifiers, hipcc strictness on implicit conversions, OpenCL int64 atomics) ranked by likelihood with the corresponding emitter sites to inspect if Phase 1 surfaces a real failure. Co-Authored-By: Claude <noreply@anthropic.com>

Removes the [KnownFailingOn(Cuda, ROCm, OpenCL)] attribute from both kernels that were tracking B.2 (InterleaveFieldsKernel and MatrixMultiplyTiledKernel) — the multi-field GetField struct-literal emit fix in aa5598c0 already resolves the family. Verified by running the full Cuda BackendTests (480/480 passing) against the local Cuda Docker compiler service, including the two previously-skipped NativeCompilation theories. ROCm and OpenCL toolchains weren't available on the host, but source emit is structurally identical to the (verified) Cuda case across all three backends — pointer- arith field stores throughout, no struct copies, OpenCL adds the correct `global` address-space qualifiers on every cast — so verification rides on CI's test-rocm-compile and test-opencl-compile jobs which run inside the GHCR images that bundle the toolchains. Updates the kernel docstrings to point at the multi-field GetField lowering the kernels now pin (instead of describing them as B.2 holding-pen kernels). Deletes Src/ILGPUC.Tests/KnownIssues/InterleaveFieldsKnownFailureTests.cs — the two skipped assertions are subsumed by BackendTests.NativeCompilation on the kernel side and by build-samples-{cuda,rocm,opencl} on the sample side. Updates fix_samples.md to mark B.2 fully fixed with the resolving commit, and adds a B.2 fix-detail section mirroring the B.1 write-up. Co-Authored-By: Claude <noreply@anthropic.com>

The existing ArrayView{2,3}DExecutionTests use the Allocate1D + As{2,3}DView reinterpret pattern in their test programs, which is the workaround that sidestepped the original launcher accessor-flattening bug. With that bug fixed (1d5207eb), the native MemoryBuffer{2,3}D path is the one users actually reach for first — and is the exact #1464 user pattern: allocate via stream.Allocate2DDenseX, launch via Index2D, copy back via GetAsArray2D. Adds two test programs that drive that path end-to-end on all five backends so a regression at the native allocator + GetAsArray{2,3}D boundary is caught here, not deferred to sample-build CI (which only AOT- compiles the SimpleArrayView{2,3}D samples and never runs the round-trip assertion). Reuses ExpectedRoundTripOutput from each test class, since the natively allocated path must produce the same logical [x,y]/[x,y,z] values as the reinterpret path — the difference is the allocator and readback surface, not the layout contract. Co-Authored-By: Claude <noreply@anthropic.com>

The Beginner tutorial Docs/02_Beginner/02_MemoryBuffers-and-ArrayViews.md was almost entirely 1D-focused, with one terse note recommending Stride2D.DenseY / Stride3D.DenseZY as the default "because they match how C# strides 2D arrays". That recommendation only holds under one specific mental model (map ILGPU's X axis to the slowest C# axis); the bitmap / image-processing idiom (X = column, Y = row, X contiguous in memory) points the opposite way and is the model #1464 was tripping on. Replaces the misleading "always default to DenseY" note with a per-flavour breakdown that names both mental models, then adds a focused "MemoryBuffer2D and MemoryBuffer3D" section before the existing 1D example covering: the X-then-Y index ordering, the layout contract for each stride, the GetAsArray2D / GetAsArray3D readback shape (T[extent.X, extent.Y, …] — NOT row-major as in the typical C# bitmap idiom), and the canonical #1464 allocation-vs-launch transposition pitfall. Points readers at the SimpleArrayView2D / SimpleArrayView3D / MemoryBufferStrides samples for runnable examples. Co-Authored-By: Claude <noreply@anthropic.com>

…elds, MatrixMultiply, RadixSort, and ThreadBuiltin kernels. Fixed FixPointAnalysis.Merge to fall back to scalar merge when analysis values have mismatched field counts (IndexOutOfRangeException at O0 on KernelIndex kernels). Added OptLevelsO1PlusAndModes / BackendTypesAndOptLevelsO1PlusAndModes to TestTypes to exclude O0 from MatrixMultiplyIRTests, where a pre-existing StructureValue type-mismatch during address-space rewriting at O0 is unrelated to the B.2 fix the kernel was added to pin. Co-Authored-By: Claude <noreply@anthropic.com>

…padding slots. Allocate1D does not zero memory; CI hit heap garbage in the unwritten padding columns and failed with "expected 0 but got XY". MemSetToZero before the kernel launch makes the read-back deterministic. Co-Authored-By: Claude <noreply@anthropic.com>

m4rs-mt and others added 13 commits May 9, 2026 14:33

PR #1586.

b74a521

Documented commit message conventions in CLAUDE.md.

de3b2a4

Co-Authored-By: Claude <noreply@anthropic.com>

Updated Snapshots submodule pointer for new IR test coverage.

96c61cc

m4rs-mt added this to the v2.0 milestone May 9, 2026

m4rs-mt changed the title ~~ILGPU V2.0: Fixed ArrayView2D/3D launcher flattening, multi-field struct-literal emit, and 2D/3D test coverage~~ ILGPU V2.0: Fixed ArrayView2D/3D launcher flattening and multi-dimensional kernel launchers May 9, 2026

m4rs-mt marked this pull request as ready for review May 10, 2026 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ILGPU V2.0: Fixed ArrayView2D/3D launcher flattening and multi-dimensional kernel launchers#1592

ILGPU V2.0: Fixed ArrayView2D/3D launcher flattening and multi-dimensional kernel launchers#1592
m4rs-mt wants to merge 14 commits intomasterfrom
multi_dim_kernels

m4rs-mt commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

m4rs-mt commented May 9, 2026

Summary

Test plan

Fixes

Dependencies

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant