Skip to content

Merge cleanups and fixes #3

Closed
jduprat wants to merge 16 commits intomainfrom
dev
Closed

Merge cleanups and fixes #3
jduprat wants to merge 16 commits intomainfrom
dev

Conversation

@jduprat
Copy link
Copy Markdown
Contributor

@jduprat jduprat commented Mar 25, 2026

No description provided.

jduprat and others added 16 commits March 24, 2026 12:37
The strided access example used the default element_bytes=2 (fp16),
under which Layout(32, 2) fits in a single cache line (1 transaction).
The example implies fp32 — make element_bytes=4 explicit so the
documented 2 transactions and 0.5 efficiency are correct.
explain(logical_product, ...) assumed B was always a Layout and called
compose(complement(A, bound), B) directly.  When B is a tuple tiler
like (2, 2), compose fails because the complement has fewer modes than
the tuple length.

Mirror the actual logical_product implementation: for tuple tilers,
show mode-by-mode decomposition instead of the single-layout formula.
_get_slice_highlight_mask_2d only handled tuple slice_spec for rank-2
layouts, silently returning an all-False mask for rank-0 and rank-1
layouts. Add an elif branch for r < 2 that unpacks the single-element
tuple and matches against the layout shape.
Layout.__init__ now validates that shape and stride arguments are
int or nested tuples of ints before calling normalize(). Invalid
types like strings, floats, or None produce a clear TypeError
naming the offending parameter (e.g. "Layout stride must be int
or tuple of ints, got str").
per_group_bank_conflicts and per_group_coalescing iterated over flat
indices, splitting a (32,4) TV layout into 4 groups of 32 — one per
value — instead of 1 group of 32 threads with all 4 values each.

Add _tv_dimensions() helper to extract (thread_count, value_count).
Group by the thread dimension (mode 0) and iterate all value modes per
thread using colexicographic indexing.  Rank-1 layouts are unchanged.
draw_composite and show_composite now accept flatten_hierarchical and
label_hierarchy_levels parameters (both as top-level defaults and as
per-panel overrides via the options dict). When flatten_hierarchical is
False, hierarchical panels render with nested coordinate labels and
tile boundary lines, matching draw_layout's existing behavior.
_build_composite_figure previously silently dropped panels that did not
fit into the grid. Now emits a UserWarning so users know data is being
omitted.
bank_conflicts, coalescing_efficiency, and segment_analysis previously
treated each thread as issuing a single scalar access.  For TV layouts
where mode 0 is the thread dimension and mode 1+ are value dimensions,
the functions now iterate all values per thread, correctly modeling
vectorized loads (e.g., LDG.128, LDS.128).

Rank-1 layouts are unchanged (value_count=1).
element_bytes varies per use case (fp16=2, fp32=4, fp8=1) and should
be set explicitly on every call.  Hardware constants like warp_size,
num_banks, and cache_line_bytes rarely change and keep their defaults.

Reorder parameters to put element_bytes first (no default) across
bank_conflicts, coalescing_efficiency, segment_analysis, and their
per_group variants.
Add `if self is other: return True` as the first check in both
Layout.__eq__ and Swizzle.__eq__. This is a standard Python best
practice that avoids redundant field-by-field comparison when
testing an object against itself.

Addresses REVIEW_ANALYSIS.md Section 5 (Equality Short-Circuiting).
Split the string representation into two methods following Python
conventions:

- __repr__ now returns an eval-safe constructor string such as
  Layout((4, 2), (1, 4)) or Layout((8, 8), (8, 1), swizzle=Swizzle(3, 0, 3)).
  This satisfies the Python data model guideline that repr should,
  where feasible, return a string that can recreate the object via eval().

- __str__ retains the human-readable CuTe notation (4, 2) : (1, 4)
  used in print() and casual display.

Addresses REVIEW_ANALYSIS.md Section 5 (String Representations).
Introduces atoms_amx.py with MMAAtom definitions for Intel AMX
instructions (tdpbf16ps, tdpfp16ps, tdpbssd, tdpbsud, tdpbusd,
tdpbuud). AMX is a true tile matrix multiply (16x16 output) executed
by a single CPU core (T=1), making it the cleanest CPU-to-MMAAtom
mapping in the layout algebra framework.
The dataclass-generated __repr__ includes all fields (layouts, PTX
strings) producing 300+ character lines that are hard to scan in
REPL sessions and logs. Add a short __str__ that shows just the
atom name and shape:

  str(atom) -> MMAAtom('SM80_16x8x16_F32F16F16F32_TN', 16x8x16)
  str(copy) -> CopyAtom('SM75_U32x4_LDSM_N')

The verbose eval-safe __repr__ from @DataClass is unchanged.

Addresses REVIEW_ANALYSIS.md Section 5 (String Representations).
Ensure idx2crd matches NVIDIA CuTe behavior where strictly scalar shapes
always modulo-wrap the coordinate. Added an oracle differential test
validating our idx2crd implementation against NVIDIA's authoritative
pycute implementation across a range of shapes and indices.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 25, 2026
@jduprat jduprat self-assigned this Mar 25, 2026
@jduprat jduprat closed this Mar 25, 2026
@jduprat jduprat reopened this Mar 25, 2026
@jduprat jduprat closed this Mar 31, 2026
@jduprat jduprat deleted the dev branch March 31, 2026 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant