Conversation
The strided access example used the default element_bytes=2 (fp16), under which Layout(32, 2) fits in a single cache line (1 transaction). The example implies fp32 — make element_bytes=4 explicit so the documented 2 transactions and 0.5 efficiency are correct.
explain(logical_product, ...) assumed B was always a Layout and called compose(complement(A, bound), B) directly. When B is a tuple tiler like (2, 2), compose fails because the complement has fewer modes than the tuple length. Mirror the actual logical_product implementation: for tuple tilers, show mode-by-mode decomposition instead of the single-layout formula.
_get_slice_highlight_mask_2d only handled tuple slice_spec for rank-2 layouts, silently returning an all-False mask for rank-0 and rank-1 layouts. Add an elif branch for r < 2 that unpacks the single-element tuple and matches against the layout shape.
Layout.__init__ now validates that shape and stride arguments are int or nested tuples of ints before calling normalize(). Invalid types like strings, floats, or None produce a clear TypeError naming the offending parameter (e.g. "Layout stride must be int or tuple of ints, got str").
per_group_bank_conflicts and per_group_coalescing iterated over flat indices, splitting a (32,4) TV layout into 4 groups of 32 — one per value — instead of 1 group of 32 threads with all 4 values each. Add _tv_dimensions() helper to extract (thread_count, value_count). Group by the thread dimension (mode 0) and iterate all value modes per thread using colexicographic indexing. Rank-1 layouts are unchanged.
draw_composite and show_composite now accept flatten_hierarchical and label_hierarchy_levels parameters (both as top-level defaults and as per-panel overrides via the options dict). When flatten_hierarchical is False, hierarchical panels render with nested coordinate labels and tile boundary lines, matching draw_layout's existing behavior.
_build_composite_figure previously silently dropped panels that did not fit into the grid. Now emits a UserWarning so users know data is being omitted.
bank_conflicts, coalescing_efficiency, and segment_analysis previously treated each thread as issuing a single scalar access. For TV layouts where mode 0 is the thread dimension and mode 1+ are value dimensions, the functions now iterate all values per thread, correctly modeling vectorized loads (e.g., LDG.128, LDS.128). Rank-1 layouts are unchanged (value_count=1).
element_bytes varies per use case (fp16=2, fp32=4, fp8=1) and should be set explicitly on every call. Hardware constants like warp_size, num_banks, and cache_line_bytes rarely change and keep their defaults. Reorder parameters to put element_bytes first (no default) across bank_conflicts, coalescing_efficiency, segment_analysis, and their per_group variants.
Add `if self is other: return True` as the first check in both Layout.__eq__ and Swizzle.__eq__. This is a standard Python best practice that avoids redundant field-by-field comparison when testing an object against itself. Addresses REVIEW_ANALYSIS.md Section 5 (Equality Short-Circuiting).
Split the string representation into two methods following Python conventions: - __repr__ now returns an eval-safe constructor string such as Layout((4, 2), (1, 4)) or Layout((8, 8), (8, 1), swizzle=Swizzle(3, 0, 3)). This satisfies the Python data model guideline that repr should, where feasible, return a string that can recreate the object via eval(). - __str__ retains the human-readable CuTe notation (4, 2) : (1, 4) used in print() and casual display. Addresses REVIEW_ANALYSIS.md Section 5 (String Representations).
Introduces atoms_amx.py with MMAAtom definitions for Intel AMX instructions (tdpbf16ps, tdpfp16ps, tdpbssd, tdpbsud, tdpbusd, tdpbuud). AMX is a true tile matrix multiply (16x16 output) executed by a single CPU core (T=1), making it the cleanest CPU-to-MMAAtom mapping in the layout algebra framework.
The dataclass-generated __repr__ includes all fields (layouts, PTX
strings) producing 300+ character lines that are hard to scan in
REPL sessions and logs. Add a short __str__ that shows just the
atom name and shape:
str(atom) -> MMAAtom('SM80_16x8x16_F32F16F16F32_TN', 16x8x16)
str(copy) -> CopyAtom('SM75_U32x4_LDSM_N')
The verbose eval-safe __repr__ from @DataClass is unchanged.
Addresses REVIEW_ANALYSIS.md Section 5 (String Representations).
Ensure idx2crd matches NVIDIA CuTe behavior where strictly scalar shapes always modulo-wrap the coordinate. Added an oracle differential test validating our idx2crd implementation against NVIDIA's authoritative pycute implementation across a range of shapes and indices.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.