Merge cleanups and fixes by jduprat · Pull Request #3 · facebookresearch/tensor-layouts

jduprat · 2026-03-25T01:28:27Z

No description provided.

The strided access example used the default element_bytes=2 (fp16), under which Layout(32, 2) fits in a single cache line (1 transaction). The example implies fp32 — make element_bytes=4 explicit so the documented 2 transactions and 0.5 efficiency are correct.

explain(logical_product, ...) assumed B was always a Layout and called compose(complement(A, bound), B) directly. When B is a tuple tiler like (2, 2), compose fails because the complement has fewer modes than the tuple length. Mirror the actual logical_product implementation: for tuple tilers, show mode-by-mode decomposition instead of the single-layout formula.

_get_slice_highlight_mask_2d only handled tuple slice_spec for rank-2 layouts, silently returning an all-False mask for rank-0 and rank-1 layouts. Add an elif branch for r < 2 that unpacks the single-element tuple and matches against the layout shape.

Layout.__init__ now validates that shape and stride arguments are int or nested tuples of ints before calling normalize(). Invalid types like strings, floats, or None produce a clear TypeError naming the offending parameter (e.g. "Layout stride must be int or tuple of ints, got str").

per_group_bank_conflicts and per_group_coalescing iterated over flat indices, splitting a (32,4) TV layout into 4 groups of 32 — one per value — instead of 1 group of 32 threads with all 4 values each. Add _tv_dimensions() helper to extract (thread_count, value_count). Group by the thread dimension (mode 0) and iterate all value modes per thread using colexicographic indexing. Rank-1 layouts are unchanged.

draw_composite and show_composite now accept flatten_hierarchical and label_hierarchy_levels parameters (both as top-level defaults and as per-panel overrides via the options dict). When flatten_hierarchical is False, hierarchical panels render with nested coordinate labels and tile boundary lines, matching draw_layout's existing behavior.

_build_composite_figure previously silently dropped panels that did not fit into the grid. Now emits a UserWarning so users know data is being omitted.

bank_conflicts, coalescing_efficiency, and segment_analysis previously treated each thread as issuing a single scalar access. For TV layouts where mode 0 is the thread dimension and mode 1+ are value dimensions, the functions now iterate all values per thread, correctly modeling vectorized loads (e.g., LDG.128, LDS.128). Rank-1 layouts are unchanged (value_count=1).

element_bytes varies per use case (fp16=2, fp32=4, fp8=1) and should be set explicitly on every call. Hardware constants like warp_size, num_banks, and cache_line_bytes rarely change and keep their defaults. Reorder parameters to put element_bytes first (no default) across bank_conflicts, coalescing_efficiency, segment_analysis, and their per_group variants.

Add `if self is other: return True` as the first check in both Layout.__eq__ and Swizzle.__eq__. This is a standard Python best practice that avoids redundant field-by-field comparison when testing an object against itself. Addresses REVIEW_ANALYSIS.md Section 5 (Equality Short-Circuiting).

Split the string representation into two methods following Python conventions: - __repr__ now returns an eval-safe constructor string such as Layout((4, 2), (1, 4)) or Layout((8, 8), (8, 1), swizzle=Swizzle(3, 0, 3)). This satisfies the Python data model guideline that repr should, where feasible, return a string that can recreate the object via eval(). - __str__ retains the human-readable CuTe notation (4, 2) : (1, 4) used in print() and casual display. Addresses REVIEW_ANALYSIS.md Section 5 (String Representations).

Introduces atoms_amx.py with MMAAtom definitions for Intel AMX instructions (tdpbf16ps, tdpfp16ps, tdpbssd, tdpbsud, tdpbusd, tdpbuud). AMX is a true tile matrix multiply (16x16 output) executed by a single CPU core (T=1), making it the cleanest CPU-to-MMAAtom mapping in the layout algebra framework.

The dataclass-generated __repr__ includes all fields (layouts, PTX strings) producing 300+ character lines that are hard to scan in REPL sessions and logs. Add a short __str__ that shows just the atom name and shape: str(atom) -> MMAAtom('SM80_16x8x16_F32F16F16F32_TN', 16x8x16) str(copy) -> CopyAtom('SM75_U32x4_LDSM_N') The verbose eval-safe __repr__ from @DataClass is unchanged. Addresses REVIEW_ANALYSIS.md Section 5 (String Representations).

Ensure idx2crd matches NVIDIA CuTe behavior where strictly scalar shapes always modulo-wrap the coordinate. Added an oracle differential test validating our idx2crd implementation against NVIDIA's authoritative pycute implementation across a range of shapes and indices.

jduprat and others added 16 commits March 24, 2026 12:37

Warn when composite figure panels exceed grid capacity

1c38385

_build_composite_figure previously silently dropped panels that did not fit into the grid. Now emits a UserWarning so users know data is being omitted.

Add missing copyright and license blocks to documentation files

092f92a

[NFC] Project is now lint clean

a872c2f

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 25, 2026

jduprat self-assigned this Mar 25, 2026

jduprat closed this Mar 25, 2026

jduprat reopened this Mar 25, 2026

jduprat closed this Mar 31, 2026

jduprat deleted the dev branch March 31, 2026 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge cleanups and fixes #3

Merge cleanups and fixes #3
jduprat wants to merge 16 commits intomainfrom
dev

jduprat commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jduprat commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant