updates by PhilipDeegan · Pull Request #14 · mkn/mkn.avx

PhilipDeegan · 2026-03-29T13:42:16Z

Summary by CodeRabbit

New Features
- Cross-platform CI workflow (Linux & Windows)
- New micro-benchmark executable suite and Timer utility
- Added aligned and unaligned AVX test programs
Improvements
- Streamlined vector/span handling with configurable alignment
- Expanded AVX support, safer SIMD helpers, and function-level counters for diagnostics
Documentation
- Alignment configuration docs updated to reflect dynamic defaults
Tests
- New and reorganized benchmark/test profiles and registrations

coderabbitai · 2026-03-29T13:42:28Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Reworks AVX core: removes hardcoded alignment macro in favor of Options::ALIGN(), converts Vector to an aligned std::vector alias, expands Span/SpanSet with aligned/unaligned variants and variadic make_spans, replaces function-pointer SIMD API with static callables, adds a dbg counter header, new benchmarks/tests, and a multi-platform CI workflow.

Changes

Cohort / File(s)	Summary
CI / Workflows `\.github/workflows/build\.yml`, `\.github/workflows/build_nix\.yml`	Added new multi-platform GitHub Actions workflow (`.github/workflows/build.yml`) and removed the prior Nix workflow (`.github/workflows/build_nix.yml`).
Top-level config `mkn\.yaml`	Switched run profile to `dbg`, replaced `bench` with `gbench`, and added/adjusted profiles: `test_all`, `test_lazy`, `test_aligned`, `test_unaligned`, `b0`, `b1`.
Alignment & Options `inc/mkn/avx/def.hpp`, `README.noformat`	Removed header-level default `MKN_AVX_ALIGN_AS`; added `Options::ALIGN()` logic, `mkn::avx::Exception`, and alignment/type-trait helpers; updated README to document conditional alignment semantics.
Span / Views `inc/mkn/avx/span.hpp`	Converted `Span` to fixed-size `Span<T,N>` with pointer/AVX union, added `UnSpan`/`UnSpanSet`, enforced matching `N` in SIMD ops, added alignment detection and variadic `make_spans`, and adjusted constructors/assignments.
Array API `inc/mkn/avx/array.hpp`	Switched defaults to `Options::ALIGN()`, simplified internal storage, reworked construction/copy/move/assignment to value-init + span-based population, changed `Span_t` to `Span<T,N>`, and added several free-function overloads.
SIMD Types & Ops `inc/mkn/avx/types.hpp`, `inc/mkn/avx/lazy.hpp`	Replaced function-pointer members with static callable members (`add/sub/mul/div/set/set_v/fma/unaligned_`), added `UnType`, `store`/`load`/`unaligned_` helpers, adjusted thread-local temporaries to use `Options::ALIGN()`, and minor eval signature tweaks.
Vector type `inc/mkn/avx/vector.hpp`, `inc/mkn/avx.hpp`	Removed custom `Vector` class and AVX-specific elementwise operators; introduced `using Vector = std::vector<T, kul::AlignedAllocator<T, Options::ALIGN()>>`; reduced includes and adjusted IWYU annotations.
Debug / Counters `inc/mkn/avx/dbg.hpp`	Added header-only `Counter` and `CountPoint` utilities and `MKN_AVX_FN_COUNTER` macro (enabled when `MKN_AVX_COUNT_FNS`), with formatted keyed counts stored in a process-local map.
Benchmarks / Utilities `test/bench.hpp`, `test/bench.cpp`, `test/betterbench.hpp`, `test/betterbench.cpp`, `test/betterbench_avx.cpp`	Added `Timer` utility and benchmark harness (SoA/AoS), refactored benchmark kernels to use `make_spans`, added `betterbench` scalar and AVX variants, and reorganized includes.
Tests `test/test_avx.cpp`, `test/test_lazy.cpp`, `test/test_aligned.cpp`, `test/test_unaligned.cpp`	Updated tests to use `Options::ALIGN()` and new span APIs, added aligned and unaligned validation tests, adjusted assertion/abort usage, and added counter logging in `test_avx`.
Misc `inc/mkn/avx/dbg.hpp`, `inc/mkn/avx/def.hpp`, `inc/mkn/avx/span.hpp` (additional)	Marked many small functions `inline`, adjusted include ordering, added IWYU keep annotations, and changed several API signatures/types as part of the refactor.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

arrays #11: Modifies the same AVX core headers (array/span/options/types/vector) and coordinates Array/Span/Options changes — strong code-level overlap.
lazy vectors ops #7: Touches lazy/vector/span/types subsystems and tests; overlaps the lazy/vector/span refactor and related tests.

Poem

🐇

I nudged the lanes to wiser align,
Swapped snug Vecs for std::vector fine,
Counters tick where functions leap,
Spans and UnSpans guard bytes in sleep,
Benchmarks hum — a carrot-dance, divine.

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.47% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'updates' is too generic and does not convey meaningful information about the substantial changes in this pull request, which include major refactoring of the AVX library architecture, build workflows, and test infrastructure.	Provide a more descriptive title that captures the main change, such as 'Refactor AVX library architecture with dynamic alignment and simplified Vector type' or similar.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch next

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 16

🧹 Nitpick comments (9)

test/bench.hpp (1)

1-6: Missing standard library includes.

The code uses std::cout, std::setprecision, std::chrono::steady_clock, std::chrono::nanoseconds, and std::uint64_t but relies on transitive includes from "mkn/kul/time.hpp". For clarity and to avoid breakage if the dependency changes its includes, consider adding explicit includes.

Proposed fix

 `#ifndef` _MKN_AVX_TEST_BENCH_HPP_
 `#define` _MKN_AVX_TEST_BENCH_HPP_

 `#include` "mkn/kul/time.hpp"
+
+#include <chrono>
+#include <cstdint>
+#include <iomanip>
+#include <iostream>

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/bench.hpp` around lines 1 - 6, Add explicit standard headers instead of
relying on transitive includes: include <iostream> for std::cout, <iomanip> for
std::setprecision, <chrono> for std::chrono::steady_clock and
std::chrono::nanoseconds, and <cstdint> for std::uint64_t; modify test/bench.hpp
to add these includes near the top so usages in any functions or code
referencing std::cout, std::setprecision, steady_clock, nanoseconds, or uint64_t
(search for those symbols in this file) compile reliably even if
mkn/kul/time.hpp changes.

inc/mkn/avx/types.hpp (1)

289-315: Minor inconsistency in operator implementations.

Some operators access operations directly via Type<T, SIZE>::add (lines 293, 307), while others use Type<T, SIZE>::Super::impl_type::sub (lines 300, 314). Both resolve to the same implementation, but using a consistent style would improve readability.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/types.hpp` around lines 289 - 315, The operator overloads for
operator+, operator-, operator*, and operator/ mix two access styles (Type<T,
SIZE>::add / mul vs Type<T, SIZE>::Super::impl_type::sub / div); make them
consistent by using the same qualified access for all four operators (e.g.,
change operator+ and operator* to call Type<T, SIZE>::Super::impl_type::add and
::mul, or change operator- and operator/ to call Type<T, SIZE>::add and ::div)
so all functions uniformly reference either Type<T, SIZE>::add/mul/div or
Type<T, SIZE>::Super::impl_type::add/mul/div; update the bodies of operator+,
operator-, operator*, operator/ accordingly.

test/bench.cpp (1)

196-200: Consider adding a TODO comment explaining why std::uint32_t benchmarks are disabled.

The commented-out benchmarks lack context. A brief comment explaining whether this is temporary (pending implementation) or permanent (unsupported type) would help future maintainers.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/bench.cpp` around lines 196 - 200, Add a short TODO comment above the
commented BENCHMARK_TEMPLATE lines explaining why the std::uint32_t variants are
disabled; mention whether mul_avx_inplace, mul_avx_inplace_array,
mul_avx_inplace_single, add_avx_inplace and add_avx_inplace_single for
std::uint32_t are temporarily disabled (e.g., pending correctness/performance
fixes or failing tests) or permanently unsupported, and include a pointer to the
issue/PR or owner if applicable so future maintainers know the intent.

test/betterbench.cpp (1)

19-41: LGTM with minor nitpick: trailing semicolon after main().

The test logic correctly validates both SoA and AoS layouts against the scalar reference. The trailing semicolon on line 41 is syntactically valid but unconventional.
-};
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/betterbench.cpp` around lines 19 - 41, Remove the extraneous trailing
semicolon after the closing brace of main() — locate the function main (int
main(/*...*/)) and delete the semicolon following its closing '}' so the file
ends with the function definition only; no other changes needed.

test/betterbench_avx.cpp (3)

50-56: Loop variable i can be scoped to the loop.

The variable i is declared outside the loop but only used within it. Moving the declaration into the for-loop improves scoping.

Proposed fix

-    std::size_t i = 0;
     {
         MKN_KUL_DBG_FUNC_ENTER;

-        for (; i < arrs; ++i)
+        for (std::size_t i = 0; i < arrs; ++i)
             op(d0[i](), d1[i](), d2[i](), d3[i]());
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/betterbench_avx.cpp` around lines 50 - 56, The loop index variable i is
declared outside its loop but only used inside it; update the code that contains
MKN_KUL_DBG_FUNC_ENTER and the for-loop over arrs so that i is declared in the
for header (e.g., for (std::size_t i = 0; i < arrs; ++i)) and remove the prior
std::size_t i = 0; declaration, keeping the existing call to op(d0[i](),
d1[i](), d2[i](), d3[i]()) and preserving surrounding debug macro usage.

58-60: Trailing semicolon after main() function.

Same as in betterbench.cpp - the trailing semicolon is syntactically valid but unconventional.

-};
+}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/betterbench_avx.cpp` around lines 58 - 60, Remove the unnecessary
trailing semicolon after the main() function definition in betterbench_avx.cpp:
locate the end of the main() function (the block that ends with the check `if
(d3[arrs - 1][3] != b3) throw std::runtime_error("FAIL!");`) and delete the
semicolon immediately following the closing brace so the function ends with `}`
only.

25-26: Unused prefetcher lambda.

The prefetcher lambda is defined but never called. Either use it in the benchmark loop or remove it to avoid dead code.

Option 1: Remove unused code

-    auto const prefetcher
-        = [](auto i, auto&... args) { ((__builtin_prefetch(args.data() + i, 1, 2)), ...); };
-

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/betterbench_avx.cpp` around lines 25 - 26, The lambda named `prefetcher`
is defined but never used; either delete the `prefetcher` definition to remove
dead code, or invoke it inside the benchmark loop where data is accessed (call
`prefetcher(i, container1, container2, ...)` passing the loop index and the
containers/arrays whose `.data()` should be prefetched) so the prefetch actually
runs; update the benchmark body that iterates indices (e.g., the inner
loop/function that uses `i`) to call `prefetcher` before accessing the data and
ensure the passed arguments have a `.data()` method or pointer.

test/test_unaligned.cpp (1)

38-45: Minor: Test template defaults to float but main() only tests double.

The function template defaults to float but is only called with double. Consider either testing both types or adjusting the default.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/test_unaligned.cpp` around lines 38 - 45, The test currently only
invokes mkn::avx::test_unaligned<double>() while the template's default type is
float; either exercise the default or explicitly test both types—update main()
to call mkn::avx::test_unaligned<>() (no template arg) to verify the float
default, or add an additional call mkn::avx::test_unaligned<float>() alongside
the existing double invocation, ensuring both code paths are covered and the
template default is validated.

.github/workflows/build.yml (1)

26-28: Consider adding integrity verification for downloaded binary.

The mkn binary is downloaded from an external URL without checksum verification. While it's from the project's own releases, adding a hash check would improve supply chain security.

Example with checksum verification

run: |
  $CURL_GET -o mkn ${PATH_GET}/mkn_nix
  echo "<expected_sha256>  mkn" | sha256sum -c -
  chmod +x mkn
  KLOG=5 ./mkn clean build test run -p test,b0,b1 -OtKda "-std=c++20 -fPIC -DMKN_KUL_FORCE_TRACE" -l -pthread -g 0

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In @.github/workflows/build.yml around lines 26 - 28, Add checksum verification
after downloading the mkn binary to prevent supply-chain tampering: after the
existing download step that uses $CURL_GET and ${PATH_GET}/mkn_nix, verify the
file (mkn) against a known SHA256 value (provided via a workflow secret or
variable) using a sha256sum check and fail the job if it does not match; only
then proceed to chmod +x mkn and run the KLOG=5 ./mkn ... command. Reference the
mkn download step (CURL_GET, PATH_GET, mkn) and ensure the expected hash is
stored/managed securely (e.g., as an env var or secret) and used in the
verification before execution.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/build.yml:
- Around line 41-44: Remove the dangerous deletion of /bin/link and instead fix
tool resolution by adjusting PATH or invoking the intended linker explicitly:
stop using the `rm /bin/link` line, export or prepend the MSVC toolchain
directory to PATH (so the correct link.exe is found by the subsequent `./mkn`
invocation) or change the build invocation to call the full path to MSVC's
link.exe; ensure the change is applied before the `bash -c 'KLOG=3 ./mkn clean
build test run ...'` step so `./mkn` uses the correct linker rather than
deleting `/bin/link`.

In `@inc/mkn/avx.hpp`:
- Around line 34-37: The disabled sample uses mkn::avx::Vector<float> but the
header no longer includes "mkn/avx/vector.hpp"; restore the include or update
the sample to avoid referencing Vector. Fix by adding `#include`
"mkn/avx/vector.hpp" to the header (near the other IWYU pragma includes) so the
sample compiles/reads correctly, or alternatively change the sample text to
explicitly state that users must include vector.hpp when using mkn::avx::Vector.

In `@inc/mkn/avx/array.hpp`:
- Around line 221-229: The free operator* currently returns mkn::avx::Array<T1,
N> which can become const when called with Span<const T, N>, causing subsequent
writes (*ret = s0; ret *= s1;) to attempt mutation of const storage; change the
result type to decay T1 (e.g. Array<std::decay_t<T1>, N>) the same way operator-
does so the returned Array is unqualified and writable while preserving value
type semantics.

In `@inc/mkn/avx/dbg.hpp`:
- Around line 71-79: The macro MKN_AVX_FN_COUNTER currently pastes __LINE__
directly into the identifier (creating __mkn_avx_count_point##__LINE__) which
prevents __LINE__ from expanding; change the macro to use two-level indirection
(helper macros) so __LINE__ is expanded before token pasting: add a pair of
helper macros (e.g., MKN_AVX_PP_CONCAT(a,b) and MKN_AVX_PP_CONCAT_EXPAND(a,b))
and update MKN_AVX_FN_COUNTER to build the unique variable name with those
helpers when declaring __mkn_avx_count_point and keep calling
mkn::avx::Counter::I() as before so the counter point name becomes unique per
line.

In `@inc/mkn/avx/def.hpp`:
- Around line 138-142: is_aligned_pointer currently tests pointer_address % N
where N is element count (Options::N<T>()), but alignment must be in bytes;
update is_aligned_pointer(T* t) to use byte alignment instead—either set the
default template parameter to Options::ALIGN() or multiply the element count by
sizeof(T) when computing the modulus (i.e., use (address % Options::ALIGN()) ==
0 or (address % (Options::N<std::decay_t<T>>() * sizeof(T))) == 0); change the
modulus expression in is_aligned_pointer accordingly so it checks byte
alignment.

In `@inc/mkn/avx/lazy.hpp`:
- Line 263: The typedef AVXVec uses Options::ALIGN() without full qualification,
causing a compile error when this code is generated outside the mkn::avx
namespace; update the allocator template argument to reference the
fully-qualified Options symbol (e.g. mkn::avx::Options::ALIGN()) so the line
using AVXVec = std::vector<E, mkn::kul::AlignedAllocator<E,
mkn::avx::Options::ALIGN()>>; resolves correctly.

In `@inc/mkn/avx/span.hpp`:
- Around line 398-410: The compound-assignment overloads (operator+= for
Arr<T0,N> and for scalar T) only iterate full N-sized blocks and ignore the
final size()%N tail; extend each to compute fullBlocks = size()/N and remainder
= size()%N, keep the existing loop over i in [0, fullBlocks), then add a
leftover loop that applies the operation to the remaining elements by computing
baseIndex = fullBlocks * N and updating v0[baseIndex + r] with v1[r] (for array
overloads) or with scratch[r] (for scalar overloads); apply the same
tail-handling fix to the corresponding operator-=, operator*=, and operator/=
overloads and their Arr and scalar variants so no trailing elements are dropped.
- Around line 597-610: The explicit-N make_span overloads currently always
construct Span<...>, bypassing UnSpan for unaligned storage; change both
make_span templates (and the similar overloads at 638-649) to select the span
type via a helper/type trait instead of hardcoding Span: introduce or use an
existing selector (e.g., select_span_t<Container, N> or a trait like
mkn::kul::span_type_for<Container, N>) that yields UnSpan<...> when the
Container/pointer is potentially unaligned (or when
mkn::kul::is_unaligned_like_v<Container> is true) and Span<...> otherwise, and
return that selector-instantiated type from make_span and make_spans so the
unaligned path is chosen when appropriate.
- Around line 523-579: UnSpanSet is returned by make_span for unaligned storage
but only implements operator+=(UnSpanSet) and a non-const operator[], so many
ops (assignment, scalar ops, -=, *=, /=, add/sub/mul/div/fma) still fall back to
aligned Span/SpanSet via cast()/store(), and the const overload of make_span
still returns SpanSet. Fix by adding the missing operator overloads and const
accessors in UnSpanSet (implement operator=, operator-=, operator*=, operator/=,
scalar ops, add/sub/mul/div/fma methods) to mirror SpanSet behavior using AVX
unaligned_load/unaligned_store and AVX_t (refer to AVX_t, unaligned_load,
unaligned_store, modulo_leftover_idx, operator[] in UnSpanSet and UnSpan), and
add a const operator[] and ensure the const make_span overload returns UnSpanSet
(or the appropriate const UnSpanSet specialization) so read-only unaligned
inputs are correctly classified instead of falling back to Span/SpanSet.
- Around line 212-225: The is_span_like_v branch in the template operator=
attempts to call cast(that) directly, which only accepts Span<T,N> or
SpanSet<T,N> and thus fails for span-like types such as std::array/std::vector;
fix by first wrapping the incoming span-like value into the appropriate
Span/SpanSet wrapper (as done in operator-=) before calling cast, e.g. construct
a SpanSet<T0 const, N> (or Span<T0 const, N> as appropriate) from that and pass
the wrapper into cast, then use the returned tuple to call store(data(), v1[0])
so the branch compiles for non-Span inputs.

In `@inc/mkn/avx/types.hpp`:
- Around line 361-367: The load template is declared noexcept but calls
KEXCEPTION("load"), which will trigger std::terminate; fix by removing noexcept
from the function signature of template<typename T, std::size_t SIZE> auto
inline load(T const* __restrict a, Type<T, SIZE> const& __restrict b) so
exceptions can propagate, or alternatively replace KEXCEPTION("load") with a
non-throwing handling (e.g., assert/abort or returning a safe default via
Type<T, SIZE>::load), and update any corresponding declarations/usages to match
the changed exception behavior (references: load, Type<T, SIZE>, KEXCEPTION,
MKN_AVX_FN_COUNTER).
- Around line 267-272: Replace all usages of the removed function-pointer static
members (set_v_func_ptr, mul_func_ptr, add_func_ptr) with the new static member
names (set_v, mul, add) used on Type_<T,SIZE>; update any local variables or
calls that reference set_v_func_ptr, mul_func_ptr, or add_func_ptr in the test
to instead call or reference Type_<...>::set_v, Type_<...>::mul, and
Type_<...>::add (i.e., rename the identifiers and adjust any callsites to match
the new static member signatures).
- Around line 98-123: The AVX1/AVX2 specializations Type_<double, 2> and
Type_<double, 4> are missing unaligned_load and unaligned_store and thus fall
back to scalar loads; add unaligned operations using the proper intrinsics: for
Type_<double, 2> implement unaligned_load with _mm_loadu_pd and unaligned_store
with _mm_storeu_pd, and for Type_<double, 4> implement unaligned_load with
_mm256_loadu_pd and unaligned_store with _mm256_storeu_pd; place these as static
inline members named unaligned_load and unaligned_store alongside the existing
set/set_v/fma definitions so callers use the SIMD unaligned intrinsics instead
of scalar fallback.

In `@mkn.yaml`:
- Around line 12-14: The test pattern in the "test_all" profile uses a typod
regex "test/test_(w*).cpp" which matches literal 'w' chars; update the "test"
value in the test_all profile to use a word-character group "(\w*)" so it
matches filenames like test_aligned.cpp, test_avx.cpp, etc.; locate the
"test_all" profile (name: test_all) and replace the test string
"test/test_(w*).cpp" with "test/test_(\w*).cpp".

In `@test/betterbench.hpp`:
- Around line 98-107: The loops assume size() is a multiple of auto constexpr
static N = mkn::avx::Options::N<T>(); which causes out-of-bounds access for AoS
(the inner loop using v.data()[off] and op<T>) and for SoA (make_spans<N>(i,
...)); fix by stopping the vectorized loop at i + N <= size() (e.g. change the
for condition to i + N <= size()) and add a scalar tail loop after the
vectorized block to process remaining indices (using the same op<T> on
individual elements), or alternatively handle the tail before stepping by N;
update both the AoS loop (the block with v.data()[off]) and the SoA path that
calls make_spans<N>(i, ...) accordingly.

In `@test/test_aligned.cpp`:
- Around line 15-26: The test_aligned template currently uses a huge static SIZE
(100000000) causing v0 and v1 to allocate ~1.6GiB for double and making normal
test runs slow/OOM-prone; either reduce SIZE to a small default (e.g., a few
thousand) or gate the large allocation behind a benchmark-only path (check a
BUILD/TEST flag or an env var like RUN_BENCHMARKS) and only use the big SIZE
when that flag is set; update the static SIZE constant and/or the allocation in
test_aligned (referencing SIZE, test_aligned<T>(), and the vectors v0/v1) so
unit tests use the small size while benchmarks use the original large size.

---

Nitpick comments:
In @.github/workflows/build.yml:
- Around line 26-28: Add checksum verification after downloading the mkn binary
to prevent supply-chain tampering: after the existing download step that uses
$CURL_GET and ${PATH_GET}/mkn_nix, verify the file (mkn) against a known SHA256
value (provided via a workflow secret or variable) using a sha256sum check and
fail the job if it does not match; only then proceed to chmod +x mkn and run the
KLOG=5 ./mkn ... command. Reference the mkn download step (CURL_GET, PATH_GET,
mkn) and ensure the expected hash is stored/managed securely (e.g., as an env
var or secret) and used in the verification before execution.

In `@inc/mkn/avx/types.hpp`:
- Around line 289-315: The operator overloads for operator+, operator-,
operator*, and operator/ mix two access styles (Type<T, SIZE>::add / mul vs
Type<T, SIZE>::Super::impl_type::sub / div); make them consistent by using the
same qualified access for all four operators (e.g., change operator+ and
operator* to call Type<T, SIZE>::Super::impl_type::add and ::mul, or change
operator- and operator/ to call Type<T, SIZE>::add and ::div) so all functions
uniformly reference either Type<T, SIZE>::add/mul/div or Type<T,
SIZE>::Super::impl_type::add/mul/div; update the bodies of operator+, operator-,
operator*, operator/ accordingly.

In `@test/bench.cpp`:
- Around line 196-200: Add a short TODO comment above the commented
BENCHMARK_TEMPLATE lines explaining why the std::uint32_t variants are disabled;
mention whether mul_avx_inplace, mul_avx_inplace_array, mul_avx_inplace_single,
add_avx_inplace and add_avx_inplace_single for std::uint32_t are temporarily
disabled (e.g., pending correctness/performance fixes or failing tests) or
permanently unsupported, and include a pointer to the issue/PR or owner if
applicable so future maintainers know the intent.

In `@test/bench.hpp`:
- Around line 1-6: Add explicit standard headers instead of relying on
transitive includes: include <iostream> for std::cout, <iomanip> for
std::setprecision, <chrono> for std::chrono::steady_clock and
std::chrono::nanoseconds, and <cstdint> for std::uint64_t; modify test/bench.hpp
to add these includes near the top so usages in any functions or code
referencing std::cout, std::setprecision, steady_clock, nanoseconds, or uint64_t
(search for those symbols in this file) compile reliably even if
mkn/kul/time.hpp changes.

In `@test/betterbench_avx.cpp`:
- Around line 50-56: The loop index variable i is declared outside its loop but
only used inside it; update the code that contains MKN_KUL_DBG_FUNC_ENTER and
the for-loop over arrs so that i is declared in the for header (e.g., for
(std::size_t i = 0; i < arrs; ++i)) and remove the prior std::size_t i = 0;
declaration, keeping the existing call to op(d0[i](), d1[i](), d2[i](), d3[i]())
and preserving surrounding debug macro usage.
- Around line 58-60: Remove the unnecessary trailing semicolon after the main()
function definition in betterbench_avx.cpp: locate the end of the main()
function (the block that ends with the check `if (d3[arrs - 1][3] != b3) throw
std::runtime_error("FAIL!");`) and delete the semicolon immediately following
the closing brace so the function ends with `}` only.
- Around line 25-26: The lambda named `prefetcher` is defined but never used;
either delete the `prefetcher` definition to remove dead code, or invoke it
inside the benchmark loop where data is accessed (call `prefetcher(i,
container1, container2, ...)` passing the loop index and the containers/arrays
whose `.data()` should be prefetched) so the prefetch actually runs; update the
benchmark body that iterates indices (e.g., the inner loop/function that uses
`i`) to call `prefetcher` before accessing the data and ensure the passed
arguments have a `.data()` method or pointer.

In `@test/betterbench.cpp`:
- Around line 19-41: Remove the extraneous trailing semicolon after the closing
brace of main() — locate the function main (int main(/*...*/)) and delete the
semicolon following its closing '}' so the file ends with the function
definition only; no other changes needed.

In `@test/test_unaligned.cpp`:
- Around line 38-45: The test currently only invokes
mkn::avx::test_unaligned<double>() while the template's default type is float;
either exercise the default or explicitly test both types—update main() to call
mkn::avx::test_unaligned<>() (no template arg) to verify the float default, or
add an additional call mkn::avx::test_unaligned<float>() alongside the existing
double invocation, ensuring both code paths are covered and the template default
is validated.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b3f7960d-b632-49bc-b729-8b3d793f2a5e

📥 Commits

Reviewing files that changed from the base of the PR and between 3ac9325 and 52f51bf.

📒 Files selected for processing (21)

.github/workflows/build.yml
.github/workflows/build_nix.yml
README.noformat
inc/mkn/avx.hpp
inc/mkn/avx/array.hpp
inc/mkn/avx/dbg.hpp
inc/mkn/avx/def.hpp
inc/mkn/avx/lazy.hpp
inc/mkn/avx/span.hpp
inc/mkn/avx/types.hpp
inc/mkn/avx/vector.hpp
mkn.yaml
test/bench.cpp
test/bench.hpp
test/betterbench.cpp
test/betterbench.hpp
test/betterbench_avx.cpp
test/test_aligned.cpp
test/test_avx.cpp
test/test_lazy.cpp
test/test_unaligned.cpp

💤 Files with no reviewable changes (1)

.github/workflows/build_nix.yml

coderabbitai · 2026-03-29T13:58:44Z

+template<typename T0, typename T1, std::size_t N>
+auto inline operator*(mkn::avx::Span<T0, N> const& __restrict s0,
+                      mkn::avx::Span<T1, N> const& __restrict s1)
+{
+    mkn::avx::Array<T1, N> ret{std::nullopt};
+    *ret = s0;
+    ret *= s1;
+    return ret;
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the specific code at lines 221-229 sed -n '221,229p' inc/mkn/avx/array.hpp

Repository: mkn/mkn.avx

Length of output: 332

🏁 Script executed:

# Let's also see the operator+ implementation for comparison sed -n '210,220p' inc/mkn/avx/array.hpp

Repository: mkn/mkn.avx

Length of output: 274

🏁 Script executed:

# Check the Array template definition to understand const behavior sed -n '1,100p' inc/mkn/avx/array.hpp | head -80

Repository: mkn/mkn.avx

Length of output: 2410

🏁 Script executed:

# Look for any tests or usages of operator* with const spans rg "operator\*" inc/mkn/avx/ -A 5 -B 2

Repository: mkn/mkn.avx

Length of output: 5774

🏁 Script executed:

# Search for test files using operator* fd -e cpp -e hpp test | head -20

Repository: mkn/mkn.avx

Length of output: 137

🏁 Script executed:

# Look for any make_span calls with const containers rg "make_span" --type-list | head -20

Repository: mkn/mkn.avx

Length of output: 444

🏁 Script executed:

# Search for any usage of operator* with Span in tests or examples rg "operator\*.*Span" -B 3 -A 3 --type cpp

Repository: mkn/mkn.avx

Length of output: 1121

🏁 Script executed:

# Check the Span assignment operator to see if it can accept const sources sed -n '1,200p' inc/mkn/avx/span.hpp | rg "operator=" -A 8

Repository: mkn/mkn.avx

Length of output: 146

Decay T1 in operator* like the other free operators.

make_span(const container) can produce Span<const T, N>. With the current Array<T1, N> result, multiplying const spans instantiates a const-qualified Array, making its underlying array member const. The subsequent assignments *ret = s0; and ret *= s1; would then attempt to write to const data. operator- already handles this correctly with std::decay_t<T1>.

Suggested fix

template<typename T0, typename T1, std::size_t N> auto inline operator*(mkn::avx::Span<T0, N> const& __restrict s0, mkn::avx::Span<T1, N> const& __restrict s1) { - mkn::avx::Array<T1, N> ret{std::nullopt}; + mkn::avx::Array<std::decay_t<T1>, N> ret{std::nullopt}; *ret = s0; ret *= s1; return ret; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

template<typename T0, typename T1, std::size_t N>

auto inline operator*(mkn::avx::Span<T0, N> const& __restrict s0,

mkn::avx::Span<T1, N> const& __restrict s1)

{

mkn::avx::Array<T1, N> ret{std::nullopt};

*ret = s0;

ret *= s1;

return ret;

}

template<typename T0, typename T1, std::size_t N>

auto inline operator*(mkn::avx::Span<T0, N> const& __restrict s0,

mkn::avx::Span<T1, N> const& __restrict s1)

{

mkn::avx::Array<std::decay_t<T1>, N> ret{std::nullopt};

*ret = s0;

ret *= s1;

return ret;

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@inc/mkn/avx/array.hpp` around lines 221 - 229, The free operator* currently returns mkn::avx::Array<T1, N> which can become const when called with Span<const T, N>, causing subsequent writes (*ret = s0; ret *= s1;) to attempt mutation of const storage; change the result type to decay T1 (e.g. Array<std::decay_t<T1>, N>) the same way operator- does so the returned Array is unqualified and writable while preserving value type semantics.

coderabbitai · 2026-03-29T13:58:44Z

+#if defined(MKN_AVX_COUNT_FNS)
+#define MKN_AVX_FN_COUNTER                                                                         \
+    static mkn::avx::CountPoint __mkn_avx_count_point##__LINE__{__FILE__, __func__, __LINE__};     \
+    mkn::avx::Counter::I()(__FILE__, __func__, __LINE__);
+
+#else // !defined(MKN_AVX_FN_COUNTER)
+#define MKN_AVX_FN_COUNTER
+
+#endif // MKN_AVX_FN_COUNTER


⚠️ Potential issue | 🟡 Minor

Macro token pasting with __LINE__ may not expand correctly.

The macro __mkn_avx_count_point##__LINE__ uses direct token pasting with __LINE__, which typically does not expand __LINE__ to the actual line number before concatenation. This means all counter points would use the literal token __LINE__ instead of unique names.

To fix this, use a two-level macro indirection:

Proposed fix

`#if` defined(MKN_AVX_COUNT_FNS) +#define MKN_AVX_CONCAT_IMPL(x, y) x##y +#define MKN_AVX_CONCAT(x, y) MKN_AVX_CONCAT_IMPL(x, y) `#define` MKN_AVX_FN_COUNTER \ - static mkn::avx::CountPoint __mkn_avx_count_point##__LINE__{__FILE__, __func__, __LINE__}; \ + static mkn::avx::CountPoint MKN_AVX_CONCAT(__mkn_avx_count_point, __LINE__){__FILE__, __func__, __LINE__}; \ mkn::avx::Counter::I()(__FILE__, __func__, __LINE__);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@inc/mkn/avx/dbg.hpp` around lines 71 - 79, The macro MKN_AVX_FN_COUNTER currently pastes __LINE__ directly into the identifier (creating __mkn_avx_count_point##__LINE__) which prevents __LINE__ from expanding; change the macro to use two-level indirection (helper macros) so __LINE__ is expanded before token pasting: add a pair of helper macros (e.g., MKN_AVX_PP_CONCAT(a,b) and MKN_AVX_PP_CONCAT_EXPAND(a,b)) and update MKN_AVX_FN_COUNTER to build the unique variable name with those helpers when declaring __mkn_avx_count_point and keep calling mkn::avx::Counter::I() as before so the counter point name becomes unique per line.

coderabbitai · 2026-03-29T13:58:44Z

+    auto const static inline add   = Type_<T, SIZE>::add;
+    auto const static inline sub   = Type_<T, SIZE>::sub;
+    auto const static inline mul   = Type_<T, SIZE>::mul;
+    auto const static inline div   = Type_<T, SIZE>::div;
+    auto const static inline set   = Type_<T, SIZE>::set;
+    auto const static inline set_v = Type_<T, SIZE>::set_v;


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify that test/betterbench_avx.cpp uses the old API names rg -n "func_ptr" --type=cpp

Repository: mkn/mkn.avx

Length of output: 1144

Breaking API change: test/betterbench_avx.cpp uses removed function pointer names and will fail to compile.

The refactor in inc/mkn/avx/types.hpp (lines 267-272) replaces *_func_ptr static members with direct member names (set_v, mul, add). However, test/betterbench_avx.cpp still uses the old API throughout: set_v_func_ptr (lines 31, 33, 34, 36, 41), mul_func_ptr (lines 37, 42, 45, 46, 47), and add_func_ptr (lines 38, 40, 43, 44). Update the test file to use the new API names.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@inc/mkn/avx/types.hpp` around lines 267 - 272, Replace all usages of the removed function-pointer static members (set_v_func_ptr, mul_func_ptr, add_func_ptr) with the new static member names (set_v, mul, add) used on Type_<T,SIZE>; update any local variables or calls that reference set_v_func_ptr, mul_func_ptr, or add_func_ptr in the test to instead call or reference Type_<...>::set_v, Type_<...>::mul, and Type_<...>::add (i.e., rename the identifiers and adjust any callsites to match the new static member signatures).

coderabbitai · 2026-03-29T13:58:44Z

+template<typename T, std::size_t SIZE>
+auto inline load(T const* __restrict a, Type<T, SIZE> const& __restrict b) noexcept
+{
+    MKN_AVX_FN_COUNTER;
+    KEXCEPTION("load");
+    // Type<T, SIZE>::load(const_cast<T*>(a), b());
+}


⚠️ Potential issue | 🟠 Major

load() is marked noexcept but throws an exception.

This function throws KEXCEPTION("load") but is declared noexcept. If this function is called, it will invoke std::terminate() instead of propagating the exception. Either remove noexcept or change the implementation to not throw (e.g., return a default value or use assert/abort).

Proposed fix - remove noexcept

template<typename T, std::size_t SIZE> -auto inline load(T const* __restrict a, Type<T, SIZE> const& __restrict b) noexcept +auto inline load(T const* __restrict a, Type<T, SIZE> const& __restrict b) { MKN_AVX_FN_COUNTER; KEXCEPTION("load"); // Type<T, SIZE>::load(const_cast<T*>(a), b()); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

template<typename T, std::size_t SIZE>

auto inline load(T const* __restrict a, Type<T, SIZE> const& __restrict b) noexcept

{

MKN_AVX_FN_COUNTER;

KEXCEPTION("load");

// Type<T, SIZE>::load(const_cast<T*>(a), b());

}

template<typename T, std::size_t SIZE>

auto inline load(T const* __restrict a, Type<T, SIZE> const& __restrict b)

{

MKN_AVX_FN_COUNTER;

KEXCEPTION("load");

// Type<T, SIZE>::load(const_cast<T*>(a), b());

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@inc/mkn/avx/types.hpp` around lines 361 - 367, The load template is declared noexcept but calls KEXCEPTION("load"), which will trigger std::terminate; fix by removing noexcept from the function signature of template<typename T, std::size_t SIZE> auto inline load(T const* __restrict a, Type<T, SIZE> const& __restrict b) so exceptions can propagate, or alternatively replace KEXCEPTION("load") with a non-throwing handling (e.g., assert/abort or returning a safe default via Type<T, SIZE>::load), and update any corresponding declarations/usages to match the changed exception behavior (references: load, Type<T, SIZE>, KEXCEPTION, MKN_AVX_FN_COUNTER).

coderabbitai · 2026-03-29T13:58:44Z

+        auto constexpr static N = mkn::avx::Options::N<T>();
+
+        // NO AVX FOR AOS
+        for (std::size_t i = 0; i < size(); i += N)
+        {
+            for (std::size_t j = 0; j < N; ++j)
+            {
+                std::size_t off = i + j;
+                op<T>(v.data()[off].v0, v.data()[off].v1, v.data()[off].v2, v.data()[off].v3);
+            }


⚠️ Potential issue | 🔴 Critical

Handle the tail before stepping by N.

Both vectorized paths assume size() is an exact multiple of N. With the current default s = 10, the final iteration will read/write past the end as soon as N > 2 in AoS, and make_spans<N>(i, ...) does the same in SoA. Please add a scalar tail path or stop the block loop at i + N <= size().

Also applies to: 123-130

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/betterbench.hpp` around lines 98 - 107, The loops assume size() is a multiple of auto constexpr static N = mkn::avx::Options::N<T>(); which causes out-of-bounds access for AoS (the inner loop using v.data()[off] and op<T>) and for SoA (make_spans<N>(i, ...)); fix by stopping the vectorized loop at i + N <= size() (e.g. change the for condition to i + N <= size()) and add a scalar tail loop after the vectorized block to process remaining indices (using the same op<T> on individual elements), or alternatively handle the tail before stepping by N; update both the AoS loop (the block with v.data()[off]) and the SoA path that calls make_spans<N>(i, ...) accordingly.

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

inc/mkn/avx/array.hpp (1)

138-170: ⚠️ Potential issue | 🟠 Major

The fixed-size array overloads stop honoring this Array's N.

operator+ wraps arr with make_span<N>(arr), but operator-, *, and / fall back to plain make_span(arr), which reverts to Options::N<T>(). Any Array<T, N> where those differ will compile against the wrong extent or fail altogether.

💡 Proposed fix

     auto inline operator-(Arr::arr_t const& arr) const
     {
         Array ret{};
-        ret.sub(*this, make_span(arr));
+        ret.sub(*this, make_span<N>(arr));
         return ret;
     }
@@
     auto inline operator*(Arr::arr_t const& arr) const
     {
         Array ret{};
-        ret.mul(*this, make_span(arr));
+        ret.mul(*this, make_span<N>(arr));
         return ret;
     }
@@
     auto inline operator/(Arr::arr_t const& arr) const
     {
         Array ret{};
-        ret.div(*this, make_span(arr));
+        ret.div(*this, make_span<N>(arr));
         return ret;
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/array.hpp` around lines 138 - 170, The non-templated overloads
operator-(Arr::arr_t const&), operator*(Arr::arr_t const&), and
operator/(Arr::arr_t const&) ignore this Array's template extent N by calling
make_span(arr) which falls back to Options::N<T>(), so change those calls to use
the N-aware overload make_span<N>(arr) (i.e., replace make_span(arr) with
make_span<N>(arr)) so sub/mul/div are invoked with a span matching this Array<T,
N>; update the three methods accordingly (symbols: operator-, operator*,
operator/, make_span, Array::sub, Array::mul, Array::div).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@inc/mkn/avx/span.hpp`:
- Around line 68-85: The union U and member u cause undefined behavior by
pointer-punning; remove U and store a single pointer member (e.g., T* dataPtr)
instead, update the Span(T* d, std::size_t s) constructor to initialize that
pointer (replace u{d} with dataPtr = d), and change both avx() overloads to
compute and return the AVX pointer via reinterpret_cast to mkn::avx::Type<R, N>*
(and the const-correct variant for avx() const) so you no longer write one union
member and read another.

In `@inc/mkn/avx/types.hpp`:
- Around line 63-64: The scalar fallback for unaligned_store currently assigns
to the pointer variable rather than the pointee; update the unaligned_store
lambda (and keep unaligned_load as-is) so it stores through the pointer (i.e.,
dereference the first parameter) and accept the value by
reference/const-reference as appropriate (e.g., change the signature used in the
lambda for unaligned_store so it does "*a = b" rather than "a = b") to fix the
SIZE==1 instantiation error in unaligned_store.
- Around line 347-350: The store and unaligned_store functions currently take T
const* and then write to it via const_cast, which is UB; change their parameter
types to T* (remove const) and remove the const_cast in Type<T, SIZE>::set calls
so they write directly to the provided pointer; for unaligned_load, ensure its
parameter remains T const* and remove any unnecessary const_cast when calling
Type<T, SIZE>::get so reads use the const pointer directly; update function
signatures for store, unaligned_store and unaligned_load (and any call-sites in
the diff) to match these types and keep noexcept and __restrict qualifiers as
appropriate.
- Around line 217-227: The mul lambdas in the Type_<std::int32_t, 4> and
Type_<std::int32_t, 8> specializations currently use _mm_mul_epi32 /
_mm256_mul_epi32 which only multiply even lanes and widen to 64-bit; change
those mul definitions to call _mm_mullo_epi32 and _mm256_mullo_epi32
respectively so all 32-bit lanes are multiplied elementwise and results remain
32-bit. Locate the mul lambda in Type_<std::int32_t,4> (internal_type = __m128i)
and the mul lambda in Type_<std::int32_t,8> (internal_type = __m256i) and
replace the intrinsic names accordingly while preserving the existing lambda
signature and argument forwarding.

In `@test/test_unaligned.cpp`:
- Around line 15-35: The test_unaligned() function uses a huge constexpr SIZE
(100000000) causing multi-GB allocations in v0 and v1; reduce memory pressure by
lowering SIZE to a small value suitable for CI (e.g. a few thousand) or make
SIZE conditional on a BENCHMARK or LARGE_TEST build flag and keep the large
value only for benchmarks; update references to SIZE, test_unaligned, and the
vectors v0/v1 accordingly so the same first/last-element assertions still run,
and ensure static_asserts (is_aligned, decltype(a)::N) remain valid under the
reduced or gated size.

---

Outside diff comments:
In `@inc/mkn/avx/array.hpp`:
- Around line 138-170: The non-templated overloads operator-(Arr::arr_t const&),
operator*(Arr::arr_t const&), and operator/(Arr::arr_t const&) ignore this
Array's template extent N by calling make_span(arr) which falls back to
Options::N<T>(), so change those calls to use the N-aware overload
make_span<N>(arr) (i.e., replace make_span(arr) with make_span<N>(arr)) so
sub/mul/div are invoked with a span matching this Array<T, N>; update the three
methods accordingly (symbols: operator-, operator*, operator/, make_span,
Array::sub, Array::mul, Array::div).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 48f7cfb1-9ef5-42c3-9821-5bc37a94486d

📥 Commits

Reviewing files that changed from the base of the PR and between 52f51bf and 4fe51a3.

📒 Files selected for processing (21)

.github/workflows/build.yml
.github/workflows/build_nix.yml
README.noformat
inc/mkn/avx.hpp
inc/mkn/avx/array.hpp
inc/mkn/avx/dbg.hpp
inc/mkn/avx/def.hpp
inc/mkn/avx/lazy.hpp
inc/mkn/avx/span.hpp
inc/mkn/avx/types.hpp
inc/mkn/avx/vector.hpp
mkn.yaml
test/bench.cpp
test/bench.hpp
test/betterbench.cpp
test/betterbench.hpp
test/betterbench_avx.cpp
test/test_aligned.cpp
test/test_avx.cpp
test/test_lazy.cpp
test/test_unaligned.cpp

💤 Files with no reviewable changes (1)

.github/workflows/build_nix.yml

✅ Files skipped from review due to trivial changes (1)

inc/mkn/avx/dbg.hpp

🚧 Files skipped from review as they are similar to previous changes (11)

test/test_lazy.cpp
inc/mkn/avx.hpp
test/bench.cpp
README.noformat
mkn.yaml
test/test_avx.cpp
.github/workflows/build.yml
test/betterbench.cpp
inc/mkn/avx/lazy.hpp
test/betterbench_avx.cpp
test/betterbench.hpp

coderabbitai · 2026-03-29T14:13:27Z

+    using internal_type          = __m128i;
+    auto const static inline add = [](auto&&... v) { return _mm_add_epi32(v...); };
+    auto const static inline mul = [](auto&&... v) { return _mm_mul_epi32(v...); };
+    // auto const static inline fma = _mm256_fmadd_ps;
 };
 template<>
 struct Type_<std::int32_t, 8>
 {
-    using internal_type                = __m256i;
-    auto constexpr static add_func_ptr = &_mm256_add_epi32;
-    auto constexpr static mul_func_ptr = &_mm256_mul_epi32;
-    // auto constexpr static fma_func_ptr = &_mm256_fmadd_ps;
+    using internal_type          = __m256i;
+    auto const static inline add = [](auto&&... v) { return _mm256_add_epi32(v...); };
+    auto const static inline mul = [](auto&&... v) { return _mm256_mul_epi32(v...); };


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

What do Intel intrinsics _mm_mul_epi32and_mm256_mul_epi32 compute, and which intrinsics perform elementwise 32-bit multiplication while keeping 32-bit lanes (_mm_mullo_epi32/_mm256_mullo_epi32)?

💡 Result:

_mm_mul_epi32 and _mm256_mul_epi32 compute the full signed 64-bit products of the low 32-bit signed integers from each packed 64-bit lane in the input vectors. For _mm_mul_epi32 (SSE4.1, PMULDQ), in a 128-bit vector with elements [63:0, 127:64], it multiplies the 32-bit low parts (bits 31:0 of each 64-bit lane) and places the 64-bit results in the output lanes. Similarly for _mm256_mul_epi32 (AVX2, VPMULDQ) on 256-bit vectors with four 64-bit lanes. The intrinsics that perform elementwise 32-bit to 32-bit signed multiplication (low 32 bits of the 64-bit product, operating on all four/eight 32-bit elements independently) are _mm_mullo_epi32 (SSE4.1, PMULLD) and _mm256_mullo_epi32 (AVX2, VPMULLD).

Citations:

1: https://stackoverflow.com/questions/19576515/mm-mul-epu32-vs-mm-mul-epi32

2: https://doc.rust-lang.org/beta/core/arch/x86_64/fn._mm_mul_epi32.html

3: https://google.github.io/zerocopy/zerocopy/util/macro_util/core_reexport/arch/x86_64/fn._mm_mul_epi32.html

4: http://www.physics.ntua.gr/~konstant/HetCluster/intel12.1/compiler_c/main_cls/intref_cls/common/intref_avx2_mul_epi.htm

5: https://doc.rust-lang.org/1.65.0/core/arch/x86/fn._mm_mullo_epi32.html

6: https://learn.microsoft.com/en-us/previous-versions/bb531409(v=vs.120)?redirectedfrom=MSDN

🏁 Script executed:

# Locate the file find . -name "types.hpp" -path "*/mkn/avx/*" 2>/dev/null

Repository: mkn/mkn.avx

Length of output: 78

🏁 Script executed:

# Check the current content at lines 217-227 if [ -f "inc/mkn/avx/types.hpp" ]; then sed -n '217,227p' inc/mkn/avx/types.hpp fi

Repository: mkn/mkn.avx

Length of output: 258

🏁 Script executed:

# Show more context to understand the structure if [ -f "inc/mkn/avx/types.hpp" ]; then sed -n '210,235p' inc/mkn/avx/types.hpp fi

Repository: mkn/mkn.avx

Length of output: 932

🏁 Script executed:

# Search for usage of Type_<std::int32_t with mul rg "Type_.*int32.*mul|\.mul\(\)" inc/mkn/avx/ -A 2 -B 2

Repository: mkn/mkn.avx

Length of output: 37

🏁 Script executed:

# Check for any test files that might exercise int32 multiplication find . -name "*test*" -o -name "*Test*" | head -20

Repository: mkn/mkn.avx

Length of output: 152

🏁 Script executed:

# Search for references to Type_ with int32 rg "Type_.*std::int32_t" --type-list | head -5

Repository: mkn/mkn.avx

Length of output: 444

🏁 Script executed:

# Check test files for int32 multiplication tests grep -r "int32_t\|int32" test/ --include="*.cpp" -B 2 -A 2

Repository: mkn/mkn.avx

Length of output: 643

🏁 Script executed:

# Search for any comments or issues related to mul, epi32 rg "mul|epi32|mullo" inc/mkn/avx/types.hpp -B 1 -A 1

Repository: mkn/mkn.avx

Length of output: 3649

Use _mm_mullo_epi32 / _mm256_mullo_epi32 for full elementwise multiplication.

_mm_mul_epi32 and _mm256_mul_epi32 only multiply even-indexed 32-bit lanes (indices 0, 2 in 4-element vectors; 0, 2, 4, 6 in 8-element vectors) and widen results to 64 bits. This breaks the API's elementwise semantics where all lanes should be processed identically with results staying in 32-bit format. Replace with _mm_mullo_epi32 / _mm256_mullo_epi32 which perform full elementwise 32-bit multiplication on all lanes.

Proposed fix

template<> struct Type_<std::int32_t, 4> { using internal_type = __m128i; auto const static inline add = [](auto&&... v) { return _mm_add_epi32(v...); }; - auto const static inline mul = [](auto&&... v) { return _mm_mul_epi32(v...); }; + auto const static inline mul = [](auto&&... v) { return _mm_mullo_epi32(v...); }; // auto const static inline fma = _mm256_fmadd_ps; }; template<> struct Type_<std::int32_t, 8> { using internal_type = __m256i; auto const static inline add = [](auto&&... v) { return _mm256_add_epi32(v...); }; - auto const static inline mul = [](auto&&... v) { return _mm256_mul_epi32(v...); }; + auto const static inline mul = [](auto&&... v) { return _mm256_mullo_epi32(v...); };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@inc/mkn/avx/types.hpp` around lines 217 - 227, The mul lambdas in the Type_<std::int32_t, 4> and Type_<std::int32_t, 8> specializations currently use _mm_mul_epi32 / _mm256_mul_epi32 which only multiply even lanes and widen to 64-bit; change those mul definitions to call _mm_mullo_epi32 and _mm256_mullo_epi32 respectively so all 32-bit lanes are multiplied elementwise and results remain 32-bit. Locate the mul lambda in Type_<std::int32_t,4> (internal_type = __m128i) and the mul lambda in Type_<std::int32_t,8> (internal_type = __m256i) and replace the intrinsic names accordingly while preserving the existing lambda signature and argument forwarding.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (5)

test/test_aligned.cpp (1)

42-50: main() marked noexcept will terminate on exceptions.

Same consideration as test_unaligned.cpp - allocation failures won't produce exception diagnostics.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/test_aligned.cpp` around lines 42 - 50, The main() function is declared
noexcept which will call std::terminate on any thrown exception (hiding
allocation errors); remove the noexcept specifier from main() in
test/test_aligned.cpp (i.e., change int main() noexcept to int main()) so
exceptions from Timer, mkn::avx::test_aligned<double>(), or other allocations
can propagate and produce diagnostics; alternatively wrap the body in a
try/catch that logs the exception and returns a non-zero exit code if you prefer
explicit handling.

test/test_unaligned.cpp (1)

40-48: main() marked noexcept will terminate on exceptions.

If any exception escapes (e.g., from allocation failure), std::terminate() is called. This may be intentional given the test uses abort_if_not, but it prevents exception diagnostics.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/test_unaligned.cpp` around lines 40 - 48, Remove the noexcept specifier
from main (in test_unaligned.cpp) so that exceptions can propagate and produce
diagnostics, or wrap the body of main() (which constructs Timer and calls
mkn::avx::test_unaligned<double>()) in a try/catch that logs the exception and
returns a non-zero exit code; modify the function signature from int main()
noexcept to int main() (or add a catch(...) that calls std::terminate/logs) and
ensure error paths from Timer and mkn::avx::test_unaligned<double>() are
reported rather than causing silent termination.

inc/mkn/avx/def.hpp (1)

89-105: Consider using #else to avoid unreachable code warnings.

When MKN_AVX_FORCE_N is defined, the return at line 93 makes lines 96-104 unreachable. While functionally correct, some compilers may emit warnings.

Proposed refactor

     template<typename AT>
     std::uint16_t static constexpr N()
     {
 `#if` defined(MKN_AVX_FORCE_N)
         return MKN_AVX_FORCE_N;
-#endif
-
+#else
         using T = std::decay_t<AT>;
         if constexpr (AVX512)
             return 512 / 8 / sizeof(T);
         else if constexpr (AVX2)
             return 256 / 8 / sizeof(T);
         else if constexpr (AVX)
             return 128 / 8 / sizeof(T);
         else
             return 1;
+#endif
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/def.hpp` around lines 89 - 105, The template function N() returns
MKN_AVX_FORCE_N under the `#if` defined(MKN_AVX_FORCE_N) path which makes the
remaining constexpr branches (AVX512/AVX2/AVX/else) unreachable and can trigger
compiler warnings; change the preprocessor guard to use an `#if`
defined(MKN_AVX_FORCE_N) ... return MKN_AVX_FORCE_N; `#else` ... `#endif` (or an
`#elif` chain) so the AVX512/AVX2/AVX branches are compiled only when
MKN_AVX_FORCE_N is not defined, leaving the function N()’s constexpr logic
intact and eliminating unreachable-code diagnostics.

inc/mkn/avx/array.hpp (2)

35-37: Clarify or remove the duplicate span.hpp include.

Line 35 includes "mkn/avx/span.hpp" and line 37 includes "span.hpp". If both resolve to the same header, the second is redundant. If line 37 is intended to include <span> from the standard library, use angle brackets instead.

♻️ Suggested fix (if redundant)

 `#include` "mkn/avx/span.hpp"
 
-#include "span.hpp"
 
 `#include` <array>

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/array.hpp` around lines 35 - 37, The file currently includes
"mkn/avx/span.hpp" and also "span.hpp"; determine if "span.hpp" is the same
header or intended to be the C++ standard <span> header and then either remove
the redundant "span.hpp" include (if it resolves to the same mkn/avx/span.hpp)
or change it to `#include` <span> (angle brackets) if the intent was to include
the standard library span; update the include accordingly so only the correct,
non-duplicate header remains and rebuild to confirm no missing symbols from
mkn::avx::span or std::span.

124-169: Inconsistent make_span template argument usage.

Line 127 explicitly specifies make_span<N>(arr), while lines 141, 155, and 169 rely on deduction with make_span(arr). For consistency and to ensure the correct N is used, consider using explicit template arguments throughout.

♻️ Suggested fix

     auto inline operator-(Arr::arr_t const& arr) const
     {
         Array ret{};
-        ret.sub(*this, make_span(arr));
+        ret.sub(*this, make_span<N>(arr));
         return ret;
     }

Apply similarly to lines 155 and 169.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/array.hpp` around lines 124 - 169, The code uses explicit
template argument make_span<N>(arr) in operator+(Arr::arr_t const& arr) but
relies on deduction in the other analogous overloads; update the Arr::arr_t
overloads for operator*(Arr::arr_t const& arr) and operator/(Arr::arr_t const&
arr) (and any other arr overloads like the one in operator-) to call
make_span<N>(arr) instead of make_span(arr) so the fixed size N is used
consistently when constructing the temporary span for Array::mul, Array::div,
Array::sub, etc.; locate the overloads by their signatures (operator+(Arr::arr_t
const&), operator-(Arr::arr_t const&), operator*(Arr::arr_t const&),
operator/(Arr::arr_t const&)) and replace make_span(arr) with make_span<N>(arr).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@inc/mkn/avx/types.hpp`:
- Around line 129-142: Add the missing fused multiply-add member to the AVX-512
double specialization Type_<double, 8> by defining a static inline fma callable
that forwards to the AVX-512 intrinsic _mm512_fmadd_pd (matching the pattern
used for add/sub/mul/div); e.g. add a line like: auto const static inline fma =
[](auto&&... v) { return _mm512_fmadd_pd(v...); }; so the free fma() that
delegates to Type_::fma for double vectors compiles.

---

Nitpick comments:
In `@inc/mkn/avx/array.hpp`:
- Around line 35-37: The file currently includes "mkn/avx/span.hpp" and also
"span.hpp"; determine if "span.hpp" is the same header or intended to be the C++
standard <span> header and then either remove the redundant "span.hpp" include
(if it resolves to the same mkn/avx/span.hpp) or change it to `#include` <span>
(angle brackets) if the intent was to include the standard library span; update
the include accordingly so only the correct, non-duplicate header remains and
rebuild to confirm no missing symbols from mkn::avx::span or std::span.
- Around line 124-169: The code uses explicit template argument
make_span<N>(arr) in operator+(Arr::arr_t const& arr) but relies on deduction in
the other analogous overloads; update the Arr::arr_t overloads for
operator*(Arr::arr_t const& arr) and operator/(Arr::arr_t const& arr) (and any
other arr overloads like the one in operator-) to call make_span<N>(arr) instead
of make_span(arr) so the fixed size N is used consistently when constructing the
temporary span for Array::mul, Array::div, Array::sub, etc.; locate the
overloads by their signatures (operator+(Arr::arr_t const&),
operator-(Arr::arr_t const&), operator*(Arr::arr_t const&), operator/(Arr::arr_t
const&)) and replace make_span(arr) with make_span<N>(arr).

In `@inc/mkn/avx/def.hpp`:
- Around line 89-105: The template function N() returns MKN_AVX_FORCE_N under
the `#if` defined(MKN_AVX_FORCE_N) path which makes the remaining constexpr
branches (AVX512/AVX2/AVX/else) unreachable and can trigger compiler warnings;
change the preprocessor guard to use an `#if` defined(MKN_AVX_FORCE_N) ... return
MKN_AVX_FORCE_N; `#else` ... `#endif` (or an `#elif` chain) so the AVX512/AVX2/AVX
branches are compiled only when MKN_AVX_FORCE_N is not defined, leaving the
function N()’s constexpr logic intact and eliminating unreachable-code
diagnostics.

In `@test/test_aligned.cpp`:
- Around line 42-50: The main() function is declared noexcept which will call
std::terminate on any thrown exception (hiding allocation errors); remove the
noexcept specifier from main() in test/test_aligned.cpp (i.e., change int main()
noexcept to int main()) so exceptions from Timer,
mkn::avx::test_aligned<double>(), or other allocations can propagate and produce
diagnostics; alternatively wrap the body in a try/catch that logs the exception
and returns a non-zero exit code if you prefer explicit handling.

In `@test/test_unaligned.cpp`:
- Around line 40-48: Remove the noexcept specifier from main (in
test_unaligned.cpp) so that exceptions can propagate and produce diagnostics, or
wrap the body of main() (which constructs Timer and calls
mkn::avx::test_unaligned<double>()) in a try/catch that logs the exception and
returns a non-zero exit code; modify the function signature from int main()
noexcept to int main() (or add a catch(...) that calls std::terminate/logs) and
ensure error paths from Timer and mkn::avx::test_unaligned<double>() are
reported rather than causing silent termination.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f098b528-2ed6-43d6-b119-fcd21fd5f3ec

📥 Commits

Reviewing files that changed from the base of the PR and between 4fe51a3 and d653f21.

📒 Files selected for processing (21)

.github/workflows/build.yml
.github/workflows/build_nix.yml
README.noformat
inc/mkn/avx.hpp
inc/mkn/avx/array.hpp
inc/mkn/avx/dbg.hpp
inc/mkn/avx/def.hpp
inc/mkn/avx/lazy.hpp
inc/mkn/avx/span.hpp
inc/mkn/avx/types.hpp
inc/mkn/avx/vector.hpp
mkn.yaml
test/bench.cpp
test/bench.hpp
test/betterbench.cpp
test/betterbench.hpp
test/betterbench_avx.cpp
test/test_aligned.cpp
test/test_avx.cpp
test/test_lazy.cpp
test/test_unaligned.cpp

💤 Files with no reviewable changes (1)

.github/workflows/build_nix.yml

✅ Files skipped from review due to trivial changes (1)

test/betterbench_avx.cpp

🚧 Files skipped from review as they are similar to previous changes (9)

test/test_lazy.cpp
inc/mkn/avx.hpp
README.noformat
test/test_avx.cpp
mkn.yaml
.github/workflows/build.yml
test/bench.cpp
test/betterbench.cpp
inc/mkn/avx/dbg.hpp

coderabbitai · 2026-03-29T14:20:42Z

 template<>
 struct Type_<double, 8>
 {
-    using internal_type                = __m512d;
-    auto constexpr static add_func_ptr = &_mm512_add_pd;
-    auto constexpr static sub_func_ptr = &_mm512_sub_pd;
-    auto constexpr static mul_func_ptr = &_mm512_mul_pd;
-    auto constexpr static div_func_ptr = &_mm512_div_pd;
-    // auto constexpr static fma_func_ptr = &_mm256_fmadd_pd;
+    using internal_type                      = __m512d;
+    auto const static inline add             = [](auto&&... v) { return _mm512_add_pd(v...); };
+    auto const static inline sub             = [](auto&&... v) { return _mm512_sub_pd(v...); };
+    auto const static inline mul             = [](auto&&... v) { return _mm512_mul_pd(v...); };
+    auto const static inline div             = [](auto&&... v) { return _mm512_div_pd(v...); };
+    auto const static inline set             = [](auto&&... v) { return _mm512_store_pd(v...); };
+    auto const static inline set_v           = [](auto&&... v) { return _mm512_set1_pd(v...); };
+    auto const static inline unaligned_load  = [](auto&&... v) { return _mm512_loadu_pd(v...); };
+    auto const static inline unaligned_store = [](auto&&... v) { return _mm512_storeu_pd(v...); };
+    // auto const static inline fma = _mm256_fmadd_pd;
 };


⚠️ Potential issue | 🟠 Major

Missing fma for Type_<double, 8> (AVX-512).

The AVX-512 double specialization has fma commented out, but the free fma() function at line 398 delegates to Type_::fma. Calls to fma with double vectors of size 8 will fail to compile.

Proposed fix

template<> struct Type_<double, 8> { using internal_type = __m512d; auto const static inline add = [](auto&&... v) { return _mm512_add_pd(v...); }; auto const static inline sub = [](auto&&... v) { return _mm512_sub_pd(v...); }; auto const static inline mul = [](auto&&... v) { return _mm512_mul_pd(v...); }; auto const static inline div = [](auto&&... v) { return _mm512_div_pd(v...); }; auto const static inline set = [](auto&&... v) { return _mm512_store_pd(v...); }; auto const static inline set_v = [](auto&&... v) { return _mm512_set1_pd(v...); }; auto const static inline unaligned_load = [](auto&&... v) { return _mm512_loadu_pd(v...); }; auto const static inline unaligned_store = [](auto&&... v) { return _mm512_storeu_pd(v...); }; - // auto const static inline fma = _mm256_fmadd_pd; + auto const static inline fma = _mm512_fmadd_pd; };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@inc/mkn/avx/types.hpp` around lines 129 - 142, Add the missing fused multiply-add member to the AVX-512 double specialization Type_<double, 8> by defining a static inline fma callable that forwards to the AVX-512 intrinsic _mm512_fmadd_pd (matching the pattern used for add/sub/mul/div); e.g. add a line like: auto const static inline fma = [](auto&&... v) { return _mm512_fmadd_pd(v...); }; so the free fma() that delegates to Type_::fma for double vectors compiles.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (12)

mkn.yaml (1)

12-14: ⚠️ Potential issue | 🟡 Minor

Regex typo: (w*) should be (\w*).

The test pattern test/test_(w*).cpp matches zero or more literal 'w' characters. It won't match test files like test_aligned.cpp, test_avx.cpp, test_lazy.cpp, or test_unaligned.cpp.
Proposed fix
 - name: test_all
   parent: headers
-  test: test/test_(w*).cpp
+  test: test/test_(\w*).cpp
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mkn.yaml` around lines 12 - 14, The test pattern under the test rule named
"test_all" uses the incorrect regex fragment `(w*)` that matches literal 'w'
characters; update the "test" value `test/test_(w*).cpp` to use a word-character
class `\w` (i.e., `test/test_(\w*).cpp`) so it matches names like
test_aligned.cpp, test_avx.cpp, etc.; ensure the backslash is properly escaped
if the YAML parser requires it.

inc/mkn/avx/dbg.hpp (1)

71-79: ⚠️ Potential issue | 🟡 Minor

Macro token pasting with __LINE__ requires two-level indirection.

The macro __mkn_avx_count_point##__LINE__ pastes __LINE__ directly without expanding it to the actual line number first. This means all uses will create the same variable name rather than unique names per line.

Proposed fix with two-level macro indirection

 `#if` defined(MKN_AVX_COUNT_FNS)
+#define MKN_AVX_CONCAT_IMPL(x, y) x##y
+#define MKN_AVX_CONCAT(x, y) MKN_AVX_CONCAT_IMPL(x, y)
 `#define` MKN_AVX_FN_COUNTER                                                                         \
-    static mkn::avx::CountPoint __mkn_avx_count_point##__LINE__{__FILE__, __func__, __LINE__};     \
+    static mkn::avx::CountPoint MKN_AVX_CONCAT(__mkn_avx_count_point, __LINE__){__FILE__, __func__, __LINE__};     \
     mkn::avx::Counter::I()(__FILE__, __func__, __LINE__);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/dbg.hpp` around lines 71 - 79, The macro MKN_AVX_FN_COUNTER uses
token pasting with __LINE__ directly which prevents __LINE__ from being
expanded, causing duplicate variable names; fix by adding a two-level paste
helper (e.g. a MKN_AVX_PASTE and MKN_AVX_PASTE2 macro) and replace the direct
paste __mkn_avx_count_point##__LINE__ with the two-level expansion (e.g.
MKN_AVX_PASTE(__mkn_avx_count_point, __LINE__)) inside MKN_AVX_FN_COUNTER so
each expansion of the static mkn::avx::CountPoint variable is unique per line
while keeping the rest of the macro (the Counter::I() call and parameters)
unchanged.

inc/mkn/avx/types.hpp (3)

225-237: ⚠️ Potential issue | 🟠 Major

Use _mm*_mullo_epi32 for 32-bit lane-wise multiplication.

_mm_mul_epi32 and _mm256_mul_epi32 only multiply the even 32-bit lanes and widen the result. These specializations should use _mm_mullo_epi32 / _mm256_mullo_epi32 to preserve the library's elementwise 32-bit semantics.

🔧 Suggested update

 template<>
 struct Type_<std::int32_t, 4>
 {
     using internal_type          = __m128i;
     auto const static inline add = [](auto&&... v) { return _mm_add_epi32(v...); };
-    auto const static inline mul = [](auto&&... v) { return _mm_mul_epi32(v...); };
+    auto const static inline mul = [](auto&&... v) { return _mm_mullo_epi32(v...); };
     // auto const static inline fma = _mm256_fmadd_ps;
 };
 template<>
 struct Type_<std::int32_t, 8>
 {
     using internal_type          = __m256i;
     auto const static inline add = [](auto&&... v) { return _mm256_add_epi32(v...); };
-    auto const static inline mul = [](auto&&... v) { return _mm256_mul_epi32(v...); };
+    auto const static inline mul = [](auto&&... v) { return _mm256_mullo_epi32(v...); };
     // auto const static inline fma = _mm256_fmadd_ps;
 };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/types.hpp` around lines 225 - 237, The mul lambdas in the
specializations Type_<std::int32_t, 4> and Type_<std::int32_t, 8> use
_mm_mul_epi32/_mm256_mul_epi32 which widen and only multiply even lanes; change
those to use the lane-wise 32-bit multiplication intrinsics (_mm_mullo_epi32 for
the __m128i specialization and _mm256_mullo_epi32 for the __m256i
specialization) so the mul function preserves elementwise 32-bit semantics while
leaving add and other members unchanged.

129-141: ⚠️ Potential issue | 🟠 Major

Wire up fma for the AVX-512 double specialization.

The free mkn::avx::fma(...) helper on Lines 395-398 now dispatches through impl_type::fma, but Type_<double, 8> still comments that member out on Line 141. Any Type<double, 8> FMA call will fail to compile.

🔧 Suggested update

 template<>
 struct Type_<double, 8>
 {
     using internal_type                      = __m512d;
     auto const static inline add             = [](auto&&... v) { return _mm512_add_pd(v...); };
     auto const static inline sub             = [](auto&&... v) { return _mm512_sub_pd(v...); };
     auto const static inline mul             = [](auto&&... v) { return _mm512_mul_pd(v...); };
     auto const static inline div             = [](auto&&... v) { return _mm512_div_pd(v...); };
     auto const static inline set             = [](auto&&... v) { return _mm512_store_pd(v...); };
     auto const static inline set_v           = [](auto&&... v) { return _mm512_set1_pd(v...); };
     auto const static inline unaligned_load  = [](auto&&... v) { return _mm512_loadu_pd(v...); };
     auto const static inline unaligned_store = [](auto&&... v) { return _mm512_storeu_pd(v...); };
-    // auto const static inline fma = _mm256_fmadd_pd;
+    auto const static inline fma             = [](auto&&... v) { return _mm512_fmadd_pd(v...); };
 };

Also applies to: 395-398

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/types.hpp` around lines 129 - 141, The AVX-512 double
specialization Type_<double, 8> has its fma member commented out, but
mkn::avx::fma dispatches to impl_type::fma; restore and wire fma by adding a
static inline lambda that forwards to the intrinsic (e.g. auto const static
inline fma = [](auto&&... v) { return _mm512_fmadd_pd(v...); };) so
Type_<double, 8>::fma is defined and matches the other operator lambdas.

357-360: ⚠️ Potential issue | 🟠 Major

Don't write through const_casted destinations.

Lines 357-360 and 388-391 take T const* and then cast away const before storing. That makes these helpers UB as soon as a genuinely const buffer reaches them; store/unaligned_store should take T*, while unaligned_load can keep T const* and drop the cast.

🔧 Suggested update

 template<typename T, std::size_t SIZE>
-void inline store(T const* __restrict a, Type<T, SIZE> const& __restrict b) noexcept
+void inline store(T* __restrict a, Type<T, SIZE> const& __restrict b) noexcept
 {
     MKN_AVX_FN_COUNTER;
-    Type<T, SIZE>::set(const_cast<T*>(a), b());
+    Type<T, SIZE>::set(a, b());
 }
 
 template<typename T, std::size_t SIZE>
 Type<T, SIZE> unaligned_load(T const* __restrict a) noexcept
 {
     MKN_AVX_FN_COUNTER;
-    return Type<T, SIZE>::Super::impl_type::unaligned_load(const_cast<T*>(a));
+    return Type<T, SIZE>::Super::impl_type::unaligned_load(a);
 }
 
 template<typename T, std::size_t SIZE>
-void unaligned_store(T const* __restrict a, Type<T, SIZE> const& b) noexcept
+void unaligned_store(T* __restrict a, Type<T, SIZE> const& b) noexcept
 {
     MKN_AVX_FN_COUNTER;
-    Type<T, SIZE>::Super::impl_type::unaligned_store(const_cast<T*>(a), b());
+    Type<T, SIZE>::Super::impl_type::unaligned_store(a, b());
 }

Also applies to: 381-391

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/types.hpp` around lines 357 - 360, The helpers store and
unaligned_store must not cast away const: change their parameter types from T
const* to T* and update their signatures (e.g., store(T* __restrict a, Type<T,
SIZE> const& __restrict b) noexcept and likewise for unaligned_store) so you can
pass the destination pointer directly to Type<T, SIZE>::set without using
const_cast; leave unaligned_load as T const* and remove any const_cast there as
well. Ensure all callers are updated to pass non-const pointers to
store/unaligned_store and that Type<T, SIZE>::set is invoked with the original
pointer (no const_cast) in the store/unaligned_store/unaligned_load
implementations.

inc/mkn/avx/array.hpp (1)

221-229: ⚠️ Potential issue | 🔴 Critical

Decay T1 in free operator* before writing into the result.

make_span(const container) can still produce Span<const T, N>. With Array<T1, N> on Line 225, the writes on Lines 226-227 target a const-qualified result. operator+ already fixes this with std::decay_t<T1>; operator* needs the same treatment.
🔧 Suggested update
 template<typename T0, typename T1, std::size_t N>
 auto inline operator*(mkn::avx::Span<T0, N> const& __restrict s0,
                       mkn::avx::Span<T1, N> const& __restrict s1)
 {
-    mkn::avx::Array<T1, N> ret{std::nullopt};
+    mkn::avx::Array<std::decay_t<T1>, N> ret{std::nullopt};
     *ret = s0;
     ret *= s1;
     return ret;
 }
Based on learnings, the separate T0/T1 parameters here are used for const/non-const variation rather than different element types.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/array.hpp` around lines 221 - 229, The free operator* uses
Array<T1, N> which can produce a const-qualified element type when T1 is a const
type; change the result container to use a decayed element type like
std::decay_t<T1> (mirror the fix in operator+) so writes to ret (in
operator*(mkn::avx::Span<T0, N> const& s0, mkn::avx::Span<T1, N> const& s1)) are
allowed; update the Array instantiation (and any related assignments) to
Array<std::decay_t<T1>, N> to ensure non-const storage while preserving existing
Span usage.

inc/mkn/avx/span.hpp (5)

523-579: ⚠️ Potential issue | 🔴 Critical

UnSpanSet still falls back to aligned machinery for most operations.

Only operator+=(UnSpanSet const&) is specialized here. Assignment, scalar ops, -=, *=, /=, add/sub/mul/div/fma, and const element access still inherit the aligned Span/SpanSet paths, and the const factory on Lines 628-631 still returns SpanSet, so read-only unaligned containers never reach the unaligned implementation.

Also applies to: 628-631

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/span.hpp` around lines 523 - 579, UnSpanSet only overrides
operator+=(UnSpanSet const&) and non-const operator[], so most operations still
use aligned Span/SpanSet implementations; update UnSpanSet and UnSpan to provide
unaligned implementations for all remaining operations (assignment, const
operator[], compound operators -=, *=, /=, scalar ops, add/sub/mul/div/fma
helpers) and ensure the const factory that currently returns SpanSet returns
UnSpanSet for unaligned/read-only containers; locate and modify the classes
UnSpan, UnSpanSet, their operator[]/operator overloads and the const factory
that constructs SpanSet to instead construct and return UnSpanSet so all code
paths use the unaligned AVX_t/unaligned_load/unaligned_store paths consistently.

584-626: ⚠️ Potential issue | 🔴 Critical

Centralize span selection on the actual pointer, not just the container type.

is_aligned(Container{}) only recognizes std::vector, so make_span(std::array) now has no viable alignment probe. On top of that, the explicit-N and sliced factories hard-code Span/SpanSet and never reconsider alignment after start, even though container.data() + start can be unaligned. This breaks the make_span(arr) calls added in inc/mkn/avx/array.hpp and still misclassifies offset views.

Also applies to: 633-655

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/span.hpp` around lines 584 - 626, The current alignment probing
uses is_aligned(Container{}) which only detects vector+AlignedAllocator and
ignores std::array and per-offset alignment; update alignment logic to probe the
actual data pointer (container.data() or pointer argument) plus the start offset
and make decisions from that pointer's alignment rather than the container type.
Specifically, add or overload an is_aligned_from_ptr(fn) that accepts a raw
pointer and/or pointer+offset and checks reinterpret_cast<std::uintptr_t>(ptr +
start) % alignment == 0 (or uses kul::AlignedAllocator trait when available),
then change make_span<std::size_t N, Container> and the non-size templates to
compute auto* p = container.data() + start (or container + start for raw
pointers/arrays) and choose Span vs UnSpan (and SpanSet vs UnSpanSet) based on
is_aligned_from_ptr(p) so std::array and offset slices are classified correctly;
ensure the const overloads mirror the same pointer-based check and preserve the
N template behavior.

158-163: ⚠️ Potential issue | 🔴 Critical

Normalize span-like inputs before feeding them into cast().

Lines 160/177/193 build Span<T0 const>{arr}, but Span only exposes pointer/copy/move constructors in this header. Lines 217-218 have the same problem in reverse: cast() only has caster() overloads for Span/SpanSet, so passing an arbitrary span-like that still fails to instantiate.

Also applies to: 175-180, 191-196, 212-225

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/span.hpp` around lines 158 - 163, Normalize any span-like inputs
before calling cast(): replace ad-hoc brace construction like Span<T0
const>{arr} with an explicit conversion to a real Span (e.g., call a small
helper normalize_to_span(arr) or use Span's pointer/size constructor) so cast()
and its caster() overloads for Span/SpanSet can instantiate; update operator+=
(the inline operator+= that builds Span<T0 const>{arr}), plus the other
occurrences that build temporary Span from arbitrary span-like types, to first
produce a concrete Span or add caster overloads that accept the span-like
wrapper and forward to Span/SpanSet.

399-410: ⚠️ Potential issue | 🟠 Major

Handle the remainder in the fixed-array and scalar compound ops.

These overloads stop at size() / N full vectors, so any size() % N tail is left unchanged. The scalar paths on Lines 406-410, 449-453, and 472-476 inherit the same bug because they delegate through the broken fixed-array overloads.

Also applies to: 423-429, 442-453, 465-476

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/span.hpp` around lines 399 - 410, The vectorized
compound-assignment overloads (auto operator+=(Arr<T0, N> const& arr) noexcept
and auto operator+=(T const& val) noexcept) only process full N-sized blocks
(size() / N) and leave the size() % N remainder unchanged; fix by computing
full_blocks = size() / N, run the existing vectorized loop over full_blocks,
then handle the tail with a scalar loop from full_blocks * N to size() applying
elementwise addition (for Arr<T0,N> use the original element indices from the
input arr/span, for T const& val add the scalar). Update both
operator+=(Arr<T0,N>) and operator+=(T const&) (and any other compound ops that
delegate to them) to include the remainder loop; use the existing SpanSet/cast
results for the block loop and plain indexing for the tail.

68-85: ⚠️ Potential issue | 🔴 Critical

Drop the union-based pointer punning in Span.

Line 84 initializes u.arr, then Lines 80-81 read back through u.i. Standard C++ does not define that union member switch, so the aligned SIMD view here depends on UB and also sidesteps const-correctness for Span<const T, N>.

🔧 Suggested direction

-    union U
-    {
-        T* arr;
-        mkn::avx::Type<R, N>* i;
-
-        auto data() { return arr; }
-        auto data() const { return arr; }
-        auto constexpr static size() { return N; }
-    };
-
-    U u;
-
-    auto& avx() { return u.i; }
-    auto& avx() const { return u.i; }
+    T* ptr;
+
+    auto avx() noexcept
+    {
+        return reinterpret_cast<mkn::avx::Type<R, N>*>(ptr);
+    }
+
+    auto avx() const noexcept
+    {
+        using avx_ptr_t =
+            std::conditional_t<std::is_const_v<T>, mkn::avx::Type<R, N> const*, mkn::avx::Type<R, N>*>;
+        return reinterpret_cast<avx_ptr_t>(ptr);
+    }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/span.hpp` around lines 68 - 85, The union-based pointer punning
(union U with members arr and i, instance u) is undefined behavior and breaks
const-correctness when reading avx() after initializing u.arr in Span(T* d,
std::size_t const s). Replace the union with a single data pointer member (e.g.
a T* data_ field), initialize data_ in the Span constructor, and implement avx()
/ avx() const to return a reinterpret_cast (or std::bit_cast wrapper) of data_
to mkn::avx::Type<R, N>* / const mkn::avx::Type<R, N>* respectively; remove
reads/writes of u.arr and u.i and ensure all uses reference the new data_ member
for correct, well-defined behavior and const-correctness.

test/betterbench_avx.cpp (1)

31-47: ⚠️ Potential issue | 🔴 Critical

Switch this benchmark to the renamed Type_ callables.

Lines 31-47 still reference set_v_func_ptr, mul_func_ptr, and add_func_ptr, but inc/mkn/avx/types.hpp now only exposes set_v, mul, and add. This file will not build until those callsites are updated.

🔧 Suggested update

-        T const t0 = AVX_t::set_v_func_ptr(2);
+        T const t0 = AVX_t::set_v(2);

-        a = AVX_t::set_v_func_ptr(10);
-        c = AVX_t::set_v_func_ptr(400);
+        a = AVX_t::set_v(10);
+        c = AVX_t::set_v(400);

-        T const t1 = AVX_t::set_v_func_ptr(2);
-        T const t3 = AVX_t::mul_func_ptr(t0, t1);
-        b          = AVX_t::add_func_ptr(b, t3);
+        T const t1 = AVX_t::set_v(2);
+        T const t3 = AVX_t::mul(t0, t1);
+        b          = AVX_t::add(b, t3);

-        c    = AVX_t::add_func_ptr(c, t1);
-        T t4 = AVX_t::set_v_func_ptr(100);
-        t4   = AVX_t::mul_func_ptr(t4, t3); // t4 *= t3;
-        d    = AVX_t::add_func_ptr(a, AVX_t::mul_func_ptr(b, c));
-        d    = AVX_t::add_func_ptr(d, t4); // d += t4;
-        d    = AVX_t::mul_func_ptr(d, a);  // d *= a;
-        d    = AVX_t::mul_func_ptr(d, b);  // d *= b;
-        d    = AVX_t::mul_func_ptr(d, c);  // d *= c;
+        c    = AVX_t::add(c, t1);
+        T t4 = AVX_t::set_v(100);
+        t4   = AVX_t::mul(t4, t3); // t4 *= t3;
+        d    = AVX_t::add(a, AVX_t::mul(b, c));
+        d    = AVX_t::add(d, t4); // d += t4;
+        d    = AVX_t::mul(d, a);  // d *= a;
+        d    = AVX_t::mul(d, b);  // d *= b;
+        d    = AVX_t::mul(d, c);  // d *= c;

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/betterbench_avx.cpp` around lines 31 - 47, The test is using old
callable names (set_v_func_ptr, mul_func_ptr, add_func_ptr) that were renamed;
update all calls on AVX_t to the new symbols set_v, mul, and add (e.g., replace
AVX_t::set_v_func_ptr(...) → AVX_t::set_v(...), AVX_t::mul_func_ptr(...) →
AVX_t::mul(...), AVX_t::add_func_ptr(...) → AVX_t::add(...)) for the expressions
that construct t0, t1, t3, t4 and the uses in a, b, c, d so the file builds with
the renamed Type_ callables.

🧹 Nitpick comments (1)

inc/mkn/avx/def.hpp (1)

89-105: Missing return or conditional structure issue with MKN_AVX_FORCE_N.

When MKN_AVX_FORCE_N is defined, the preprocessor includes the return MKN_AVX_FORCE_N; statement, but the rest of the function body (lines 96-105) is still compiled and becomes unreachable code. This pattern works but generates compiler warnings about unreachable code.

Consider wrapping the remaining logic in an #else block for clarity:

Proposed fix

     template<typename AT>
     std::uint16_t static constexpr N()
     {
 `#if` defined(MKN_AVX_FORCE_N)
         return MKN_AVX_FORCE_N;
-#endif
-
+#else
         using T = std::decay_t<AT>;
         if constexpr (AVX512)
             return 512 / 8 / sizeof(T);
         else if constexpr (AVX2)
             return 256 / 8 / sizeof(T);
         else if constexpr (AVX)
             return 128 / 8 / sizeof(T);
         else
             return 1;
+#endif
     }

The same pattern applies to ALIGN() at lines 107-121.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/def.hpp` around lines 89 - 105, The preprocessor early-return in
template function N() (and similarly ALIGN()) leaves the remaining logic
compiled and causes unreachable-code warnings; change the pattern by wrapping
the fallback logic in an `#else` ... `#endif` block so that when MKN_AVX_FORCE_N is
defined the function contains only the return MKN_AVX_FORCE_N; and when it is
not defined the existing using T = std::decay_t<AT>; if constexpr (AVX512) ...
else ... logic is compiled; apply the same `#if/`#else/#endif structure around the
ALIGN() implementation to eliminate unreachable branches.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/test_unaligned.cpp`:
- Around line 31-40: The test constructs both UnSpanSet<T, 4> a and b from v0 so
b += a mutates v0 while the assertions check v1; fix by constructing b from v1
(replace UnSpanSet<T, 4> b{v0} with UnSpanSet<T, 4> b{v1}) so b += a operates on
v1 as intended, or alternatively change the abort_if_not checks to assert
v0.front()/back() if the intent was to mutate v0; update the code referencing
UnSpanSet, a, b, v0, v1 and the b += a operation accordingly.

---

Duplicate comments:
In `@inc/mkn/avx/array.hpp`:
- Around line 221-229: The free operator* uses Array<T1, N> which can produce a
const-qualified element type when T1 is a const type; change the result
container to use a decayed element type like std::decay_t<T1> (mirror the fix in
operator+) so writes to ret (in operator*(mkn::avx::Span<T0, N> const& s0,
mkn::avx::Span<T1, N> const& s1)) are allowed; update the Array instantiation
(and any related assignments) to Array<std::decay_t<T1>, N> to ensure non-const
storage while preserving existing Span usage.

In `@inc/mkn/avx/dbg.hpp`:
- Around line 71-79: The macro MKN_AVX_FN_COUNTER uses token pasting with
__LINE__ directly which prevents __LINE__ from being expanded, causing duplicate
variable names; fix by adding a two-level paste helper (e.g. a MKN_AVX_PASTE and
MKN_AVX_PASTE2 macro) and replace the direct paste
__mkn_avx_count_point##__LINE__ with the two-level expansion (e.g.
MKN_AVX_PASTE(__mkn_avx_count_point, __LINE__)) inside MKN_AVX_FN_COUNTER so
each expansion of the static mkn::avx::CountPoint variable is unique per line
while keeping the rest of the macro (the Counter::I() call and parameters)
unchanged.

In `@inc/mkn/avx/span.hpp`:
- Around line 523-579: UnSpanSet only overrides operator+=(UnSpanSet const&) and
non-const operator[], so most operations still use aligned Span/SpanSet
implementations; update UnSpanSet and UnSpan to provide unaligned
implementations for all remaining operations (assignment, const operator[],
compound operators -=, *=, /=, scalar ops, add/sub/mul/div/fma helpers) and
ensure the const factory that currently returns SpanSet returns UnSpanSet for
unaligned/read-only containers; locate and modify the classes UnSpan, UnSpanSet,
their operator[]/operator overloads and the const factory that constructs
SpanSet to instead construct and return UnSpanSet so all code paths use the
unaligned AVX_t/unaligned_load/unaligned_store paths consistently.
- Around line 584-626: The current alignment probing uses
is_aligned(Container{}) which only detects vector+AlignedAllocator and ignores
std::array and per-offset alignment; update alignment logic to probe the actual
data pointer (container.data() or pointer argument) plus the start offset and
make decisions from that pointer's alignment rather than the container type.
Specifically, add or overload an is_aligned_from_ptr(fn) that accepts a raw
pointer and/or pointer+offset and checks reinterpret_cast<std::uintptr_t>(ptr +
start) % alignment == 0 (or uses kul::AlignedAllocator trait when available),
then change make_span<std::size_t N, Container> and the non-size templates to
compute auto* p = container.data() + start (or container + start for raw
pointers/arrays) and choose Span vs UnSpan (and SpanSet vs UnSpanSet) based on
is_aligned_from_ptr(p) so std::array and offset slices are classified correctly;
ensure the const overloads mirror the same pointer-based check and preserve the
N template behavior.
- Around line 158-163: Normalize any span-like inputs before calling cast():
replace ad-hoc brace construction like Span<T0 const>{arr} with an explicit
conversion to a real Span (e.g., call a small helper normalize_to_span(arr) or
use Span's pointer/size constructor) so cast() and its caster() overloads for
Span/SpanSet can instantiate; update operator+= (the inline operator+= that
builds Span<T0 const>{arr}), plus the other occurrences that build temporary
Span from arbitrary span-like types, to first produce a concrete Span or add
caster overloads that accept the span-like wrapper and forward to Span/SpanSet.
- Around line 399-410: The vectorized compound-assignment overloads (auto
operator+=(Arr<T0, N> const& arr) noexcept and auto operator+=(T const& val)
noexcept) only process full N-sized blocks (size() / N) and leave the size() % N
remainder unchanged; fix by computing full_blocks = size() / N, run the existing
vectorized loop over full_blocks, then handle the tail with a scalar loop from
full_blocks * N to size() applying elementwise addition (for Arr<T0,N> use the
original element indices from the input arr/span, for T const& val add the
scalar). Update both operator+=(Arr<T0,N>) and operator+=(T const&) (and any
other compound ops that delegate to them) to include the remainder loop; use the
existing SpanSet/cast results for the block loop and plain indexing for the
tail.
- Around line 68-85: The union-based pointer punning (union U with members arr
and i, instance u) is undefined behavior and breaks const-correctness when
reading avx() after initializing u.arr in Span(T* d, std::size_t const s).
Replace the union with a single data pointer member (e.g. a T* data_ field),
initialize data_ in the Span constructor, and implement avx() / avx() const to
return a reinterpret_cast (or std::bit_cast wrapper) of data_ to
mkn::avx::Type<R, N>* / const mkn::avx::Type<R, N>* respectively; remove
reads/writes of u.arr and u.i and ensure all uses reference the new data_ member
for correct, well-defined behavior and const-correctness.

In `@inc/mkn/avx/types.hpp`:
- Around line 225-237: The mul lambdas in the specializations
Type_<std::int32_t, 4> and Type_<std::int32_t, 8> use
_mm_mul_epi32/_mm256_mul_epi32 which widen and only multiply even lanes; change
those to use the lane-wise 32-bit multiplication intrinsics (_mm_mullo_epi32 for
the __m128i specialization and _mm256_mullo_epi32 for the __m256i
specialization) so the mul function preserves elementwise 32-bit semantics while
leaving add and other members unchanged.
- Around line 129-141: The AVX-512 double specialization Type_<double, 8> has
its fma member commented out, but mkn::avx::fma dispatches to impl_type::fma;
restore and wire fma by adding a static inline lambda that forwards to the
intrinsic (e.g. auto const static inline fma = [](auto&&... v) { return
_mm512_fmadd_pd(v...); };) so Type_<double, 8>::fma is defined and matches the
other operator lambdas.
- Around line 357-360: The helpers store and unaligned_store must not cast away
const: change their parameter types from T const* to T* and update their
signatures (e.g., store(T* __restrict a, Type<T, SIZE> const& __restrict b)
noexcept and likewise for unaligned_store) so you can pass the destination
pointer directly to Type<T, SIZE>::set without using const_cast; leave
unaligned_load as T const* and remove any const_cast there as well. Ensure all
callers are updated to pass non-const pointers to store/unaligned_store and that
Type<T, SIZE>::set is invoked with the original pointer (no const_cast) in the
store/unaligned_store/unaligned_load implementations.

In `@mkn.yaml`:
- Around line 12-14: The test pattern under the test rule named "test_all" uses
the incorrect regex fragment `(w*)` that matches literal 'w' characters; update
the "test" value `test/test_(w*).cpp` to use a word-character class `\w` (i.e.,
`test/test_(\w*).cpp`) so it matches names like test_aligned.cpp, test_avx.cpp,
etc.; ensure the backslash is properly escaped if the YAML parser requires it.

In `@test/betterbench_avx.cpp`:
- Around line 31-47: The test is using old callable names (set_v_func_ptr,
mul_func_ptr, add_func_ptr) that were renamed; update all calls on AVX_t to the
new symbols set_v, mul, and add (e.g., replace AVX_t::set_v_func_ptr(...) →
AVX_t::set_v(...), AVX_t::mul_func_ptr(...) → AVX_t::mul(...),
AVX_t::add_func_ptr(...) → AVX_t::add(...)) for the expressions that construct
t0, t1, t3, t4 and the uses in a, b, c, d so the file builds with the renamed
Type_ callables.

---

Nitpick comments:
In `@inc/mkn/avx/def.hpp`:
- Around line 89-105: The preprocessor early-return in template function N()
(and similarly ALIGN()) leaves the remaining logic compiled and causes
unreachable-code warnings; change the pattern by wrapping the fallback logic in
an `#else` ... `#endif` block so that when MKN_AVX_FORCE_N is defined the function
contains only the return MKN_AVX_FORCE_N; and when it is not defined the
existing using T = std::decay_t<AT>; if constexpr (AVX512) ... else ... logic is
compiled; apply the same `#if/`#else/#endif structure around the ALIGN()
implementation to eliminate unreachable branches.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d1f0e7ef-5edc-4443-9106-3c2c0bb70d05

📥 Commits

Reviewing files that changed from the base of the PR and between d653f21 and d2e6993.

📒 Files selected for processing (21)

.github/workflows/build.yml
.github/workflows/build_nix.yml
README.noformat
inc/mkn/avx.hpp
inc/mkn/avx/array.hpp
inc/mkn/avx/dbg.hpp
inc/mkn/avx/def.hpp
inc/mkn/avx/lazy.hpp
inc/mkn/avx/span.hpp
inc/mkn/avx/types.hpp
inc/mkn/avx/vector.hpp
mkn.yaml
test/bench.cpp
test/bench.hpp
test/betterbench.cpp
test/betterbench.hpp
test/betterbench_avx.cpp
test/test_aligned.cpp
test/test_avx.cpp
test/test_lazy.cpp
test/test_unaligned.cpp

💤 Files with no reviewable changes (1)

.github/workflows/build_nix.yml

✅ Files skipped from review due to trivial changes (2)

.github/workflows/build.yml
test/betterbench.hpp

🚧 Files skipped from review as they are similar to previous changes (6)

inc/mkn/avx.hpp
test/test_lazy.cpp
README.noformat
inc/mkn/avx/lazy.hpp
test/test_avx.cpp
test/bench.cpp

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

inc/mkn/avx/span.hpp (1)
158-195: ⚠️ Potential issue | 🔴 Critical

Fix ill-formed std::array overloads that cannot construct Span from a container.

The std::array overloads at lines 166-171, 181-186, and 196-201 attempt Span<T0 const> that{arr};, but this is invalid. Span has only pointer-based and copy/move constructors—no constructor accepts container arguments. This code will not compile.

Use SpanSet<T0 const, N>{arr} instead, which provides the span-like container constructor and properly preserves the fixed size N.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/span.hpp` around lines 158 - 195, The std::array overloads of
operator+=, operator-= and operator*= use invalid construction Span<T0 const>
that{arr}; — replace those constructions with SpanSet<T0 const, N>{arr} so the
container-based constructor exists and preserves fixed size N; update the three
sites referenced by the array overloads (the operator+= overload taking
std::array, the operator-=(std::array...) overload, and the
operator*=(std::array...) overload) to create that via SpanSet<T0 const, N>{arr}
and then keep the existing cast(*this, that) usage and subsequent element ops
unchanged.

♻️ Duplicate comments (12)

inc/mkn/avx/span.hpp (5)

213-218: ⚠️ Potential issue | 🔴 Critical

The is_span_like_v assignment branch still won't compile for containers.

cast(that) only accepts Span<T0, N> inputs. When that is a std::vector, std::array, or other non-Span span-like type, this branch is selected and then fails during overload resolution. Wrap that in a SpanSet<..., N> before calling cast.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/span.hpp` around lines 213 - 218, The is_span_like_v branch in
operator=(T0 const& that) selects for non-Span containers but calls cast(that)
which only accepts Span<T0, N>; wrap the incoming that in a SpanSet (e.g.,
SpanSet<T0, N>(that)) before calling cast so overload resolution succeeds for
std::vector/std::array and other span-like types, then proceed to extract v1 and
call store(data(), v1[0]) as before (update the cast call site inside operator=
accordingly).

543-579: ⚠️ Potential issue | 🟠 Major

UnSpanSet still falls back to aligned code for most operations.

Only operator+=(UnSpanSet const&) uses unaligned_load/unaligned_store. Assignment, scalar ops, -=, *=, /=, and even const element access still inherit the aligned SpanSet implementation.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/span.hpp` around lines 543 - 579, UnSpanSet currently only
overrides operator+= and still inherits aligned implementations from SpanSet for
assignment, scalar ops, operator-=, operator*=, operator/= and const element
access; add unaligned-aware overrides (e.g., operator=, operator-=, operator*=,
operator/=, scalar add/mul/etc. and const operator[]) that mirror the vectorized
loop in UnSpanSet::operator+=: iterate by N-sized blocks, use
unaligned_load<R,N> to read AVX_t values, perform the operation using AVX_t
(AVX_t alias provided), then write back with unaligned_store, and handle
leftover elements with the modulo_leftover_idx() fallback to element-wise ops —
this makes all core operations use unaligned memory paths rather than falling
back to SpanSet's aligned code.

68-85: ⚠️ Potential issue | 🟠 Major

Avoid union-based pointer punning here.

The constructor initializes u.arr, but the AVX accessors read u.i. That active-member swap is undefined C++, and it also drops constness for Span<const T, N>.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/span.hpp` around lines 68 - 85, The union U with members arr and
i causes undefined pointer-punning and loses constness when Span(T* d,
std::size_t s) initializes u.arr but avx() reads u.i; replace the union with
explicit separate members (e.g. T* arr; mkn::avx::Type<R, N>* i;) or, better,
remove the stored avx pointer entirely and make avx()/avx() const compute/return
a correctly cast pointer from arr (using reinterpret_cast or std::launder as
appropriate) so the active member is not swapped and const-correctness is
preserved; update the Span constructor to initialize both members if kept, or
leave only arr and let avx() produce the AVX pointer on demand, and ensure avx()
const returns a pointer-to-const for Span<const T, N>.

584-655: ⚠️ Potential issue | 🔴 Critical

The new factory layer is still inconsistent for non-vector and explicit-N inputs.

make_span(container) decides alignment via is_aligned<Container>(), but that overload set only handles std::vector, so other span-like containers fail there. This already affects the std::array call sites in inc/mkn/avx/array.hpp at Lines 141, 155, and 169. The explicit-N, const, and sliced overloads then hardcode Span/SpanSet, so supported unaligned call sites bypass UnSpan entirely.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/span.hpp` around lines 584 - 655, is_aligned only recognizes
std::vector with kul::AlignedAllocator so make_span(Container) misclassifies
other span-like types (e.g. std::array) and the explicit-N/const/sliced
overloads then hardcode Span/SpanSet instead of using UnSpan/UnSpanSet. Fix by
broadening is_aligned to detect span-like and array/pointer-backed containers
(use mkn::kul::is_span_like_v, std::is_array/std::is_pointer, and a
std::array<T,N> overload) and/or provide a generic fallback that inspects
Container::allocator_type when present; then change the explicit-N overloads and
the const/sliced make_span templates (the templates named make_span(Container&),
make_span(Container const&), and make_span<std::size_t N, typename Container>)
to choose between Span/SpanSet and UnSpan/UnSpanSet based on
is_aligned<Container>() instead of hardcoding Span/SpanSet.

398-476: ⚠️ Potential issue | 🟠 Major

The array/scalar compound ops still skip the tail.

These overloads only process size() / N full blocks. For non-multiple lengths, the final size() % N elements are left unchanged in +=, -=, *=, and /=.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/span.hpp` around lines 398 - 476, The Arr-based compound
operators (operator+=(Arr<T0,N>), operator-=(Arr<T0,N>), operator*=(Arr<T0,N>),
operator/=(Arr<T0,N>)) only iterate full SIMD blocks (size()/N) and therefore
skip the final leftover elements; update each of these functions (the ones that
construct SpanSet<T0 const, N> that and call cast(*this, that)) to add a tail
loop after the block loop that runs from modulo_leftover_idx() to size() and
performs element-wise ops using (*this)[i] and that[i] (mirror the leftover
handling used in the SpanSet overloads), leaving the scalar overloads
(operator+=(T), operator*=(T), operator/=(T)) as-is since they delegate to these
Arr overloads. Ensure you reference cast, SpanSet, modulo_leftover_idx(), and
the specific operator overload names when making the change.

inc/mkn/avx/types.hpp (4)

225-237: ⚠️ Potential issue | 🟠 Major

These int32_t multiplies use the wrong intrinsics.

_mm_mul_epi32 and _mm256_mul_epi32 only multiply even 32-bit lanes and widen to 64-bit. The elementwise API here needs _mm_mullo_epi32 and _mm256_mullo_epi32.

Suggested fix

 template<>
 struct Type_<std::int32_t, 4>
 {
     using internal_type          = __m128i;
     auto const static inline add = [](auto&&... v) { return _mm_add_epi32(v...); };
-    auto const static inline mul = [](auto&&... v) { return _mm_mul_epi32(v...); };
+    auto const static inline mul = [](auto&&... v) { return _mm_mullo_epi32(v...); };
     // auto const static inline fma = _mm256_fmadd_ps;
 };
 template<>
 struct Type_<std::int32_t, 8>
 {
     using internal_type          = __m256i;
     auto const static inline add = [](auto&&... v) { return _mm256_add_epi32(v...); };
-    auto const static inline mul = [](auto&&... v) { return _mm256_mul_epi32(v...); };
+    auto const static inline mul = [](auto&&... v) { return _mm256_mullo_epi32(v...); };
     // auto const static inline fma = _mm256_fmadd_ps;
 };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/types.hpp` around lines 225 - 237, The mul lambdas in
Type_<std::int32_t, 4> and Type_<std::int32_t, 8> use the widening intrinsics
_mm_mul_epi32 and _mm256_mul_epi32 which only multiply even lanes; change those
to the elementwise, low-lane intrinsics _mm_mullo_epi32 and _mm256_mullo_epi32
respectively so the mul function performs per-element 32-bit multiplication;
update the mul definitions inside the Type_<std::int32_t,4> and
Type_<std::int32_t,8> structs to call the correct intrinsics.

371-376: ⚠️ Potential issue | 🟠 Major

load() cannot stay noexcept while it throws.

If KEXCEPTION("load") runs here, noexcept turns it into std::terminate(). Drop noexcept or make the failure path non-throwing.

Suggested fix

 template<typename T, std::size_t SIZE>
-auto inline load(T const* __restrict a, Type<T, SIZE> const& __restrict b) noexcept
+auto inline load(T const* __restrict a, Type<T, SIZE> const& __restrict b)
 {
     MKN_AVX_FN_COUNTER;
     KEXCEPTION("load");
     // Type<T, SIZE>::load(const_cast<T*>(a), b());
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/types.hpp` around lines 371 - 376, The function template load(T
const* __restrict a, Type<T, SIZE> const& __restrict b) is marked noexcept but
calls KEXCEPTION("load") which can throw, causing std::terminate; remove
noexcept from the function signature (or change the error path to a non-throwing
alternative such as logging/assertion) so the function's exception behavior
matches its declaration; update the declaration of load and any related
overloads/usages of Type<T,SIZE>::load and references to KEXCEPTION in this
context to keep behavior consistent.

129-142: ⚠️ Potential issue | 🟠 Major

Add the missing AVX-512 double fma member.

The free mkn::avx::fma(...) helper dispatches to Type_<T, SIZE>::fma. With Type_<double, 8>::fma still commented out, any AVX-512 double fma call is a compile break.

Suggested fix

 template<>
 struct Type_<double, 8>
 {
     using internal_type                      = __m512d;
     auto const static inline add             = [](auto&&... v) { return _mm512_add_pd(v...); };
     auto const static inline sub             = [](auto&&... v) { return _mm512_sub_pd(v...); };
     auto const static inline mul             = [](auto&&... v) { return _mm512_mul_pd(v...); };
     auto const static inline div             = [](auto&&... v) { return _mm512_div_pd(v...); };
     auto const static inline set             = [](auto&&... v) { return _mm512_store_pd(v...); };
     auto const static inline set_v           = [](auto&&... v) { return _mm512_set1_pd(v...); };
     auto const static inline unaligned_load  = [](auto&&... v) { return _mm512_loadu_pd(v...); };
     auto const static inline unaligned_store = [](auto&&... v) { return _mm512_storeu_pd(v...); };
-    // auto const static inline fma = _mm256_fmadd_pd;
+    auto const static inline fma             = [](auto&&... v) { return _mm512_fmadd_pd(v...); };
 };

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/types.hpp` around lines 129 - 142, Type_<double, 8> is missing
the fma member so calls to mkn::avx::fma(...) for AVX-512 doubles fail; add a
static inline fma entry matching the pattern of the other ops in the
Type_<double, 8> specialization (use the AVX-512 intrinsic _mm512_fmadd_pd) —
e.g. add an auto const static inline fma = [](auto&&... v) { return
_mm512_fmadd_pd(v...); }; to the struct so fma dispatches correctly.

357-360: ⚠️ Potential issue | 🟠 Major

Do not take T const* on the write paths.

store and unaligned_store both cast away const and then write through the pointer. If the caller passes genuinely const storage, that's UB; these overloads should take T*, while the read path on Line 384 can forward the const pointer directly.

Suggested fix

 template<typename T, std::size_t SIZE>
-void inline store(T const* __restrict a, Type<T, SIZE> const& __restrict b) noexcept
+void inline store(T* __restrict a, Type<T, SIZE> const& __restrict b) noexcept
 {
     MKN_AVX_FN_COUNTER;
-    Type<T, SIZE>::set(const_cast<T*>(a), b());
+    Type<T, SIZE>::set(a, b());
 }
@@
 template<typename T, std::size_t SIZE>
 Type<T, SIZE> unaligned_load(T const* __restrict a) noexcept
 {
     MKN_AVX_FN_COUNTER;
-    return Type<T, SIZE>::Super::impl_type::unaligned_load(const_cast<T*>(a));
+    return Type<T, SIZE>::Super::impl_type::unaligned_load(a);
 }
 
 template<typename T, std::size_t SIZE>
-void unaligned_store(T const* __restrict a, Type<T, SIZE> const& b) noexcept
+void unaligned_store(T* __restrict a, Type<T, SIZE> const& b) noexcept
 {
     MKN_AVX_FN_COUNTER;
-    Type<T, SIZE>::Super::impl_type::unaligned_store(const_cast<T*>(a), b());
+    Type<T, SIZE>::Super::impl_type::unaligned_store(a, b());
 }

Also applies to: 381-391

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/types.hpp` around lines 357 - 360, The write-path overloads store
and unaligned_store currently take T const* and perform a const_cast to write
via Type<T, SIZE>::set, which can cause UB if callers pass genuinely const
storage; change their pointer parameter types from T const* to T* (for both
store and unaligned_store) and remove the const_cast so set writes to a T*
directly, while leaving the read-path overload (the load/reader that accepts T
const*) unchanged so it can forward const pointers without modification.

inc/mkn/avx/array.hpp (1)

221-229: ⚠️ Potential issue | 🔴 Critical

Decay T1 in the Span * Span result.

When either operand is Span<const T, N>, Array<T1, N> becomes const-qualified here and *ret = s0; ret *= s1; then tries to write into const storage. operator+ already avoids this with std::decay_t<T1>.

Suggested fix

 template<typename T0, typename T1, std::size_t N>
 auto inline operator*(mkn::avx::Span<T0, N> const& __restrict s0,
                       mkn::avx::Span<T1, N> const& __restrict s1)
 {
-    mkn::avx::Array<T1, N> ret{std::nullopt};
+    mkn::avx::Array<std::decay_t<T1>, N> ret{std::nullopt};
     *ret = s0;
     ret *= s1;
     return ret;
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/array.hpp` around lines 221 - 229, The operator* template
constructs an Array using T1 which can become const when operands are Span<const
T, N>, causing writes to const storage; change the constructed array type to
decay T1 (e.g., use mkn::avx::Array<std::decay_t<T1>, N> ret{std::nullopt}) so
ret is not const-qualified, matching the operator+ fix and allowing *ret = s0;
ret *= s1; to modify the storage safely; update the template return context if
needed to preserve deduction.

inc/mkn/avx/dbg.hpp (1)

71-74: ⚠️ Potential issue | 🟡 Minor

Expand __LINE__ before concatenation.

__mkn_avx_count_point##__LINE__ does not produce a line-expanded identifier, so repeated uses in one scope will still collide. A two-step concat helper fixes that.

Suggested fix

+#define MKN_AVX_PP_CAT_IMPL(a, b) a##b
+#define MKN_AVX_PP_CAT(a, b) MKN_AVX_PP_CAT_IMPL(a, b)
+
 `#define` MKN_AVX_FN_COUNTER                                                                         \
-    static mkn::avx::CountPoint __mkn_avx_count_point##__LINE__{__FILE__, __func__, __LINE__};     \
+    static mkn::avx::CountPoint MKN_AVX_PP_CAT(__mkn_avx_count_point, __LINE__){                  \
+        __FILE__, __func__, __LINE__};                                                             \
     mkn::avx::Counter::I()(__FILE__, __func__, __LINE__);

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@inc/mkn/avx/dbg.hpp` around lines 71 - 74, The macro MKN_AVX_FN_COUNTER uses
token pasting __mkn_avx_count_point##__LINE__ which doesn’t expand __LINE__,
causing identifier collisions; change it to use a two-step concatenation helper
(e.g., define a helper like MKN_AVX_CAT(a,b) -> MKN_AVX_CAT2(a,b) and
MKN_AVX_CAT2(a,b) -> a##b) and then replace __mkn_avx_count_point##__LINE__ with
MKN_AVX_CAT(__mkn_avx_count_point, __LINE__) inside the MKN_AVX_FN_COUNTER macro
so the line number is expanded before concatenation, leaving the rest of the
macro (the CountPoint construction and Counter::I() invocation) intact.

test/betterbench.hpp (1)

101-107: ⚠️ Potential issue | 🔴 Critical

Handle non-multiple-of-N tails to avoid out-of-bounds access.

At Line 101 and Line 126, the loops advance by N under i < size(), so the last chunk can overrun when size() % N != 0. Limit the SIMD block loop to full chunks and add a scalar tail pass.

Proposed fix

 template<typename T>
 void AoS<T>::operator()()
 {
@@
     else
     {
         auto constexpr static N = mkn::avx::Options::N<T>();
 
         // NO AVX FOR AOS
-        for (std::size_t i = 0; i < size(); i += N)
+        std::size_t i = 0;
+        for (; i + N <= size(); i += N)
         {
             for (std::size_t j = 0; j < N; ++j)
             {
                 std::size_t off = i + j;
                 op<T>(v.data()[off].v0, v.data()[off].v1, v.data()[off].v2, v.data()[off].v3);
             }
         }
+        for (; i < size(); ++i)
+            op<T>(v.data()[i].v0, v.data()[i].v1, v.data()[i].v2, v.data()[i].v3);
     }
 }
@@
 template<typename T, typename V>
 void SoA<T, V>::operator()()
 {
@@
     else
     {
         auto constexpr static N = mkn::avx::Options::N<V>();
 
         // AVX!!
-        for (std::size_t i = 0; i < size(); i += N)
+        std::size_t i = 0;
+        for (; i + N <= size(); i += N)
         {
             auto&& [a0, a1, a2, a3] = mkn::avx::make_spans<N>(i, v0(), v1(), v2(), v3());
             op<T>(a0, a1, a2, a3);
         }
+        for (; i < size(); ++i)
+            op<V>(v0().data()[i], v1().data()[i], v2().data()[i], v3().data()[i]);
     }
 }

Also applies to: 126-130

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@test/betterbench.hpp` around lines 101 - 107, The SIMD-style loop over chunks
can run past the end when size() % N != 0; change the chunk loop in
test/betterbench.hpp to iterate only full N-sized blocks (e.g., use i + N <=
size() or compute full_chunks = size() / N and loop full_chunks * N) and then
add a scalar tail loop that processes the remaining indices one-by-one calling
op<T> with v.data()[off].v0..v3 (the same call used inside the chunk) to avoid
out-of-bounds access; apply the same fix for the second occurrence of the
chunked loop referenced in the comment.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@inc/mkn/avx/span.hpp`:
- Around line 90-93: The Span<T, N> constructor sets the inner span's runtime
extent to 1 (span{d, 1}) which conflicts with the template extent N; change the
initializer for the backing mkn::kul::Span (the member named span in the inline
Span(T* d) noexcept constructor) to use N instead of 1 so the view represents N
scalar elements consistently with the rest of the class.

In `@test/betterbench_avx.cpp`:
- Around line 58-59: The test currently throws a std::runtime_error from main
when the check fails (if d3[arrs - 1][3] != b3) which can produce uncaught
exceptions; change this to return a non-zero exit code instead: modify the
failing branch in the main/test routine (the block referencing d3, arrs and b3)
to print a clear error message and return 1 (or another non-zero) rather than
throwing, ensuring main exits cleanly for CI.

In `@test/betterbench.cpp`:
- Around line 31-33: In main(), replace both occurrences of throwing
std::runtime_error("FAIL!") (used after the soa.v3 check and the other failure
check) with explicit non-zero returns (return 1;) so the test exits
deterministically, and remove the now-unused `#include` <stdexcept> header; ensure
the function still returns 0 at successful end and update any related comments
if present.

---

Outside diff comments:
In `@inc/mkn/avx/span.hpp`:
- Around line 158-195: The std::array overloads of operator+=, operator-= and
operator*= use invalid construction Span<T0 const> that{arr}; — replace those
constructions with SpanSet<T0 const, N>{arr} so the container-based constructor
exists and preserves fixed size N; update the three sites referenced by the
array overloads (the operator+= overload taking std::array, the
operator-=(std::array...) overload, and the operator*=(std::array...) overload)
to create that via SpanSet<T0 const, N>{arr} and then keep the existing
cast(*this, that) usage and subsequent element ops unchanged.

---

Duplicate comments:
In `@inc/mkn/avx/array.hpp`:
- Around line 221-229: The operator* template constructs an Array using T1 which
can become const when operands are Span<const T, N>, causing writes to const
storage; change the constructed array type to decay T1 (e.g., use
mkn::avx::Array<std::decay_t<T1>, N> ret{std::nullopt}) so ret is not
const-qualified, matching the operator+ fix and allowing *ret = s0; ret *= s1;
to modify the storage safely; update the template return context if needed to
preserve deduction.

In `@inc/mkn/avx/dbg.hpp`:
- Around line 71-74: The macro MKN_AVX_FN_COUNTER uses token pasting
__mkn_avx_count_point##__LINE__ which doesn’t expand __LINE__, causing
identifier collisions; change it to use a two-step concatenation helper (e.g.,
define a helper like MKN_AVX_CAT(a,b) -> MKN_AVX_CAT2(a,b) and MKN_AVX_CAT2(a,b)
-> a##b) and then replace __mkn_avx_count_point##__LINE__ with
MKN_AVX_CAT(__mkn_avx_count_point, __LINE__) inside the MKN_AVX_FN_COUNTER macro
so the line number is expanded before concatenation, leaving the rest of the
macro (the CountPoint construction and Counter::I() invocation) intact.

In `@inc/mkn/avx/span.hpp`:
- Around line 213-218: The is_span_like_v branch in operator=(T0 const& that)
selects for non-Span containers but calls cast(that) which only accepts Span<T0,
N>; wrap the incoming that in a SpanSet (e.g., SpanSet<T0, N>(that)) before
calling cast so overload resolution succeeds for std::vector/std::array and
other span-like types, then proceed to extract v1 and call store(data(), v1[0])
as before (update the cast call site inside operator= accordingly).
- Around line 543-579: UnSpanSet currently only overrides operator+= and still
inherits aligned implementations from SpanSet for assignment, scalar ops,
operator-=, operator*=, operator/= and const element access; add unaligned-aware
overrides (e.g., operator=, operator-=, operator*=, operator/=, scalar
add/mul/etc. and const operator[]) that mirror the vectorized loop in
UnSpanSet::operator+=: iterate by N-sized blocks, use unaligned_load<R,N> to
read AVX_t values, perform the operation using AVX_t (AVX_t alias provided),
then write back with unaligned_store, and handle leftover elements with the
modulo_leftover_idx() fallback to element-wise ops — this makes all core
operations use unaligned memory paths rather than falling back to SpanSet's
aligned code.
- Around line 68-85: The union U with members arr and i causes undefined
pointer-punning and loses constness when Span(T* d, std::size_t s) initializes
u.arr but avx() reads u.i; replace the union with explicit separate members
(e.g. T* arr; mkn::avx::Type<R, N>* i;) or, better, remove the stored avx
pointer entirely and make avx()/avx() const compute/return a correctly cast
pointer from arr (using reinterpret_cast or std::launder as appropriate) so the
active member is not swapped and const-correctness is preserved; update the Span
constructor to initialize both members if kept, or leave only arr and let avx()
produce the AVX pointer on demand, and ensure avx() const returns a
pointer-to-const for Span<const T, N>.
- Around line 584-655: is_aligned only recognizes std::vector with
kul::AlignedAllocator so make_span(Container) misclassifies other span-like
types (e.g. std::array) and the explicit-N/const/sliced overloads then hardcode
Span/SpanSet instead of using UnSpan/UnSpanSet. Fix by broadening is_aligned to
detect span-like and array/pointer-backed containers (use
mkn::kul::is_span_like_v, std::is_array/std::is_pointer, and a std::array<T,N>
overload) and/or provide a generic fallback that inspects
Container::allocator_type when present; then change the explicit-N overloads and
the const/sliced make_span templates (the templates named make_span(Container&),
make_span(Container const&), and make_span<std::size_t N, typename Container>)
to choose between Span/SpanSet and UnSpan/UnSpanSet based on
is_aligned<Container>() instead of hardcoding Span/SpanSet.
- Around line 398-476: The Arr-based compound operators (operator+=(Arr<T0,N>),
operator-=(Arr<T0,N>), operator*=(Arr<T0,N>), operator/=(Arr<T0,N>)) only
iterate full SIMD blocks (size()/N) and therefore skip the final leftover
elements; update each of these functions (the ones that construct SpanSet<T0
const, N> that and call cast(*this, that)) to add a tail loop after the block
loop that runs from modulo_leftover_idx() to size() and performs element-wise
ops using (*this)[i] and that[i] (mirror the leftover handling used in the
SpanSet overloads), leaving the scalar overloads (operator+=(T), operator*=(T),
operator/=(T)) as-is since they delegate to these Arr overloads. Ensure you
reference cast, SpanSet, modulo_leftover_idx(), and the specific operator
overload names when making the change.

In `@inc/mkn/avx/types.hpp`:
- Around line 225-237: The mul lambdas in Type_<std::int32_t, 4> and
Type_<std::int32_t, 8> use the widening intrinsics _mm_mul_epi32 and
_mm256_mul_epi32 which only multiply even lanes; change those to the
elementwise, low-lane intrinsics _mm_mullo_epi32 and _mm256_mullo_epi32
respectively so the mul function performs per-element 32-bit multiplication;
update the mul definitions inside the Type_<std::int32_t,4> and
Type_<std::int32_t,8> structs to call the correct intrinsics.
- Around line 371-376: The function template load(T const* __restrict a, Type<T,
SIZE> const& __restrict b) is marked noexcept but calls KEXCEPTION("load") which
can throw, causing std::terminate; remove noexcept from the function signature
(or change the error path to a non-throwing alternative such as
logging/assertion) so the function's exception behavior matches its declaration;
update the declaration of load and any related overloads/usages of
Type<T,SIZE>::load and references to KEXCEPTION in this context to keep behavior
consistent.
- Around line 129-142: Type_<double, 8> is missing the fma member so calls to
mkn::avx::fma(...) for AVX-512 doubles fail; add a static inline fma entry
matching the pattern of the other ops in the Type_<double, 8> specialization
(use the AVX-512 intrinsic _mm512_fmadd_pd) — e.g. add an auto const static
inline fma = [](auto&&... v) { return _mm512_fmadd_pd(v...); }; to the struct so
fma dispatches correctly.
- Around line 357-360: The write-path overloads store and unaligned_store
currently take T const* and perform a const_cast to write via Type<T,
SIZE>::set, which can cause UB if callers pass genuinely const storage; change
their pointer parameter types from T const* to T* (for both store and
unaligned_store) and remove the const_cast so set writes to a T* directly, while
leaving the read-path overload (the load/reader that accepts T const*) unchanged
so it can forward const pointers without modification.

In `@test/betterbench.hpp`:
- Around line 101-107: The SIMD-style loop over chunks can run past the end when
size() % N != 0; change the chunk loop in test/betterbench.hpp to iterate only
full N-sized blocks (e.g., use i + N <= size() or compute full_chunks = size() /
N and loop full_chunks * N) and then add a scalar tail loop that processes the
remaining indices one-by-one calling op<T> with v.data()[off].v0..v3 (the same
call used inside the chunk) to avoid out-of-bounds access; apply the same fix
for the second occurrence of the chunked loop referenced in the comment.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 598b241e-4214-45a8-87fe-49382cc30dca

📥 Commits

Reviewing files that changed from the base of the PR and between d2e6993 and cced5c2.

📒 Files selected for processing (21)

.github/workflows/build.yml
.github/workflows/build_nix.yml
README.noformat
inc/mkn/avx.hpp
inc/mkn/avx/array.hpp
inc/mkn/avx/dbg.hpp
inc/mkn/avx/def.hpp
inc/mkn/avx/lazy.hpp
inc/mkn/avx/span.hpp
inc/mkn/avx/types.hpp
inc/mkn/avx/vector.hpp
mkn.yaml
test/bench.cpp
test/bench.hpp
test/betterbench.cpp
test/betterbench.hpp
test/betterbench_avx.cpp
test/test_aligned.cpp
test/test_avx.cpp
test/test_lazy.cpp
test/test_unaligned.cpp

💤 Files with no reviewable changes (1)

.github/workflows/build_nix.yml

✅ Files skipped from review due to trivial changes (3)

README.noformat
inc/mkn/avx.hpp
.github/workflows/build.yml

🚧 Files skipped from review as they are similar to previous changes (4)

test/test_lazy.cpp
test/test_avx.cpp
inc/mkn/avx/lazy.hpp
test/bench.cpp

coderabbitai · 2026-03-29T17:14:55Z

+    if (d3[arrs - 1][3] != b3)
+        throw std::runtime_error("FAIL!");


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Check if the file exists and read it if [ -f "test/betterbench_avx.cpp" ]; then echo "=== File exists, checking size ===" wc -l test/betterbench_avx.cpp echo "" echo "=== Reading full file ===" cat -n test/betterbench_avx.cpp else echo "File not found" fi

Repository: mkn/mkn.avx

Length of output: 2253

Avoid uncaught exceptions from main in benchmark/test executables.

At lines 58-59, throwing from the entry point can terminate abruptly; returning a non-zero exit code is a cleaner CI failure signal.

Proposed fix

-#include <stdexcept> - if (d3[arrs - 1][3] != b3) - throw std::runtime_error("FAIL!"); + if (d3[arrs - 1][3] != b3) + return 1;

🧰 Tools

🪛 Cppcheck (2.20.0)

[error] 59-59: Unhandled exception thrown in function that is an entry point.

(throwInEntryPoint)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/betterbench_avx.cpp` around lines 58 - 59, The test currently throws a std::runtime_error from main when the check fails (if d3[arrs - 1][3] != b3) which can produce uncaught exceptions; change this to return a non-zero exit code instead: modify the failing branch in the main/test routine (the block referencing d3, arrs and b3) to print a clear error message and return 1 (or another non-zero) rather than throwing, ensuring main exits cleanly for CI.

coderabbitai · 2026-03-29T17:14:55Z

+        if (soa.v3.data()[soa.v3.size() - 1] != b3)
+            throw std::runtime_error("FAIL!");
+    }


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's check the file exists and get its size ls -la test/betterbench.cpp wc -l test/betterbench.cpp

Repository: mkn/mkn.avx

Length of output: 147

🏁 Script executed:

# Read the full file to understand context cat -n test/betterbench.cpp

Repository: mkn/mkn.avx

Length of output: 1188

Replace exception throws in main() with explicit return codes.

The test file throws std::runtime_error at lines 32 and 39 without catching them. Test executables should return non-zero exit codes explicitly for better CI integration and determinism. Replace both throw std::runtime_error("FAIL!") with return 1; and remove the #include <stdexcept> header that becomes unused.

Proposed fix

-#include <stdexcept> @@ - if (soa.v3.data()[soa.v3.size() - 1] != b3) - throw std::runtime_error("FAIL!"); + if (soa.v3.data()[soa.v3.size() - 1] != b3) + return 1; @@ - if (aos.data()[aos.size() - 1].v3 != b3) - throw std::runtime_error("FAIL!"); + if (aos.data()[aos.size() - 1].v3 != b3) + return 1;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (soa.v3.data()[soa.v3.size() - 1] != b3)

throw std::runtime_error("FAIL!");

}

if (soa.v3.data()[soa.v3.size() - 1] != b3)

return 1;

}

🧰 Tools

🪛 Cppcheck (2.20.0)

[error] 32-32: Unhandled exception thrown in function that is an entry point.

(throwInEntryPoint)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/betterbench.cpp` around lines 31 - 33, In main(), replace both occurrences of throwing std::runtime_error("FAIL!") (used after the soa.v3 check and the other failure check) with explicit non-zero returns (return 1;) so the test exits deterministically, and remove the now-unused `#include` <stdexcept> header; ensure the function still returns 0 at successful end and update any related comments if present.

PhilipDeegan force-pushed the next branch from 52f51bf to 4fe51a3 Compare March 29, 2026 13:53

coderabbitai Bot reviewed Mar 29, 2026

View reviewed changes

PhilipDeegan force-pushed the next branch 2 times, most recently from 4608eeb to d653f21 Compare March 29, 2026 14:05

coderabbitai Bot reviewed Mar 29, 2026

View reviewed changes

PhilipDeegan force-pushed the next branch from d653f21 to 8e715f1 Compare March 29, 2026 14:16

coderabbitai Bot reviewed Mar 29, 2026

View reviewed changes

PhilipDeegan force-pushed the next branch from 8e715f1 to d2e6993 Compare March 29, 2026 16:45

coderabbitai Bot reviewed Mar 29, 2026

View reviewed changes

Comment thread test/test_unaligned.cpp Outdated

PhilipDeegan force-pushed the next branch 3 times, most recently from 62dfdab to 0a16e4f Compare March 29, 2026 17:06

coderabbitai Bot reviewed Mar 29, 2026

View reviewed changes

PhilipDeegan force-pushed the next branch 3 times, most recently from e9fd65e to 609eccd Compare March 29, 2026 20:50

more more

609eccd

PhilipDeegan force-pushed the next branch from 9dafbf7 to da96d14 Compare March 30, 2026 18:17

++

b9455f8

PhilipDeegan force-pushed the next branch from da96d14 to b9455f8 Compare March 30, 2026 19:27

		if (d3[arrs - 1][3] != b3)
		throw std::runtime_error("FAIL!");

Conversation

PhilipDeegan commented Mar 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PhilipDeegan commented Mar 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 29, 2026 •

edited

Loading