Skip to content

Performance improvements#127

Open
jmg049 wants to merge 1 commit intoHEnquist:masterfrom
jmg049:performance
Open

Performance improvements#127
jmg049 wants to merge 1 commit intoHEnquist:masterfrom
jmg049:performance

Conversation

@jmg049
Copy link
Copy Markdown

@jmg049 jmg049 commented Apr 30, 2026

  1. replace Box with AnyInterpolator enum

The hot path in InnerSinc::process called get_sinc_interpolated through a vtable. Switching to a concrete enum lets the compiler resolve the target statically, removes the indirect call, and is a prerequisite for other ideas, such as const-generic sinc lengths, multi-frame ILP batching, that rely on the callee being visible to the optimiser.

Measured 1.5-3% improvement on AVX cubic/linear/nearest f32 and AVX cubic f64 with sinc_len=256, oversampling=256; neutral elsewhere.

  1. prefetch upcoming sinc rows on AVX/SSE

Adds a prefetch_sinc hook to SincInterpolator and uses _mm_prefetch on x86_64 to warm L1 with the sinc rows that the next inner loop will read. Default impl is a no-op so non-SIMD interpolators are unaffected.

For sinc_len=128, oversampling=512 the full sinc table is ~256KB which sits at the L2 boundary on many CPUs; bringing the next row to L1 while the current row is being processed hides the L2 latency.

Measured on top of the AnyInterpolator change:
AVX cubic f32 -2.6% AVX cubic f64 -2.4%
AVX linear f32 -3.2% AVX linear f64 -2.3%
AVX nearest f32 ~neutral AVX nearest f64 ~neutral

  1. unroll FMA loop via const-generic specialisations

Splits the AVX dot product into a const-generic dot_avx_{f32,f64}_n plus a runtime fallback. get_sinc_interpolated_unsafe dispatches on the sinc length (64, 128, 256) to a fully-unrolled variant. The dispatch branch is per-call and trivially predicted because the resampler's sinc_len does not change after construction.

Cumulative with the earlier perf changes vs master:
AVX cubic f32 -5.0% AVX cubic f64 -2.1%
AVX linear f32 -2.7% AVX linear f64 -4.8%
AVX nearest f32 -1.6% AVX nearest f64 ~neutral

  1. widen FMA chains to 4 parallel accumulators

The const-generic AVX kernels now run four independent FMA accumulator chains. AVX FMA has 4-cycle latency and 2-per-cycle throughput, so a single chain reduction caps the kernel at 1/8 of peak throughput; four chains cover the latency window and let both FMA ports stay busy.

Implemented a form of "multi-frame ILP batching.
Within-evaluation chains capture the same benefit (parallel FMA chains hiding latency) without requiring a restructure of the outer process loop.

f32 paths benefit the most (the original kernel had only one chain); f64 was already running two chains so the gain there is small.

Cumulative vs master:
AVX cubic f32 -9.6% AVX cubic f64 -2.2%
AVX linear f32 -9.7% AVX linear f64 -4.4%
AVX nearest f32 ~neutral AVX nearest f64 ~neutral

All tests passing

1. replace Box<dyn SincInterpolator> with AnyInterpolator enum

The hot path in InnerSinc::process called get_sinc_interpolated through a vtable.
Switching to a concrete enum lets the compiler resolve the target statically, removes the indirect call, and is a prerequisite for other ideas, such as const-generic sinc lengths, multi-frame ILP batching, that rely on the callee being visible to the optimiser.

Measured 1.5-3% improvement on AVX cubic/linear/nearest f32 and AVX cubic f64 with sinc_len=256, oversampling=256; neutral elsewhere.

2. prefetch upcoming sinc rows on AVX/SSE

Adds a prefetch_sinc hook to SincInterpolator and uses _mm_prefetch on x86_64 to warm L1 with the sinc rows that the next inner loop will read.
Default impl is a no-op so non-SIMD interpolators are unaffected.

For sinc_len=128, oversampling=512 the full sinc table is ~256KB which sits at the L2 boundary on many CPUs;
bringing the next row to L1 while the current row is being processed hides the L2 latency.

Measured on top of the AnyInterpolator change:
  AVX cubic   f32  -2.6%      AVX cubic   f64  -2.4%
  AVX linear  f32  -3.2%      AVX linear  f64  -2.3%
  AVX nearest f32  ~neutral   AVX nearest f64  ~neutral

3. unroll FMA loop via const-generic specialisations

Splits the AVX dot product into a const-generic dot_avx_{f32,f64}_n<LEN> plus a runtime fallback. get_sinc_interpolated_unsafe dispatches on the sinc length (64, 128, 256) to a fully-unrolled variant.
The dispatch branch is per-call and trivially predicted because the resampler's sinc_len does not change after construction.

Cumulative with the earlier perf changes vs master:
  AVX cubic   f32  -5.0%      AVX cubic   f64  -2.1%
  AVX linear  f32  -2.7%      AVX linear  f64  -4.8%
  AVX nearest f32  -1.6%      AVX nearest f64  ~neutral

4. widen FMA chains to 4 parallel accumulators

The const-generic AVX kernels now run four independent FMA accumulator chains.
AVX FMA has 4-cycle latency and 2-per-cycle throughput, so a single chain reduction caps the kernel at 1/8 of peak throughput;
four chains cover the latency window and let both FMA ports stay busy.

Implemented a form of "multi-frame ILP batching.
Within-evaluation chains capture the same benefit (parallel FMA chains hiding latency) without requiring a restructure of the outer process loop.

f32 paths benefit the most (the original kernel had only one chain);
f64 was already running two chains so the gain there is small.

Cumulative vs master:
  AVX cubic   f32  -9.6%      AVX cubic   f64  -2.2%
  AVX linear  f32  -9.7%      AVX linear  f64  -4.4%
  AVX nearest f32  ~neutral   AVX nearest f64  ~neutral
@HEnquist
Copy link
Copy Markdown
Owner

HEnquist commented May 2, 2026

Thanks for the PR! These are quite useful improvements. This PR reminded me of some changes I have been planning for some time but never finished. See #128 for a draft PR. I think some of the changes here are not quite compatible with some changes I needed to do there. I have not checked in detail yet but I hope these two improvements can be joined without too much effort.

@jmg049
Copy link
Copy Markdown
Author

jmg049 commented May 3, 2026

Happy to contribute. I use Rubato for audio_samples and anything I can do to make things faster I will.

Happy to help out with merging with other stuff you are / have been working that might conflict.

Makes me wish I kept the 4 items as separate commit. Would be easier to disentangle, but sure look.

@HEnquist
Copy link
Copy Markdown
Owner

HEnquist commented May 3, 2026

I let Claude take a look:

PR #127 vs smarter-dot-products — Compatibility Report

Overview

Both branches modify the same 6–7 core files and make incompatible changes to the SincInterpolator trait. The current branch refactored the trait interface; the PR still targets the original one. However, the conflicts are mostly at the trait boundary, and the PR's four optimizations are individually assessable.


Change 1: AnyInterpolator enum (replaces Box<dyn SincInterpolator>)

Status: Compatible with minor adaptation

The current branch still stores the interpolator as Box<dyn SincInterpolator<T>> in InnerSinc, but refactored the trait to require get_sinc_dot_product and get_sincs as abstract methods instead of get_sinc_interpolated. The PR's AnyInterpolator delegates the old method.

Adoption requires:

  • Adding get_sinc_dot_product and get_sincs delegations to AnyInterpolator's trait impl
  • Changing InnerSinc.interpolator from Box<dyn SincInterpolator<T>> to AnyInterpolator<T>

Effort: Low–medium. High value (removes vtable indirection on every sample).


Change 2: prefetch_sinc hook

Status: Compatible as-is

The defaulted no-op method can be added to the trait directly with no conflicts. The AVX/SSE _mm_prefetch implementations carry over unchanged.

The call sites in asynchro_sinc.rs need placement adjustment — the current branch reorganized the inner loops around combined-sinc building — but the prefetch can still fire before the per-channel loop for the next output sample.

Effort: Low. Moderate value (targets L2 boundary for large sinc tables).


Change 3: Const-generic FMA loop unrolling

Status: Requires non-trivial adaptation

The PR's dot_avx_f32_n<const LEN: usize>() / dot_avx_f64_n<const LEN: usize>() take sincs as &[__m256] — packed vectors. The current branch removed packing entirely: sincs are stored as Vec<Vec<T>> and get_sinc_dot_product_unsafe takes a plain &[T] slice.

The const-generic approach is still applicable (_mm256_loadu_ps from &[f32] works fine and a compile-time LEN allows full unrolling regardless of storage format), but the PR's functions cannot be dropped in unchanged — they need to be rewritten to take sinc: &[T] instead of sinc: &[__m256]. The dispatch logic (matching on sinc_len to select <64>, <128>, <256> specializations) can be preserved exactly.

Effort: Medium. Moderate value (the combined-sinc path in the current branch already reduces how often the dot product is called per output sample, so proportional gains are smaller than on master).


Change 4: 4 parallel FMA accumulator chains

Status: Compatible with minor adaptation

The current branch's get_sinc_dot_product_unsafe in AVX performs a single-chain reduction over a &[T] slice. The 4-accumulator pattern from the PR applies directly to that function with no architectural conflict — it is purely a loop restructuring inside an unsafe SIMD function. This is the easiest high-value item to adopt.

Effort: Low–medium. High value (f32 paths especially; the current branch has a single accumulator chain).


Summary

PR Change Conflicts with current branch? Adoptable? Effort
AnyInterpolator enum Trait interface mismatch Yes, with additions Low–med
prefetch_sinc Call-site placement only Yes Low
Const-generic unrolling Storage format (&[__m256] vs &[T]) Yes, with rewrite Medium
4-accumulator FMA chains None — pure loop restructuring Yes, drop-in Low–med

Recommended order of adoption: (4) 4-accumulator chains → (1) AnyInterpolator → (2) prefetch → (3) const-generic unrolling. Items 4 and 2 are the most straightforward wins; item 1 is architecturally clean once the trait delegations are added; item 3 requires the most rewriting but is still fully compatible with the unpacked sinc storage.

@HEnquist
Copy link
Copy Markdown
Owner

HEnquist commented May 3, 2026

see #129

@HEnquist
Copy link
Copy Markdown
Owner

HEnquist commented May 3, 2026

btw, you mentioned seeking max performance. I notice you're only using the async sinc resampler. Have you considered the FFT-based synchronous resampler? For offline/batch resampling it's typically significantly faster (O(log N) per sample vs O(sinc_len) per sample). The quality tradeoff actually goes slightly in the FFT resampler's favour too. The main thing you'd give up is the direct filter parameter control (sinc_len, f_cutoff, window) that lets you precisely match resampy/soxr quality presets, which looks intentional given your comments. Is that the reason for the choice, or is it worth benchmarking the FFT path? If it's important, then f_cutoff and windows could be made configurable for the Fft resampler as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants