Performance improvements by jmg049 · Pull Request #127 · HEnquist/rubato

jmg049 · 2026-04-30T14:05:14Z

replace Box with AnyInterpolator enum

The hot path in InnerSinc::process called get_sinc_interpolated through a vtable. Switching to a concrete enum lets the compiler resolve the target statically, removes the indirect call, and is a prerequisite for other ideas, such as const-generic sinc lengths, multi-frame ILP batching, that rely on the callee being visible to the optimiser.

Measured 1.5-3% improvement on AVX cubic/linear/nearest f32 and AVX cubic f64 with sinc_len=256, oversampling=256; neutral elsewhere.

prefetch upcoming sinc rows on AVX/SSE

Adds a prefetch_sinc hook to SincInterpolator and uses _mm_prefetch on x86_64 to warm L1 with the sinc rows that the next inner loop will read. Default impl is a no-op so non-SIMD interpolators are unaffected.

For sinc_len=128, oversampling=512 the full sinc table is ~256KB which sits at the L2 boundary on many CPUs; bringing the next row to L1 while the current row is being processed hides the L2 latency.

Measured on top of the AnyInterpolator change:
AVX cubic f32 -2.6% AVX cubic f64 -2.4%
AVX linear f32 -3.2% AVX linear f64 -2.3%
AVX nearest f32 ~neutral AVX nearest f64 ~neutral

unroll FMA loop via const-generic specialisations

Splits the AVX dot product into a const-generic dot_avx_{f32,f64}_n plus a runtime fallback. get_sinc_interpolated_unsafe dispatches on the sinc length (64, 128, 256) to a fully-unrolled variant. The dispatch branch is per-call and trivially predicted because the resampler's sinc_len does not change after construction.

Cumulative with the earlier perf changes vs master:
AVX cubic f32 -5.0% AVX cubic f64 -2.1%
AVX linear f32 -2.7% AVX linear f64 -4.8%
AVX nearest f32 -1.6% AVX nearest f64 ~neutral

widen FMA chains to 4 parallel accumulators

The const-generic AVX kernels now run four independent FMA accumulator chains. AVX FMA has 4-cycle latency and 2-per-cycle throughput, so a single chain reduction caps the kernel at 1/8 of peak throughput; four chains cover the latency window and let both FMA ports stay busy.

Implemented a form of "multi-frame ILP batching.
Within-evaluation chains capture the same benefit (parallel FMA chains hiding latency) without requiring a restructure of the outer process loop.

f32 paths benefit the most (the original kernel had only one chain); f64 was already running two chains so the gain there is small.

Cumulative vs master:
AVX cubic f32 -9.6% AVX cubic f64 -2.2%
AVX linear f32 -9.7% AVX linear f64 -4.4%
AVX nearest f32 ~neutral AVX nearest f64 ~neutral

All tests passing

1. replace Box<dyn SincInterpolator> with AnyInterpolator enum The hot path in InnerSinc::process called get_sinc_interpolated through a vtable. Switching to a concrete enum lets the compiler resolve the target statically, removes the indirect call, and is a prerequisite for other ideas, such as const-generic sinc lengths, multi-frame ILP batching, that rely on the callee being visible to the optimiser. Measured 1.5-3% improvement on AVX cubic/linear/nearest f32 and AVX cubic f64 with sinc_len=256, oversampling=256; neutral elsewhere. 2. prefetch upcoming sinc rows on AVX/SSE Adds a prefetch_sinc hook to SincInterpolator and uses _mm_prefetch on x86_64 to warm L1 with the sinc rows that the next inner loop will read. Default impl is a no-op so non-SIMD interpolators are unaffected. For sinc_len=128, oversampling=512 the full sinc table is ~256KB which sits at the L2 boundary on many CPUs; bringing the next row to L1 while the current row is being processed hides the L2 latency. Measured on top of the AnyInterpolator change: AVX cubic f32 -2.6% AVX cubic f64 -2.4% AVX linear f32 -3.2% AVX linear f64 -2.3% AVX nearest f32 ~neutral AVX nearest f64 ~neutral 3. unroll FMA loop via const-generic specialisations Splits the AVX dot product into a const-generic dot_avx_{f32,f64}_n<LEN> plus a runtime fallback. get_sinc_interpolated_unsafe dispatches on the sinc length (64, 128, 256) to a fully-unrolled variant. The dispatch branch is per-call and trivially predicted because the resampler's sinc_len does not change after construction. Cumulative with the earlier perf changes vs master: AVX cubic f32 -5.0% AVX cubic f64 -2.1% AVX linear f32 -2.7% AVX linear f64 -4.8% AVX nearest f32 -1.6% AVX nearest f64 ~neutral 4. widen FMA chains to 4 parallel accumulators The const-generic AVX kernels now run four independent FMA accumulator chains. AVX FMA has 4-cycle latency and 2-per-cycle throughput, so a single chain reduction caps the kernel at 1/8 of peak throughput; four chains cover the latency window and let both FMA ports stay busy. Implemented a form of "multi-frame ILP batching. Within-evaluation chains capture the same benefit (parallel FMA chains hiding latency) without requiring a restructure of the outer process loop. f32 paths benefit the most (the original kernel had only one chain); f64 was already running two chains so the gain there is small. Cumulative vs master: AVX cubic f32 -9.6% AVX cubic f64 -2.2% AVX linear f32 -9.7% AVX linear f64 -4.4% AVX nearest f32 ~neutral AVX nearest f64 ~neutral

HEnquist · 2026-05-02T20:35:48Z

Thanks for the PR! These are quite useful improvements. This PR reminded me of some changes I have been planning for some time but never finished. See #128 for a draft PR. I think some of the changes here are not quite compatible with some changes I needed to do there. I have not checked in detail yet but I hope these two improvements can be joined without too much effort.

jmg049 · 2026-05-03T11:42:02Z

Happy to contribute. I use Rubato for audio_samples and anything I can do to make things faster I will.

Happy to help out with merging with other stuff you are / have been working that might conflict.

Makes me wish I kept the 4 items as separate commit. Would be easier to disentangle, but sure look.

HEnquist · 2026-05-03T21:06:40Z

I let Claude take a look:

PR #127 vs `smarter-dot-products` — Compatibility Report

Overview

Both branches modify the same 6–7 core files and make incompatible changes to the SincInterpolator trait. The current branch refactored the trait interface; the PR still targets the original one. However, the conflicts are mostly at the trait boundary, and the PR's four optimizations are individually assessable.

Change 1: `AnyInterpolator` enum (replaces `Box<dyn SincInterpolator>`)

Status: Compatible with minor adaptation

The current branch still stores the interpolator as Box<dyn SincInterpolator<T>> in InnerSinc, but refactored the trait to require get_sinc_dot_product and get_sincs as abstract methods instead of get_sinc_interpolated. The PR's AnyInterpolator delegates the old method.

Adoption requires:

Adding get_sinc_dot_product and get_sincs delegations to AnyInterpolator's trait impl
Changing InnerSinc.interpolator from Box<dyn SincInterpolator<T>> to AnyInterpolator<T>

Effort: Low–medium. High value (removes vtable indirection on every sample).

Change 2: `prefetch_sinc` hook

Status: Compatible as-is

The defaulted no-op method can be added to the trait directly with no conflicts. The AVX/SSE _mm_prefetch implementations carry over unchanged.

The call sites in asynchro_sinc.rs need placement adjustment — the current branch reorganized the inner loops around combined-sinc building — but the prefetch can still fire before the per-channel loop for the next output sample.

Effort: Low. Moderate value (targets L2 boundary for large sinc tables).

Change 3: Const-generic FMA loop unrolling

Status: Requires non-trivial adaptation

The PR's dot_avx_f32_n<const LEN: usize>() / dot_avx_f64_n<const LEN: usize>() take sincs as &[__m256] — packed vectors. The current branch removed packing entirely: sincs are stored as Vec<Vec<T>> and get_sinc_dot_product_unsafe takes a plain &[T] slice.

The const-generic approach is still applicable (_mm256_loadu_ps from &[f32] works fine and a compile-time LEN allows full unrolling regardless of storage format), but the PR's functions cannot be dropped in unchanged — they need to be rewritten to take sinc: &[T] instead of sinc: &[__m256]. The dispatch logic (matching on sinc_len to select <64>, <128>, <256> specializations) can be preserved exactly.

Effort: Medium. Moderate value (the combined-sinc path in the current branch already reduces how often the dot product is called per output sample, so proportional gains are smaller than on master).

Change 4: 4 parallel FMA accumulator chains

Status: Compatible with minor adaptation

The current branch's get_sinc_dot_product_unsafe in AVX performs a single-chain reduction over a &[T] slice. The 4-accumulator pattern from the PR applies directly to that function with no architectural conflict — it is purely a loop restructuring inside an unsafe SIMD function. This is the easiest high-value item to adopt.

Effort: Low–medium. High value (f32 paths especially; the current branch has a single accumulator chain).

Summary

PR Change	Conflicts with current branch?	Adoptable?	Effort
`AnyInterpolator` enum	Trait interface mismatch	Yes, with additions	Low–med
`prefetch_sinc`	Call-site placement only	Yes	Low
Const-generic unrolling	Storage format (`&[__m256]` vs `&[T]`)	Yes, with rewrite	Medium
4-accumulator FMA chains	None — pure loop restructuring	Yes, drop-in	Low–med

Recommended order of adoption: (4) 4-accumulator chains → (1) AnyInterpolator → (2) prefetch → (3) const-generic unrolling. Items 4 and 2 are the most straightforward wins; item 1 is architecturally clean once the trait delegations are added; item 3 requires the most rewriting but is still fully compatible with the unpacked sinc storage.

HEnquist · 2026-05-03T21:50:58Z

see #129

HEnquist · 2026-05-03T22:16:08Z

btw, you mentioned seeking max performance. I notice you're only using the async sinc resampler. Have you considered the FFT-based synchronous resampler? For offline/batch resampling it's typically significantly faster (O(log N) per sample vs O(sinc_len) per sample). The quality tradeoff actually goes slightly in the FFT resampler's favour too. The main thing you'd give up is the direct filter parameter control (sinc_len, f_cutoff, window) that lets you precisely match resampy/soxr quality presets, which looks intentional given your comments. Is that the reason for the choice, or is it worth benchmarking the FFT path? If it's important, then f_cutoff and windows could be made configurable for the Fft resampler as well.

HEnquist mentioned this pull request May 3, 2026

Adapt PR 127 to smarter-dot-products branch #129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements#127

Performance improvements#127
jmg049 wants to merge 1 commit intoHEnquist:masterfrom
jmg049:performance

jmg049 commented Apr 30, 2026

Uh oh!

HEnquist commented May 2, 2026

Uh oh!

jmg049 commented May 3, 2026

Uh oh!

HEnquist commented May 3, 2026

Uh oh!

HEnquist commented May 3, 2026

Uh oh!

HEnquist commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jmg049 commented Apr 30, 2026

Uh oh!

HEnquist commented May 2, 2026

Uh oh!

jmg049 commented May 3, 2026

Uh oh!

HEnquist commented May 3, 2026

PR #127 vs smarter-dot-products — Compatibility Report

Overview

Change 1: AnyInterpolator enum (replaces Box<dyn SincInterpolator>)

Change 2: prefetch_sinc hook

Change 3: Const-generic FMA loop unrolling

Change 4: 4 parallel FMA accumulator chains

Summary

Uh oh!

HEnquist commented May 3, 2026

Uh oh!

HEnquist commented May 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PR #127 vs `smarter-dot-products` — Compatibility Report

Change 1: `AnyInterpolator` enum (replaces `Box<dyn SincInterpolator>`)

Change 2: `prefetch_sinc` hook