Performance improvements#127
Conversation
1. replace Box<dyn SincInterpolator> with AnyInterpolator enum
The hot path in InnerSinc::process called get_sinc_interpolated through a vtable.
Switching to a concrete enum lets the compiler resolve the target statically, removes the indirect call, and is a prerequisite for other ideas, such as const-generic sinc lengths, multi-frame ILP batching, that rely on the callee being visible to the optimiser.
Measured 1.5-3% improvement on AVX cubic/linear/nearest f32 and AVX cubic f64 with sinc_len=256, oversampling=256; neutral elsewhere.
2. prefetch upcoming sinc rows on AVX/SSE
Adds a prefetch_sinc hook to SincInterpolator and uses _mm_prefetch on x86_64 to warm L1 with the sinc rows that the next inner loop will read.
Default impl is a no-op so non-SIMD interpolators are unaffected.
For sinc_len=128, oversampling=512 the full sinc table is ~256KB which sits at the L2 boundary on many CPUs;
bringing the next row to L1 while the current row is being processed hides the L2 latency.
Measured on top of the AnyInterpolator change:
AVX cubic f32 -2.6% AVX cubic f64 -2.4%
AVX linear f32 -3.2% AVX linear f64 -2.3%
AVX nearest f32 ~neutral AVX nearest f64 ~neutral
3. unroll FMA loop via const-generic specialisations
Splits the AVX dot product into a const-generic dot_avx_{f32,f64}_n<LEN> plus a runtime fallback. get_sinc_interpolated_unsafe dispatches on the sinc length (64, 128, 256) to a fully-unrolled variant.
The dispatch branch is per-call and trivially predicted because the resampler's sinc_len does not change after construction.
Cumulative with the earlier perf changes vs master:
AVX cubic f32 -5.0% AVX cubic f64 -2.1%
AVX linear f32 -2.7% AVX linear f64 -4.8%
AVX nearest f32 -1.6% AVX nearest f64 ~neutral
4. widen FMA chains to 4 parallel accumulators
The const-generic AVX kernels now run four independent FMA accumulator chains.
AVX FMA has 4-cycle latency and 2-per-cycle throughput, so a single chain reduction caps the kernel at 1/8 of peak throughput;
four chains cover the latency window and let both FMA ports stay busy.
Implemented a form of "multi-frame ILP batching.
Within-evaluation chains capture the same benefit (parallel FMA chains hiding latency) without requiring a restructure of the outer process loop.
f32 paths benefit the most (the original kernel had only one chain);
f64 was already running two chains so the gain there is small.
Cumulative vs master:
AVX cubic f32 -9.6% AVX cubic f64 -2.2%
AVX linear f32 -9.7% AVX linear f64 -4.4%
AVX nearest f32 ~neutral AVX nearest f64 ~neutral
|
Thanks for the PR! These are quite useful improvements. This PR reminded me of some changes I have been planning for some time but never finished. See #128 for a draft PR. I think some of the changes here are not quite compatible with some changes I needed to do there. I have not checked in detail yet but I hope these two improvements can be joined without too much effort. |
|
Happy to contribute. I use Rubato for audio_samples and anything I can do to make things faster I will. Happy to help out with merging with other stuff you are / have been working that might conflict. Makes me wish I kept the 4 items as separate commit. Would be easier to disentangle, but sure look. |
|
I let Claude take a look: PR #127 vs
|
| PR Change | Conflicts with current branch? | Adoptable? | Effort |
|---|---|---|---|
AnyInterpolator enum |
Trait interface mismatch | Yes, with additions | Low–med |
prefetch_sinc |
Call-site placement only | Yes | Low |
| Const-generic unrolling | Storage format (&[__m256] vs &[T]) |
Yes, with rewrite | Medium |
| 4-accumulator FMA chains | None — pure loop restructuring | Yes, drop-in | Low–med |
Recommended order of adoption: (4) 4-accumulator chains → (1) AnyInterpolator → (2) prefetch → (3) const-generic unrolling. Items 4 and 2 are the most straightforward wins; item 1 is architecturally clean once the trait delegations are added; item 3 requires the most rewriting but is still fully compatible with the unpacked sinc storage.
|
see #129 |
|
btw, you mentioned seeking max performance. I notice you're only using the async sinc resampler. Have you considered the FFT-based synchronous resampler? For offline/batch resampling it's typically significantly faster (O(log N) per sample vs O(sinc_len) per sample). The quality tradeoff actually goes slightly in the FFT resampler's favour too. The main thing you'd give up is the direct filter parameter control (sinc_len, f_cutoff, window) that lets you precisely match resampy/soxr quality presets, which looks intentional given your comments. Is that the reason for the choice, or is it worth benchmarking the FFT path? If it's important, then f_cutoff and windows could be made configurable for the Fft resampler as well. |
The hot path in InnerSinc::process called get_sinc_interpolated through a vtable. Switching to a concrete enum lets the compiler resolve the target statically, removes the indirect call, and is a prerequisite for other ideas, such as const-generic sinc lengths, multi-frame ILP batching, that rely on the callee being visible to the optimiser.
Measured 1.5-3% improvement on AVX cubic/linear/nearest f32 and AVX cubic f64 with sinc_len=256, oversampling=256; neutral elsewhere.
Adds a prefetch_sinc hook to SincInterpolator and uses _mm_prefetch on x86_64 to warm L1 with the sinc rows that the next inner loop will read. Default impl is a no-op so non-SIMD interpolators are unaffected.
For sinc_len=128, oversampling=512 the full sinc table is ~256KB which sits at the L2 boundary on many CPUs; bringing the next row to L1 while the current row is being processed hides the L2 latency.
Measured on top of the AnyInterpolator change:
AVX cubic f32 -2.6% AVX cubic f64 -2.4%
AVX linear f32 -3.2% AVX linear f64 -2.3%
AVX nearest f32 ~neutral AVX nearest f64 ~neutral
Splits the AVX dot product into a const-generic dot_avx_{f32,f64}_n plus a runtime fallback. get_sinc_interpolated_unsafe dispatches on the sinc length (64, 128, 256) to a fully-unrolled variant. The dispatch branch is per-call and trivially predicted because the resampler's sinc_len does not change after construction.
Cumulative with the earlier perf changes vs master:
AVX cubic f32 -5.0% AVX cubic f64 -2.1%
AVX linear f32 -2.7% AVX linear f64 -4.8%
AVX nearest f32 -1.6% AVX nearest f64 ~neutral
The const-generic AVX kernels now run four independent FMA accumulator chains. AVX FMA has 4-cycle latency and 2-per-cycle throughput, so a single chain reduction caps the kernel at 1/8 of peak throughput; four chains cover the latency window and let both FMA ports stay busy.
Implemented a form of "multi-frame ILP batching.
Within-evaluation chains capture the same benefit (parallel FMA chains hiding latency) without requiring a restructure of the outer process loop.
f32 paths benefit the most (the original kernel had only one chain); f64 was already running two chains so the gain there is small.
Cumulative vs master:
AVX cubic f32 -9.6% AVX cubic f64 -2.2%
AVX linear f32 -9.7% AVX linear f64 -4.4%
AVX nearest f32 ~neutral AVX nearest f64 ~neutral
All tests passing