Problem
Current benchmarks only measure single MSM performance at various input sizes (2^10 through 2^24). In real proving systems, MSM is called multiple times in parallel — for example, a 2^14 MSM called 2048 times concurrently during a single proving pass.
GPU may significantly outperform multi-threaded CPU in these batched scenarios due to:
- Better utilization of parallel compute units when saturated with concurrent work
- Amortization of setup overhead (buffer allocation, shader compilation) across batches
- Different memory access patterns under concurrent load
Without batched benchmarks, we may be underestimating GPU's real-world advantage (or missing optimization opportunities).
Current State
All existing benchmarks run a single MSM computation per iteration:
benches/e2e.rs — Criterion benchmark, one MSM per sample
tests/cuzk/e2e.rs — end-to-end test, single MSM execution
There is no batched or concurrent MSM benchmarking anywhere in the codebase.
Proposed Work
1. Batched MSM Benchmarks
Add benchmarks that measure throughput when running multiple MSMs concurrently:
- Varying batch sizes: e.g., 1, 4, 16, 64, 256, 1024, 2048 concurrent MSMs
- Varying MSM sizes within batches: e.g., batches of 2^14 MSMs (common in real provers)
- Metrics: total wall-clock time, throughput (MSMs/sec), per-MSM latency under load
2. Batched MSM API (stretch)
Consider whether a dedicated batched MSM API could improve performance by:
- Sharing Metal command buffers across MSMs in a batch
- Pipelining GPU work (overlap data transfer with computation)
- Reusing allocated buffers across MSMs of the same size
Problem
Current benchmarks only measure single MSM performance at various input sizes (2^10 through 2^24). In real proving systems, MSM is called multiple times in parallel — for example, a 2^14 MSM called 2048 times concurrently during a single proving pass.
GPU may significantly outperform multi-threaded CPU in these batched scenarios due to:
Without batched benchmarks, we may be underestimating GPU's real-world advantage (or missing optimization opportunities).
Current State
All existing benchmarks run a single MSM computation per iteration:
benches/e2e.rs— Criterion benchmark, one MSM per sampletests/cuzk/e2e.rs— end-to-end test, single MSM executionThere is no batched or concurrent MSM benchmarking anywhere in the codebase.
Proposed Work
1. Batched MSM Benchmarks
Add benchmarks that measure throughput when running multiple MSMs concurrently:
2. Batched MSM API (stretch)
Consider whether a dedicated batched MSM API could improve performance by: