heterogeneous quantization for different backends #11713

goka-wu · 2025-06-16T02:44:00Z

goka-wu
Jun 16, 2025

My team aims to develop a heterogeneous inference framework using ExecuTorch, but we are currently grappling with challenges in heterogeneous quantization.

Consider this scenario: For a single model, we plan to simultaneously delegate computations to NPU-A, NPU-B, and CPU backends. The CPU will utilize XNNPACKQuantizer, while NPU-A and NPU-B require custom quantization algorithms. How can we apply these three distinct quantization methods to their respective partitions before the graph is partitioned?

Based on my understanding of ExecuTorch's workflow: Quantization at Aten IR → Partitioning at EdgeIR,
if the graph is partitioned into:

P1 (executed on NPU-A)
P2 (executed on NPU-B)
P3 (executed on CPU via XNNPACK)

How can we ensure that the NPU-A-specific quantization algorithm is applied to P1 during the Aten IR quantization stage, given that partitioning occurs later at the Edge IR stage?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

heterogeneous quantization for different backends #11713

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

heterogeneous quantization for different backends #11713

Uh oh!

goka-wu Jun 16, 2025

Replies: 0 comments

goka-wu
Jun 16, 2025