You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My team aims to develop a heterogeneous inference framework using ExecuTorch, but we are currently grappling with challenges in heterogeneous quantization.
Consider this scenario: For a single model, we plan to simultaneously delegate computations to NPU-A, NPU-B, and CPU backends. The CPU will utilize XNNPACKQuantizer, while NPU-A and NPU-B require custom quantization algorithms. How can we apply these three distinct quantization methods to their respective partitions before the graph is partitioned?
Based on my understanding of ExecuTorch's workflow: Quantization at Aten IR → Partitioning at EdgeIR,
if the graph is partitioned into:
P1 (executed on NPU-A)
P2 (executed on NPU-B)
P3 (executed on CPU via XNNPACK)
How can we ensure that the NPU-A-specific quantization algorithm is applied to P1 during the Aten IR quantization stage, given that partitioning occurs later at the Edge IR stage?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
My team aims to develop a heterogeneous inference framework using ExecuTorch, but we are currently grappling with challenges in heterogeneous quantization.
Consider this scenario: For a single model, we plan to simultaneously delegate computations to NPU-A, NPU-B, and CPU backends. The CPU will utilize XNNPACKQuantizer, while NPU-A and NPU-B require custom quantization algorithms. How can we apply these three distinct quantization methods to their respective partitions before the graph is partitioned?
Based on my understanding of ExecuTorch's workflow: Quantization at Aten IR → Partitioning at EdgeIR,
if the graph is partitioned into:
P1 (executed on NPU-A)
P2 (executed on NPU-B)
P3 (executed on CPU via XNNPACK)
How can we ensure that the NPU-A-specific quantization algorithm is applied to P1 during the Aten IR quantization stage, given that partitioning occurs later at the Edge IR stage?
Beta Was this translation helpful? Give feedback.
All reactions