Embedding model ONNX export fails due to torchaudio preprocessing #1929
Closed
altunenes
started this conversation in
Development
Replies: 1 comment
-
|
I successfully converted both models to ONNX class converttttt(torch.nn.Module):
def __init__(self, wespeaker_model):
super().__init__()
self.resnet = wespeaker_model.resnet
// f : forward
def f(self, fbank_features):
# Input: (batch, num_frames, 80) fbank features
# Output: (batch, 256) embeddings
o = self.resnet(fbank_features)
return o[-1] if isinstance(o, tuple) else o
fm = Model.from_pretrained("models/embedding").eval()
on = converttttt(fm).eval()
torch.onnx.export(
on,
torch.randn(1, 200, 80),
"embedding_model.onnx",
input_names=["fbank_features"],
output_names=["embeddings"],
dynamic_axes={"fbank_features": {0: "batch_size", 1: "num_frames"}}
)so extract fbank features separately before inference using any fbank implementation (torchaudio, kaldi, etc). tried verification (cosine similarity: 1.0000001192). |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, thank you for the new release looks really cool! I encountered some errors while trying to convert your models to ONNX and wanted to consult with you.
https://huggingface.co/pyannote/speaker-diarization-community-1
I successfully converted segmentation-community-1 to ONNX using the standard approach:
This works perfectly and produces identical outputs between PyTorch and ONNX...
However, the embedding model (new wespeaker) cannot be exported using the same method:
Error: RuntimeError: Unsupported value kind: Tensor in torchaudio/compliance/kaldi.pyThe model appears to have internal fbank extraction using torchaudio operations that don't support ONNX export. I verified the model outputs 256-dimensional embeddings in PyTorch, which matches the PLDA files provided.
Question: Is there a supported way to export the embedding model to ONNX? The segmentation model exports cleanly... Wespeaker models export by accepting pre-computed fbank features as input - does pyannote's embedding model support a similar inference path that skips the internal audio preprocessing?
I need the embedding model in ONNX format to use with the PLDA files for VBx clustering for rust deployment .
note: I also tried accessing the ResNet backbone directly, but it requires specific internal preprocessing between fbank extraction and the ResNet forward pass that I cannot replicate externally. Direct ResNet export fails with shape mismatches...
all the best...
Beta Was this translation helpful? Give feedback.
All reactions