Embedding model ONNX export fails due to torchaudio preprocessing #1929

altunenes · 2025-10-01T07:35:01Z

altunenes
Oct 1, 2025

Hi, thank you for the new release looks really cool! I encountered some errors while trying to convert your models to ONNX and wanted to consult with you.

https://huggingface.co/pyannote/speaker-diarization-community-1

I successfully converted segmentation-community-1 to ONNX using the standard approach:

segmentation_model = Model.from_pretrained('models').eval()
dummy_input = torch.zeros(2, 1, 160000)
torch.onnx.export(segmentation_model, dummy_input, "segmentation.onnx", ...)

This works perfectly and produces identical outputs between PyTorch and ONNX...

However, the embedding model (new wespeaker) cannot be exported using the same method:

Error: RuntimeError: Unsupported value kind: Tensor in torchaudio/compliance/kaldi.py

The model appears to have internal fbank extraction using torchaudio operations that don't support ONNX export. I verified the model outputs 256-dimensional embeddings in PyTorch, which matches the PLDA files provided.

Question: Is there a supported way to export the embedding model to ONNX? The segmentation model exports cleanly... Wespeaker models export by accepting pre-computed fbank features as input - does pyannote's embedding model support a similar inference path that skips the internal audio preprocessing?
I need the embedding model in ONNX format to use with the PLDA files for VBx clustering for rust deployment .

note: I also tried accessing the ResNet backbone directly, but it requires specific internal preprocessing between fbank extraction and the ResNet forward pass that I cannot replicate externally. Direct ResNet export fails with shape mismatches...

all the best...

altunenes · 2025-10-01T09:50:53Z

altunenes
Oct 1, 2025
Author

I successfully converted both models to ONNX
Solution: Export only the ResNet backbone that accepts pre-computed fbank features:

class converttttt(torch.nn.Module):
   def __init__(self, wespeaker_model):
       super().__init__()
       self.resnet = wespeaker_model.resnet
    // f : forward 
   def f(self, fbank_features):
       # Input: (batch, num_frames, 80) fbank features
       # Output: (batch, 256) embeddings
       o = self.resnet(fbank_features)
       return o[-1] if isinstance(o, tuple) else o

fm = Model.from_pretrained("models/embedding").eval()
on = converttttt(fm).eval()

torch.onnx.export(
   on,
   torch.randn(1, 200, 80),
   "embedding_model.onnx",
   input_names=["fbank_features"],
   output_names=["embeddings"],
   dynamic_axes={"fbank_features": {0: "batch_size", 1: "num_frames"}}
)

so extract fbank features separately before inference using any fbank implementation (torchaudio, kaldi, etc). tried verification (cosine similarity: 1.0000001192).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Embedding model ONNX export fails due to torchaudio preprocessing #1929

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Embedding model ONNX export fails due to torchaudio preprocessing #1929

Uh oh!

Uh oh!

altunenes Oct 1, 2025

Replies: 1 comment

Uh oh!

Uh oh!

altunenes Oct 1, 2025 Author

altunenes
Oct 1, 2025

altunenes
Oct 1, 2025
Author