|
| 1 | +# pyannote-coreml |
| 2 | + |
| 3 | +This Core ML port of the Hugging Face `pyannote/speaker-diarization-community-1` pipeline was produced primarily by the Mobius coding agent. The directory is laid out so another agent can pick it up and run end-to-end, while still giving power users a clear manual path through the convert → compare → quantize toolchain. |
| 4 | + |
| 5 | +## What Lives Here |
| 6 | + |
| 7 | +- `convert-coreml.py`, `compare-models.py`, `quantize-models.py` — scripted pipeline for export, parity checks, and post-export optimizations. |
| 8 | +- `coreml_models/` — default output folder for `.mlpackage` bundles plus resource JSON. |
| 9 | +- `docs/` — background notes (`docs/plda-coreml.md`, conversion guides, optimization results). |
| 10 | +- `coreml_wrappers.py`, `embedding_io.py`, `plda_module.py` — importable helpers for wrapping Core ML bundles inside PyTorch pipelines. |
| 11 | +- `pyproject.toml`, `uv.lock` — reproducible Python 3.10.12 environment pinned to Torch 2.4, coremltools 7.2, pyannote-audio 4.0.0. |
| 12 | +- Sample clips (`yc_first_10s.wav`, `yc_first_minute.wav`, `../../../../longconvo-30m*.wav`) for smoke tests and benchmarking. |
| 13 | + |
| 14 | +## Agent-Oriented Workflow |
| 15 | + |
| 16 | +Mobius (or any compatible coding agent) can operate this toolkit by chaining three scripts: |
| 17 | + |
| 18 | +1. `convert-coreml.py` exports FBANK, segmentation, embedding, and PLDA components to Core ML (with optional selective FP16). |
| 19 | +2. `compare-models.py` runs PyTorch vs Core ML parity tests, reports timing, DER/JER metrics, and refreshes plots under `plots/`. |
| 20 | +3. `quantize-models.py` generates INT8/INT4/palettized variants, benchmarks latency and memory, and emits comparison charts. |
| 21 | + |
| 22 | +All scripts write machine-readable summaries to disk so an agent can decide what to ship or flag regressions. Automation typically runs them in that order inside this directory with `uv run`. |
| 23 | + |
| 24 | +## Manual Pipeline |
| 25 | + |
| 26 | +Prerequisites: macOS 14+, Xcode 15+, [uv](https://github.com/astral-sh/uv), access to the gated Hugging Face repo. Accept the user agreement on [huggingface.co/pyannote/speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1) before attempting to download the checkpoints, then fetch the assets into `pyannote-speaker-diarization-community-1/` (run `git lfs pull` if necessary). |
| 27 | + |
| 28 | +```bash |
| 29 | +# 1. Create or refresh the local environment |
| 30 | +uv sync |
| 31 | + |
| 32 | +# 2. Convert PyTorch checkpoints to Core ML |
| 33 | +uv run python convert-coreml.py --model-root ./pyannote-speaker-diarization-community-1 \ |
| 34 | + --output-dir ./coreml_models |
| 35 | +# Optional: add --selective-fp16 for mixed precision exports |
| 36 | + |
| 37 | +# 3. Compare PyTorch vs Core ML outputs, generate plots/metrics |
| 38 | +uv run python compare-models.py --audio-path ../../../../longconvo-30m-last5m.wav \ |
| 39 | + --model-root ./pyannote-speaker-diarization-community-1 \ |
| 40 | + --coreml-dir ./coreml_models |
| 41 | + |
| 42 | +# 4. Produce quantized variants and benchmark them (uses convert+compare outputs) |
| 43 | +uv run python quantize-models.py --audio-path ../../../../longconvo-30m.wav \ |
| 44 | + --coreml-dir ./coreml_models |
| 45 | +# Add --skip-generation to benchmark existing variants only |
| 46 | +``` |
| 47 | + |
| 48 | +Key artifacts land under `coreml_models/` (FP32/FP16 exports, PLDA Core ML bundle, resource JSON files) and `plots/` (latency and accuracy reports). The scripts emit timing summaries and DER/JER results directly to stdout for quick inspection. |
| 49 | + |
| 50 | +## Using the Wrappers from Python |
| 51 | + |
| 52 | +`coreml_wrappers.py` exposes helpers to drop the converted models into an existing pyannote pipeline. The snippet below loads the FBANK and embedding bundles, mirrors the PyTorch interface, and emits embeddings for a local clip. |
| 53 | + |
| 54 | +```python |
| 55 | +from pathlib import Path |
| 56 | + |
| 57 | +import coremltools as ct |
| 58 | +import torch |
| 59 | +import torchaudio |
| 60 | +from pyannote.audio import Model |
| 61 | + |
| 62 | +from coreml_wrappers import CoreMLEmbeddingModule |
| 63 | +from embedding_io import SEGMENTATION_FRAMES |
| 64 | + |
| 65 | +root = Path(__file__).resolve().parent |
| 66 | +embedding_ml = ct.models.MLModel(root / "coreml_models" / "embedding-community-1.mlpackage") |
| 67 | +fbank_ml = ct.models.MLModel(root / "coreml_models" / "fbank-community-1.mlpackage") |
| 68 | +prototype = Model.from_pretrained(str(root / "pyannote-speaker-diarization-community-1" / "embedding")) |
| 69 | + |
| 70 | +wrapper = CoreMLEmbeddingModule(embedding_ml, fbank_ml, prototype, output_key="embedding") |
| 71 | + |
| 72 | +waveform, _ = torchaudio.load(root / "yc_first_10s.wav") |
| 73 | +waveform = waveform.unsqueeze(0) if waveform.ndim == 1 else waveform |
| 74 | +weights = torch.ones(1, SEGMENTATION_FRAMES) |
| 75 | +embedding = wrapper(waveform.unsqueeze(0), weights) |
| 76 | +print(embedding.shape) |
| 77 | +``` |
| 78 | + |
| 79 | +Call `wrap_pipeline_with_coreml` to swap the segmentation and embedding stages inside a full PyTorch diarization pipeline while keeping the VBx/PLDA logic on-device. |
| 80 | + |
| 81 | +## Status & Known Limitations |
| 82 | + |
| 83 | +- ✅ Conversion, comparison, and quantization scripts are in place and agent friendly. |
| 84 | +- ✅ PLDA parameters now ship as a Core ML model (`plda-community-1.mlpackage`) with precise dtype handling (see `docs/plda-coreml.md`). |
| 85 | +- ⚠️ Fixed 5 s embedding windows introduce mild oscillations around speaker transitions versus the variable-length PyTorch baseline (DER ~0.017–0.018). Plots under `plots/` illustrate the difference. |
| 86 | +- 🔍 Further tuning ideas: adjust VBx thresholds, add post-processing to merge short segments, investigate weighted pooling exports once coremltools supports variable-length inputs. |
| 87 | + |
| 88 | +## References |
| 89 | + |
| 90 | +- Hugging Face pipeline: `pyannote/speaker-diarization-community-1` |
| 91 | +- VBx clustering background: [VBx: Variational Bayes HMM Clustering](https://arxiv.org/abs/2012.14952) |
| 92 | +- Additional notes and deep dives live in `docs/` (start with `docs/plda-coreml.md` and `ANE_OPTIMIZATION_RESULTS.md`). |
0 commit comments