Skip to content

Commit 33fd6ea

Browse files
authored
pyannote/community-1 (#8)
* add agents doc * clean up * celan up more
1 parent f7519ee commit 33fd6ea

File tree

49 files changed

+8800
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+8800
-1
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
11
__pycache__
22
.DS_Store
3-
.venv
3+
.venv
4+
5+
*.wav

models/AGENTS.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# Repository Guidelines
2+
3+
## Project Structure & Module Organization
4+
- Code lives under `models/{class}/{model}/{target}`; mirror existing patterns like `vad/silero-vad/coreml`.
5+
- Each target directory is self-contained: `pyproject.toml`, `uv.lock`, conversion scripts, docs, and sample assets.
6+
- Keep `README.md`/`CITATION.cff` next to the model. Push large binaries to Hugging Face and reference them here.
7+
8+
## Build, Test, and Development Commands
9+
Run these from the target directory (Python 3.10.12):
10+
- `uv sync` — create/refresh the env defined by `pyproject.toml`.
11+
- `uv run python convert-coreml.py --output-dir ./build/<name>` — run conversion and emit CoreML bundles.
12+
- `uv run python compare-models.py --audio-file <path> --coreml-dir <dir>` — benchmark converted models (if present).
13+
- `uv run python test.py` — execute the model-specific smoke test.
14+
15+
## Deployment Targets & Runtime Tips
16+
- Trace with `.CpuOnly`. Target iOS 17+ and macOS 14+.
17+
- Use `uv` for reproducible installs; avoid system Python.
18+
- Keep bundles small; prefer float16 where supported.
19+
20+
## Coding Style & Naming Conventions
21+
- 4-space indentation, type hints when practical, and double-quoted strings.
22+
- Lowercase-kebab-case for files/dirs; mirror upstream model names and runtime targets (`coreml`, `onnx`, etc.).
23+
- When packaging libraries, place importable code under `src/<package>` and expose CLIs via `if __name__ == "__main__": main()`.
24+
25+
## Testing Guidelines
26+
- Ship a runnable sanity check using bundled assets (e.g., `yc_first_minute.wav`) and verify end-to-end output.
27+
- Prefer deterministic assertions or concise summary prints; record expected metrics/speedups for benchmarking utilities.
28+
- Document prerequisites such as `git lfs install` before fetching large checkpoints.
29+
30+
## Commit & Pull Request Guidelines
31+
- Commits: concise, imperative subjects; append issue numbers when relevant (e.g., `Move parakeet to the right folder (#4)`).
32+
- Pull requests: describe the model, destination runtime, conversion steps, and validation evidence (logs, plots, or HF links). Call out deviations, new dependencies, and follow-up work.
33+
34+
## Model Assets & Distribution
35+
- Store heavy weights, notebooks, and rendered plots externally (Hugging Face Hub). Include download instructions or automation scripts.
36+
- Verify upstream license compliance before redistribution.
37+

models/speaker-diarization/pyannote-community-1/LICENSE

Lines changed: 426 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
2+
# pyannote/speaker-diarization-community-1
3+
4+
Made possible by: [speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1)
5+
6+
```text
7+
@inproceedings{
8+
author={Fluid Inference},
9+
title={{Speaker diarization via Core ML}},
10+
year=2025,
11+
}
12+
13+
Speaker segmentation model
14+
@inproceedings{Plaquet23,
15+
author={Alexis Plaquet and Hervé Bredin},
16+
title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
17+
year=2023,
18+
booktitle={Proc. INTERSPEECH 2023},
19+
}
20+
21+
Speaker embedding model
22+
@inproceedings{Wang2023,
23+
title={Wespeaker: A research and production oriented speaker embedding learning toolkit},
24+
author={Wang, Hongji and Liang, Chengdong and Wang, Shuai and Chen, Zhengyang and Zhang, Binbin and Xiang, Xu and Deng, Yanlei and Qian, Yanmin},
25+
booktitle={ICASSP 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
26+
pages={1--5},
27+
year={2023},
28+
organization={IEEE}
29+
}
30+
31+
Speaker clustering
32+
@article{Landini2022,
33+
author={Landini, Federico and Profant, J{\'a}n and Diez, Mireia and Burget, Luk{\'a}{\v{s}}},
34+
title={{Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks}},
35+
year={2022},
36+
journal={Computer Speech \& Language},
37+
}
38+
```
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
pyannote-speaker-diarization-community-1/
2+
coreml_models/
3+
.matplotlib_cache/
4+
5+
build/
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# pyannote-coreml
2+
3+
This Core ML port of the Hugging Face `pyannote/speaker-diarization-community-1` pipeline was produced primarily by the Mobius coding agent. The directory is laid out so another agent can pick it up and run end-to-end, while still giving power users a clear manual path through the convert → compare → quantize toolchain.
4+
5+
## What Lives Here
6+
7+
- `convert-coreml.py`, `compare-models.py`, `quantize-models.py` — scripted pipeline for export, parity checks, and post-export optimizations.
8+
- `coreml_models/` — default output folder for `.mlpackage` bundles plus resource JSON.
9+
- `docs/` — background notes (`docs/plda-coreml.md`, conversion guides, optimization results).
10+
- `coreml_wrappers.py`, `embedding_io.py`, `plda_module.py` — importable helpers for wrapping Core ML bundles inside PyTorch pipelines.
11+
- `pyproject.toml`, `uv.lock` — reproducible Python 3.10.12 environment pinned to Torch 2.4, coremltools 7.2, pyannote-audio 4.0.0.
12+
- Sample clips (`yc_first_10s.wav`, `yc_first_minute.wav`, `../../../../longconvo-30m*.wav`) for smoke tests and benchmarking.
13+
14+
## Agent-Oriented Workflow
15+
16+
Mobius (or any compatible coding agent) can operate this toolkit by chaining three scripts:
17+
18+
1. `convert-coreml.py` exports FBANK, segmentation, embedding, and PLDA components to Core ML (with optional selective FP16).
19+
2. `compare-models.py` runs PyTorch vs Core ML parity tests, reports timing, DER/JER metrics, and refreshes plots under `plots/`.
20+
3. `quantize-models.py` generates INT8/INT4/palettized variants, benchmarks latency and memory, and emits comparison charts.
21+
22+
All scripts write machine-readable summaries to disk so an agent can decide what to ship or flag regressions. Automation typically runs them in that order inside this directory with `uv run`.
23+
24+
## Manual Pipeline
25+
26+
Prerequisites: macOS 14+, Xcode 15+, [uv](https://github.com/astral-sh/uv), access to the gated Hugging Face repo. Accept the user agreement on [huggingface.co/pyannote/speaker-diarization-community-1](https://huggingface.co/pyannote/speaker-diarization-community-1) before attempting to download the checkpoints, then fetch the assets into `pyannote-speaker-diarization-community-1/` (run `git lfs pull` if necessary).
27+
28+
```bash
29+
# 1. Create or refresh the local environment
30+
uv sync
31+
32+
# 2. Convert PyTorch checkpoints to Core ML
33+
uv run python convert-coreml.py --model-root ./pyannote-speaker-diarization-community-1 \
34+
--output-dir ./coreml_models
35+
# Optional: add --selective-fp16 for mixed precision exports
36+
37+
# 3. Compare PyTorch vs Core ML outputs, generate plots/metrics
38+
uv run python compare-models.py --audio-path ../../../../longconvo-30m-last5m.wav \
39+
--model-root ./pyannote-speaker-diarization-community-1 \
40+
--coreml-dir ./coreml_models
41+
42+
# 4. Produce quantized variants and benchmark them (uses convert+compare outputs)
43+
uv run python quantize-models.py --audio-path ../../../../longconvo-30m.wav \
44+
--coreml-dir ./coreml_models
45+
# Add --skip-generation to benchmark existing variants only
46+
```
47+
48+
Key artifacts land under `coreml_models/` (FP32/FP16 exports, PLDA Core ML bundle, resource JSON files) and `plots/` (latency and accuracy reports). The scripts emit timing summaries and DER/JER results directly to stdout for quick inspection.
49+
50+
## Using the Wrappers from Python
51+
52+
`coreml_wrappers.py` exposes helpers to drop the converted models into an existing pyannote pipeline. The snippet below loads the FBANK and embedding bundles, mirrors the PyTorch interface, and emits embeddings for a local clip.
53+
54+
```python
55+
from pathlib import Path
56+
57+
import coremltools as ct
58+
import torch
59+
import torchaudio
60+
from pyannote.audio import Model
61+
62+
from coreml_wrappers import CoreMLEmbeddingModule
63+
from embedding_io import SEGMENTATION_FRAMES
64+
65+
root = Path(__file__).resolve().parent
66+
embedding_ml = ct.models.MLModel(root / "coreml_models" / "embedding-community-1.mlpackage")
67+
fbank_ml = ct.models.MLModel(root / "coreml_models" / "fbank-community-1.mlpackage")
68+
prototype = Model.from_pretrained(str(root / "pyannote-speaker-diarization-community-1" / "embedding"))
69+
70+
wrapper = CoreMLEmbeddingModule(embedding_ml, fbank_ml, prototype, output_key="embedding")
71+
72+
waveform, _ = torchaudio.load(root / "yc_first_10s.wav")
73+
waveform = waveform.unsqueeze(0) if waveform.ndim == 1 else waveform
74+
weights = torch.ones(1, SEGMENTATION_FRAMES)
75+
embedding = wrapper(waveform.unsqueeze(0), weights)
76+
print(embedding.shape)
77+
```
78+
79+
Call `wrap_pipeline_with_coreml` to swap the segmentation and embedding stages inside a full PyTorch diarization pipeline while keeping the VBx/PLDA logic on-device.
80+
81+
## Status & Known Limitations
82+
83+
- ✅ Conversion, comparison, and quantization scripts are in place and agent friendly.
84+
- ✅ PLDA parameters now ship as a Core ML model (`plda-community-1.mlpackage`) with precise dtype handling (see `docs/plda-coreml.md`).
85+
- ⚠️ Fixed 5 s embedding windows introduce mild oscillations around speaker transitions versus the variable-length PyTorch baseline (DER ~0.017–0.018). Plots under `plots/` illustrate the difference.
86+
- 🔍 Further tuning ideas: adjust VBx thresholds, add post-processing to merge short segments, investigate weighted pooling exports once coremltools supports variable-length inputs.
87+
88+
## References
89+
90+
- Hugging Face pipeline: `pyannote/speaker-diarization-community-1`
91+
- VBx clustering background: [VBx: Variational Bayes HMM Clustering](https://arxiv.org/abs/2012.14952)
92+
- Additional notes and deep dives live in `docs/` (start with `docs/plda-coreml.md` and `ANE_OPTIMIZATION_RESULTS.md`).

0 commit comments

Comments
 (0)