feat: preparing for pyannote.audio 3.0.0 (#1470)

hbredin · web-flow · commit 9a5a90251d48 · 2023-09-26T14:02:16.000+02:00
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,24 +1,30 @@
 # Changelog
 
-## Version 3.0 (xxxx-xx-xx)
+## Version 3.0.0 (2023-09-26)
 
-### Highlights
+### Features and improvements
 
-- *"Harder"*. Fixed [major reproducibility issue](https://github.com/pyannote/pyannote-audio/issues/1370) with Ampere (A100) NVIDIA GPUs
-    In case you tried `pyannote.audio` pretrained pipelines in the past on Ampera (A100) NVIDIA GPUs
-    and were disappointed by the accuracy, please give it another try with this new version.
-- "Better".
-- "Faster".
-- "Stronger".
+  - feat(pipeline): send pipeline to device with `pipeline.to(device)`
+  - feat(pipeline): add `return_embeddings` option to `SpeakerDiarization` pipeline
+  - feat(pipeline): make `segmentation_batch_size` and `embedding_batch_size` mutable in `SpeakerDiarization` pipeline (they now default to `1`)
+  - feat(pipeline): add progress hook to pipelines
+  - feat(task): add [powerset](https://www.isca-speech.org/archive/interspeech_2023/plaquet23_interspeech.html) support to `SpeakerDiarization` task
+  - feat(task): add support for multi-task models
+  - feat(task): add support for label scope in speaker diarization task
+  - feat(task): add support for missing classes in multi-label segmentation task
+  - feat(model): add segmentation model based on torchaudio self-supervised representation
+  - feat(pipeline): check version compatibility at load time
+  - improve(task): load metadata as tensors rather than pyannote.core instances
+  - improve(task): improve error message on missing specifications
 
 ### Breaking changes
 
   - BREAKING(task): rename `Segmentation` task to `SpeakerDiarization`
-  - BREAKING(task): remove support for variable chunk duration for segmentation tasks
   - BREAKING(pipeline): pipeline defaults to CPU (use `pipeline.to(device)`)
   - BREAKING(pipeline): remove `SpeakerSegmentation` pipeline (use `SpeakerDiarization` pipeline)
-  - BREAKING(pipeline): remove support for `FINCHClustering` and `HiddenMarkovModelClustering`
   - BREAKING(pipeline): remove `segmentation_duration` parameter from `SpeakerDiarization` pipeline (defaults to `duration` of segmentation model)
+  - BREAKING(task): remove support for variable chunk duration for segmentation tasks
+  - BREAKING(pipeline): remove support for `FINCHClustering` and `HiddenMarkovModelClustering`
   - BREAKING(setup): drop support for Python 3.7
   - BREAKING(io): channels are now 0-indexed (used to be 1-indexed)
   - BREAKING(io): multi-channel audio is no longer downmixed to mono by default.
@@ -29,21 +35,8 @@
   - BREAKING(model): get rid of (flaky) `Model.introspection`
     If, for some weird reason, you wrote some custom code based on that,
     you should instead rely on `Model.example_output`.
+  - BREAKING(interactive): remove support for Prodigy recipes
 
-### Features and improvements
-
-  - feat(task): add [powerset](https://www.isca-speech.org/archive/interspeech_2023/plaquet23_interspeech.html) support to `SpeakerDiarization` task
-  - feat(task): add support for multi-task models
-  - feat(task): add support for label scope in speaker diarization task
-  - feat(task): add support for missing classes in multi-label segmentation task
-  - feat(model): add segmentation model based on torchaudio self-supervised representation
-  - feat(pipeline): send pipeline to device with `pipeline.to(device)`
-  - feat(pipeline): add `return_embeddings` option to `SpeakerDiarization` pipeline
-  - feat(pipeline): make `segmentation_batch_size` and `embedding_batch_size` mutable in `SpeakerDiarization` pipeline (they now default to `1`)
-  - feat(pipeline): add progress hook to pipelines
-  - feat(pipeline): check version compatibility at load time
-  - improve(task): load metadata as tensors rather than pyannote.core instances
-  - improve(task): improve error message on missing specifications
 
 ### Fixes and improvements
 
@@ -54,7 +47,7 @@
   - fix(task): fix support for "balance" option
   - improve(task): shorten and improve structure of Tensorboard tags
 
-### Dependencies
+### Dependencies update
 
   - setup: switch to torch 2.0+, torchaudio 2.0+, soundfile 0.12+, lightning 2.0+, torchmetrics 0.11+
   - setup: switch to pyannote.core 5.0+, pyannote.database 5.0+, and pyannote.pipeline 3.0+
diff --git a/README.md b/README.md
@@ -1,30 +1,37 @@
-> [!IMPORTANT]
-> I propose (paid) scientific [consulting services](https://herve.niderb.fr/consulting.html) to companies willing to make the most of their data and open-source speech processing toolkits (and `pyannote` in particular). 
+Using `pyannote.audio` open-source toolkit in production?  
+Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).
 
-# Speaker diarization with `pyannote.audio`
+# `pyannote.audio` speaker diarization toolkit
 
-`pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines.
+`pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it comes with state-of-the-art [pretrained models and pipelines](https://hf.co/pyannote), that can be further finetuned to your own data for even better performance.
 
 <p align="center">
  <a href="https://www.youtube.com/watch?v=37R_R82lfwA"><img src="https://img.youtube.com/vi/37R_R82lfwA/0.jpg"></a>
 </p>
 
 
-## TL;DR [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pyannote/pyannote-audio/blob/develop/tutorials/intro.ipynb)
+## TL;DR
+
+1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.0` with `pip install pyannote.audio`
+2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions
+3. Accept [`pyannote/speaker-diarization-3.0`](https://hf.co/pyannote-speaker-diarization-3.0) user conditions
+4. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).
 
 
 ```python
-# 1. visit hf.co/pyannote/speaker-diarization and hf.co/pyannote/segmentation and accept user conditions (only if requested)
-# 2. visit hf.co/settings/tokens to create an access token (only if you had to go through 1.)
-# 3. instantiate pretrained speaker diarization pipeline
 from pyannote.audio import Pipeline
-pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization",
-                                    use_auth_token="ACCESS_TOKEN_GOES_HERE")
+pipeline = Pipeline.from_pretrained(
+    "pyannote/speaker-diarization-3.0",
+    use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
+
+# send pipeline to GPU (when available)
+import torch
+pipeline.to(torch.device("cuda"))
 
-# 4. apply pretrained pipeline
+# apply pretrained pipeline
 diarization = pipeline("audio.wav")
 
-# 5. print the result
+# print the result
 for turn, _, speaker in diarization.itertracks(yield_label=True):
     print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
 # start=0.2s stop=1.5s speaker_0
@@ -39,16 +46,7 @@ for turn, _, speaker in diarization.itertracks(yield_label=True):
 - :exploding_head: state-of-the-art performance (see [Benchmark](#benchmark))
 - :snake: Python-first API
 - :zap: multi-GPU training with [pytorch-lightning](https://pytorchlightning.ai/)
-- :control_knobs: data augmentation with [torch-audiomentations](https://github.com/asteroid-team/torch-audiomentations)
-
-## Installation
 
-Only Python 3.8+ is supported.
-
-```bash
-# install from develop branch
-pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip
-```
 
 ## Documentation
 
@@ -72,53 +70,50 @@ pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/de
     - 2022-12-02 > ["How I reached 1st place at Ego4D 2022, 1st place at Albayzin 2022, and 6th place at VoxSRC 2022 speaker diarization challenges"](tutorials/adapting_pretrained_pipeline.ipynb)
     - 2022-10-23 > ["One speaker segmentation model to rule them all"](https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all)
     - 2021-08-05 > ["Streaming voice activity detection with pyannote.audio"](https://herve.niderb.fr/fastpages/2021/08/05/Streaming-voice-activity-detection-with-pyannote.html)
-- Miscellaneous
-    - [Training with `pyannote-audio-train` command line tool](tutorials/training_with_cli.md)
-    - [Speaker verification](tutorials/speaker_verification.ipynb)
-    - Visualization and debugging
+- Videos
+  - [Introduction to speaker diarization](https://umotion.univ-lemans.fr/video/9513-speech-segmentation-and-speaker-diarization/) / JSALT 2023 summer school / 90 min
+  - [Speaker segmentation model](https://www.youtube.com/watch?v=wDH2rvkjymY) / Interspeech 2021 / 3 min
+  - [First releaase of pyannote.audio](https://www.youtube.com/watch?v=37R_R82lfwA) / ICASSP 2020 /  8 min
 
 ## Benchmark
 
-Out of the box, `pyannote.audio` default speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization) is expected to be much better (and faster) in v2.x than in v1.1. Those numbers are diarization error rates (in %)
-
-| Dataset \ Version      | v1.1 | v2.0 | v2.1.1 (finetuned) |
-| ---------------------- | ---- | ---- | ------------------ |
-| AISHELL-4              | -    | 14.6 | 14.1 (14.5)        |
-| AliMeeting (channel 1) | -    | -    | 27.4 (23.8)        |
-| AMI (IHM)              | 29.7 | 18.2 | 18.9 (18.5)        |
-| AMI (SDM)              | -    | 29.0 | 27.1 (22.2)        |
-| CALLHOME (part2)       | -    | 30.2 | 32.4 (29.3)        |
-| DIHARD 3 (full)        | 29.2 | 21.0 | 26.9 (21.9)        |
-| VoxConverse (v0.3)     | 21.5 | 12.6 | 11.2 (10.7)        |
-| REPERE (phase2)        | -    | 12.6 | 8.2 ( 8.3)         |
-| This American Life     | -    | -    | 20.8 (15.2)        |
+Out of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.0) v3.0 is expected to be much better (and faster) than v2.x.  
+Those numbers are diarization error rates (in %):
+
+| Dataset \ Version      | v1.1 | v2.0 | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.0](https://hf.co/pyannote/speaker-diarization-3.0) |  <a href="mailto:herve-at-niderb-dot-fr?subject=Premium pyannote.audio pipeline&body=Looks like I got your attention! Drop me an email for more details. Hervé.">Premium</a>  |
+| ---------------------- | ---- | ---- | ------ | ------ | --------- |
+| AISHELL-4              | -    | 14.6 |  14.1  |  12.3  | 12.3      |
+| AliMeeting (channel 1) | -    | -    |  27.4  |  24.3  | 19.4      |
+| AMI (IHM)              | 29.7 | 18.2 |  18.9  |  19.0  | 16.7      |
+| AMI (SDM)              | -    | 29.0 |  27.1  |  22.2  | 20.1      |
+| AVA-AVD                | -    | -    |  -     |  49.1  | 42.7      |
+| DIHARD 3 (full)        | 29.2 | 21.0 |  26.9  |  21.7  | 17.0      |
+| MSDWild                | -    | -    |  -     |  24.6  | 20.4      |
+| REPERE (phase2)        | -    | 12.6 |   8.2  |   7.8  |  7.8      |
+| VoxConverse (v0.3)     | 21.5 | 12.6 |  11.2  |  11.3  |  9.5      |
 
 ## Citations
 
 If you use `pyannote.audio` please use the following citations:
 
 ```bibtex
-@inproceedings{Bredin2020,
-  Title = {{pyannote.audio: neural building blocks for speaker diarization}},
-  Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
-  Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
-  Year = {2020},
+@inproceedings{Plaquet23,
+  author={Alexis Plaquet and Hervé Bredin},
+  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
+  year=2023,
+  booktitle={Proc. INTERSPEECH 2023},
 }
 ```
 
 ```bibtex
-@inproceedings{Bredin2021,
-  Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
-  Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
-  Booktitle = {Proc. Interspeech 2021},
-  Year = {2021},
+@inproceedings{Bredin23,
+  author={Hervé Bredin},
+  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
+  year=2023,
+  booktitle={Proc. INTERSPEECH 2023},
 }
 ```
 
-## Support
-
-For commercial enquiries and scientific consulting, please contact [me](mailto:herve@niderb.fr).
-
 ## Development
 
 The commands below will setup pre-commit hooks and packages needed for developing the `pyannote.audio` library.
diff --git a/requirements.txt b/requirements.txt
@@ -3,6 +3,7 @@ einops >=0.6.0
 huggingface_hub >= 0.13.0
 lightning >= 2.0.1
 omegaconf >=2.1,<3.0
+onnxruntime >= 1.16.0
 pyannote.core >= 5.0.0
 pyannote.database >= 5.0.1
 pyannote.metrics >= 3.2
diff --git a/version.txt b/version.txt
@@ -1 +1 @@
-2.1.1
+3.0.0