Skip to content

Commit 9a5a902

Browse files
authored
feat: preparing for pyannote.audio 3.0.0 (#1470)
1 parent 76d86fa commit 9a5a902

File tree

4 files changed

+67
-78
lines changed

4 files changed

+67
-78
lines changed

CHANGELOG.md

Lines changed: 18 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,30 @@
11
# Changelog
22

3-
## Version 3.0 (xxxx-xx-xx)
3+
## Version 3.0.0 (2023-09-26)
44

5-
### Highlights
5+
### Features and improvements
66

7-
- *"Harder"*. Fixed [major reproducibility issue](https://github.com/pyannote/pyannote-audio/issues/1370) with Ampere (A100) NVIDIA GPUs
8-
In case you tried `pyannote.audio` pretrained pipelines in the past on Ampera (A100) NVIDIA GPUs
9-
and were disappointed by the accuracy, please give it another try with this new version.
10-
- "Better".
11-
- "Faster".
12-
- "Stronger".
7+
- feat(pipeline): send pipeline to device with `pipeline.to(device)`
8+
- feat(pipeline): add `return_embeddings` option to `SpeakerDiarization` pipeline
9+
- feat(pipeline): make `segmentation_batch_size` and `embedding_batch_size` mutable in `SpeakerDiarization` pipeline (they now default to `1`)
10+
- feat(pipeline): add progress hook to pipelines
11+
- feat(task): add [powerset](https://www.isca-speech.org/archive/interspeech_2023/plaquet23_interspeech.html) support to `SpeakerDiarization` task
12+
- feat(task): add support for multi-task models
13+
- feat(task): add support for label scope in speaker diarization task
14+
- feat(task): add support for missing classes in multi-label segmentation task
15+
- feat(model): add segmentation model based on torchaudio self-supervised representation
16+
- feat(pipeline): check version compatibility at load time
17+
- improve(task): load metadata as tensors rather than pyannote.core instances
18+
- improve(task): improve error message on missing specifications
1319

1420
### Breaking changes
1521

1622
- BREAKING(task): rename `Segmentation` task to `SpeakerDiarization`
17-
- BREAKING(task): remove support for variable chunk duration for segmentation tasks
1823
- BREAKING(pipeline): pipeline defaults to CPU (use `pipeline.to(device)`)
1924
- BREAKING(pipeline): remove `SpeakerSegmentation` pipeline (use `SpeakerDiarization` pipeline)
20-
- BREAKING(pipeline): remove support for `FINCHClustering` and `HiddenMarkovModelClustering`
2125
- BREAKING(pipeline): remove `segmentation_duration` parameter from `SpeakerDiarization` pipeline (defaults to `duration` of segmentation model)
26+
- BREAKING(task): remove support for variable chunk duration for segmentation tasks
27+
- BREAKING(pipeline): remove support for `FINCHClustering` and `HiddenMarkovModelClustering`
2228
- BREAKING(setup): drop support for Python 3.7
2329
- BREAKING(io): channels are now 0-indexed (used to be 1-indexed)
2430
- BREAKING(io): multi-channel audio is no longer downmixed to mono by default.
@@ -29,21 +35,8 @@
2935
- BREAKING(model): get rid of (flaky) `Model.introspection`
3036
If, for some weird reason, you wrote some custom code based on that,
3137
you should instead rely on `Model.example_output`.
38+
- BREAKING(interactive): remove support for Prodigy recipes
3239

33-
### Features and improvements
34-
35-
- feat(task): add [powerset](https://www.isca-speech.org/archive/interspeech_2023/plaquet23_interspeech.html) support to `SpeakerDiarization` task
36-
- feat(task): add support for multi-task models
37-
- feat(task): add support for label scope in speaker diarization task
38-
- feat(task): add support for missing classes in multi-label segmentation task
39-
- feat(model): add segmentation model based on torchaudio self-supervised representation
40-
- feat(pipeline): send pipeline to device with `pipeline.to(device)`
41-
- feat(pipeline): add `return_embeddings` option to `SpeakerDiarization` pipeline
42-
- feat(pipeline): make `segmentation_batch_size` and `embedding_batch_size` mutable in `SpeakerDiarization` pipeline (they now default to `1`)
43-
- feat(pipeline): add progress hook to pipelines
44-
- feat(pipeline): check version compatibility at load time
45-
- improve(task): load metadata as tensors rather than pyannote.core instances
46-
- improve(task): improve error message on missing specifications
4740

4841
### Fixes and improvements
4942

@@ -54,7 +47,7 @@
5447
- fix(task): fix support for "balance" option
5548
- improve(task): shorten and improve structure of Tensorboard tags
5649

57-
### Dependencies
50+
### Dependencies update
5851

5952
- setup: switch to torch 2.0+, torchaudio 2.0+, soundfile 0.12+, lightning 2.0+, torchmetrics 0.11+
6053
- setup: switch to pyannote.core 5.0+, pyannote.database 5.0+, and pyannote.pipeline 3.0+

README.md

Lines changed: 47 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,37 @@
1-
> [!IMPORTANT]
2-
> I propose (paid) scientific [consulting services](https://herve.niderb.fr/consulting.html) to companies willing to make the most of their data and open-source speech processing toolkits (and `pyannote` in particular).
1+
Using `pyannote.audio` open-source toolkit in production?
2+
Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).
33

4-
# Speaker diarization with `pyannote.audio`
4+
# `pyannote.audio` speaker diarization toolkit
55

6-
`pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines.
6+
`pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it comes with state-of-the-art [pretrained models and pipelines](https://hf.co/pyannote), that can be further finetuned to your own data for even better performance.
77

88
<p align="center">
99
<a href="https://www.youtube.com/watch?v=37R_R82lfwA"><img src="https://img.youtube.com/vi/37R_R82lfwA/0.jpg"></a>
1010
</p>
1111

1212

13-
## TL;DR [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pyannote/pyannote-audio/blob/develop/tutorials/intro.ipynb)
13+
## TL;DR
14+
15+
1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.0` with `pip install pyannote.audio`
16+
2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions
17+
3. Accept [`pyannote/speaker-diarization-3.0`](https://hf.co/pyannote-speaker-diarization-3.0) user conditions
18+
4. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).
1419

1520

1621
```python
17-
# 1. visit hf.co/pyannote/speaker-diarization and hf.co/pyannote/segmentation and accept user conditions (only if requested)
18-
# 2. visit hf.co/settings/tokens to create an access token (only if you had to go through 1.)
19-
# 3. instantiate pretrained speaker diarization pipeline
2022
from pyannote.audio import Pipeline
21-
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization",
22-
use_auth_token="ACCESS_TOKEN_GOES_HERE")
23+
pipeline = Pipeline.from_pretrained(
24+
"pyannote/speaker-diarization-3.0",
25+
use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
26+
27+
# send pipeline to GPU (when available)
28+
import torch
29+
pipeline.to(torch.device("cuda"))
2330

24-
# 4. apply pretrained pipeline
31+
# apply pretrained pipeline
2532
diarization = pipeline("audio.wav")
2633

27-
# 5. print the result
34+
# print the result
2835
for turn, _, speaker in diarization.itertracks(yield_label=True):
2936
print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
3037
# start=0.2s stop=1.5s speaker_0
@@ -39,16 +46,7 @@ for turn, _, speaker in diarization.itertracks(yield_label=True):
3946
- :exploding_head: state-of-the-art performance (see [Benchmark](#benchmark))
4047
- :snake: Python-first API
4148
- :zap: multi-GPU training with [pytorch-lightning](https://pytorchlightning.ai/)
42-
- :control_knobs: data augmentation with [torch-audiomentations](https://github.com/asteroid-team/torch-audiomentations)
43-
44-
## Installation
4549

46-
Only Python 3.8+ is supported.
47-
48-
```bash
49-
# install from develop branch
50-
pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip
51-
```
5250

5351
## Documentation
5452

@@ -72,53 +70,50 @@ pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/de
7270
- 2022-12-02 > ["How I reached 1st place at Ego4D 2022, 1st place at Albayzin 2022, and 6th place at VoxSRC 2022 speaker diarization challenges"](tutorials/adapting_pretrained_pipeline.ipynb)
7371
- 2022-10-23 > ["One speaker segmentation model to rule them all"](https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all)
7472
- 2021-08-05 > ["Streaming voice activity detection with pyannote.audio"](https://herve.niderb.fr/fastpages/2021/08/05/Streaming-voice-activity-detection-with-pyannote.html)
75-
- Miscellaneous
76-
- [Training with `pyannote-audio-train` command line tool](tutorials/training_with_cli.md)
77-
- [Speaker verification](tutorials/speaker_verification.ipynb)
78-
- Visualization and debugging
73+
- Videos
74+
- [Introduction to speaker diarization](https://umotion.univ-lemans.fr/video/9513-speech-segmentation-and-speaker-diarization/) / JSALT 2023 summer school / 90 min
75+
- [Speaker segmentation model](https://www.youtube.com/watch?v=wDH2rvkjymY) / Interspeech 2021 / 3 min
76+
- [First releaase of pyannote.audio](https://www.youtube.com/watch?v=37R_R82lfwA) / ICASSP 2020 / 8 min
7977

8078
## Benchmark
8179

82-
Out of the box, `pyannote.audio` default speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization) is expected to be much better (and faster) in v2.x than in v1.1. Those numbers are diarization error rates (in %)
83-
84-
| Dataset \ Version | v1.1 | v2.0 | v2.1.1 (finetuned) |
85-
| ---------------------- | ---- | ---- | ------------------ |
86-
| AISHELL-4 | - | 14.6 | 14.1 (14.5) |
87-
| AliMeeting (channel 1) | - | - | 27.4 (23.8) |
88-
| AMI (IHM) | 29.7 | 18.2 | 18.9 (18.5) |
89-
| AMI (SDM) | - | 29.0 | 27.1 (22.2) |
90-
| CALLHOME (part2) | - | 30.2 | 32.4 (29.3) |
91-
| DIHARD 3 (full) | 29.2 | 21.0 | 26.9 (21.9) |
92-
| VoxConverse (v0.3) | 21.5 | 12.6 | 11.2 (10.7) |
93-
| REPERE (phase2) | - | 12.6 | 8.2 ( 8.3) |
94-
| This American Life | - | - | 20.8 (15.2) |
80+
Out of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.0) v3.0 is expected to be much better (and faster) than v2.x.
81+
Those numbers are diarization error rates (in %):
82+
83+
| Dataset \ Version | v1.1 | v2.0 | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.0](https://hf.co/pyannote/speaker-diarization-3.0) | <a href="mailto:herve-at-niderb-dot-fr?subject=Premium pyannote.audio pipeline&body=Looks like I got your attention! Drop me an email for more details. Hervé.">Premium</a> |
84+
| ---------------------- | ---- | ---- | ------ | ------ | --------- |
85+
| AISHELL-4 | - | 14.6 | 14.1 | 12.3 | 12.3 |
86+
| AliMeeting (channel 1) | - | - | 27.4 | 24.3 | 19.4 |
87+
| AMI (IHM) | 29.7 | 18.2 | 18.9 | 19.0 | 16.7 |
88+
| AMI (SDM) | - | 29.0 | 27.1 | 22.2 | 20.1 |
89+
| AVA-AVD | - | - | - | 49.1 | 42.7 |
90+
| DIHARD 3 (full) | 29.2 | 21.0 | 26.9 | 21.7 | 17.0 |
91+
| MSDWild | - | - | - | 24.6 | 20.4 |
92+
| REPERE (phase2) | - | 12.6 | 8.2 | 7.8 | 7.8 |
93+
| VoxConverse (v0.3) | 21.5 | 12.6 | 11.2 | 11.3 | 9.5 |
9594

9695
## Citations
9796

9897
If you use `pyannote.audio` please use the following citations:
9998

10099
```bibtex
101-
@inproceedings{Bredin2020,
102-
Title = {{pyannote.audio: neural building blocks for speaker diarization}},
103-
Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
104-
Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
105-
Year = {2020},
100+
@inproceedings{Plaquet23,
101+
author={Alexis Plaquet and Hervé Bredin},
102+
title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
103+
year=2023,
104+
booktitle={Proc. INTERSPEECH 2023},
106105
}
107106
```
108107

109108
```bibtex
110-
@inproceedings{Bredin2021,
111-
Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
112-
Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
113-
Booktitle = {Proc. Interspeech 2021},
114-
Year = {2021},
109+
@inproceedings{Bredin23,
110+
author={Hervé Bredin},
111+
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
112+
year=2023,
113+
booktitle={Proc. INTERSPEECH 2023},
115114
}
116115
```
117116

118-
## Support
119-
120-
For commercial enquiries and scientific consulting, please contact [me](mailto:[email protected]).
121-
122117
## Development
123118

124119
The commands below will setup pre-commit hooks and packages needed for developing the `pyannote.audio` library.

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ einops >=0.6.0
33
huggingface_hub >= 0.13.0
44
lightning >= 2.0.1
55
omegaconf >=2.1,<3.0
6+
onnxruntime >= 1.16.0
67
pyannote.core >= 5.0.0
78
pyannote.database >= 5.0.1
89
pyannote.metrics >= 3.2

version.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.1.1
1+
3.0.0

0 commit comments

Comments
 (0)