Skip to content

Commit adaf770

Browse files
committed
Merge branch 'release/3.3.0'
2 parents 70a8507 + d260ba0 commit adaf770

File tree

16 files changed

+2438
-47
lines changed

16 files changed

+2438
-47
lines changed

CHANGELOG.md

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,33 @@
11
# Changelog
22

3+
## Version 3.3.0 (2024-06-14)
4+
5+
### TL;DR
6+
7+
`pyannote.audio` does [speech separation](https://hf.co/pyannote/speech-separation-ami-1.0): multi-speaker audio in, one audio channel per speaker out!
8+
9+
```bash
10+
pip install pyannote.audio[separation]==3.3.0
11+
```
12+
13+
### New features
14+
15+
- feat(task): add `PixIT` joint speaker diarization and speech separation task (with [@joonaskalda](https://github.com/joonaskalda/))
16+
- feat(model): add `ToTaToNet` joint speaker diarization and speech separation model (with [@joonaskalda](https://github.com/joonaskalda/))
17+
- feat(pipeline): add `SpeechSeparation` pipeline (with [@joonaskalda](https://github.com/joonaskalda/))
18+
- feat(io): add option to select torchaudio `backend`
19+
20+
### Fixes
21+
22+
- fix(task): fix wrong train/development split when training with (some) meta-protocols ([#1709](https://github.com/pyannote/pyannote-audio/issues/1709))
23+
- fix(task): fix metadata preparation with missing validation subset ([@clement-pages](https://github.com/clement-pages/))
24+
25+
### Improvements
26+
27+
- improve(io): when available, default to using `soundfile` backend
28+
- improve(pipeline): do not extract embeddings when `max_speakers` is set to 1
29+
- improve(pipeline): optimize memory usage of most pipelines ([#1713](https://github.com/pyannote/pyannote-audio/pull/1713) by [@benniekiss](https://github.com/benniekiss/))
30+
331
## Version 3.2.0 (2024-05-08)
432

533
### New features
@@ -18,6 +46,7 @@
1846
- fix(task): fix estimation of training set size (with [@FrenchKrab](https://github.com/FrenchKrab))
1947
- fix(hook): fix `torch.Tensor` support in `ArtifactHook`
2048
- fix(doc): fix typo in `Powerset` docstring (with [@lukasstorck](https://github.com/lukasstorck))
49+
- fix(doc): remove mention of unsupported `numpy.ndarray` waveform (with [@Purfview](https://github.com/Purfview))
2150

2251
### Improvements
2352

@@ -26,12 +55,12 @@
2655
- improve(io): switch to `torchaudio >= 2.2.0`
2756
- improve(doc): update tutorials (with [@clement-pages](https://github.com/clement-pages/))
2857

29-
## Breaking changes
58+
### Breaking changes
3059

3160
- BREAKING(model): get rid of `Model.example_output` in favor of `num_frames` method, `receptive_field` property, and `dimension` property
3261
- BREAKING(task): custom tasks need to be updated (see "Add your own task" tutorial)
3362

34-
## Community contributions
63+
### Community contributions
3564

3665
- community: add tutorial for offline use of `pyannote/speaker-diarization-3.1` (by [@simonottenhauskenbun](https://github.com/simonottenhauskenbun))
3766

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
1-
Using `pyannote.audio` open-source toolkit in production?
2-
Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).
1+
Using `pyannote.audio` open-source toolkit in production?
2+
Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options.
33

44
# `pyannote.audio` speaker diarization toolkit
55

@@ -79,7 +79,7 @@ for turn, _, speaker in diarization.itertracks(yield_label=True):
7979
Out of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.1) v3.1 is expected to be much better (and faster) than v2.x.
8080
Those numbers are diarization error rates (in %):
8181

82-
| Benchmark | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.1](https://hf.co/pyannote/speaker-diarization-3.1) | [Premium](https://forms.office.com/e/GdqwVgkZ5C) |
82+
| Benchmark | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.1](https://hf.co/pyannote/speaker-diarization-3.1) | [pyannoteAI](https://www.pyannote.ai) |
8383
| --------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------ | ------------------------------------------------------ | ------------------------------------------------ |
8484
| [AISHELL-4](https://arxiv.org/abs/2104.03603) | 14.1 | 12.2 | 11.9 |
8585
| [AliMeeting](https://www.openslr.org/119/) (channel 1) | 27.4 | 24.4 | 22.5 |

pyannote/audio/core/inference.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -559,9 +559,6 @@ def aggregate(
559559
step=frames.step,
560560
)
561561

562-
masks = 1 - np.isnan(scores)
563-
scores.data = np.nan_to_num(scores.data, copy=True, nan=0.0)
564-
565562
# Hamming window used for overlap-add aggregation
566563
hamming_window = (
567564
np.hamming(num_frames_per_chunk).reshape(-1, 1)
@@ -613,11 +610,13 @@ def aggregate(
613610
)
614611

615612
# loop on the scores of sliding chunks
616-
for (chunk, score), (_, mask) in zip(scores, masks):
613+
for chunk, score in scores:
617614
# chunk ~ Segment
618615
# score ~ (num_frames_per_chunk, num_classes)-shaped np.ndarray
619616
# mask ~ (num_frames_per_chunk, num_classes)-shaped np.ndarray
620-
617+
mask = 1 - np.isnan(score)
618+
np.nan_to_num(score, copy=False, nan=0.0)
619+
621620
start_frame = frames.closest_frame(chunk.start + 0.5 * frames.duration)
622621

623622
aggregated_output[start_frame : start_frame + num_frames_per_chunk] += (

pyannote/audio/core/io.py

Lines changed: 44 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -48,21 +48,41 @@
4848
- a "IOBase" instance with "read" and "seek" support: open("audio.wav", "rb")
4949
- a "Mapping" with any of the above as "audio" key: {"audio": ...}
5050
- a "Mapping" with both "waveform" and "sample_rate" key:
51-
{"waveform": (channel, time) numpy.ndarray or torch.Tensor, "sample_rate": 44100}
51+
{"waveform": (channel, time) torch.Tensor, "sample_rate": 44100}
5252
5353
For last two options, an additional "channel" key can be provided as a zero-indexed
5454
integer to load a specific channel: {"audio": "stereo.wav", "channel": 0}
5555
"""
5656

5757

58-
def get_torchaudio_info(file: AudioFile):
58+
def get_torchaudio_info(
59+
file: AudioFile, backend: str = None
60+
) -> torchaudio.AudioMetaData:
5961
"""Protocol preprocessor used to cache output of torchaudio.info
6062
6163
This is useful to speed future random access to this file, e.g.
6264
in dataloaders using Audio.crop a lot....
65+
66+
Parameters
67+
----------
68+
file : AudioFile
69+
backend : str
70+
torchaudio backend to use. Defaults to 'soundfile' if available,
71+
or the first available backend.
72+
73+
Returns
74+
-------
75+
info : torchaudio.AudioMetaData
76+
Audio file metadata
6377
"""
6478

65-
info = torchaudio.info(file["audio"])
79+
if not backend:
80+
backends = (
81+
torchaudio.list_audio_backends()
82+
) # e.g ['ffmpeg', 'soundfile', 'sox']
83+
backend = "soundfile" if "soundfile" in backends else backends[0]
84+
85+
info = torchaudio.info(file["audio"], backend=backend)
6686

6787
# rewind if needed
6888
if isinstance(file["audio"], IOBase):
@@ -82,6 +102,9 @@ class Audio:
82102
In case of multi-channel audio, convert to single-channel audio
83103
using one of the following strategies: select one channel at
84104
'random' or 'downmix' by averaging all channels.
105+
backend : str
106+
torchaudio backend to use. Defaults to 'soundfile' if available,
107+
or the first available backend.
85108
86109
Usage
87110
-----
@@ -126,7 +149,7 @@ def validate_file(file: AudioFile) -> Mapping:
126149
-------
127150
validated_file : Mapping
128151
{"audio": str, "uri": str, ...}
129-
{"waveform": array or tensor, "sample_rate": int, "uri": str, ...}
152+
{"waveform": tensor, "sample_rate": int, "uri": str, ...}
130153
{"audio": file, "uri": "stream"} if `file` is an IOBase instance
131154
132155
Raises
@@ -148,7 +171,7 @@ def validate_file(file: AudioFile) -> Mapping:
148171
raise ValueError(AudioFileDocString)
149172

150173
if "waveform" in file:
151-
waveform: Union[np.ndarray, Tensor] = file["waveform"]
174+
waveform: Tensor = file["waveform"]
152175
if len(waveform.shape) != 2 or waveform.shape[0] > waveform.shape[1]:
153176
raise ValueError(
154177
"'waveform' must be provided as a (channel, time) torch Tensor."
@@ -179,11 +202,19 @@ def validate_file(file: AudioFile) -> Mapping:
179202

180203
return file
181204

182-
def __init__(self, sample_rate=None, mono=None):
205+
def __init__(self, sample_rate: int = None, mono=None, backend: str = None):
183206
super().__init__()
184207
self.sample_rate = sample_rate
185208
self.mono = mono
186209

210+
if not backend:
211+
backends = (
212+
torchaudio.list_audio_backends()
213+
) # e.g ['ffmpeg', 'soundfile', 'sox']
214+
backend = "soundfile" if "soundfile" in backends else backends[0]
215+
216+
self.backend = backend
217+
187218
def downmix_and_resample(self, waveform: Tensor, sample_rate: int) -> Tensor:
188219
"""Downmix and resample
189220
@@ -244,7 +275,7 @@ def get_duration(self, file: AudioFile) -> float:
244275
if "torchaudio.info" in file:
245276
info = file["torchaudio.info"]
246277
else:
247-
info = get_torchaudio_info(file)
278+
info = get_torchaudio_info(file, backend=self.backend)
248279

249280
frames = info.num_frames
250281
sample_rate = info.sample_rate
@@ -291,7 +322,7 @@ def __call__(self, file: AudioFile) -> Tuple[Tensor, int]:
291322
sample_rate = file["sample_rate"]
292323

293324
elif "audio" in file:
294-
waveform, sample_rate = torchaudio.load(file["audio"])
325+
waveform, sample_rate = torchaudio.load(file["audio"], backend=self.backend)
295326

296327
# rewind if needed
297328
if isinstance(file["audio"], IOBase):
@@ -349,7 +380,7 @@ def crop(
349380
sample_rate = info.sample_rate
350381

351382
else:
352-
info = get_torchaudio_info(file)
383+
info = get_torchaudio_info(file, backend=self.backend)
353384
frames = info.num_frames
354385
sample_rate = info.sample_rate
355386

@@ -401,7 +432,10 @@ def crop(
401432
else:
402433
try:
403434
data, _ = torchaudio.load(
404-
file["audio"], frame_offset=start_frame, num_frames=num_frames
435+
file["audio"],
436+
frame_offset=start_frame,
437+
num_frames=num_frames,
438+
backend=self.backend,
405439
)
406440
# rewind if needed
407441
if isinstance(file["audio"], IOBase):

pyannote/audio/core/task.py

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -362,12 +362,13 @@ def prepare_data(self):
362362

363363
if self.has_validation:
364364
files_iter = itertools.chain(
365-
self.protocol.train(), self.protocol.development()
365+
zip(itertools.repeat("train"), self.protocol.train()),
366+
zip(itertools.repeat("development"), self.protocol.development()),
366367
)
367368
else:
368-
files_iter = self.protocol.train()
369+
files_iter = zip(itertools.repeat("train"), self.protocol.train())
369370

370-
for file_id, file in enumerate(files_iter):
371+
for file_id, (subset, file) in enumerate(files_iter):
371372
# gather metadata and update metadata_unique_values so that each metadatum
372373
# (e.g. source database or label) is represented by an integer.
373374
metadatum = dict()
@@ -378,7 +379,8 @@ def prepare_data(self):
378379
metadatum["database"] = metadata_unique_values["database"].index(
379380
file["database"]
380381
)
381-
metadatum["subset"] = Subsets.index(file["subset"])
382+
383+
metadatum["subset"] = Subsets.index(subset)
382384

383385
# keep track of label scope (file, database, or global)
384386
metadatum["scope"] = Scopes.index(file["scope"])
@@ -593,7 +595,9 @@ def prepare_data(self):
593595
prepared_data["metadata-labels"] = np.array(unique_labels, dtype=np.str_)
594596
unique_labels.clear()
595597

596-
self.prepare_validation(prepared_data)
598+
if self.has_validation:
599+
self.prepare_validation(prepared_data)
600+
597601
self.post_prepare_data(prepared_data)
598602

599603
# save prepared data on the disk

0 commit comments

Comments
 (0)