Skip to content

Releases: pyannote/pyannote-audio

Version 4.0.2

19 Nov 20:45

Choose a tag to compare

What's Changed

  • BREAKING(util): make Binarize.__call__ return string tracks (instead of int) @benniekiss
  • fix(torch): pin torch, torchcodec, and torchaudio versions to avoid segmentation fault
  • fix(pyannoteAI): update pyannoteAI wrapper to return both regular and exclusive diarization
  • feat(pipeline): add Pipeline.cuda() convenience method @tkanarsky
  • feat(pipeline): add preload option to base Pipeline.__call__ to force preloading audio in memory (@antoinelaurent)
  • feat(cli): add option to apply pipeline on a directory of audio files
  • improve(util): make permutate faster thanks to vectorized cost function

New Contributors

Full Changelog: 4.0.1...4.0.2

4.0.1

10 Oct 12:26

Choose a tag to compare

Version 4.0.1

pyannote.audio 4.0

29 Sep 12:04

Choose a tag to compare

Version 4.0.0

TL;DR

Improved speaker assignment and counting

pyannote/speaker-diarization-community-1 pretrained pipeline relies on VBx clustering instead of agglomerative hierarchical clustering (as suggested by BUT Speech@FIT researchers Petr Pálka and Jiangyu Han).

Exclusive speaker diarization

pyannote/speaker-diarization-community-1 pretrained pipeline returns a new exclusive speaker diarization, on top of the regular speaker diarization.
This is a feature which is backported from our latest commercial model that simplifies the reconciliation between fine-grained speaker diarization timestamps and (sometimes not so precise) transcription timestamps.

from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-community-1", token="huggingface-access-token")
output = pipeline("/path/to/conversation.wav")
print(output.speaker_diarization)            # regular speaker diarization
print(output.exclusive_speaker_diarization)  # exclusive speaker diarization

Faster training

Metadata caching and optimized dataloaders make training on large scale datasets much faster.
This led to a 15x speed up on pyannoteAI internal large scale training.

pyannoteAI premium speaker diarization

Change one line of code to use pyannoteAI premium models and enjoy more accurate speaker diarization.

from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
-    "pyannote/speaker-diarization-community-1", token="huggingface-access-token")
+    "pyannote/speaker-diarization-precision-2, token="pyannoteAI-api-key")
diarization = pipeline("/path/to/conversation.wav")

Offline (air-gapped) use

Pipelines can now be stored alongside their internal models in the same repository, streamlining fully offline use.

  1. Accept pyannote/speaker-diarization-community-1 pipeline user agreement

  2. Clone the pipeline repository from Huggingface (if prompted for a password, use a Huggingface access token with correct permissions)

    $ git lfs install
    $ git clone https://hf.co/pyannote/speaker-diarization-community-1 /path/to/directory/pyannote-speaker-diarization-community-1
  3. Enjoy!

    # load pipeline from disk (works without internet connection)
    from pyannote.audio import Pipeline
    pipeline = Pipeline.from_pretrained('/path/to/directory/pyannote-speaker-diarization-community-1')
    
    # run the pipeline locally on your computer
    diarization = pipeline("audio.wav")

Telemetry

With the optional telemetry feature in pyannote.audio, you can choose to send anonymous usage metrics to help the pyannote team improve the library.

Breaking changes

  • BREAKING(io): remove support for sox and soundfile audio I/O backends (only ffmpeg or in-memory audio is supported)
  • BREAKING(setup): drop support to Python < 3.10
  • BREAKING(hub): rename use_auth_token to token
  • BREAKING(hub): drop support for {pipeline_name}@{revision} syntax in Model.from_pretrained(...) and Pipeline.from_pretrained(...) -- use new revision keyword argument instead
  • BREAKING(task): remove OverlappedSpeechDetection task (part of SpeakerDiarization task)
  • BREAKING(pipeline): remove OverlappedSpeechDetection and Resegmentation unmaintained pipelines (part of SpeakerDiarization)
  • BREAKING(cache): rely on huggingface_hub caching directory (PYANNOTE_CACHE is no longer used)
  • BREAKING(inference): Inference now only supports already instantiated models
  • BREAKING(task): drop support for multilabel training in SpeakerDiarization task
  • BREAKING(task): drop support for warm_up option in SpeakerDiarization task
  • BREAKING(task): drop support for weigh_by_cardinality option in SpeakerDiarization task
  • BREAKING(task): drop support for vad_loss option in SpeakerDiarization task
  • BREAKING(chore): switch to native namespace package
  • BREAKING(cli): remove deprecated pyannote-audio-train CLI

New features

  • feat(io): switch from torchaudio to torchcodec for audio I/O
  • feat(pipeline): add support for VBx clustering (@Selesnyan and jyhan03)
  • feat(pyannoteAI): add wrapper around pyannoteAI SDK
  • improve(hub): add support for pipeline repos that also include underlying models
  • feat(clustering): add support for k-means clustering
  • feat(model): add wav2vec_frozen option to freeze/unfreeze wav2vec in SSeRiouSS architecture
  • feat(task): add support for manual optimization in SpeakerDiarization task
  • feat(utils): add hidden option to ProgressHook
  • feat(utils): add FilterByNumberOfSpeakers protocol files filter
  • feat(core): add Calibration class to calibrate logits/distances into probabilities
  • feat(metric): add DetectionErrorRate, SegmentationErrorRate, DiarizationPrecision, and DiarizationRecall metrics
  • feat(cli): add CLI to download, apply, benchmark, and optimize pipelines
  • feat(cli): add CLI to strip checkpoints to their bare inference minimum

Improvements

  • improve(model): improve WavLM (un)freezing support for SSeRiouSS architecture (@clement-pages)
  • improve(task): improve SpeakerDiarization training with manual optimization (@clement-pages)
  • improve(train): speed up dataloaders
  • improve(setup): switch to uv
  • improve(setup): switch to lightning from pytorch-lightning
  • improve(utils): improve dependency check when loading pretrained models and/or pipeline
  • improve(utils): add option to skip dependency check
  • improve(utils): add option to load a pretrained model checkpoint from an io.BytesIO buffer
  • improve(pipeline): add option to load a pretrained pipeline from a dict (@benniekiss)

Fixes

  • fix(model): improve WavLM (un)freezing support for ToTaToNet architecture (@clement-pages)
  • fix(separation): fix clipping issue in speech separation pipeline (@joonaskalda)
  • fix(separation): fix alignment between separated sources and diarization (@Lebourdais and @clement-pages)
  • fix(separation): prevent leakage removal collar from being applied to diarization (@clement-pages)
  • fix(separation): fix PixIT training with manual optimization (@clement-pages)
  • fix(doc): fix link to pytorch (@emmanuel-ferdman)
  • fix(task): fix corner case with small (<9) number of validation samples (@antoinelaurent)
  • fix(doc): fix default embedding in SpeechSeparation and SpeakerDiarization docstring (@razi-tm).

Version 3.4.0

09 Sep 07:11

Choose a tag to compare

Maintenance release

Upcoming major releases of pyannote.{core,database,metrics,pipeline} dependencies will break 3.x branch.
Version 3.4.0 pins those dependencies to compatible versions.

Version 3.3.1

23 Jun 00:30

Choose a tag to compare

Breaking changes

  • setup: drop support for Python 3.8

Fixes

Version 3.3.0

14 Jun 08:41

Choose a tag to compare

TL;DR

pyannote.audio does speech separation: multi-speaker audio in, one audio channel per speaker out!

pip install pyannote.audio[separation]==3.3.0

New features

  • feat(task): add PixIT joint speaker diarization and speech separation task (with @joonaskalda)
  • feat(model): add ToTaToNet joint speaker diarization and speech separation model (with @joonaskalda)
  • feat(pipeline): add SpeechSeparation pipeline (with @joonaskalda)
  • feat(io): add option to select torchaudio backend

Fixes

  • fix(task): fix wrong train/development split when training with (some) meta-protocols (#1709)
  • fix(task): fix metadata preparation with missing validation subset (@clement-pages)

Improvements

  • improve(io): when available, default to using soundfile backend
  • improve(pipeline): do not extract embeddings when max_speakers is set to 1
  • improve(pipeline): optimize memory usage of most pipelines (#1713 by @benniekiss)

Version 3.2.0

08 May 09:51

Choose a tag to compare

New features

  • feat(task): add option to cache task training metadata to speed up training (with @clement-pages)
  • feat(model): add receptive_field, num_frames and dimension to models (with @Bilal-Rahou)
  • feat(model): add fbank_only property to WeSpeaker models
  • feat(util): add Powerset.permutation_mapping to help with permutation in powerset space (with @FrenchKrab)
  • feat(sample): add sample file at pyannote.audio.sample.SAMPLE_FILE
  • feat(metric): add reduce option to diarization_error_rate metric (with @Bilal-Rahou)
  • feat(pipeline): add Waveform and SampleRate preprocessors

Fixes

  • fix(task): fix random generators and their reproducibility (with @FrenchKrab)
  • fix(task): fix estimation of training set size (with @FrenchKrab)
  • fix(hook): fix torch.Tensor support in ArtifactHook
  • fix(doc): fix typo in Powerset docstring (with @lukasstorck)

Improvements

  • improve(metric): add support for number of speakers mismatch in diarization_error_rate metric
  • improve(pipeline): track both Model and nn.Module attributes in Pipeline.to(device)
  • improve(io): switch to torchaudio >= 2.2.0
  • improve(doc): update tutorials (with @clement-pages)

Breaking changes

  • BREAKING(model): get rid of Model.example_output in favor of num_frames method, receptive_field property, and dimension property
  • BREAKING(task): custom tasks need to be updated (see "Add your own task" tutorial)

Community contributions

Version 3.1.1

01 Dec 13:26

Choose a tag to compare

TL;DR

Providing num_speakers to pyannote/speaker-diarization-3.1 now works as expected.

Full changelog

Fixes

Version 3.1.0

16 Nov 12:37

Choose a tag to compare

TL;DR

pyannote/speaker-diarization-3.1 no longer requires unpopular ONNX runtime

Full changelog

New features

  • feat(model): add WeSpeaker embedding wrapper based on PyTorch
  • feat(model): add support for multi-speaker statistics pooling
  • feat(pipeline): add TimingHook for profiling processing time
  • feat(pipeline): add ArtifactHook for saving internal steps
  • feat(pipeline): add support for list of hooks with Hooks
  • feat(utils): add "soft" option to Powerset.to_multilabel

Fixes

  • fix(pipeline): add missing "embedding" hook call in SpeakerDiarization
  • fix(pipeline): fix AgglomerativeClustering to honor num_clusters when provided
  • fix(pipeline): fix frame-wise speaker count exceeding max_speakers or detected num_speakers in SpeakerDiarization pipeline

Improvements

  • improve(pipeline): compute fbank on GPU when requested

Breaking changes

  • BREAKING(pipeline): rename WeSpeakerPretrainedSpeakerEmbedding to ONNXWeSpeakerPretrainedSpeakerEmbedding
  • BREAKING(setup): remove onnxruntime dependency.
    You can still use ONNX hbredin/wespeaker-voxceleb-resnet34-LM but you will have to install onnxruntime yourself.
  • BREAKING(pipeline): remove logging_hook (use ArtifactHook instead)
  • BREAKING(pipeline): remove onset and offset parameter in SpeakerDiarizationMixin.speaker_count
    You should now binarize segmentations before passing them to speaker_count

Version 3.0.1

28 Sep 19:47

Choose a tag to compare

TL;DR

pyannote/speaker-diarization-3.0 is now much faster when sent to GPU.

import torch
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.0")
pipeline.to(torch.device("cuda"))

Full changelog

Fixes and improvements

  • fix: fix WeSpeakerPretrainedSpeakerEmbedding GPU support

Dependencies update

  • setup: switch from onnxruntime to onnxruntime-gpu