Skip to content

Conversation

@desh2608
Copy link
Collaborator

This workflow shows how we can use SpeechBrain x-vectors + sklearn agglomerative clustering to perform a crude speaker diarization. This can be used on top of the whisper workflow to obtain speaker-attributed transcripts.

Copy link
Collaborator

@pzelasko pzelasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is cool, what is the reason you don't want to merge it?

@desh2608
Copy link
Collaborator Author

This is cool, what is the reason you don't want to merge it?

Mainly because this approach isn't really benchmarked on anything, and I am not sure how well the ECAPA-TDNN embeddings would work with agglomerative clustering.

@flyingleafe
Copy link
Contributor

@desh2608 pyannote.audio is basically ECAPA-TDNN + agglomerative clustering, and it is benchmarked quite well.
(https://github.com/pyannote/pyannote-audio)
Why not use it directly?

@desh2608
Copy link
Collaborator Author

@desh2608 pyannote.audio is basically ECAPA-TDNN + agglomerative clustering, and it is benchmarked quite well. (https://github.com/pyannote/pyannote-audio) Why not use it directly?

I think that was in the older Pyannote, if I'm not mistaken? Pyannote 2.0 uses end-to-end segmentation which performs much better. In any case, this was just a quick DIY workflow. It should be relatively easy for folks to just use Pyannote to create RTTMs and then use the SupervisionSet.from_rttm() to create Lhotse manifests.

@flyingleafe
Copy link
Contributor

@desh2608 pyannote.audio is basically ECAPA-TDNN + agglomerative clustering, and it is benchmarked quite well. (https://github.com/pyannote/pyannote-audio) Why not use it directly?

I think that was in the older Pyannote, if I'm not mistaken? Pyannote 2.0 uses end-to-end segmentation which performs much better. In any case, this was just a quick DIY workflow. It should be relatively easy for folks to just use Pyannote to create RTTMs and then use the SupervisionSet.from_rttm() to create Lhotse manifests.

Well, not quite, the segmentation model in Pyannote 2.0 is a first step, the assignment of speakers to the segments is still done with ECAPA-TDNN + clustering. But whatever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants