You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+88-46Lines changed: 88 additions & 46 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,52 +1,84 @@
1
1
# Speaker Anonymization
2
2
3
3
This repository contains the speaker anonymization system developed at the Institute for Natural Language Processing
4
-
(IMS) at the University of Stuttgart, Germany. The system is described in the following papers:
4
+
(IMS) at the University of Stuttgart, Germany.
5
5
6
-
| Paper | Published at | Branch | Demo |
7
-
|-------|--------------|--------|------|
8
-
|[Speaker Anonymization with Phonetic Intermediate Representations](https://www.isca-speech.org/archive/interspeech_2022/meyer22b_interspeech.html)|[Interspeech 2022](https://www.interspeech2022.org/)|[phonetic_representations](https://github.com/DigitalPhonetics/speaker-anonymization/tree/phonetic_representations)|[https://huggingface.co/spaces/sarinam/speaker-anonymization](https://huggingface.co/spaces/sarinam/speaker-anonymization)|
9
-
|[Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy](https://arxiv.org/abs/2210.07002)| Soon at [SLT 2022](https://slt2022.org/)| coming soon | coming soon |
6
+
This branch contains the code for our paper [Anonymizing Speech with Generative Adversarial Networks to Preserve
7
+
Speaker Privacy](https://arxiv.org/abs/2210.07002) which we will present soon at the [SLT 2022](https://slt2022.org/)
8
+
. It is an extension of our first anonymization system described in [Speaker Anonymization with Phonetic
9
+
Intermediate Representations](https://www.isca-speech.org/archive/interspeech_2022/meyer22b_interspeech.html) but
10
+
uses a Wasserstein Generative Adversarial Network to generate artificial target speakers.
10
11
11
-
If you want to see the code to the respective papers, go to the branch referenced in the table. The latest version
12
-
of our system can be found here on the main branch.
12
+
If you want to see a list of all papers and implementations within this project, please visit the [main branch](https://github.com/DigitalPhonetics/speaker-anonymization/tree/main).
13
13
14
-
**Check out our live demo on Hugging Face: [https://huggingface.co/spaces/sarinam/speaker-anonymization](https://huggingface.co/spaces/sarinam/speaker-anonymization)**
14
+
[comment]: <>(**Check out the live demo to this code on Hugging Face: [https://huggingface.co/spaces/sarinam/speaker-anonymization](https://huggingface.co/spaces/sarinam/speaker-anonymization)**)
15
15
16
-
**Check also out [our contribution](https://www.voiceprivacychallenge.org/results-2022/docs/3___T04.pdf) to the [Voice Privacy Challenge 2022](https://www.voiceprivacychallenge.org/results-2022/)!**
16
+
This implementation is similar to [our submission](https://www.voiceprivacychallenge.org/results-2022/docs/3___T04.pdf)
The system is based on the Voice Privacy Challenge 2020 which is included as submodule. It uses the basic idea of
21
23
speaker embedding anonymization with neural synthesis, and uses the data and evaluation framework of the challenge.
22
-
For a detailed description of the system, please read our Interspeech paper linked above.
24
+
For detailed descriptions of the system, please read our papers linked above.
23
25
24
26
### Added Features
25
-
Since the publication of the first paper, some features have been added. The new structure of the pipeline and its
26
-
capabilities contain:
27
-
***GAN-based speaker anonymization**: We show in [this paper](https://arxiv.org/abs/2210.07002) that a Wasserstein
28
-
GAN can be trained to generate artificial speaker embeddings that resemble real ones but are not connected to any
29
-
known speaker -- in our opinion, a crucial condition for anonymization. The current GAN model in the latest
30
-
release v2.0 has been trained to generate a custom type of 128-dimensional speaker embeddings (included also in our
31
-
speech
32
-
synthesis toolkit [IMSToucan](https://github.com/DigitalPhonetics/IMS-Toucan)) instead of x-vectors or ECAPA-TDNN
33
-
embeddings.
34
-
***Prosody cloning**: We now provide an option to transfer the original prosody to the anonymized audio via [prosody
35
-
cloning](https://arxiv.org/abs/2206.12229)! If you want to avoid an exact cloning but modify it slightly (but
36
-
randomly to avoid reversability), use the random offset thresholds. They are given as lower and upper threshold,
37
-
as an percentage in relation to the modification. For instance, if you give these thresholds as (80, 120), you
38
-
will modify the pitch and energy values of each phone by multiplying it with a random value between 80% and 120%
39
-
(leading to either weakening or amplifying the signal).
40
-
***ASR**: Our ASR is now using a [Branchformer](https://arxiv.org/abs/2207.02971) encoder and includes word
41
-
boundaries and stress markers in its output.
27
+
The system is an extension of our first anonymization pipeline described in [our Interspeech paper](https://www.isca-speech.org/archive/interspeech_2022/meyer22b_interspeech.html).
28
+
Given an input audio, two kinds of information are extracted: (a) the linguistic content in form of phone sequences
29
+
using a custom Speech Recognition (ASR) model, and (b) a vector encoding the speaker information as speaker
30
+
embedding, formed by a concatenation of x-vector and ECAPA-TDNN embeddings. Using a
31
+
previously trained Wasserstein GAN to convert random noise into natural-like yet artificial speaker vectors, we
32
+
randomly sample a new target speaker embedding. If this target vector is dissimilar enough from the original speaker
33
+
embedding (b) - measured by cosine distance -, we use this target speaker and the phone sequence (a) to resynthesize
34
+
the utterance with a custom Speech Synthesis model, resulting in an anonymous version of the original audio.
35
+
36
+
In the extension described in [this paper](https://arxiv.org/abs/2210.07002), we mainly use the same pipeline and
37
+
models as in the first version of the system (see the branch [phonetic_representations](https://github.com/DigitalPhonetics/speaker-anonymization/tree/phonetic_representations)). However, the ASR model has been further
38
+
improved by hyperparameter optimization, and the GAN-based anonymization has been introduced as a novel speaker
39
+
selection/generation method.
42
40
43
41

44
42
45
-
The current code on the main branch expects the models of release v2.0. If you want to use the pipeline as presented at
The current code on the main branch expects the models of release v1.2. Please make sure to download these models
44
+
before running the code.
45
+
46
+
### Wasserstein GAN
47
+
The Generative Adversarial Network (GAN) is trained to minimize the Wasserstein distance between the original and
48
+
generated distribution, hence called Wasserstein GAN, or WGAN. It consists of two parts: a generator that converts
49
+
random noise into a vector of the same shape as our speaker embeddings, and a critic which is responsible for
50
+
decreasing the distance. Since unlike to vanilla GANs, our discriminator (the critic) does not decide between real or
51
+
fake data points but instead compares their distribution, we reduce the chance of common problems like mode collaps and
52
+
the imitation of training data points. To further increase the chance of convergence (another common problem with GANs),
53
+
we train the model with Quadratic Transport Cost.
54
+
55
+
During training, we compare the vectors generated by our network to the speaker embeddings of real speakers which
56
+
where extracted on utterance level, meaning that one speaker is represented in this speaker pool as many times as we
57
+
consider utterances from this speaker. In this way, we increase the number of data points in our training data.
58
+
However, GANs are usually trained on larger datasets than the ones considered in the framework of the Voice Privacy
59
+
Challenge, which we follow, and on image data which higher dimensions than speaker embeddings. Therefore, we reduced
60
+
the size of the input noise and the size of the generator and critic ResNet models. More information about this and
61
+
our hyperparameter selection is given in our paper.
62
+
63
+
During inference, we simply generate a vector using our GAN (or sample one vector of pre-generated vectors) and
64
+
compare it to the speaker embedding of the input speaker. If both vectors have a cosine distance of at least 0.3, we
65
+
consider both vectors (i.e. speakers) as dissimilar enough and take the generated one as target speaker embedding.
66
+
67
+
If you want to know more about the specifics of this GAN architecture and training, read the original papers about
68
+
[GANs](https://arxiv.org/abs/1406.2661), [Wasserstein GANs](https://proceedings.mlr.press/v70/arjovsky17a.html) and
69
+
their [improvements](https://arxiv.org/abs/1704.00028), and [Wasserstein GANs with Quadratic Transport Cost](https://ieeexplore.ieee.org/document/9009084).
70
+
71
+
72
+
### ASR optimization
73
+
We achieved an improvement of our ASR model by applying the following changes:
74
+
75
+
1. Proportion of Encoder CTC loss during training is 0.6 instead of 0.3, and proportion of Decoder Cross Entropy loss is 0.4 instead of 0.7, meaning more priority for having phone-discriminating representations in the Encoder's output.
76
+
77
+
2. Gradient accumulation during training is 8 steps instead of 4, meaning larger virtual batch size.
78
+
79
+
3. Proportion of CTC score during inference is 0.2 instead of 0.4, meaning that slightly less information is used from the input represented by Encoder and slightly more information is used from the language patterns learned by Decoder.
80
+
81
+
4. Inference is using average of 10 best checkpoints (by Decoder's accuracy on validation data) instead of 1, usually this makes the model less biased towards the validation set resulting in the better generalization on unseen data.
Download the models [from the release page (v2.0)](https://github.com/DigitalPhonetics/speaker-anonymization/releases/tag/v2.0), unzip the folders and place them into a *models*
91
+
Download the models [from the release page (v1.2)](https://github.com/DigitalPhonetics/speaker-anonymization/releases/tag/v1.2), unzip the folders and place them into a *models*
60
92
folder as stated in the release notes. Make sure to not unzip the single ASR models, only the outer folder.
0 commit comments