Skip to content

Commit e494e84

Browse files
author
Sarina Meyer
committed
Added code to paper "Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy"
1 parent 318a5ef commit e494e84

File tree

11 files changed

+122
-227
lines changed

11 files changed

+122
-227
lines changed

README.md

Lines changed: 88 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,52 +1,84 @@
11
# Speaker Anonymization
22

33
This repository contains the speaker anonymization system developed at the Institute for Natural Language Processing
4-
(IMS) at the University of Stuttgart, Germany. The system is described in the following papers:
4+
(IMS) at the University of Stuttgart, Germany.
55

6-
| Paper | Published at | Branch | Demo |
7-
|-------|--------------|--------|------|
8-
| [Speaker Anonymization with Phonetic Intermediate Representations](https://www.isca-speech.org/archive/interspeech_2022/meyer22b_interspeech.html) | [Interspeech 2022](https://www.interspeech2022.org/) | [phonetic_representations](https://github.com/DigitalPhonetics/speaker-anonymization/tree/phonetic_representations) | [https://huggingface.co/spaces/sarinam/speaker-anonymization](https://huggingface.co/spaces/sarinam/speaker-anonymization) |
9-
| [Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy](https://arxiv.org/abs/2210.07002) | Soon at [SLT 2022](https://slt2022.org/) | coming soon | coming soon |
6+
This branch contains the code for our paper [Anonymizing Speech with Generative Adversarial Networks to Preserve
7+
Speaker Privacy](https://arxiv.org/abs/2210.07002) which we will present soon at the [SLT 2022](https://slt2022.org/)
8+
. It is an extension of our first anonymization system described in [Speaker Anonymization with Phonetic
9+
Intermediate Representations](https://www.isca-speech.org/archive/interspeech_2022/meyer22b_interspeech.html) but
10+
uses a Wasserstein Generative Adversarial Network to generate artificial target speakers.
1011

11-
If you want to see the code to the respective papers, go to the branch referenced in the table. The latest version
12-
of our system can be found here on the main branch.
12+
If you want to see a list of all papers and implementations within this project, please visit the [main branch](https://github.com/DigitalPhonetics/speaker-anonymization/tree/main).
1313

14-
**Check out our live demo on Hugging Face: [https://huggingface.co/spaces/sarinam/speaker-anonymization](https://huggingface.co/spaces/sarinam/speaker-anonymization)**
14+
[comment]: <> (**Check out the live demo to this code on Hugging Face: [https://huggingface.co/spaces/sarinam/speaker-anonymization]&#40;https://huggingface.co/spaces/sarinam/speaker-anonymization&#41;**)
1515

16-
**Check also out [our contribution](https://www.voiceprivacychallenge.org/results-2022/docs/3___T04.pdf) to the [Voice Privacy Challenge 2022](https://www.voiceprivacychallenge.org/results-2022/)!**
16+
This implementation is similar to [our submission](https://www.voiceprivacychallenge.org/results-2022/docs/3___T04.pdf)
17+
to the
18+
[Voice Privacy Challenge 2022](https://www.voiceprivacychallenge.org/results-2022/).
1719

1820

1921
## System Description
2022
The system is based on the Voice Privacy Challenge 2020 which is included as submodule. It uses the basic idea of
2123
speaker embedding anonymization with neural synthesis, and uses the data and evaluation framework of the challenge.
22-
For a detailed description of the system, please read our Interspeech paper linked above.
24+
For detailed descriptions of the system, please read our papers linked above.
2325

2426
### Added Features
25-
Since the publication of the first paper, some features have been added. The new structure of the pipeline and its
26-
capabilities contain:
27-
* **GAN-based speaker anonymization**: We show in [this paper](https://arxiv.org/abs/2210.07002) that a Wasserstein
28-
GAN can be trained to generate artificial speaker embeddings that resemble real ones but are not connected to any
29-
known speaker -- in our opinion, a crucial condition for anonymization. The current GAN model in the latest
30-
release v2.0 has been trained to generate a custom type of 128-dimensional speaker embeddings (included also in our
31-
speech
32-
synthesis toolkit [IMSToucan](https://github.com/DigitalPhonetics/IMS-Toucan)) instead of x-vectors or ECAPA-TDNN
33-
embeddings.
34-
* **Prosody cloning**: We now provide an option to transfer the original prosody to the anonymized audio via [prosody
35-
cloning](https://arxiv.org/abs/2206.12229)! If you want to avoid an exact cloning but modify it slightly (but
36-
randomly to avoid reversability), use the random offset thresholds. They are given as lower and upper threshold,
37-
as an percentage in relation to the modification. For instance, if you give these thresholds as (80, 120), you
38-
will modify the pitch and energy values of each phone by multiplying it with a random value between 80% and 120%
39-
(leading to either weakening or amplifying the signal).
40-
* **ASR**: Our ASR is now using a [Branchformer](https://arxiv.org/abs/2207.02971) encoder and includes word
41-
boundaries and stress markers in its output.
27+
The system is an extension of our first anonymization pipeline described in [our Interspeech paper](https://www.isca-speech.org/archive/interspeech_2022/meyer22b_interspeech.html).
28+
Given an input audio, two kinds of information are extracted: (a) the linguistic content in form of phone sequences
29+
using a custom Speech Recognition (ASR) model, and (b) a vector encoding the speaker information as speaker
30+
embedding, formed by a concatenation of x-vector and ECAPA-TDNN embeddings. Using a
31+
previously trained Wasserstein GAN to convert random noise into natural-like yet artificial speaker vectors, we
32+
randomly sample a new target speaker embedding. If this target vector is dissimilar enough from the original speaker
33+
embedding (b) - measured by cosine distance -, we use this target speaker and the phone sequence (a) to resynthesize
34+
the utterance with a custom Speech Synthesis model, resulting in an anonymous version of the original audio.
35+
36+
In the extension described in [this paper](https://arxiv.org/abs/2210.07002), we mainly use the same pipeline and
37+
models as in the first version of the system (see the branch [phonetic_representations](https://github.com/DigitalPhonetics/speaker-anonymization/tree/phonetic_representations)). However, the ASR model has been further
38+
improved by hyperparameter optimization, and the GAN-based anonymization has been introduced as a novel speaker
39+
selection/generation method.
4240

4341
![architecture](figures/architecture.png)
4442

45-
The current code on the main branch expects the models of release v2.0. If you want to use the pipeline as presented at
46-
Interspeech 2022,
47-
please go to
48-
the
49-
[phonetic_representations branch](https://github.com/DigitalPhonetics/speaker-anonymization/tree/phonetic_representations).
43+
The current code on the main branch expects the models of release v1.2. Please make sure to download these models
44+
before running the code.
45+
46+
### Wasserstein GAN
47+
The Generative Adversarial Network (GAN) is trained to minimize the Wasserstein distance between the original and
48+
generated distribution, hence called Wasserstein GAN, or WGAN. It consists of two parts: a generator that converts
49+
random noise into a vector of the same shape as our speaker embeddings, and a critic which is responsible for
50+
decreasing the distance. Since unlike to vanilla GANs, our discriminator (the critic) does not decide between real or
51+
fake data points but instead compares their distribution, we reduce the chance of common problems like mode collaps and
52+
the imitation of training data points. To further increase the chance of convergence (another common problem with GANs),
53+
we train the model with Quadratic Transport Cost.
54+
55+
During training, we compare the vectors generated by our network to the speaker embeddings of real speakers which
56+
where extracted on utterance level, meaning that one speaker is represented in this speaker pool as many times as we
57+
consider utterances from this speaker. In this way, we increase the number of data points in our training data.
58+
However, GANs are usually trained on larger datasets than the ones considered in the framework of the Voice Privacy
59+
Challenge, which we follow, and on image data which higher dimensions than speaker embeddings. Therefore, we reduced
60+
the size of the input noise and the size of the generator and critic ResNet models. More information about this and
61+
our hyperparameter selection is given in our paper.
62+
63+
During inference, we simply generate a vector using our GAN (or sample one vector of pre-generated vectors) and
64+
compare it to the speaker embedding of the input speaker. If both vectors have a cosine distance of at least 0.3, we
65+
consider both vectors (i.e. speakers) as dissimilar enough and take the generated one as target speaker embedding.
66+
67+
If you want to know more about the specifics of this GAN architecture and training, read the original papers about
68+
[GANs](https://arxiv.org/abs/1406.2661), [Wasserstein GANs](https://proceedings.mlr.press/v70/arjovsky17a.html) and
69+
their [improvements](https://arxiv.org/abs/1704.00028), and [Wasserstein GANs with Quadratic Transport Cost](https://ieeexplore.ieee.org/document/9009084).
70+
71+
72+
### ASR optimization
73+
We achieved an improvement of our ASR model by applying the following changes:
74+
75+
1. Proportion of Encoder CTC loss during training is 0.6 instead of 0.3, and proportion of Decoder Cross Entropy loss is 0.4 instead of 0.7, meaning more priority for having phone-discriminating representations in the Encoder's output.
76+
77+
2. Gradient accumulation during training is 8 steps instead of 4, meaning larger virtual batch size.
78+
79+
3. Proportion of CTC score during inference is 0.2 instead of 0.4, meaning that slightly less information is used from the input represented by Encoder and slightly more information is used from the language patterns learned by Decoder.
80+
81+
4. Inference is using average of 10 best checkpoints (by Decoder's accuracy on validation data) instead of 1, usually this makes the model less biased towards the validation set resulting in the better generalization on unseen data.
5082

5183
## Installation
5284
### 1. Clone repository
@@ -56,14 +88,14 @@ git clone --recurse-submodules https://github.com/DigitalPhonetics/speaker-anony
5688
```
5789

5890
### 2. Download models
59-
Download the models [from the release page (v2.0)](https://github.com/DigitalPhonetics/speaker-anonymization/releases/tag/v2.0), unzip the folders and place them into a *models*
91+
Download the models [from the release page (v1.2)](https://github.com/DigitalPhonetics/speaker-anonymization/releases/tag/v1.2), unzip the folders and place them into a *models*
6092
folder as stated in the release notes. Make sure to not unzip the single ASR models, only the outer folder.
6193
```
6294
cd speaker-anonymization
6395
mkdir models
6496
cd models
6597
for file in anonymization asr tts; do
66-
wget https://github.com/DigitalPhonetics/speaker-anonymization/releases/download/v2.0/${file}.zip
98+
wget https://github.com/DigitalPhonetics/speaker-anonymization/releases/download/v1.2/${file}.zip
6799
unzip ${file}.zip
68100
rm ${file}.zip
69101
done
@@ -129,7 +161,7 @@ on CPU (not recommended).
129161
The script will anonymize the development and test data of LibriSpeech and VCTK in three steps:
130162
1. ASR: Recognition of the linguistic content, output in form of text or phone sequences
131163
2. Anonymization: Modification of speaker embeddings, output as torch vectors
132-
3. TTS: Synthesis based on recognized transcription, extracted prosody and anonymized speaker embedding, output as
164+
3. TTS: Synthesis based on recognized transcription and anonymized speaker embedding, output as
133165
audio files (wav)
134166

135167
Each module produces intermediate results that are saved to disk. A module is only executed if previous intermediate
@@ -151,14 +183,24 @@ Finally, for clarity, the most important parts of the evaluation results as well
151183
the [results](results) directory.
152184

153185

154-
## Citation
155-
```
156-
@inproceedings{meyer22b_interspeech,
157-
author={Sarina Meyer and Florian Lux and Pavel Denisov and Julia Koch and Pascal Tilli and Ngoc Thang Vu},
158-
title={{Speaker Anonymization with Phonetic Intermediate Representations}},
159-
year=2022,
160-
booktitle={Proc. Interspeech 2022},
161-
pages={4925--4929},
162-
doi={10.21437/Interspeech.2022-10703}
163-
}
164-
```
186+
[comment]: <> (## Citation)
187+
188+
[comment]: <> (```)
189+
190+
[comment]: <> (@inproceedings{meyer_2023_anonymizing,)
191+
192+
[comment]: <> ( author={Sarina Meyer and Pascal Tilli and Pavel Denisov and Florian Lux and Julia Koch and Ngoc Thang Vu},)
193+
194+
[comment]: <> ( title={{Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy}},)
195+
196+
[comment]: <> ( year=2023,)
197+
198+
[comment]: <> ( booktitle={Proc. SLT 2022},)
199+
200+
[comment]: <> ( pages={},)
201+
202+
[comment]: <> ( doi={})
203+
204+
[comment]: <> (})
205+
206+
[comment]: <> (```)

anonymization/WGAN/embeddings_generator.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,8 @@ def __init__(self, gan_path, device):
1616
self._load_model(self.gan_path)
1717

1818
def generate_embeddings(self, n=1000):
19-
return self.wgan.sample_generator(num_samples=n, nograd=True, return_intermediate=False)
19+
samples = self.wgan.sample_generator(num_samples=n, nograd=True)
20+
return self._inverse_normalize(samples, self.mean, self.std)
2021

2122
def _load_model(self, path):
2223
gan_checkpoint = torch.load(path, map_location="cpu")
@@ -28,3 +29,8 @@ def _load_model(self, path):
2829
self.mean = gan_checkpoint["mean"]
2930
self.std = gan_checkpoint["std"]
3031

32+
def _inverse_normalize(self, tensor, mean, std):
33+
for t, m, s in zip(tensor, mean, std):
34+
t.mul_(s).add_(m)
35+
return tensor
36+

anonymization/WGAN/init_wgan.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,9 @@ def create_wgan(parameters, device, optimizer='adam'):
3636

3737

3838
def init_resnet(parameters):
39-
critic = ResNet_D(parameters['data_dim'][-1], parameters['size'], nfilter=parameters['nfilter'],
39+
critic = ResNet_D(parameters['z_dim'], parameters['size'], nfilter=parameters['nfilter'],
4040
nfilter_max=parameters['nfilter_max'])
41-
generator = ResNet_G(parameters['data_dim'][-1], parameters['z_dim'], parameters['size'],
41+
generator = ResNet_G(parameters['z_dim'], parameters['size'],
4242
nfilter=parameters['nfilter'], nfilter_max=parameters['nfilter_max'])
4343

4444
generator.apply(weights_init_G)

anonymization/WGAN/resnet_1.py

Lines changed: 5 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
class ResNet_G(nn.Module):
99

10-
def __init__(self, data_dim, z_dim, size, nfilter=64, nfilter_max=512, bn=True, res_ratio=0.1, **kwargs):
10+
def __init__(self, z_dim, size, nfilter=64, nfilter_max=512, bn=True, res_ratio=0.1, **kwargs):
1111
super().__init__()
1212
self.input_dim = z_dim
1313
self.output_dim = z_dim
@@ -47,16 +47,14 @@ def __init__(self, data_dim, z_dim, size, nfilter=64, nfilter_max=512, bn=True,
4747
self.resnet = nn.Sequential(*blocks)
4848
self.conv_img = nn.Conv2d(nf, 3, 3, padding=1)
4949

50-
self.fc_out = nn.Linear(3 * size * size, data_dim)
50+
self.fc_out = nn.Linear(3 * size * size, 704)
5151

52-
def forward(self, z, return_intermediate=False):
52+
def forward(self, z):
5353
batch_size = z.size(0)
5454
out = self.fc(z)
5555
if self.bn:
5656
out = self.bn1d(out)
5757
out = self.relu(out)
58-
if return_intermediate:
59-
l_1 = out.detach().clone()
6058
out = out.view(batch_size, self.nf0, self.s0, self.s0)
6159

6260
out = self.resnet(out)
@@ -66,8 +64,6 @@ def forward(self, z, return_intermediate=False):
6664
out.flatten(1)
6765
out = self.fc_out(out.flatten(1))
6866

69-
if return_intermediate:
70-
return out, l_1
7167
return out
7268

7369
def sample_latent(self, n_samples, z_size):
@@ -76,7 +72,7 @@ def sample_latent(self, n_samples, z_size):
7672

7773
class ResNet_D(nn.Module):
7874

79-
def __init__(self, data_dim, size, nfilter=64, nfilter_max=512, res_ratio=0.1):
75+
def __init__(self, z_dim, size, nfilter=64, nfilter_max=512, res_ratio=0.1):
8076
super().__init__()
8177
s0 = self.s0 = 4
8278
nf = self.nf = nfilter
@@ -94,7 +90,7 @@ def __init__(self, data_dim, size, nfilter=64, nfilter_max=512, res_ratio=0.1):
9490
ResNetBlock(nf0, nf1, bn=False, res_ratio=res_ratio)
9591
]
9692

97-
self.fc_input = nn.Linear(data_dim, 3 * size * size)
93+
self.fc_input = nn.Linear(704, 3 * size * size)
9894

9995
for i in range(1, nlayers + 1):
10096
nf0 = min(nf * 2 ** i, nf_max)

anonymization/WGAN/wgan_qc.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -235,7 +235,7 @@ def train(self, data_loader, writer, experiment=None):
235235

236236
return self
237237

238-
def sample_generator(self, num_samples, nograd=False, return_intermediate=False):
238+
def sample_generator(self, num_samples, nograd=False):
239239
self.G.eval()
240240
if isinstance(self.G, torch.nn.parallel.DataParallel):
241241
latent_samples = self.G.module.sample_latent(num_samples, self.G.module.z_dim)
@@ -244,12 +244,10 @@ def sample_generator(self, num_samples, nograd=False, return_intermediate=False)
244244
latent_samples = latent_samples.to(self.device)
245245
if nograd:
246246
with torch.no_grad():
247-
generated_data = self.G(latent_samples, return_intermediate=return_intermediate)
247+
generated_data = self.G(latent_samples)
248248
else:
249249
generated_data = self.G(latent_samples)
250250
self.G.train()
251-
if return_intermediate:
252-
return generated_data[0].detach(), generated_data[1], latent_samples
253251
return generated_data.detach()
254252

255253
def sample(self, num_samples):

anonymization/gan_anonymizer.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,6 @@ def load_parameters(self, model_dir: Path):
2929
settings = json.load(f)
3030
self.vec_type = settings.get('vec_type', self.vec_type)
3131
self.vectors_file = settings.get('vectors_file', self.vectors_file)
32-
self.embed_model_path = settings.get('embed_model_path', None)
3332

3433
if (model_dir / self.vectors_file).is_file():
3534
self.gan_vectors = torch.load(model_dir / self.vectors_file, map_location=self.device)
@@ -44,7 +43,6 @@ def save_parameters(self, model_dir: Path):
4443
settings = {
4544
'vec_type': self.vec_type,
4645
'vectors_file': self.vectors_file,
47-
'embed_model_path': self.embed_model_path,
4846
'gan_model_name': self.gan_model_name,
4947
'num_sampled': self.n
5048
}

0 commit comments

Comments
 (0)