DigitalPhonetics
diff --git a/‎README.md‎
Lines changed: 39 additions & 26 deletions b/‎README.md‎
Lines changed: 39 additions & 26 deletions
diff --git a/‎anonymization/WGAN/README.md‎
Lines changed: 4 additions & 0 deletions b/‎anonymization/WGAN/README.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎anonymization/WGAN/__init__.py‎ b/‎anonymization/WGAN/__init__.py‎
diff --git a/‎anonymization/WGAN/embeddings_generator.py‎
Lines changed: 30 additions & 0 deletions b/‎anonymization/WGAN/embeddings_generator.py‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎anonymization/WGAN/init_wgan.py‎
Lines changed: 65 additions & 0 deletions b/‎anonymization/WGAN/init_wgan.py‎
Lines changed: 65 additions & 0 deletions
diff --git a/‎anonymization/WGAN/resnet_1.py‎
Lines changed: 175 additions & 0 deletions b/‎anonymization/WGAN/resnet_1.py‎
Lines changed: 175 additions & 0 deletions
@@ -1,22 +1,52 @@
 # Speaker Anonymization
 
 This repository contains the speaker anonymization system developed at the Institute for Natural Language Processing 
-(IMS) at the University of Stuttgart, Germany. The system is described in our paper [*Speaker Anonymization with 
-Phonetic Intermediate Representations*](https://www.isca-speech.org/archive/interspeech_2022/meyer22b_interspeech.html).
+(IMS) at the University of Stuttgart, Germany. The system is described in the following papers:
+
+| Paper | Published at | Branch | Demo |
+|-------|--------------|--------|------|
+| [Speaker Anonymization with Phonetic Intermediate Representations](https://www.isca-speech.org/archive/interspeech_2022/meyer22b_interspeech.html) | [Interspeech 2022](https://www.interspeech2022.org/) | [phonetic_representations](https://github.com/DigitalPhonetics/speaker-anonymization/tree/phonetic_representations) | [https://huggingface.co/spaces/sarinam/speaker-anonymization](https://huggingface.co/spaces/sarinam/speaker-anonymization) |
+| [Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy](https://arxiv.org/abs/2210.07002) | Soon at [SLT 2022](https://slt2022.org/) | coming soon | coming soon |
+
+If you want to see the code to the respective papers, go to the branch referenced in the table. The latest version 
+of our system can be found here on the main branch.
 
 **Check out our live demo on Hugging Face: [https://huggingface.co/spaces/sarinam/speaker-anonymization](https://huggingface.co/spaces/sarinam/speaker-anonymization)**
 
 **Check also out [our contribution](https://www.voiceprivacychallenge.org/results-2022/docs/3___T04.pdf) to the [Voice Privacy Challenge 2022](https://www.voiceprivacychallenge.org/results-2022/)!**
 
-**The code and live demo to our latest paper [Anonymizing Speech with Generative Adversarial Networks to Preserve Speaker Privacy](https://arxiv.org/abs/2210.07002) is going to be added soon.**
 
 ## System Description
 The system is based on the Voice Privacy Challenge 2020 which is included as submodule. It uses the basic idea of 
 speaker embedding anonymization with neural synthesis, and uses the data and evaluation framework of the challenge. 
-For a detailed description of the system, please read our paper linked above.
+For a detailed description of the system, please read our Interspeech paper linked above.
+
+### Added Features
+Since the publication of the first paper, some features have been added. The new structure of the pipeline and its 
+capabilities contain:
+* **GAN-based speaker anonymization**: We show in [this paper](https://arxiv.org/abs/2210.07002) that a Wasserstein 
+  GAN can be trained to generate artificial speaker embeddings that resemble real ones but are not connected to any 
+  known speaker -- in our opinion, a crucial condition for anonymization. The current GAN model in the latest 
+  release v2.0 has been trained to generate a custom type of 128-dimensional speaker embeddings (included also in our 
+  speech 
+  synthesis toolkit [IMSToucan](https://github.com/DigitalPhonetics/IMS-Toucan)) instead of x-vectors or ECAPA-TDNN 
+  embeddings.
+* **Prosody cloning**: We now provide an option to transfer the original prosody to the anonymized audio via [prosody 
+  cloning](https://arxiv.org/abs/2206.12229)! If you want to avoid an exact cloning but modify it slightly (but 
+  randomly to avoid reversability), use the random offset thresholds. They are given as lower and upper threshold, 
+  as an percentage in relation to the modification. For instance, if you give these thresholds as (80, 120), you 
+  will modify the pitch and energy values of each phone by multiplying it with a random value between 80% and 120% 
+  (leading to either weakening or amplifying the signal).
+* **ASR**: Our ASR is now using a [Branchformer](https://arxiv.org/abs/2207.02971) encoder and includes word 
+  boundaries and stress markers in its output.
 
 ![architecture](figures/architecture.png)
 
+The current code on the main branch expects the models of release v2.0. If you want to use the pipeline as presented at 
+Interspeech 2022, 
+please go to 
+the 
+[phonetic_representations branch](https://github.com/DigitalPhonetics/speaker-anonymization/tree/phonetic_representations).
 
 ## Installation
 ### 1. Clone repository
@@ -26,13 +56,14 @@ git clone --recurse-submodules https://github.com/DigitalPhonetics/speaker-anony
 ``` 
 
 ### 2. Download models
-Download the models [from the release page (v1.0)](https://github.com/DigitalPhonetics/speaker-anonymization/releases/tag/v1.0), unzip the folders and place them into a *models* folder as stated in the release notes. Make sure to not unzip the single ASR models, only the outer folder.
+Download the models [from the release page (v2.0)](https://github.com/DigitalPhonetics/speaker-anonymization/releases/tag/v2.0), unzip the folders and place them into a *models* 
+folder as stated in the release notes. Make sure to not unzip the single ASR models, only the outer folder.
 ```
 cd speaker-anonymization
 mkdir models
 cd models
 for file in anonymization asr tts; do
-    wget https://github.com/DigitalPhonetics/speaker-anonymization/releases/download/v1.0/${file}.zip
+    wget https://github.com/DigitalPhonetics/speaker-anonymization/releases/download/v2.0/${file}.zip
     unzip ${file}.zip
     rm ${file}.zip
 done
@@ -98,7 +129,8 @@ on CPU (not recommended).
 The script will anonymize the development and test data of LibriSpeech and VCTK in three steps:
 1. ASR: Recognition of the linguistic content, output in form of text or phone sequences
 2. Anonymization: Modification of speaker embeddings, output as torch vectors
-3. TTS: Synthesis based on recognized transcription and anonymized speaker embedding, output as audio files (wav)
+3. TTS: Synthesis based on recognized transcription, extracted prosody and anonymized speaker embedding, output as 
+   audio files (wav)
 
 Each module produces intermediate results that are saved to disk. A module is only executed if previous intermediate 
 results for dependent pipeline combination do not exist or if recomputation is forced. Otherwise, the previous 
@@ -119,25 +151,6 @@ Finally, for clarity, the most important parts of the evaluation results as well
 the [results](results) directory.
 
 
-## Models
-The following table lists all models for each module that are reported in the paper and are included in this 
-repository. Each model is given by its name in the directory and the name used in the paper. In the *settings* 
-dictionary in [run_inference.py](run_inference.py), the model name should be used. The *x* for default names the 
-models that are used in the main configuration of the system.
-
-| Module | Default| Model name | Name in paper|
-|--------|--------|------------|--------------|
-| ASR    | x      | asr_tts-phn_en.zip | phones |
-|        |        | asr_stt_en | STT          |
-|        |        | asr_tts_en.zip | TTS       |
-| Anonymization | x | pool_minmax_ecapa+xvector | pool |
-|        |        | pool_raw_ecapa+xvector | pool raw |
-|        |        | random_in-scale_ecapa+xvector | random |
-| TTS    | x      | trained_on_ground_truth_phonemes.pt| Libri100|
-|        |        | trained_on_asr_phoneme_outputs.pt | Libri100 + finetuned |
-|        |        | trained_on_libri600_asr_phoneme_outputs.pt | Libri600 |
-|        |        | trained_on_libri600_ground_truth_phonemes.pt | Libri600 + finetuned |
-
 ## Citation
 ```
 @inproceedings{meyer22b_interspeech,
 
@@ -0,0 +1,4 @@
+# Wasserstein GAN with Quadratic Transport Cost for the Generation of Artificial Speaker Embeddings
+
+This model is also used in our [IMS Toucan toolkit](https://github.com/DigitalPhonetics/IMS-Toucan/tree/ControllableMultilingual) to control voices in multi-speaker speech synthesis. 
+Check it out!
@@ -0,0 +1,30 @@
+import torch
+
+from .init_wgan import create_wgan
+
+
+class EmbeddingsGenerator:
+
+    def __init__(self, gan_path, device):
+        self.device = device
+        self.gan_path = gan_path
+
+        self.mean = None
+        self.std = None
+        self.wgan = None
+
+        self._load_model(self.gan_path)
+
+    def generate_embeddings(self, n=1000):
+        return self.wgan.sample_generator(num_samples=n, nograd=True, return_intermediate=False)
+
+    def _load_model(self, path):
+        gan_checkpoint = torch.load(path, map_location="cpu")
+
+        self.wgan = create_wgan(parameters=gan_checkpoint['model_parameters'], device=self.device)
+        self.wgan.G.load_state_dict(gan_checkpoint['generator_state_dict'])
+        self.wgan.D.load_state_dict(gan_checkpoint['critic_state_dict'])
+
+        self.mean = gan_checkpoint["mean"]
+        self.std = gan_checkpoint["std"]
+
@@ -0,0 +1,65 @@
+import torch
+import torch.nn as nn
+
+from .wgan_qc import WassersteinGanQuadraticCost
+from .resnet_1 import ResNet_D, ResNet_G
+
+
+def create_wgan(parameters, device, optimizer='adam'):
+    if parameters['model'] == 'resnet':
+        generator, discriminator = init_resnet(parameters)
+    else:
+        raise NotImplementedError
+
+    if optimizer == 'adam':
+        optimizer_g = torch.optim.Adam(generator.parameters(), lr=parameters['learning_rate'], betas=parameters['betas'])
+        optimizer_d = torch.optim.Adam(discriminator.parameters(), lr=parameters['learning_rate'], betas=parameters['betas'])
+    elif optimizer == 'rmsprop':
+        optimizer_g = torch.optim.RMSprop(generator.parameters(), lr=parameters['learning_rate'])
+        optimizer_d = torch.optim.RMSprop(generator.parameters(), lr=parameters['learning_rate'])
+
+    criterion = torch.nn.MSELoss()
+
+    gan = WassersteinGanQuadraticCost(generator,
+                                      discriminator,
+                                      optimizer_g,
+                                      optimizer_d,
+                                      criterion=criterion,
+                                      data_dimensions=parameters['data_dim'],
+                                      epochs=parameters['epochs'],
+                                      batch_size=parameters['batch_size'],
+                                      device=device,
+                                      n_max_iterations=parameters['n_max_iterations'],
+                                      gamma=parameters['gamma'])
+
+    return gan
+
+
+def init_resnet(parameters):
+    critic = ResNet_D(parameters['data_dim'][-1], parameters['size'], nfilter=parameters['nfilter'],
+                      nfilter_max=parameters['nfilter_max'])
+    generator = ResNet_G(parameters['data_dim'][-1], parameters['z_dim'], parameters['size'],
+                         nfilter=parameters['nfilter'], nfilter_max=parameters['nfilter_max'])
+
+    generator.apply(weights_init_G)
+    critic.apply(weights_init_D)
+
+    return generator, critic
+
+
+def weights_init_D(m):
+    classname = m.__class__.__name__
+    if classname.find('Conv') != -1:
+        nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='leaky_relu')
+    elif classname.find('BatchNorm') != -1:
+        nn.init.constant_(m.weight, 1)
+        nn.init.constant_(m.bias, 0)
+
+
+def weights_init_G(m):
+    classname = m.__class__.__name__
+    if classname.find('Conv') != -1:
+        nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='leaky_relu')
+    elif classname.find('BatchNorm') != -1:
+        nn.init.constant_(m.weight, 1)
+        nn.init.constant_(m.bias, 0)
@@ -0,0 +1,175 @@
+import numpy as np
+import torch
+import torch.utils.data
+import torch.utils.data.distributed
+from torch import nn
+
+
+class ResNet_G(nn.Module):
+
+    def __init__(self, data_dim, z_dim, size, nfilter=64, nfilter_max=512, bn=True, res_ratio=0.1, **kwargs):
+        super().__init__()
+        self.input_dim = z_dim
+        self.output_dim = z_dim
+        self.dropout_rate = 0
+
+        s0 = self.s0 = 4
+        nf = self.nf = nfilter
+        nf_max = self.nf_max = nfilter_max
+        self.bn = bn
+        self.z_dim = z_dim
+
+        # Submodules
+        nlayers = int(np.log2(size / s0))
+        self.nf0 = min(nf_max, nf * 2 ** (nlayers + 1))
+
+        self.fc = nn.Linear(z_dim, self.nf0 * s0 * s0)
+        if self.bn:
+            self.bn1d = nn.BatchNorm1d(self.nf0 * s0 * s0)
+        self.relu = nn.LeakyReLU(0.2, inplace=True)
+
+        blocks = []
+        for i in range(nlayers, 0, -1):
+            nf0 = min(nf * 2 ** (i + 1), nf_max)
+            nf1 = min(nf * 2 ** i, nf_max)
+            blocks += [
+                ResNetBlock(nf0, nf1, bn=self.bn, res_ratio=res_ratio),
+                nn.Upsample(scale_factor=2)
+                ]
+
+        nf0 = min(nf * 2, nf_max)
+        nf1 = min(nf, nf_max)
+        blocks += [
+            ResNetBlock(nf0, nf1, bn=self.bn, res_ratio=res_ratio),
+            ResNetBlock(nf1, nf1, bn=self.bn, res_ratio=res_ratio)
+            ]
+
+        self.resnet = nn.Sequential(*blocks)
+        self.conv_img = nn.Conv2d(nf, 3, 3, padding=1)
+
+        self.fc_out = nn.Linear(3 * size * size, data_dim)
+
+    def forward(self, z, return_intermediate=False):
+        batch_size = z.size(0)
+        out = self.fc(z)
+        if self.bn:
+            out = self.bn1d(out)
+        out = self.relu(out)
+        if return_intermediate:
+            l_1 = out.detach().clone()
+        out = out.view(batch_size, self.nf0, self.s0, self.s0)
+
+        out = self.resnet(out)
+
+        out = self.conv_img(out)
+        out = self.relu(out)
+        out.flatten(1)
+        out = self.fc_out(out.flatten(1))
+
+        if return_intermediate:
+            return out, l_1
+        return out
+
+    def sample_latent(self, n_samples, z_size):
+        return torch.randn((n_samples, z_size))
+
+
+class ResNet_D(nn.Module):
+
+    def __init__(self, data_dim, size, nfilter=64, nfilter_max=512, res_ratio=0.1):
+        super().__init__()
+        s0 = self.s0 = 4
+        nf = self.nf = nfilter
+        nf_max = self.nf_max = nfilter_max
+        self.size = size
+
+        # Submodules
+        nlayers = int(np.log2(size / s0))
+        self.nf0 = min(nf_max, nf * 2 ** nlayers)
+
+        nf0 = min(nf, nf_max)
+        nf1 = min(nf * 2, nf_max)
+        blocks = [
+            ResNetBlock(nf0, nf0, bn=False, res_ratio=res_ratio),
+            ResNetBlock(nf0, nf1, bn=False, res_ratio=res_ratio)
+            ]
+
+        self.fc_input = nn.Linear(data_dim, 3 * size * size)
+
+        for i in range(1, nlayers + 1):
+            nf0 = min(nf * 2 ** i, nf_max)
+            nf1 = min(nf * 2 ** (i + 1), nf_max)
+            blocks += [
+                nn.AvgPool2d(3, stride=2, padding=1),
+                ResNetBlock(nf0, nf1, bn=False, res_ratio=res_ratio),
+                ]
+
+        self.conv_img = nn.Conv2d(3, 1 * nf, 3, padding=1)
+        self.relu = nn.LeakyReLU(0.2, inplace=True)
+        self.resnet = nn.Sequential(*blocks)
+
+        self.fc = nn.Linear(self.nf0 * s0 * s0, 1)
+
+    def forward(self, x):
+        batch_size = x.size(0)
+
+        out = self.fc_input(x)
+        out = self.relu(out).view(batch_size, 3, self.size, self.size)
+
+        out = self.relu((self.conv_img(out)))
+        out = self.resnet(out)
+        out = out.view(batch_size, self.nf0 * self.s0 * self.s0)
+        out = self.fc(out)
+
+        return out
+
+
+class ResNetBlock(nn.Module):
+
+    def __init__(self, fin, fout, fhidden=None, bn=True, res_ratio=0.1):
+        super().__init__()
+        # Attributes
+        self.bn = bn
+        self.is_bias = not bn
+        self.learned_shortcut = (fin != fout)
+        self.fin = fin
+        self.fout = fout
+        if fhidden is None:
+            self.fhidden = min(fin, fout)
+        else:
+            self.fhidden = fhidden
+        self.res_ratio = res_ratio
+
+        # Submodules
+        self.conv_0 = nn.Conv2d(self.fin, self.fhidden, 3, stride=1, padding=1, bias=self.is_bias)
+        if self.bn:
+            self.bn2d_0 = nn.BatchNorm2d(self.fhidden)
+        self.conv_1 = nn.Conv2d(self.fhidden, self.fout, 3, stride=1, padding=1, bias=self.is_bias)
+        if self.bn:
+            self.bn2d_1 = nn.BatchNorm2d(self.fout)
+        if self.learned_shortcut:
+            self.conv_s = nn.Conv2d(self.fin, self.fout, 1, stride=1, padding=0, bias=False)
+            if self.bn:
+                self.bn2d_s = nn.BatchNorm2d(self.fout)
+        self.relu = nn.LeakyReLU(0.2, inplace=True)
+
+    def forward(self, x):
+        x_s = self._shortcut(x)
+        dx = self.conv_0(x)
+        if self.bn:
+            dx = self.bn2d_0(dx)
+        dx = self.relu(dx)
+        dx = self.conv_1(dx)
+        if self.bn:
+            dx = self.bn2d_1(dx)
+        out = self.relu(x_s + self.res_ratio * dx)
+        return out
+
+    def _shortcut(self, x):
+        if self.learned_shortcut:
+            x_s = self.conv_s(x)
+            if self.bn:
+                x_s = self.bn2d_s(x_s)
+        else:
+            x_s = x
+        return x_s