Eta_WavLM_implementation

Implementation of Eta-WavLM paper to get speaker independent features from WavLm features. I am using the Eta-WavLM Representations for the source utterance while inferencing from knn vc voice conversion model with litserve deployment code.

Setting Up the Environment

To clone the Repository

git clone https://github.com/rooshil-bhatia/Eta_WavLM_implementation.git

Create a Conda Environment with python 3.10>= and activate it.

conda create -n eta_wavlm python=3.10

conda activate eta_wavlm

Install all the dependencies.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

cd Eta_WavLM_implementation

pip install -r requirements.txt

Folder Structure

Eta_WavLM_implementation/
├── data/                      --> LibriSpeech dataset train-clean-100
├── inference_output/          --> Will contain features inferenced from the file simple_inference.py\
├── models/                    --> Trained model files will be saved here (.pkl)
├── pretrained_models/         --> This will store ECAPA-TDNN model automatically while starting to train
├── client.py                  --> API client Example
├── inference_eta_wavlm.py     --> Full inference wrapper
├── knnvc.py                   --> kNN-VC inference with Eta-WavLm features of the source utterance
├── model.py                   --> Core Eta-WavLM implementation
├── server.py                  --> LitServe Voice Conversion API server
├── simple_inference.py        --> Inference example to get features from an audio
└── train_eta_wavlm.py         --> Training script

Training

After running

python train_eta_wavlm.py

This will:

1)Download LibriSpeech train-clean-100 dataset automatically to ./data (~5.95GB)

2)Extract WavLM features from the 15th layer (1024 dimensions)

3)Extract ECAPA-TDNN embeddings for all utterances and apply PCA reduction (192 → 128 dims)

4)Learn linear transformation (A*, b*) using pseudo-inverse solution (Moore-Penrose inverse)

5)Save trained model to ./models/eta_wavlm_transform.pkl

Training Configuration

The training script uses these paper-compliant settings:

WavLM Model: microsoft/wavlm-large (15th layer)
Speaker Encoder: ECAPA-TDNN from SpeechBrain
PCA Components: 128 (optimal per paper's ablation)
Subsample Frames('L' in the paper): 50 frames per utterance (adjustable)
Audio Duration: 1-6 seconds (just a filter can be adjusted)
Max Training Files: 200 LibriSpeech utterances (adjustable)

Inferencing

After running

python simple_inference.py

Make sure to load the Eta-WavLm checkpoint

This will Give 5 files in ./inference_output:

audio_output_eta_features.npy
audio_output_original_features.npy
audio_output_results.json
audio_output_speaker_component.npy
audio_output_speaker_embedding.npy

The content for audio_output_results.json will look like:

Click to expand JSON output

{
  "audio_path": "/speech/suma/rooshil/sample1.wav",
  "duration_seconds": 7.25,
  "sequence_length": 300,
  "feature_dimension": 1024,
  "speaker_embedding_dimension": 192,
  "analysis": {
    "speaker_component_norm": 228.8349,
    "speaker_embedding_norm": 322.0620,
    "speaker_contribution_ratio": 0.8512,
    "original_feature_norm_mean": 268.8370,
    "eta_feature_norm_mean": 249.3773,
    "cosine_similarity_mean": 0.5927,
    "cosine_similarity_std": 0.1325,
    "original_feature_variance": 50.2814,
    "eta_feature_variance": 50.2814,
    "variance_retention_ratio": 1.0,
    "speaker_removal_effectiveness": 0.4073
  },
  "model_info": {
    "wavlm_model": "microsoft/wavlm-large",
    "wavlm_layer": 15,
    "speaker_encoder": "ECAPA-TDNN",
    "A_star_shape": [128, 1024],
    "b_star_shape": [1024]
  }
}

Litserve Deployment Details

To start the server. This will run the server http://localhost:8000.

python server.py

The input format is

{
  "audio1": "base64_encoded_source_audio_wav",
  "audio2": "base64_encoded_reference_audio_wav"
}

Field Descriptions:

audio1: Source speech audio (the content you want to convert, the source speech audio will go through Eta-WavLm to get speaker independent features.)

audio2: Reference speech audio (the target speaker voice characteristics)

Format: WAV files encoded as base64 strings. You can just give the path of the wav file and it will be automatically encoded.

Requirements: Any sample rate (auto-resampled to 16kHz), mono or stereo

The output format is

{
  "output_wav_b64": "base64_encoded_converted_audio_wav"
}

This is the voice converted audio and it will be decoded at the client side and will get saved at the desired path.

To infer from the running server there is an example client.py file in which you can add path to the wav files in src_wav_path= path_to_your_wav_file and ref_wav_path=path_to_your_wav_file and you will get a voice converted wav file output saved at your desired path which you can set in file.

Thought Process & Understanding

Eta-WavLm paper has proposed that speaker specific characteristics in WavLM representations can be removed through a linear transformation.
The training process reduces to solving a well-defined linear regression problem: S = D^T A + 1_N b^T, where the goal is to learn parameters A* and b* that best explain WavLM features in terms of PCA-reduced speaker embeddings.
This formulation transforms speaker identity removal from a complex neural network problem into a tractable linear algebra problem.
After a bit of research and reading results of the paper Comparing the Moore-Penrose Pseudoinverse and Gradient Descent for Solving Linear Regression Problems: A Performance Analysis, I concluded that Pseudoinverse would be a faster, better and efficient way to solve the problem.
Choosing WavLM over HuBERT is a good choice as it is trained on a larger dataset is more robust to noise and can handle speech overlaps.
Tried to do some analysis between speaker independent and speaker dependent features to see effect of traning and change in the nature of the embeddings. Unlike the paper where they improved the performance of a voice conversion system to prove their research.

Trade Offs

Used LibriSpeech train-clean-100 subset of the data due to storage issues.
The paper states the their model has been trained on 1000 hours of data which might have seen more speakers and variablity than my configuration as I used 200 files to train my Eta_WavLM model.
The model architecture for the Voice Conversion that they used has been given in the paper but rather than training it from scratch I used pre trained Knn -vc checkpoint and replaced the source utterance WavLM features with Eta_WavLM features to test the voice conversion quality and make the whole pipeline working hence making a proof of concept.

Future Plans

To train Eta_WavLM on bigger dataset.
To train an end to end voice conversion model using Eta_WavLM representations from scratch for getting the best output.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Eta_WavLM_implementation

Setting Up the Environment

Folder Structure

Training

Training Configuration

Inferencing

Litserve Deployment Details

Thought Process & Understanding

Trade Offs

Future Plans

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
inference_output		inference_output
models		models
README.md		README.md
client.py		client.py
final_diagram.jpeg		final_diagram.jpeg
inference_eta_wavlm.py		inference_eta_wavlm.py
knnvc.py		knnvc.py
knnvc_eta.jpeg		knnvc_eta.jpeg
model.py		model.py
requirements.txt		requirements.txt
server.py		server.py
simple_inference.py		simple_inference.py
train_eta_wavlm.py		train_eta_wavlm.py

rooshil-bhatia/Eta_WavLM_implementation

Folders and files

Latest commit

History

Repository files navigation

Eta_WavLM_implementation

Setting Up the Environment

Folder Structure

Training

Training Configuration

Inferencing

Litserve Deployment Details

Thought Process & Understanding

Trade Offs

Future Plans

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages