Implementation of Eta-WavLM paper to get speaker independent features from WavLm features. I am using the Eta-WavLM Representations for the source utterance while inferencing from knn vc voice conversion model with litserve deployment code.

- To clone the Repository
git clone https://github.com/rooshil-bhatia/Eta_WavLM_implementation.git- Create a Conda Environment with python 3.10>= and activate it.
conda create -n eta_wavlm python=3.10conda activate eta_wavlm- Install all the dependencies.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118cd Eta_WavLM_implementationpip install -r requirements.txtEta_WavLM_implementation/
├── data/ --> LibriSpeech dataset train-clean-100
├── inference_output/ --> Will contain features inferenced from the file simple_inference.py\
├── models/ --> Trained model files will be saved here (.pkl)
├── pretrained_models/ --> This will store ECAPA-TDNN model automatically while starting to train
├── client.py --> API client Example
├── inference_eta_wavlm.py --> Full inference wrapper
├── knnvc.py --> kNN-VC inference with Eta-WavLm features of the source utterance
├── model.py --> Core Eta-WavLM implementation
├── server.py --> LitServe Voice Conversion API server
├── simple_inference.py --> Inference example to get features from an audio
└── train_eta_wavlm.py --> Training script
After running
python train_eta_wavlm.pyThis will:
1)Download LibriSpeech train-clean-100 dataset automatically to ./data (~5.95GB)
2)Extract WavLM features from the 15th layer (1024 dimensions)
3)Extract ECAPA-TDNN embeddings for all utterances and apply PCA reduction (192 → 128 dims)
4)Learn linear transformation (A*, b*) using pseudo-inverse solution (Moore-Penrose inverse)
5)Save trained model to ./models/eta_wavlm_transform.pkl
The training script uses these paper-compliant settings:
- WavLM Model: microsoft/wavlm-large (15th layer)
- Speaker Encoder: ECAPA-TDNN from SpeechBrain
- PCA Components: 128 (optimal per paper's ablation)
- Subsample Frames('L' in the paper): 50 frames per utterance (adjustable)
- Audio Duration: 1-6 seconds (just a filter can be adjusted)
- Max Training Files: 200 LibriSpeech utterances (adjustable)
After running
python simple_inference.pyMake sure to load the Eta-WavLm checkpoint
This will Give 5 files in ./inference_output:
- audio_output_eta_features.npy
- audio_output_original_features.npy
- audio_output_results.json
- audio_output_speaker_component.npy
- audio_output_speaker_embedding.npy
The content for audio_output_results.json will look like:
Click to expand JSON output
{
"audio_path": "/speech/suma/rooshil/sample1.wav",
"duration_seconds": 7.25,
"sequence_length": 300,
"feature_dimension": 1024,
"speaker_embedding_dimension": 192,
"analysis": {
"speaker_component_norm": 228.8349,
"speaker_embedding_norm": 322.0620,
"speaker_contribution_ratio": 0.8512,
"original_feature_norm_mean": 268.8370,
"eta_feature_norm_mean": 249.3773,
"cosine_similarity_mean": 0.5927,
"cosine_similarity_std": 0.1325,
"original_feature_variance": 50.2814,
"eta_feature_variance": 50.2814,
"variance_retention_ratio": 1.0,
"speaker_removal_effectiveness": 0.4073
},
"model_info": {
"wavlm_model": "microsoft/wavlm-large",
"wavlm_layer": 15,
"speaker_encoder": "ECAPA-TDNN",
"A_star_shape": [128, 1024],
"b_star_shape": [1024]
}
}
- To start the server. This will run the server
http://localhost:8000.
python server.pyThe input format is
{
"audio1": "base64_encoded_source_audio_wav",
"audio2": "base64_encoded_reference_audio_wav"
}
Field Descriptions:
audio1: Source speech audio (the content you want to convert, the source speech audio will go through Eta-WavLm to get speaker independent features.)
audio2: Reference speech audio (the target speaker voice characteristics)
Format: WAV files encoded as base64 strings. You can just give the path of the wav file and it will be automatically encoded.
Requirements: Any sample rate (auto-resampled to 16kHz), mono or stereo
The output format is
{
"output_wav_b64": "base64_encoded_converted_audio_wav"
}
This is the voice converted audio and it will be decoded at the client side and will get saved at the desired path.
- To infer from the running server there is an example
client.pyfile in which you can add path to the wav files insrc_wav_path= path_to_your_wav_fileandref_wav_path=path_to_your_wav_fileand you will get a voice converted wav file output saved at your desired path which you can set in file.
-
Eta-WavLm paper has proposed that speaker specific characteristics in WavLM representations can be removed through a linear transformation.
-
The training process reduces to solving a well-defined linear regression problem: S = D^T A + 1_N b^T, where the goal is to learn parameters A* and b* that best explain WavLM features in terms of PCA-reduced speaker embeddings.
-
This formulation transforms speaker identity removal from a complex neural network problem into a tractable linear algebra problem.
-
After a bit of research and reading results of the paper Comparing the Moore-Penrose Pseudoinverse and Gradient Descent for Solving Linear Regression Problems: A Performance Analysis, I concluded that Pseudoinverse would be a faster, better and efficient way to solve the problem.
-
Choosing WavLM over HuBERT is a good choice as it is trained on a larger dataset is more robust to noise and can handle speech overlaps.
-
Tried to do some analysis between speaker independent and speaker dependent features to see effect of traning and change in the nature of the embeddings. Unlike the paper where they improved the performance of a voice conversion system to prove their research.
-
Used LibriSpeech
train-clean-100subset of the data due to storage issues. -
The paper states the their model has been trained on 1000 hours of data which might have seen more speakers and variablity than my configuration as I used 200 files to train my Eta_WavLM model.
-
The model architecture for the Voice Conversion that they used has been given in the paper but rather than training it from scratch I used pre trained Knn -vc checkpoint and replaced the source utterance WavLM features with Eta_WavLM features to test the voice conversion quality and make the whole pipeline working hence making a proof of concept.
- To train Eta_WavLM on bigger dataset.
- To train an end to end voice conversion model using Eta_WavLM representations from scratch for getting the best output.