This repository implements a Speaker Verification System using Gaussian Mixture Models (GMMs) with MFCC, Delta, and Double Delta features. The system is trained and tested on a subset of a speaker recognition dataset containing 1-second WAV files per speaker.
Speaker verification is an essential component in audio-based security and authentication systems. This project uses GMMs to learn speaker-specific acoustic patterns and verify the identity of a speaker based on short utterances.
- π Preprocessing includes noise addition, resampling, and pre-emphasis.
- ποΈ Feature extraction based on MFCC, delta, and double-delta coefficients.
- π― Model training using Gaussian Mixture Models with Expectation-Maximization.
- π Equal Error Rate (EER) used for evaluation; best model achieves EER = 0.177.
- π₯ Includes speaker pair comparisons and real-world applicability demonstrations.
.
βββ Audio/ # Contains 1-second audio files for each speaker
βββ Noise/ # Contains noise samples used for robustness
βββ test_pairs.txt # File containing speaker comparison test cases
βββ ml-end.ipynb # Main notebook for training, evaluation, and testing
βββ README.md # You're reading it- Silence and Noise Removal
- Resampling to 16kHz
- Pre-Emphasis Filtering
- MFCC (Mel-Frequency Cepstral Coefficients)
- MFCC Delta
- MFCC Double Delta
- Trained a GMM per speaker using
sklearn.mixture.GaussianMixture - Parameters:
n_components=4,max_iter=160
- Log-likelihood comparison for speaker prediction
- Equal Error Rate (EER) calculated for performance
- Also supports speaker pair comparison
| Model Version | EER |
|---|---|
| MFCC + Delta + Double Delta | 0.177 |
| MFCC only | Higher |
A lower EER indicates better performance (balance between false accept and false reject).
Key papers referenced include:
- Reynolds & Rose (1995): Robust text-independent speaker identification using GMM
- Dehak et al. (2007): Modeling Prosodic Features with Joint Factor Analysis
- Jadhav et al. (2018): GMM + MFCC + EM-based speaker recognition
See full reference list in the report
- Train on larger datasets with more speakers
- Explore i-vector and x-vector embeddings
- Apply deep learning approaches (e.g., LSTM, CNN) for feature extraction
- Integrate with real-time APIs or voice assistants