https://github.com/thinhlx1993/vietnamese_asr
I collect data from many different sources for training. The training data contains over 10k hours of speech data from the sources below
- Common Voice dataset
- VIVOS dataset
- (AN4) database audio files
- Vietnamese Speech recognition
- Youtube public dataset
- Vietnamese Dialogue Telephony speech dataset
- Travel Call Center Speech Data
- LibriSpeech
Tokenizer SentencePieceTokenizer initialized with 128 tokens
121 M Total params
| Name | Type | Params |
|---|---|---|
| 0 | preprocessor | AudioToMelSpectrogramPreprocessor |
| 1 | encoder | ConformerEncoder |
| 2 | decoder | ConvASRDecoder |
| 3 | loss | CTCLoss |
| 4 | spec_augmentation | SpectrogramAugmentation |
| 5 | wer | WER |
| WER | CER | |
|---|---|---|
| without ngram LM | 10.71 | 12.21 |
| with ngram LM | 9.15 | 10.2 |
https://drive.google.com/drive/folders/1SVNibfeMshfVkmatIU90LYok_Mf0zMD0?usp=sharing
https://github.com/NVIDIA/NeMo
I created a free-to-use API server to submit the inference data
The file input should have a bitrate of 16000 to avoid hidden bugs
File duration must be lower than 10s
import subprocess
command = [
"curl", "--location", "https://api.voicesplitter.com/api/v1/uploads",
"--form", 'file=@"/path/to/your/wav_file.wav"'
]
result = subprocess.run(command, capture_output=True, text=True)
print(result.stdout.encode('utf-8').decode('unicode_escape'))