Scaling Multilingual Visual Speech Recognition

This code is for our paper titled: Scaling Multilingual Visual Speech Recognition.
Authors: K R Prajwal*, Sindhu Hegde*, Andrew Zisserman

📝 Paper	📑 Project Page	📦 MultiVSR Dataset	🛠 Demo Video
Paper	Website	Dataset	Video

We introduce MultiVSR - a large-scale dataset for multilingual visual speech recognition. MultiVSR comprises ~12,000 hours of video data paired with word-aligned transcripts from 13 languages. We design a multi-task Transformer-based encoder-decoder model, which can simultaneously perform two tasks: (i) language identification and (ii) visual speech recognition from silent lip videos. Our model is jointly trained across all languages using a sequence-to-sequence framework.

News 🚀🚀🚀

[2025.08.15] 🎬 Integrated the video pre-processing code from SyncNet
[2025.06.24] 🔥 Real-world video inference code released
[2025.06.23] 🧬 Pre-trained checkpoints released
[2025.04.10] 🎥 The MultiVSR dataset released

Dataset

Refer to the dataset section for details on downloading and pre-processing the data.

Installation

Clone the repository git clone https://github.com/Sindhu-Hegde/multivsr.git

Install the required packages (it is recommended to create a new environment)

python -m venv env_multivsr
source env_multivsr/bin/activate
pip install -r requirements.txt

Note: The code has been tested with Python 3.13.5

Checkpoints

Download the trained models and save in checkpoints folder

Model	Download Link
VTP Feature Extractor	Link
Lip-reading Transformer	Link

Inference on a real-world video

Step-1: Pre-process the video

The first step to do is to extract and preprocess face tracks using the run_pipeline.py script, adapted from syncnet_python repository. Run the following command to pre-process the video:

cd preprocess
python run_pipeline.py --videofile <path-to-video-file> --reference <name-of-the-result-folder> --data_dir <folder-path-to-save-the-results>
cd ..

The processed face tracks are saved in: <data_dir/reference/pycrop/*.avi>. Once the face tracks are extracted, the below script can be used to perform lip-reading.

Step-2: Lip-reading inference

python inference.py --ckpt_path <lipreading-transformer-checkpoint> --visual_encoder_ckpt_path <feature-extractor-checkpoint> --fpath <path-to-the-peocessed-video-file>

Following pre-processed sample videos are available for a quick test: Note: Step-1 need to be skipped for these sample videos, since they are already pre-processed.

Video path	Language	GT Transcription
samples/GBfc471SoSo-00000.mp4	English (en)	So this is part of the Paradise Papers that came out this weekend
samples/GgQ9IGGSQ0I-00003.mp4	Italian (it)	In questo video vi abbiamo parlato soltanto dei verbi principali, cioè dei verbi più usati in italiano, soprattutto per quanto riguarda
samples/hxn8clTtMTo-00001.mp4	French (fr)	ça va jouer sur nos pratiques, ça va jouer sur nos comportements. Donc on va avoir des
samples/LNjqg9qEu0Y-00008.mp4	German (de)	weil die Kompetenzen und Kapazitäten zwischen den Geschlechtern unterschiedlich verteilt sind.
samples/RGI2GUiiL6o-00003.mp4	Portuguese (pt)	magia simbólica específica nesse sentido. Quando um caboclo, por exemplo, vai riscar

Example run:

python inference.py --ckpt_path checkpoints/model.pth --visual_encoder_ckpt_path checkpoints/feature_extractor.pth --fpath samples/GBfc471SoSo-00000.mp4

Output of the above run:

Following models are loaded successfully:
/mnt/models/multivsr/model.pth
/mnt/models/multivsr/feature_extractor.pth

Extracted frames from the input video:  torch.Size([1, 3, 100, 96, 96])

Running inference...
-------------------------------------------------------------------
-------------------------------------------------------------------
Language: en
Transcription: and so this is part of the paradox papers that came out in this weekend
-------------------------------------------------------------------
-------------------------------------------------------------------

Additional options that can be set if needed:

--start <start-second> 
--end <end-second> 
--lang_id <two-letter-lang-code>

Setting the language code can lead to more accurate results, as errors in language identification can be avoided.

Note: If you get UnicodeEncodeError, please run: export PYTHONIOENCODING=utf-8 in your terminal session.

Citation

If you find this work useful for your research, please consider citing our paper:

@InProceeding{prajwal2025multivsr,
      author       = "Prajwal, K R and Hegde, Sindhu and Zisserman, Andrew",
      title        = "Scaling Multilingual Visual Speech Recognition",
      booktitle    = "IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)", 
      pages        = "1-5",
      year         = "2025",
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
checkpoints/multilingual		checkpoints/multilingual
dataset		dataset
preprocess		preprocess
samples		samples
README.md		README.md
config.py		config.py
dataloader.py		dataloader.py
gitignore		gitignore
inference.py		inference.py
models.py		models.py
requirements.txt		requirements.txt
search.py		search.py
test_scores.py		test_scores.py
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scaling Multilingual Visual Speech Recognition

News 🚀🚀🚀

Dataset

Installation

Checkpoints

Inference on a real-world video

Step-1: Pre-process the video

Step-2: Lip-reading inference

Additional options that can be set if needed:

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

Sindhu-Hegde/multivsr

Folders and files

Latest commit

History

Repository files navigation

Scaling Multilingual Visual Speech Recognition

News 🚀🚀🚀

Dataset

Installation

Checkpoints

Inference on a real-world video

Step-1: Pre-process the video

Step-2: Lip-reading inference

Additional options that can be set if needed:

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages