This project proposes a multimodal empathy modeling framework that learns from dyadic interactions by encoding a speaker’s audio and facial dynamics (3DMM coefficients) to generate an empathizer’s context-aware and dynamic audiovisual responses beyond static reactions.
Demo_video.mp4
This code is composed of five groups:
Deep3DFaceRecon_pytorch: use for extract 3dmm coefficients. Mainly from sicxu/Deep3DFaceRecon, modified following RenYurui/PIRenderpreprocess: scripts for making dataset compatible with our methodvico: our method proposed in paper Responsive Listening Head Generation: A Benchmark Dataset and Baseline arXivPIRender: render 3dmm coefficients to video. Mainly from RenYurui/PIRender with minor modifications.evaluation: quantitative analysis for generations, including SSIM, CPBD, PSNR, FID, CSIM, etc.- code for CSIM is mainly from deepinsight/insightface
- code for lip sync evaluation is mainly from joonson/syncnet_python
- in Challenge 2023, we use cleardusk/3DDFA_V2 to extract landmarks for LipLMD and 3DMM reconstruction.
For end-to-end inference, this repo may be useful.
-
create a workspace
mkdir vico-workspace cd vico-workspace -
download dataset from this link and unzip
listening_head.zipto folderdata/unzip listening_head.zip -d data/
-
reorganize
data/folder to meet the requirements of PIRendermkdir -p data/listening_head/videos/test mv data/listening_head/videos/*.mp4 data/listening_head/videos/test -
clone baseline code
git clone https://github.com/dc3ea9f/vico_challenge_baseline.git
-
extract 3d coefficients for video ([reference])
-
change directory to
vico_challenge_baseline/Deep3DFaceRecon_pytorch/cd vico_challenge_baseline/Deep3DFaceRecon_pytorch/ -
prepare environment following this
-
prepare
BFM/andcheckpoints/following these instructions -
extract facial landmarks from videos
python extract_kp_videos.py \ --input_dir ../../data/listening_head/videos/ \ --output_dir ../../data/listening_head/keypoints/ \ --device_ids 0,1,2,3 \ --workers 12
-
extract coefficients for videos
python face_recon_videos.py \ --input_dir ../../data/listening_head/videos/ \ --keypoint_dir ../../data/listening_head/keypoints/ \ --output_dir ../../data/listening_head/recons/ \ --inference_batch_size 128 \ --name=face_recon_feat0.2_augment \ --epoch=20 \ --model facerecon
-
-
extract audios features
-
change directory to
vico_challenge_baseline/preprocesscd ../preprocess -
install python package
librosa,torchaudioandsoundfile -
extract audio features
python extract_audio_features.py \ --input_audio_folder ../../data/listening_head/audios/ \ --input_recons_folder ../../data/listening_head/recons/ \ --output_folder ../../data/listening_head/example/features/audio_feats
-
-
reorganize video features
python rearrange_recon_coeffs.py \ --input_folder ../../data/listening_head/recons/ \ --output_folder ../../data/listening_head/example/features/video_feats
-
organize data
-
compute mean and std for features
python statistics_mean_std.py ../../data/listening_head/example/features
-
organize for training
mkdir ../../data/listening_head/example/metadata cp ../../data/listening_head/train.csv ../../data/listening_head/example/metadata/data.csv cd ../vico ln -s ../../data/listening_head/example/ ./data
-
-
train baseline
python -m torch.distributed.launch --nproc_per_node 4 --master_port 22345 train.py \ --batch_size 4 \ --time_size 90 \ --max_epochs 500 \ --lr 0.002 \ --task listener \ --output_path saved/baseline_listener
- For users who wish to run simple inference without training, we provide the pretrained checkpoint.
You can download it from the Google Drive.
-
inference
python eval.py \ --batch_size 4 \ --output_path saved/baseline_listener_E500 \ --resume saved/baseline_listener/checkpoints/Epoch_500.bin \ --task listener
-
change directory to render
cd ../PIRender -
prepare environment for PIRender following this
-
download the trained weights of PIRender following this
-
prepare vox lmdb
python scripts/prepare_vox_lmdb.py \ --path ../../data/listening_head/videos/ \ --coeff_3dmm_path ../vico/saved/baseline_listener_E500/recon_coeffs/ \ --out ../vico/saved/baseline_listener_E500/vox_lmdb/
-
render to videos
python -m torch.distributed.launch --nproc_per_node=1 --master_port 42345 inference_avarmerg.py \ --config ./config/face_demo.yaml \ --name face \ --no_resume \ --input ../vico/saved/baseline_listener_E500/vox_lmdb/ \ --output_dir ./vox_result/baseline_listener_E500
-
Change directory to AnyGPT
git clone https://github.com/OpenMOSS/AnyGPT cd AnyGPT -
Set up the environment
conda create --name AnyGPT python=3.9 conda activate AnyGPT pip install -r requirements.txt -
Download pre-trained models
- Check the AnyGPT-base weights in fnlp/AnyGPT-base
- Check the AnyGPT-chat weights in fnlp/AnyGPT-chat
- Check the SpeechTokenizer and Soundstorm weights in fnlp/AnyGPT-speech-modules
- Check the SEED tokenizer weights in AILab-CVC/seed-tokenizer-2
The SpeechTokenizer is used for tokenizing and reconstructing speech, Soundstorm is responsible for completing paralinguistic information, and SEED-tokenizer is used for tokenizing images.
The model weights of unCLIP SD-UNet which are used to reconstruct the image, and Encodec-32k which are used to tokenize and reconstruct music will be downloaded automatically.
-
Generate Audio (Inference)
python anygpt/src/infer/cli_infer_chat_model.py \ --model-name-or-path 'path/to/model' \ --image-tokenizer-path 'path/to/model' \ --speech-tokenizer-path 'path/to/model' \ --speech-tokenizer-config 'path/to/config' \ --soundstorm-path 'path/to/model' \ --output-dir "infer_output/chat"
for example
python anygpt/src/infer/cli_infer_chat_model.py \ --model-name-or-path models/anygpt/chat \ --image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt \ --speech-tokenizer-path models/speechtokenizer/ckpt.dev \ --speech-tokenizer-config models/speechtokenizer/config.json \ --soundstorm-path models/soundstorm/speechtokenizer_soundstorm_mls.pt \ --output-dir "infer_output/chat"
-
Change directory to Wav2Lip
git clone https://github.com/Rudrabha/Wav2Lip cd Wav2Lip -
Set up the environment
conda create -n wav2lip python=3.6 conda activate wav2lip sudo apt-get install ffmpeg pip install -r requirements.txt -
Download pre-trained models
- Face detection pre-trained model should be downloaded to
face_detection/detection/sfd/s3fd.pth. Alternative link if the above does not work.
Model Description Link to the model Wav2Lip Highly accurate lip-sync Link Wav2Lip + GAN Slightly inferior lip-sync, but better visual quality Link - Place all your checkpoints (.pth files)
./checkpoints.
- Face detection pre-trained model should be downloaded to
-
Lip-syncing videos using the pre-trained models (Inference)
python inference.py --checkpoint_path ./checkpoints/<ckpt> --face <video.mp4> --audio <an-audio-source>
-
Change directory to ESRGAN
git clone https://github.com/xinntao/ESRGAN cd ESRGAN -
Set up the environment
conda create -n esrgan python=3.6 pip install numpy opencv-python- Dependencies
- Python 3
- PyTorch >= 1.0 (CUDA version >= 7.5 if installing with CUDA. More details)
- Dependencies
-
Download pre-trained models Google Drive or Baidu Drive. Place the models in
./models. We provide two models with high perceptual quality and high PSNR performance (see model list). -
Inference
python test.py -
The results are in
./resultsfolder.
- This work is heavily based on vico-challenge_baseline, Deep3DFaceRecon_Pytorch, PIRender, AnyGPT, Wav2Lip, ESRGAN, and AvaMERG . Thanks to all the authors for their great work.