MEG: Multi-signal Empathy Generation

This project proposes a multimodal empathy modeling framework that learns from dyadic interactions by encoding a speaker’s audio and facial dynamics (3DMM coefficients) to generate an empathizer’s context-aware and dynamic audiovisual responses beyond static reactions.

Demo_video.mp4

Empathetic response generation

This code is composed of five groups:

Deep3DFaceRecon_pytorch: use for extract 3dmm coefficients. Mainly from sicxu/Deep3DFaceRecon, modified following RenYurui/PIRender
preprocess: scripts for making dataset compatible with our method
vico: our method proposed in paper Responsive Listening Head Generation: A Benchmark Dataset and Baseline arXiv
PIRender: render 3dmm coefficients to video. Mainly from RenYurui/PIRender with minor modifications.
evaluation: quantitative analysis for generations, including SSIM, CPBD, PSNR, FID, CSIM, etc.
- code for CSIM is mainly from deepinsight/insightface
- code for lip sync evaluation is mainly from joonson/syncnet_python
- in Challenge 2023, we use cleardusk/3DDFA_V2 to extract landmarks for LipLMD and 3DMM reconstruction.

For end-to-end inference, this repo may be useful.

Train Baseline

Data Preparation

create a workspace
```
mkdir vico-workspace
cd vico-workspace
```
download dataset from this link and unzip listening_head.zip to folder data/
```
unzip listening_head.zip -d data/
```

reorganize data/ folder to meet the requirements of PIRender

mkdir -p data/listening_head/videos/test
mv data/listening_head/videos/*.mp4 data/listening_head/videos/test

clone baseline code

git clone https://github.com/dc3ea9f/vico_challenge_baseline.git

extract 3d coefficients for video ([reference])

change directory to vico_challenge_baseline/Deep3DFaceRecon_pytorch/
```
cd vico_challenge_baseline/Deep3DFaceRecon_pytorch/
```
prepare environment following this
prepare BFM/ and checkpoints/ following these instructions

extract facial landmarks from videos

python extract_kp_videos.py \
  --input_dir ../../data/listening_head/videos/ \
  --output_dir ../../data/listening_head/keypoints/ \
  --device_ids 0,1,2,3 \
  --workers 12

extract coefficients for videos

python face_recon_videos.py \
  --input_dir ../../data/listening_head/videos/ \
  --keypoint_dir ../../data/listening_head/keypoints/ \
  --output_dir ../../data/listening_head/recons/ \
  --inference_batch_size 128 \
  --name=face_recon_feat0.2_augment \
  --epoch=20 \
  --model facerecon

extract audios features

change directory to vico_challenge_baseline/preprocess
```
cd ../preprocess
```
install python package librosa, torchaudio and soundfile

extract audio features

python extract_audio_features.py \
  --input_audio_folder ../../data/listening_head/audios/ \
  --input_recons_folder ../../data/listening_head/recons/ \
  --output_folder ../../data/listening_head/example/features/audio_feats

reorganize video features

python rearrange_recon_coeffs.py \
  --input_folder ../../data/listening_head/recons/ \
  --output_folder ../../data/listening_head/example/features/video_feats

organize data

compute mean and std for features

python statistics_mean_std.py ../../data/listening_head/example/features

organize for training

mkdir ../../data/listening_head/example/metadata
cp ../../data/listening_head/train.csv ../../data/listening_head/example/metadata/data.csv
cd ../vico
ln -s ../../data/listening_head/example/ ./data

Train and Inference

Empathizer Head Generation

train baseline

python -m torch.distributed.launch --nproc_per_node 4 --master_port 22345 train.py \
  --batch_size 4 \
  --time_size 90 \
  --max_epochs 500 \
  --lr 0.002 \
  --task listener \
  --output_path saved/baseline_listener

For users who wish to run simple inference without training, we provide the pretrained checkpoint.
You can download it from the Google Drive.

inference

python eval.py \
  --batch_size 4 \
  --output_path saved/baseline_listener_E500 \
  --resume saved/baseline_listener/checkpoints/Epoch_500.bin \
  --task listener

Render to Videos

change directory to render
```
cd ../PIRender
```
prepare environment for PIRender following this
download the trained weights of PIRender following this

Empathizer Head

prepare vox lmdb

python scripts/prepare_vox_lmdb.py \
  --path ../../data/listening_head/videos/ \
  --coeff_3dmm_path ../vico/saved/baseline_listener_E500/recon_coeffs/ \
  --out ../vico/saved/baseline_listener_E500/vox_lmdb/

render to videos

python -m torch.distributed.launch --nproc_per_node=1 --master_port 42345 inference_avarmerg.py \
  --config ./config/face_demo.yaml \
  --name face \
  --no_resume \
  --input ../vico/saved/baseline_listener_E500/vox_lmdb/ \
  --output_dir ./vox_result/baseline_listener_E500

Processing Empathic Multi-Signal

Generating Empathetic Audio

Change directory to AnyGPT

git clone https://github.com/OpenMOSS/AnyGPT
cd AnyGPT

Set up the environment

conda create --name AnyGPT python=3.9
conda activate AnyGPT
pip install -r requirements.txt

Download pre-trained models
- Check the AnyGPT-base weights in fnlp/AnyGPT-base
- Check the AnyGPT-chat weights in fnlp/AnyGPT-chat
- Check the SpeechTokenizer and Soundstorm weights in fnlp/AnyGPT-speech-modules
- Check the SEED tokenizer weights in AILab-CVC/seed-tokenizer-2
The SpeechTokenizer is used for tokenizing and reconstructing speech, Soundstorm is responsible for completing paralinguistic information, and SEED-tokenizer is used for tokenizing images.

The model weights of unCLIP SD-UNet which are used to reconstruct the image, and Encodec-32k which are used to tokenize and reconstruct music will be downloaded automatically.

Generate Audio (Inference)

python anygpt/src/infer/cli_infer_chat_model.py 
\ --model-name-or-path 'path/to/model'
\ --image-tokenizer-path 'path/to/model'
\ --speech-tokenizer-path 'path/to/model'
\ --speech-tokenizer-config 'path/to/config'
\ --soundstorm-path 'path/to/model'
\ --output-dir "infer_output/chat"

for example

python anygpt/src/infer/cli_infer_chat_model.py 
\ --model-name-or-path models/anygpt/chat
\ --image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt 
\ --speech-tokenizer-path models/speechtokenizer/ckpt.dev 
\ --speech-tokenizer-config models/speechtokenizer/config.json 
\ --soundstorm-path models/soundstorm/speechtokenizer_soundstorm_mls.pt 
\ --output-dir "infer_output/chat"

Alignment

Change directory to Wav2Lip

git clone https://github.com/Rudrabha/Wav2Lip
cd Wav2Lip

Set up the environment

conda create -n wav2lip python=3.6
conda activate wav2lip
sudo apt-get install ffmpeg
pip install -r requirements.txt

Download pre-trained models
- Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth. Alternative link if the above does not work.
Model Description Link to the model

Wav2Lip Highly accurate lip-sync Link

Wav2Lip + GAN Slightly inferior lip-sync, but better visual quality Link
- Place all your checkpoints (.pth files) ./checkpoints.

Lip-syncing videos using the pre-trained models (Inference)

python inference.py --checkpoint_path ./checkpoints/<ckpt> --face <video.mp4> --audio <an-audio-source>

Super-Resolution

Change directory to ESRGAN

git clone https://github.com/xinntao/ESRGAN
cd ESRGAN

Set up the environment
```
conda create -n esrgan python=3.6
pip install numpy opencv-python
```
- Dependencies
  - Python 3
  - PyTorch >= 1.0 (CUDA version >= 7.5 if installing with CUDA. More details)
Download pre-trained models Google Drive or Baidu Drive. Place the models in ./models. We provide two models with high perceptual quality and high PSNR performance (see model list).
Inference
```
python test.py
```
The results are in ./results folder.

Acknowledgments

This work is heavily based on vico-challenge_baseline, Deep3DFaceRecon_Pytorch, PIRender, AnyGPT, Wav2Lip, ESRGAN, and AvaMERG . Thanks to all the authors for their great work.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
audio_condition		audio_condition
vico_challenge_baseline		vico_challenge_baseline
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MEG: Multi-signal Empathy Generation

Empathetic response generation

Train Baseline

Data Preparation

Train and Inference

Empathizer Head Generation

Render to Videos

Empathizer Head

Processing Empathic Multi-Signal

Generating Empathetic Audio

Alignment

Super-Resolution

Acknowledgments

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Model	Description	Link to the model
Wav2Lip	Highly accurate lip-sync	Link
Wav2Lip + GAN	Slightly inferior lip-sync, but better visual quality	Link

smu-ivpl/MEG-Multi-signal-Empathy-Generation

Folders and files

Latest commit

History

Repository files navigation

MEG: Multi-signal Empathy Generation

Empathetic response generation

Train Baseline

Data Preparation

Train and Inference

Empathizer Head Generation

Render to Videos

Empathizer Head

Processing Empathic Multi-Signal

Generating Empathetic Audio

Alignment

Super-Resolution

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages