Skip to content

smu-ivpl/MEG-Multi-signal-Empathy-Generation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MEG: Multi-signal Empathy Generation

This project proposes a multimodal empathy modeling framework that learns from dyadic interactions by encoding a speaker’s audio and facial dynamics (3DMM coefficients) to generate an empathizer’s context-aware and dynamic audiovisual responses beyond static reactions.

Demo_video.mp4

Empathetic response generation

This code is composed of five groups:

  • Deep3DFaceRecon_pytorch: use for extract 3dmm coefficients. Mainly from sicxu/Deep3DFaceRecon, modified following RenYurui/PIRender
  • preprocess: scripts for making dataset compatible with our method
  • vico: our method proposed in paper Responsive Listening Head Generation: A Benchmark Dataset and Baseline arXiv
  • PIRender: render 3dmm coefficients to video. Mainly from RenYurui/PIRender with minor modifications.
  • evaluation: quantitative analysis for generations, including SSIM, CPBD, PSNR, FID, CSIM, etc.

For end-to-end inference, this repo may be useful.

Train Baseline

Data Preparation

  1. create a workspace

    mkdir vico-workspace
    cd vico-workspace
  2. download dataset from this link and unzip listening_head.zip to folder data/

    unzip listening_head.zip -d data/
  3. reorganize data/ folder to meet the requirements of PIRender

    mkdir -p data/listening_head/videos/test
    mv data/listening_head/videos/*.mp4 data/listening_head/videos/test
  4. clone baseline code

    git clone https://github.com/dc3ea9f/vico_challenge_baseline.git
  5. extract 3d coefficients for video ([reference])

    1. change directory to vico_challenge_baseline/Deep3DFaceRecon_pytorch/

      cd vico_challenge_baseline/Deep3DFaceRecon_pytorch/
    2. prepare environment following this

    3. prepare BFM/ and checkpoints/ following these instructions

    4. extract facial landmarks from videos

      python extract_kp_videos.py \
        --input_dir ../../data/listening_head/videos/ \
        --output_dir ../../data/listening_head/keypoints/ \
        --device_ids 0,1,2,3 \
        --workers 12
    5. extract coefficients for videos

      python face_recon_videos.py \
        --input_dir ../../data/listening_head/videos/ \
        --keypoint_dir ../../data/listening_head/keypoints/ \
        --output_dir ../../data/listening_head/recons/ \
        --inference_batch_size 128 \
        --name=face_recon_feat0.2_augment \
        --epoch=20 \
        --model facerecon
  6. extract audios features

    1. change directory to vico_challenge_baseline/preprocess

      cd ../preprocess
    2. install python package librosa, torchaudio and soundfile

    3. extract audio features

      python extract_audio_features.py \
        --input_audio_folder ../../data/listening_head/audios/ \
        --input_recons_folder ../../data/listening_head/recons/ \
        --output_folder ../../data/listening_head/example/features/audio_feats
  7. reorganize video features

    python rearrange_recon_coeffs.py \
      --input_folder ../../data/listening_head/recons/ \
      --output_folder ../../data/listening_head/example/features/video_feats
  8. organize data

    1. compute mean and std for features

      python statistics_mean_std.py ../../data/listening_head/example/features
    2. organize for training

      mkdir ../../data/listening_head/example/metadata
      cp ../../data/listening_head/train.csv ../../data/listening_head/example/metadata/data.csv
      cd ../vico
      ln -s ../../data/listening_head/example/ ./data

Train and Inference

Empathizer Head Generation

  1. train baseline

    python -m torch.distributed.launch --nproc_per_node 4 --master_port 22345 train.py \
      --batch_size 4 \
      --time_size 90 \
      --max_epochs 500 \
      --lr 0.002 \
      --task listener \
      --output_path saved/baseline_listener
  • For users who wish to run simple inference without training, we provide the pretrained checkpoint.
    You can download it from the Google Drive.
  1. inference

    python eval.py \
      --batch_size 4 \
      --output_path saved/baseline_listener_E500 \
      --resume saved/baseline_listener/checkpoints/Epoch_500.bin \
      --task listener

Render to Videos

  1. change directory to render

    cd ../PIRender
  2. prepare environment for PIRender following this

  3. download the trained weights of PIRender following this

Empathizer Head

  1. prepare vox lmdb

    python scripts/prepare_vox_lmdb.py \
      --path ../../data/listening_head/videos/ \
      --coeff_3dmm_path ../vico/saved/baseline_listener_E500/recon_coeffs/ \
      --out ../vico/saved/baseline_listener_E500/vox_lmdb/
  2. render to videos

    python -m torch.distributed.launch --nproc_per_node=1 --master_port 42345 inference_avarmerg.py \
      --config ./config/face_demo.yaml \
      --name face \
      --no_resume \
      --input ../vico/saved/baseline_listener_E500/vox_lmdb/ \
      --output_dir ./vox_result/baseline_listener_E500

Processing Empathic Multi-Signal

Generating Empathetic Audio

  1. Change directory to AnyGPT

    git clone https://github.com/OpenMOSS/AnyGPT
    cd AnyGPT
    
  2. Set up the environment

    conda create --name AnyGPT python=3.9
    conda activate AnyGPT
    pip install -r requirements.txt
    
  3. Download pre-trained models

    The SpeechTokenizer is used for tokenizing and reconstructing speech, Soundstorm is responsible for completing paralinguistic information, and SEED-tokenizer is used for tokenizing images.

    The model weights of unCLIP SD-UNet which are used to reconstruct the image, and Encodec-32k which are used to tokenize and reconstruct music will be downloaded automatically.

  4. Generate Audio (Inference)

    python anygpt/src/infer/cli_infer_chat_model.py 
    \ --model-name-or-path 'path/to/model'
    \ --image-tokenizer-path 'path/to/model'
    \ --speech-tokenizer-path 'path/to/model'
    \ --speech-tokenizer-config 'path/to/config'
    \ --soundstorm-path 'path/to/model'
    \ --output-dir "infer_output/chat"

    for example

    python anygpt/src/infer/cli_infer_chat_model.py 
    \ --model-name-or-path models/anygpt/chat
    \ --image-tokenizer-path models/seed-tokenizer-2/seed_quantizer.pt 
    \ --speech-tokenizer-path models/speechtokenizer/ckpt.dev 
    \ --speech-tokenizer-config models/speechtokenizer/config.json 
    \ --soundstorm-path models/soundstorm/speechtokenizer_soundstorm_mls.pt 
    \ --output-dir "infer_output/chat"

Alignment

  1. Change directory to Wav2Lip

    git clone https://github.com/Rudrabha/Wav2Lip
    cd Wav2Lip
    
  2. Set up the environment

    conda create -n wav2lip python=3.6
    conda activate wav2lip
    sudo apt-get install ffmpeg
    pip install -r requirements.txt
    
  3. Download pre-trained models

    • Face detection pre-trained model should be downloaded to face_detection/detection/sfd/s3fd.pth. Alternative link if the above does not work.
    Model Description Link to the model
    Wav2Lip Highly accurate lip-sync Link
    Wav2Lip + GAN Slightly inferior lip-sync, but better visual quality Link
    • Place all your checkpoints (.pth files) ./checkpoints.
  4. Lip-syncing videos using the pre-trained models (Inference)

    python inference.py --checkpoint_path ./checkpoints/<ckpt> --face <video.mp4> --audio <an-audio-source>
    

Super-Resolution

  1. Change directory to ESRGAN

    git clone https://github.com/xinntao/ESRGAN
    cd ESRGAN
    
  2. Set up the environment

    conda create -n esrgan python=3.6
    pip install numpy opencv-python
    
  3. Download pre-trained models Google Drive or Baidu Drive. Place the models in ./models. We provide two models with high perceptual quality and high PSNR performance (see model list).

  4. Inference

    python test.py
    
  5. The results are in ./results folder.

Acknowledgments

About

A multimodal empathy modeling framework that learns from dyadic interations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •