RealTalk: Realistic Emotion-Aware Lifelike Talking-Head Synthesis

Northeastern University
2025 ICCVW

Abstract

Emotion is a critical component of artificial social intelligence. However, while current methods excel in lip synchronization and image quality, they often fail to generate accurate and controllable emotional expressions while preserving the subject’s identity. To address this challenge, we introduce RealTalk, a novel framework for synthesizing emotional talking heads with high emotion accuracy, enhanced emotion controllability, and robust identity preservation. RealTalk employs a variational autoencoder (VAE) to generate 3D facial landmarks from driving audio, which are concatenated with emotion-label embeddings using a ResNet-based landmark deformation model (LDM) to produce emotional landmarks. These landmarks and facial blendshape coefficients jointly condition a novel tri-plane attention Neural Radiance Field (NeRF) to synthesize highly realistic emotional talking heads. Extensive experiments demonstrate that RealTalk outperforms existing methods in emotion accuracy, controllability, and identity preservation, advancing the development of sociallyintelligent AI systems.

Pipeline

Pipeline Image

Quantitative Results

Baseline Comparisons

quant Image

Qualitative Results

Baseline Comparisons

qua Image

Comparison with out-of-domain video generations.

quan_ood Image

User Study

user_study Image

MOS results from 20 participants. The participants were instructed to rate each video based on 4 criteria: 1) emotional accuracy; 2) lip synchronization; 3) video realism; and 4) video quality.

Video Demos

Neutral

Surprise

BibTeX

@article{wang2025realtalk,
        title={RealTalk: Realistic Emotion-Aware Lifelike Talking-Head Synthesis},
        author={Wang, Wenqing and Fu, Yun},
        journal={arXiv preprint arXiv:2508.12163},
        year={2025}
      }