Audio-driven talking-head generation is a crucial and useful technology for virtual human interaction and film-making. While recent advances have focused on improving image fidelity and lip synchronization, generating accurate emotional expressions remains underexplored. In this paper, we introduce EmoGene, a novel framework for synthesizing high-fidelity, audio-driven video portraits with accurate emotional expressions. Our approach employs a variational autoencoder (VAE)-based audio-to-motion module to generate facial landmarks, which are concatenated with emotional embedding in a motion-to-emotion module to produce emotional landmarks. These landmarks drive a Neural Radiance Fields (NeRF)-based emotion-to-video module to render realistic emotional talking-head videos. Additionally, we propose a pose sampling method to generate natural idle-state (non-speaking) videos for silent audio inputs. Extensive experiments demonstrate that EmoGene outperforms previous methods in generating high-fidelity emotional talking-head videos.
1) The Audio-to-Motion module converts input audio features into neutral facial landmarks. 2) The Motion-to-Emotion module transforms these landmarks into emotional landmarks based on the emotion label. 3) The Emotion-to-Video module generates the emotional talking-head video conditioned on the emotional landmarks.
MOS results from 20 participants. The participants were instructed to rate each video based on 4 criteria: 1) emotional accuracy; 2) lip synchronization; 3) video realism; and 4) video quality.
@INPROCEEDINGS{11099460,
author={Wang, Wenqing and Fu, Yun},
booktitle={2025 IEEE 19th International Conference on Automatic Face and Gesture Recognition (FG)},
title={EmoGene: Audio-Driven Emotional 3D Talking-Head Generation},
year={2025},
volume={},
number={},
pages={1-10},
keywords={Deformable models;Accuracy;Three-dimensional displays;Lips;Gesture recognition;Sampling methods;Neural radiance field;Rendering (computer graphics);Synchronization;Videos},
doi={10.1109/FG61629.2025.11099460}}