MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

Yunfei Liu1, Lijian Lin1, Fei Yu2, Changyin Zhou2, Yu Li1
1International Digital Economy Academy (IDEA), Shenzhen, China,   2Vistring Inc., Hangzhou, China

We propose a mapping-once system with dual-attention for multimodal and high-fidelity portrait video animation.

We propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation. Extensive evaluations demonstrate that the proposed system produces more natural and realistic video portraits compared to previous methods.


Given the subject figure and arbitrary audio, the proposed system generates the audio-driven portrait video.

The proposed method is a three-stage system.

Our method contains three stages, i.e.,

  1. Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio. In MODA, we design a dual-attention module to encode accurate mouth movements and diverse modalities.
  2. Facial composer network generates dense and detailed face landmarks.
  3. Temporal-guided renderer syntheses stable videos.

Network architecture

Given the driven audio A and subject condition S, MODA aims to map them into R (consists of lip movement, eye blinking, head pose, and torso) with a single forward process.

Architecture of MODA network. Given an audio and subject condition, MODA generates four types of motions within a single forward process.

The proposed DualAttn simultaneously learns one-to-one mapping (SpecAttn) for lip-sync and one-to-many mapping (ProbAttn) for other movements.

Numerical results

Comparisons with state-of-the-art methods. † denotes our generated results with size 256x256 through a small render. Best results are highlighted in bold. The number with underline denotes the second-best result.

Visual results. Visual comparison of 5 state-of-the-art methods.


How to cite


  title={MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions},
  author={Liu, Yunfei and Lin, Lijian and Fei, Yu and Changyin, Zhou, and Yu, Li},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},