Audio-Driven Co-Speech Gesture Video Generation

Xian Liu¹, Qianyi Wu², Hang Zhou¹, Yuanqi Du³, Wayne Wu⁴, Dahua Lin^{1, 4}, Ziwei Liu⁵

¹Multimedia Laboratory, The Chinese University of Hong Kong ²Monash University
³Cornell University ⁴Shanghai AI Laboratory ⁵S-Lab, Nanyang Technological University

[Paper] [Code] [Dataset]

Overview

This paper formally defines and studys the challenging problem of audio-driven co-speech gesture video generation in image domain. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, ANGIE, to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e.g., 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video.

Results

Problem setting illustration. Given an image with speech audio, we generate aligned speaker image sequence.

Qualitative image sequence results on the PATS Image dataset.

We validate that the codebooks contain meaningful motion patterns. Drive the same image with different VQ codes leads to different gestures (left); Drive different images with the same VQ code shows same motions (right).

Demo Video

We present the demo video for better visualization of our qualitative results.

BibTeX

@article{liu2022audio,
    title={Audio-Driven Co-Speech Gesture Video Generation},
    author={Liu, Xian and Wu, Qianyi and Zhou, Hang and Du, Yuanqi and Wu, Wayne and Lin, Dahua and Liu, Ziwei},
    journal={Advances in Neural Information Processing Systems},
    year={2022}
}

Related Work

Shiry Ginosar et al. Learning Individual Styles of Conversational Gesture. CVPR, 2019.
Comment: The first work that utilizes deep learning framework with an adversarial training scheme (GAN) for the task of co-speech gesture generation.

Youngwoo Yoon et al. Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity. SIGGRAPH Asia, 2020.
Comment: Consider three input modalities as stimuli for co-speech gesture generation.

Xian Liu et al. Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation. CVPR, 2022.
Comment: Excavate the hierarchical cross-modal associations of multiple granularities between multi-level audio features and human skeletons.