Audio-Driven Co-Speech Gesture Video Generation
Advances in Neural Information Processing Systems (NeurIPS), 2022.
Xian Liu1Qianyi Wu2Hang Zhou1Yuanqi Du3Wayne Wu4Dahua Lin1, 4Ziwei Liu5
1Multimedia Laboratory, The Chinese University of Hong Kong    2Monash University
3Cornell University    4Shanghai AI Laboratory    5S-Lab, Nanyang Technological University
This paper formally defines and studys the challenging problem of audio-driven co-speech gesture video generation in image domain. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. To this end, we propose a novel framework, ANGIE, to effectively capture the reusable co-speech gesture patterns as well as fine-grained rhythmic movements. To achieve high-fidelity image sequence generation, we leverage an unsupervised motion representation instead of a structural human body prior (e.g., 2D skeletons). Specifically, 1) we propose a vector quantized motion extractor (VQ-Motion Extractor) to summarize common co-speech gesture patterns from implicit motion representation to codebooks. 2) Moreover, a co-speech gesture GPT with motion refinement (Co-Speech GPT) is devised to complement the subtle prosodic motion details. Extensive experiments demonstrate that our framework renders realistic and vivid co-speech gesture video.
Problem setting illustration. Given an image with speech audio, we generate aligned speaker image sequence.
Qualitative image sequence results on the PATS Image dataset.
We validate that the codebooks contain meaningful motion patterns. Drive the same image with different VQ codes leads to different gestures (left); Drive different images with the same VQ code shows same motions (right).
    title={Audio-Driven Co-Speech Gesture Video Generation},
    author={Liu, Xian and Wu, Qianyi and Zhou, Hang and Du, Yuanqi and Wu, Wayne and Lin, Dahua and Liu, Ziwei},
    journal={Advances in Neural Information Processing Systems},
Related Work
Shiry Ginosar et al. Learning Individual Styles of Conversational Gesture. CVPR, 2019.
Comment: The first work that utilizes deep learning framework with an adversarial training scheme (GAN) for the task of co-speech gesture generation.
Xian Liu et al. Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation. CVPR, 2022.
Comment: Excavate the hierarchical cross-modal associations of multiple granularities between multi-level audio features and human skeletons.