HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

Xian Liu¹ Xiaohang Zhan² Jiaxiang Tang³
Ying Shan² Gang Zeng³ Dahua Lin¹ Xihui Liu⁴ Ziwei Liu⁵

¹CUHK ²Tencent AI Lab ³PKU ⁴HKU ⁵NTU

Paper arXiv Demo Github Models

Avatar Gallery

A boy with a beanie wearing a hoodie and joggers.

A body builder wearing a tanktop.

A Viking.

A Black woman wearing sunglasses, a white t-shirt and jeans.

A Texas ranger.

A Black man wearing a green t-shirt.

A Black woman wearing a hoodie.

A woman wearing ski clothes.

A Black man wearing a red baseball cap.

A Black woman dressed in gym clothes.

A woman wearing a short jean skirt, a cropped top, and a white sneaker.

A man with a black hat and sunglasses.

A man with a buzz cut wearing a red t-shirt.

A woman with a side shave wearing a black t-shirt.

A man with a messy hair wearing a gray sweater.

A person with a curly beard and a black hoodie.

A bounty hunter.

A cyborg.

A demon slayer.

A dryad.

A fairy.

A knight.

A philosopher.

A vampire hunter.

A revenant.

A vampire.

A vigilante.

A warlock.

An alchemist.

An alien.

An astrologer.

A man with a black fedora and a denim jacket.

A man with a brown fedora and a denim jacket.

A woman with a pink beanie.

A woman with a yellow beanie.

A man with a black fedora and a leather jacket.

A man with a blue fedora and a leather jacket.

A man with a green fedora and a leather jacket.

A man with a red fedora and a leather jacket.

Demo Video

Abstract

Realistic 3D human generation from text prompts is a desirable yet challenging task. Existing methods optimize 3D representations like mesh or neural fields via score distillation sampling (SDS), which suffers from inadequate fine details or excessive training time. In this paper, we propose an efficient yet effective framework, HumanGaussian, that generates high-quality 3D humans with fine-grained geometry and realistic appearance. Our key insight is that 3D Gaussian Splatting is an efficient renderer with periodic Gaussian shrinkage or growing, where such adaptive density control can be naturally guided by intrinsic human structures. Specifically, 1) we first propose a Structure-Aware SDS that simultaneously optimizes human appearance and geometry. The multi-modal score function from both RGB and depth space is leveraged to distill the Gaussian densification and pruning process. 2) Moreover, we devise an Annealed Negative Prompt Guidance by decomposing SDS into a noisier generative score and a cleaner classifier score, which well addresses the over-saturation issue. The floating artifacts are further eliminated based on Gaussian size in a prune-only phase to enhance generation smoothness. Extensive experiments demonstrate the superior efficiency and competitive quality of our framework, rendering vivid 3D humans under diverse scenarios.

We propose HumanGaussian, an efficient yet effective framework that generates high-quality 3D humans with fine-grained geometry and realistic appearance. Our method adapts 3D Gaussian Splatting into text-driven 3D human generation with novel designs.

Framework Overview

Overview of the proposed HumanGaussian Framework. We generate high-quality 3D humans from text prompts with the neural representation of 3D Gaussian Splatting (3DGS). In Structure-Aware SDS, we start from the SMPL-X prior to densely sample Gaussians on the human mesh surface as initial center positions. Then, a Texture-Structure Joint Model is trained to simultaneously denoise the image x and depth d conditioned on pose skeleton p. Based on this, we design a dual-branch SDS to jointly optimize human appearance and geometry, where the 3DGS density is adaptively controlled by distilling from both the RGB and depth space. In Annealed Negative Prompt Guidance, we use the cleaner classifier score with an annealed negative score to regularize the stochastic SDS gradient of high variance. The floating artifacts are further eliminated based on Gaussian size in a prune-only phase to enhance generation smoothness.

Qualitative Comparisons

Visual Comparisons with Text-to-3D and 3D Human Models. We compare with recent state-of-the-art baselines on five different prompts, each showing two camera views. Note that the textural unrealism and blurriness are highlighted with yellow arrows; the geometric artifacts are highlighted with green rectangles. Please kindly zoom in for best view and refer to demo video for more results.

Ablation Study

Ablation Studies on HumanGaussian Module Design. We present generation results of the human frontal view under five ablation settings for better visualization and comparisons: (A) baseline; (B) +SMPL-X, Pose-Cond.; (C) +Neg. Guidance, CFG=7.5; (D) +Dual-Branch SDS; (E) +Size-based Prune. The detailed ablation setting designs and result analysis are elaborated in Sec.4.3.

Zero-Shot Animation

Though the HumanGaussian framework is trained on a single body pose at the training stage, it can be animated with unseen pose sequences in a zero-shot manner, i.e., we can use a sequence of SMPL-X pose parameters to animate the pre-trained avatars w/o further finetuning.

BibTeX

  @article{liu2023humangaussian,
    title={HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting},
    author={Liu, Xian and Zhan, Xiaohang and Tang, Jiaxiang and Shan, Ying and Zeng, Gang and Lin, Dahua and Liu, Xihui and Liu, Ziwei},
    journal={arXiv preprint arXiv:2311.17061},
    year={2023}
}

Related Work

Jiaxiang Tang et al. DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation. arXiv preprint arXiv:2309.16653, 2023.
Comment: The first work that adapts Gaussian Splatting to the Text-to-3D generation problem.

Xian Liu et al. HyperHuman: Hyper-Realistic Human Generation with Latent Structural Diffusion. arXiv preprint arXiv:2310.08579, 2023.
Comment: An in-the-wild human generation foundation model that simultaneously denoises the RGB, depth, and surface-normal to capture the joint distribution in a unified framework.