Abstract

We propose GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial-aware features and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. It is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Experimental results showcase GaussianTalker's superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Specifically, GaussianTalker achieves a remarkable rendering speed of 120 FPS, surpassing previous benchmarks.

Overall Framework

GaussianTalker utilizes a multi-resolution triplane to leverage different scales of features depicting a canonical 3D head. These features are fed into a spatial-audio attention module along with the audio feature to predict per-frame deformations, enabling fast and reliable talking head synthesis.

Comparison with Baseline Models

Fidelity, lip synchronization and inference time comparison between existing 3D talking face synthesis models and ours. Our method, GaussianTalker, achieves on par with or better results at much higher FPS. Note that we also include GaussianTalker∗, a more efficient and faster variant. Size of each bubble represents the inference time per frame of each method.

Qualitative Experiments

Self-Driven Results

Cross-Driven Results

Qualitative Experiments

Self-Driven Results

Cross-Driven Results

Importance of our Spatial-Audio Attention Module

Speech-related Motion Disentanglement

Our spatial-audio attention module effectively disentangles the speech-related motion, by conditioning the unrelated facial motion and scene variations on ohter input conditions. We thereby disentangle the speech-related motion from the video, allowing the model to better model the correspondence between the input speech audio and corresponding facial motion.

Stabilization of Scene Variations

By conditioning the spatial-audio attention module on the facail viewpoint, we effectively control the scene variations that do not correlate with the speech audio, such as hair motion and skin illumination for certain angles.

Citation

If you find our work useful in your research, please cite our work as:
    @misc{cho2024gaussiantalker,
        title={GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting}, 
        author={Kyusun Cho and Joungbin Lee and Heeji Yoon and Yeobin Hong and Jaehoon Ko and Sangjun Ahn and Seungryong Kim},
        year={2024},
        eprint={xxxx.xxxxx},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
    }

Acknowledgements

The website template was borrowed from Michaël Gharbi.