Let 2D Diffusion Model Know 3D-Consistency
for Robust Text-to-3D Generation

Junyoung Seo*1, Wooseok Jang*1, Min-Seop Kwak*1, Jaehoon Ko1, Hyeonsu Kim1,
Junho Kim2, Jin-Hwa Kim†2, Jiyoung Lee†2, Seungryong Kim†1.
1Korea University, 2NAVER AI Lab
*Equal contribution Co-corresponding author

ICLR 2024

3DFuse (Ours)
3DFuse (Ours)
"a cute little kitten"
"a photo of cute hippo"
"a cute elephant with a long trunk and ivory tusks"
"a cute pig with a pink snout and curly tail"
"cute sheep with white fur"
"a gentle deer with a spotted coat and a peaceful expression"
"a cozy cabin in the woods with a chimney and a porch"
"a stack of pancakes with maple syrup and butter"
"a photo of comfortable bed"
"a fantastical wizard's tower with a spiral staircase
and mysterious artifacts"


Text-to-3D generation has shown rapid progress in recent days with the advent of score distillation, a methodology of using pretrained text-to-2D diffusion models to optimize neural radiance field (NeRF) in the zero-shot setting. However, the lack of 3D awareness in the 2D diffusion models destabilizes score distillation-based methods from reconstructing a plausible 3D scene. To address this issue, we propose 3DFuse, a novel framework that incorporates 3D awareness into pretrained 2D diffusion models, enhancing the robustness and 3D consistency of score distillation-based methods. We realize this by first constructing a coarse 3D structure of a given text prompt and then utilizing projected, view-specific depth map as a condition for the diffusion model. Additionally, we introduce a training strategy that enables the 2D diffusion model learns to handle the errors and sparsity within the coarse 3D structure for robust generation, as well as a method for ensuring semantic consistency throughout all viewpoints of the scene. Our framework surpasses the limitations of prior arts, and has significant implications for 3D consistent generation of 2D diffusion models.


(a) Naive score distillation

(b) 3D aware score distillation (Ours)

(a) Previous methods only use noisy rendered images and prompt itself for score distillation through diffusion model, resulting in poor 3D coherence. (b) Our 3DFuse addresses this issue and shows robust performance in recovering 3D-consistent scene.

3DFuse Framework

In the framework, semantic code is sampled to reduce the text prompt ambiguity by generating an image based on the text prompt and then optimizing the prompt’s embedding to match the generated image. Our consistency injection module receives this semantic code to synthesize view-specific depth maps as a condition to the diffusion U-net. The module also consists of a sparse depth injector to implicitly incorporate 3D awareness by utilizing an external 3D prior, and LoRA layers to maintain semantic consistency.

View-dependent 2D Generation

(a) results of view augmented prompting, and (b) our results with 3DFuse framework.

Image-conditional Generation

Instead of generating the initial image from a text prompt, we directly give an input image, which effectively reconfigures our framework as an image-conditional setting.


  title={Let 2D Diffusion Model Know 3D-Consistency for Robust Text-to-3D Generation},
  author={Seo, Junyoung and Jang, Wooseok and Kwak, Min-Seop and Ko, Jaehoon and Kim, Hyeonsu and Kim, Junho and Kim, Jin-Hwa and Lee, Jiyoung and Kim, Seungryong},
  journal={arXiv preprint arXiv:2303.07937},