DreamMatcher: Appearance Matching Self-Attention
for Semantically-Consistent Text-to-Image Personalization

Jisu Nam1, Heesu Kim2, DongJae Lee2, Siyoon Jin1, Seungryong Kim†1, Seunggyu Chang†2
1Korea University, 2Naver Cloud
Co-corresponding author

CVPR 2024


DreamMatcher enables semantically-consistent Text-to-Image (T2I) personalization. Our method is designed to be compatible with any existing T2I personalized models. When integrated with them, DreamMatcher significantly enhances subject appearance, including colors, textures, and shapes, while accurately preserving the target structure as guided by the target prompt. Notably, our plug-in method does not require any additional training or fine-tuning.


Abstract

The objective of Text-to-Image (T2I) personalization is to customize a diffusion model to a user-provided reference concept, generating diverse images of the concept aligned with the target prompts. Conventional methods for T2I personalization represent the reference concepts using unique text embeddings. However, they often fail to accurately mimic the appearance of the reference. To address this, one solution may be explicitly conditioning the reference images into the target denoising process, called key-value replacement. However, prior works are constrained to local editing since they disrupt the structure path of the pre-trained T2I model, leading to suboptimal matching. To overcome this, we propose a novel plug-in method, DreamMatcher, which reformulates T2I personalization as semantic matching. Specifically, DreamMatcher replaces the target values with reference values aligned by semantic matching, while leaving the structure path unchanged to preserve the versatile capability of pre-trained T2I models for generating diverse structures. We also introduce a semantic-consistent masking strategy to isolate the personalized concept from irrelevant regions, such as the background or objects newly introduced by the target prompts. DreamMatcher is compatible with any existing T2I models, showing significant improvements in complex personalization scenarios. Intensive experiments and analyses demonstrate the effectiveness of our approach.


Appearance Matching Self-Attention

Comparison between (a) key-value replacement (MasaCtrl [Cao et al., ICCV’23]) and (b) appearance matching self-attention (AMA): AMA aligns the reference appearance path toward the fixed target structure path through explicit semantic matching and consistency modeling.

Intuition of DreamMatcher: (a) reference image, (b) disrupted target structure path by key-value replacement, (c) generated image by (b), (d) target structure path in pre-trained T2I model, and (e) generated image by DreamMatcher. For visualization, principal component analysis (PCA) is applied to the structure path. Key-value replacement disrupts target structure, yielding sub-optimal personalized results, whereas DreamMatcher preserves target structure, producing high-fidelity subject images aligned with target prompts.


Overall Framework


Overall network architecture of DreamMatcher. Given a reference image, appearance matching self-attention (AMA) aligns the reference appearance into the fixed target structure in self-attention module of pre-trained personalized model. This is achieved by explictly leveraging reliable semantic matching from reference to target. Furthermore, semantic matching guidance enhances the fine-grained details of the subject in the generated images.


Personalization Results

Qualitative comparision with baselines for live objects: We compare DreamMatcher with three different baselines, Textual Inversion, DreamBooth, and CustomDiffusion.

Qualitative comparision with baselines for non-live objects: We compare DreamMatcher with three different baselines, Textual Inversion, DreamBooth, and CustomDiffusion.

Qualitative comparision with previous works for live objects: For this comparison, DreamBooth is used as the baseline for MasaCtrl, FreeU, MagicFusion, and DreamMatcher.

Qualitative comparision with previous works for non-live objects: For this comparison, DreamBooth is used as the baseline for MasaCtrl, FreeU, MagicFusion, and DreamMatcher.

Qualitative comparision with the baseline for multiple subjects personalization: We used CustomDiffusion as the baseline for personalizing multiple subjects.

BibTeX

@misc{nam2024dreammatcher,
      title={DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization}, 
      author={Jisu Nam and Heesu Kim and DongJae Lee and Siyoon Jin and Seungryong Kim and Seunggyu Chang},
      year={2024},
      eprint={2402.09812},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}