Qualitative comparisons between unguided (baseline) and perturbed-attention-guided (PAG) diffusion samples. Without any external conditions, e.g., class labels or text prompts, or additional training, our PAG dramatically elevates the quality of diffusion samples even in unconditional generation, where classifier-free guidance (CFG) is inapplicable. Our guidance can also enhance the baseline performance in various downstream tasks such as ControlNet with empty prompt and image restoration such as inpainting and deblurring.

Abstract

Recent studies prove that diffusion models can generate high-quality samples, but their quality is often highly reliant on sampling guidance techniques such as classifier guidance (CG) and classifier-free guidance (CFG), which are inapplicable in unconditional generation or various downstream tasks such as image restoration. In this paper, we propose a novel diffusion sampling guidance, called Perturbed-Attention Guidance (PAG), which improves sample quality across both unconditional and conditional settings, achieving this without requiring further training or the integration of external modules. PAG is designed to progressively enhance the structure of synthesized samples throughout the denoising process by considering the self-attention mechanisms' ability to capture structural information. It involves generating intermediate samples with degraded structure by substituting selected self-attention maps in diffusion U-Net with an identity matrix, and guiding the denoising process away from these degraded samples.


Guided Diffusion Results

← Hover over the image to compare results! (baseline vs ours)→

Unconditional Generation

Conditional Generation

Stable Diffusion Results

Unconditional Generation

Conditional Generation (CFG vs CFG + Ours)

Overall Framework

Conceptual comparison between CFG and PAG. CFG employs jointly trained unconditional model as the undesirable path, whereas PAG utilizes perturbed self-attention for the same purpose. \(\mathbf{A}_t\) corresponds to the self-attention map. In PAG, we perturb this by replacing with an identity matrix \(\mathbf{I}\).

Application to Downstream Tasks

Image Restoration (PSLD)

ControlNet with Empty Prompt (ControlNet)

🧨Diffusers Pipelines and Community Implementation for GUI Interfaces

Thanks to the exceptional efforts of @v0xie, @pamparamm and @multimodalart, you can now easily incorporate PAG into your custom pipelines or workflows.
🧨Diffusers pipeline for SD: hyoungwoncho/sd_perturbed_attention_guidance
🧨Diffusers pipeline for SDXL: multimodalart/sdxl_perturbed_attention_guidance
SD WebUI (Automatic1111) extension: v0xie/sd-webui-incantations
ComfyUI node / SD WebUI Forge extension: pamparamm/sd-perturbed-attention

Try a demo for SDXL here!

Citation

If you find our work useful in your redreasearch, please cite our work as:
@article{ahn2024selfrectifying,
    title={Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance},
    author={Donghoon Ahn and Hyoungwon Cho and Jaewon Min and Wooseok Jang and Jungwoo Kim and SeonHwa Kim and Hyun Hee Park and Kyong Hwan Jin and Seungryong Kim},
    journal={arXiv preprint arXiv:2403.17377},
    year={2024}
    }

Acknowledgements

The website template was borrowed from Michaël Gharbi.