LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data
Jihye Park*1
Sunwoo Kim*1
Soohyun Kim*1
Seokju Cho1
Jaejun Yoo2
Youngjung Uh3
Seungryong Kim†1
Korea University1
UNIST2
Yonsei University3
Equal contribution*
Corresponding Author
CVPR 2023
[Paper
|
GitHub]

LANIT firstly highlights to address multiple attributes in one sample for Image-to-Image translation tasks.

Abstract

Existing techniques for image-to-image translation commonly have suffered from two critical problems: heavy reliance on per-sample domain annotation and/or inability of handling multiple attributes per image. Recent methods adopt clustering approaches to easily provide per-sample annotations in an unsupervised manner. However, they cannot account for the real-world setting; one sample may have multiple attributes. In addition, the semantics of the clusters are not easily coupled to human understanding. To overcome these, we present a LANguage-driven Image-to-image Translation model, dubbed LANIT. We leverage easy-to-obtain candidate domain annotations given in texts for a dataset and jointly optimize them during training. The target style is specified by aggregating multi-domain style vectors according to the multi-hot domain assignments. As the initial candidate domain texts might be inaccurate, we set the candidate domain texts to be learnable and jointly fine-tune them during training. Furthermore, we introduce a slack domain to cover samples that are not covered by the candidate domains. Experiments on several standard benchmarks demonstrate that LANIT achieves comparable or superior performance to existing models


Levels of Supervision

For unpaired image-to-image translation, (a) conventional methods (CycleGAN, MUNIT, FUNIT, StarGAN, SEMIT) require at least per-sample-level domain supervision, which is often hard to collect. To overcome this, (b) unsupervised learning methods (TUNIT, Style aware discriminator) learn image translation model using a dataset itself without any supervision, but it shows limited performance and lacks the semantic understanding of each cluster, limiting its applicability. Unlike them, (c) we present a novel framework for image translation that requires a dataset with possible textual domain descriptions (i.e., dataset-level annotation), which achieves comparable or even better performance than previous methods.


Network Configuration

We propose a language-driven Image-to-Image translation framework with candidate "dataset-level" domain annotations, which is more practical than fully-unsupervised methods in the real-world.


Qualitative Results

Controlling the number of attributes to translate

Latent-guided translation

AnimalFaces-10

Food-10

Reference-guided translation

LSUN-Church

LSUN-Car

MetFace

Anime



Paper and Supplementary Material

J. Park, S. Kim, S. Kim, S. Cho, J. Yoo, Y. Uh, S. Kim
LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data.
(hosted on ArXiv)


[Bibtex]


Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.