For open-vocabulary semantic segmentation, we propose to aggregate a matching cost derived from dense image and text embeddings of CLIP, resulting in state-of-the-art performance across all benchmarks.

Abstract

Existing works on open-vocabulary semantic segmentation have utilized large-scale vision-language models, such as CLIP, to leverage their exceptional open-vocabulary recognition capabilities. However, the problem of transferring these capabilities learned from image-level supervision to the pixel-level task of segmentation and addressing arbitrary unseen categories at inference makes this task challenging. To address these issues, we aim to attentively relate objects within an image to given categories by leveraging relational information among class categories and visual semantics through aggregation, while also adapting the CLIP representations to the pixel-level task. However, we observe that direct optimization of the CLIP embeddings can harm its open-vocabulary capabilities. In this regard, we propose an alternative approach to optimize the image-text similarity map, i.e. the cost map, using a novel cost aggregation-based method. Our framework, namely CAT-Seg, achieves state-of-the-art performance across all benchmarks. We provide extensive ablation studies to validate our choices.

HuggingFace Demo

Qualitative Results

Qualitative results of ADE20K with 150 categories.

Qualitative results of ADE20K with 847 categories.

Quantitative Results

The best-performing results are presented in bold, while the second-best results are underlined. Improvements over the second-best methods are highlighted in green. mIoU is adopted for evaluation metric. †: Re-implemented to train with full COCO-Stuff dataset. *: Model trained on LAION-2B dataset

Motivation of Cost Aggregation

To validate our framework, we consider two approaches: direct optimization of CLIP embeddings through feature aggregation and indirect optimization through cost aggregation. Left: Both approaches achieves performance gains for seen classes from fine-tuning the CLIP image encoder. Right: Feature aggregation fails to generalize to unseen classes, while cost aggregation achieves a large performance gains, highlighting the effectiveness of this approach for open-vocabulary segmentation.

Main Architecture

Our network consists of a cost computation and embedding module, an aggregation module composed of spatial aggregation and inter-class aggregation, and an upsampling decoder.

In-the-Wild Segmentation Results

The remarkable performance of our model is not limited to photos, but extends to pixel-art, illustrations, and game scenes, demonstrating its exceptional generalization ability.


Application: Image Editing with Stable Diffusion

Utilizing the prediction of CAT-Seg enables us to effortlessly manipulate the image through Stable Diffusion inpainting.

Citation

Acknowledgements

The website template was borrowed from Michaël Gharbi.