MasQCLIP for Open-Vocabulary Universal Image Segmentation
ICCV 2023

1Peking University, 2Tsinghua University, 3UC San Diego
*Equal Contribution

MasQCLIP exhibits the ability to segment objects of arbitrary classes as per user specifications and to discriminate between subtle distinctions within them.

Abstract

We present a new method for open-vocabulary universal image segmentation, which is capable of performing instance, semantic, and panoptic segmentation under a unified framework. Our approach, called MasQCLIP, seamlessly integrates with a pre-trained CLIP model by utilizing its dense features, thereby circumventing the need for extensive parameter training.

MasQCLIP emphasizes two new aspects when building an image segmentation method with a CLIP model: (1) a student-teacher module to deal with masks of the novel (unseen) classes by distilling information from the base (seen) classes; (2) a fine-tuning process to update model parameters for the queries \(Q\) within the CLIP model. Thanks to these two simple and intuitive designs, MasQCLIP is able to achieve state-of-the-art performances with a substantial gain over the competing methods by a large margin across all three tasks, including open-vocabulary instance, semantic, and panoptic segmentation.

Method


Interpolate start reference image.
MasQCLIP consists of a class-agnostic mask proposal network and a mask classification module based on CLIP. In the mask proposal network, we apply progressive distillation to segment masks beyond base classes. After we obtain an open-world mask proposal network, the predicted masks are then sent to the classification module to obtain labels. To efficiently utilize the dense CLIP features, we propose MasQ-Tuning. We set new query projections \(f_{Q}^{\prime}\) for the Mask Class Tokens to obtain optimal attention weights, and \(f_Q^\prime\) at each layer are the only learnable parameters.

Progressive Distillation

Interpolate start reference image.

MasQ-Tuning

To enhance adaptation from image classification to mask classification while maintaining the generalization ability of CLIP, we apply new query projections \(f_Q^\prime\) to each cross-attention layer for Mask Class Tokens, i.e.

$$\text{CrossAttn}(\cdot) = \text{softmax}(\mathbf{Q}_{\text{mask}}^\prime K_{\text{img}}^T + \mathcal{M}_{\text{mask}}) \cdot V_{\text{img}}$$

$$\mathbf{Q}_{\text{mask}}^\prime, K_{\text{img}}, V_{\text{img}} = \mathbf{f_Q^\prime}(x_\text{mask}), f_K(x_\text{img}), f_V(x_\text{img})$$

Main Results


Interpolate start reference image.

Qualitative Results


Semantic Segmentation

Instance Segmentation

Panoptic Segmentation

BibTeX

@inproceedings{xu2023masqclip,
      author    = {Xu, Xin and Xiong, Tianyi and Ding, Zheng and Tu, Zhuowen},
      title     = {MasQCLIP for Open-Vocabulary Universal Image Segmentation},
      booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
      month     = {October},
      year      = {2023},
      pages     = {887-898}
}