66.4 kB

Title: Hierarchical Open-vocabulary Universal Image Segmentation

URL Source: https://arxiv.org/html/2307.00764

Markdown Content: Xudong Wang 1⁣1{}^{1}start_FLOATSUPERSCRIPT 1 * end_FLOATSUPERSCRIPT Shufan Li 1⁣1{}^{1}start_FLOATSUPERSCRIPT 1 * end_FLOATSUPERSCRIPT Konstantinos Kallidromitis 2⁣2{}^{2}start_FLOATSUPERSCRIPT 2 * end_FLOATSUPERSCRIPT Yusuke Kato 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Kazuki Kozuka 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Trevor Darrell 1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT

1 1{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT Berkeley AI Research, UC Berkeley 2 2{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Panasonic AI Research

project page: http://people.eecs.berkeley.edu/∼similar-to\sim∼xdwang/projects/HIPIE

Abstract

Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions. However, complex visual scenes can be naturally decomposed into simpler parts and abstracted at multiple levels of granularity, introducing inherent segmentation ambiguity. Unlike existing methods that typically sidestep this ambiguity and treat it as an external factor, our approach actively incorporates a hierarchical representation encompassing different semantic-levels into the learning process. We also propose a decoupled text-image fusion mechanism and representation learning modules for both “things” and “stuff”.1 1 1 The terms things (countable objects, typically foreground) and stuff (non-object, non-countable, typically background)[1] are commonly used to distinguish between objects that have a well-defined geometry and are countable, e.g. people, cars, and animals, and surfaces or regions that lack a fixed geometry and are primarily identified by their texture and/or material, e.g. the sky, road, and water body.

*: equal contribution Additionally, we systematically examine the differences that exist in the textual and visual features between these types of categories. Our resulting model, named HIPIE, tackles HI erarchical, o P en-vocabulary, and un I v E rsal segmentation tasks within a unified framework. Benchmarked on over 40 datasets, e.g., ADE20K, COCO, Pascal-VOC Part, RefCOCO/RefCOCOg, ODinW and SeginW, HIPIE achieves the state-of-the-art results at various levels of image comprehension, including semantic-level (e.g., semantic segmentation), instance-level (e.g., panoptic/referring segmentation and object detection), as well as part-level (e.g., part/subpart segmentation) tasks.

1 Introduction

Figure 1: HIPIE is a unified framework which, given an image and a set of arbitrary text descriptions, provides hierarchical semantic, instance, part, and subpart-level image segmentations. This includes open-vocabulary semantic (e.g., crowds and sky), instance/panoptic (e.g., person and cat), part (e.g., head and torso), subpart (e.g., ear and nose) and referring expression (e.g., umbrella with a white pole) masks. HIPIE outperforms previous methods and established new SOTAs on these tasks regardless of their granularity or task specificity. Bottom images: our method can seamlessly integrate with SAM to enable class-aware image segmentation on SA-1B.

Image segmentation is a fundamental task in computer vision, enabling a wide range of applications such as object recognition, scene understanding, and image manipulation[51, 14, 43, 7, 38]. Recent advancements in large language models pave the way for open-vocabulary image segmentation, where models can handle a wide variety of object classes using text prompts. However, there is no single “correct” way to segment an image. The inherent ambiguity in segmentation stems from the fact that the interpretations of boundaries and regions within an image depend on the specific tasks.

Existing methods for open-vocabulary image segmentation typically address the ambiguity in image segmentation by considering it as an external factor beyond the modeling process. In contrast, we adopt a different approach by embracing this ambiguity and present HIPIE, as illustrated in Fig.1, a novel HI erarchical, o P en-vocabulary and un I v E rsal image segmentation and detection model. This includes semantic-level segmentation, which focuses on segmenting objects based on their semantic meaning, as well as instance-level segmentation, which involves segmenting individual instances of objects or groups of objects (e.g., instance and referring segmentation).

Figure 2: Noticeable discrepancies exist in the between-class similarities of visual and textual features between stuff and thing classes. We propose a decoupled representation learning approach that effectively generates more discriminative visual and textual features. We extract similarity matrices for the visual features, obtained through a pretrained MAE[18] or our fine-tuned one, and for the text features, produced using a pretrained BERT[6] or fine-tuned one. We report results on COCO-Panoptic[24] and measure the mean similarity (μ 𝜇\mu italic_μ).

Additionally, our model captures finer details by incorporating part-level segmentation, which involves segmenting object parts/subparts. By encompassing different granularity, HIPIE allows for a more comprehensive and nuanced analysis of images, enabling a richer understanding of their contents.

To design HIPIE, we begin by investigating the design choices for open-vocabulary image segmentation (OIS). Existing methods on OIS typically adopt a text-image fusion mechanism, and employ a shared representation learning module for both stuff and thing classes[4, 63, 59, 10, 57]. Fig.2 shows the similarity matrics of visual and textual features between stuff and thing classes. On this basis, we can derive several conclusions:

• Noticeable discrepancies exist in the between-class similarities of textual and visual features between stuff and thing classes.
• Stuff classes exhibit significantly higher levels of similarity in text features than things.

This observation suggests that integrating textual features may yield more significant benefits in generating discriminative features for thing classes compared to stuff classes. Consequently, for thing classes, we adopt an early image-text fusion approach to fully leverage the benefits of discriminative textual features. Conversely, for stuff classes, we utilize a late image-text fusion strategy to mitigate the potential negative effects introduced by non-discriminative textual features. Furthermore, the presence of discrepancies in the visual and textual features between stuff and thing classes, along with the inherent differences in their characteristics (stuff classes requiring better capture of texture and materials, while thing classes often having well-defined geometry and requiring better capture of shape information), indicates the need for decoupling the representation learning modules for producing masks for stuffs and things.

In addition to instance/semantic-level segmentation, our model is capable of open-vocabulary hierarchical segmentation. Instead of treating part classes, like ‘dog leg’, as standard multi-word labels, we concatenate class names from different granularity as prompts. During training, we supervise the classification head using both part labels, such as ‘tail’, and instance labels, such as ‘dog’, and we explicitly contrast a mask embedding with both instance-level and part-level labels. In the inference stage, we perform two separate forward passes using the same image but different prompts to generate instance and part segmentation. This design choice empowers open-vocabulary hierarchical segmentation, allowing us to perform part segmentation on novel part classes by randomly combining classes from various granularity, such as ‘giraffe’ and ‘leg’, even if they have never been seen during training. By eliminating the constraints of predefined object classes and granularity, HIPIE offers a more flexible and adaptable solution for image segmentation.

We extensively benchmark HIPIE on various popular datasets to validate its effectiveness, including MSCOCO, ADE20K, Pascal Panoptic Part, and RefCOCO/RefCOCOg. HIPIE achieves state-of-the-art performance across all these datasets that cover a variety of tasks and granularity.

To the best of our knowledge, HIPIE is the first hierarchical, open-vocabulary and universal image segmentation and detection model (see Table 1). By decoupling representation learning and text-image fusion mechanisms for things vs. stuff classes, HIPIE overcomes the limitations of existing approaches and achieves state-of-the-art performance on various benchmarks.

2 Related Works

Open Vocab.Instance Segment.Semantic Segment.Panoptic Segment.Referring Segment.Cls-agnostic Part Seg.Cls-aware Part Seg.Object Detection SAM[25]✗✓✗✗✗✓✗* SEEM[68]✓✓✓✓✓✗✗* ODISE[57]✓✓✓✓✗✗✗* UNINEXT[59]†✓✗✗✓✗✗✓ X-Decoder[67]✓✓✓✓✓✗✗* G-DINO[37]✓✓✗✗✓✗✗✓ PPS[5]✗✗✗✗✗✓✓✗ HIPIE✓✓✓✓✓✓✓✓ vs. prev. SOTA-+5.1+2.0+1.3+0.5-+5.2+3.2

Table 1: Our HIPIE is capable of performing all the listed segmentation and detection tasks and achieves the state-of-the-art performance using a unified framework. We present performance comparisons with SOTA methods on a range of benchmark datasets: AP mask mask{}^{\text{mask}}start_FLOATSUPERSCRIPT mask end_FLOATSUPERSCRIPT for instance segmentation on MSCOCO[35], AP box box{}^{\text{box}}start_FLOATSUPERSCRIPT box end_FLOATSUPERSCRIPT for object detection on MSCOCO, oIoU for referring segmentation on RefCOCO+[62], mIoU for semantic segmentation on Pascal Context[65], and mIoU PartS PartS{}_{\text{PartS}}start_FLOATSUBSCRIPT PartS end_FLOATSUBSCRIPT for part segmentation on Pascal-Panoptic-Parts[5]. The second best performing method for each task is underlined. {}^{}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT: object detection can be conducted via generating bounding boxes using instance segmentation masks. ‘Seg.’ denotes segmentation. †: In principle, UNINEXT can take arbitrary texts as labels, however, the original work focused on close-set performance and did not explore open-vocabulary inference.

Open-Vocabulary Semantic Segmentation[2, 54, 27, 16, 45, 33, 55, 56, 17] aims to segment an image into semantic regions indicated by text descriptions that may not have been seen during training. ZS3Net[2] combines a deep visual segmentation model with an approach to generate visual representations from semantic word embeddings to learn pixel-wise classifiers for novel categories. LSeg[27] uses CLIP’s text encoder[44] to generate the corresponding semantic class’s text embedding, which it then aligns with the pixel embeddings. OpenSeg[16] adopts a grouping strategy for pixels prior to learning visual-semantic alignments. By aligning each word in a caption to one or a few predicted masks, it can scale-up the dataset and vocabulary sizes. GroupViT[55] is trained on a large-scale image-text dataset using contrastive losses. With text supervision alone, the model learns to group semantic regions together. OVSegmentor[56] uses learnable group tokens to cluster image pixels, aligning them with the corresponding caption embeddings.

Open-Vocabulary Panoptic Segmentation (OPS) unifies semantic and instance segmentation, and aims to perform these two tasks for arbitrary categories of text-based descriptions during inference time[10, 57, 67, 68, 59]. MaskCLIP[10] first predicts class-agnostic masks using a mask proposal network. Then, it refines the mask features through Relative Mask Attention interactions with the CLIP visual model and integrates the CLIP text embeddings for open-vocabulary classification. ODISE[57] unifies Stable Diffusion[48], a pre-trained text-image diffusion model, with text-image discriminative models, e.g. CLIP[44], to perform open-vocabulary panoptic segmentation. X-Decoder[67] takes two types of queries as input: generic non-semantic queries that aim to decode segmentation masks for universal segmentation, and textual queries to make the decoder language-aware for various open-vocabulary vision tasks. UNINEXT[59] unifies diverse instance perception tasks into an object discovery and retrieval paradigm, enabling flexible perception of open-vocabulary objects by changing the input prompts.

Referring Segmentation learns valid multimodal features between visual and linguistic modalities to segment the target object described by a given natural language expression[20, 61, 21, 23, 13, 60, 53, 36, 64]. It can be divided into two main categories: 1) Decoder-fusion based method[8, 52, 64, 36] first extracts vision features and language features, respectively, and then fuses them with a multi-modal design. 2) Encoder-fusion based method[13, 60, 31] fuses the language features into the vision features early in the vision encoder.

Parts Segmentation learns to segment instances into more fine-grained masks. PPP [5] established a baseline of hierarchical understanding of images by combining a scene-level panoptic segmentation model and part-level segmentation model. JPPF [22] improved this baseline by introducing joint Panoptic-Part Fusion module that achieves comparable performance with significantly smaller models.

Promptable Segmentation. The Segment Anything Model (SAM)[25] is an approach for building a fully automatic promptable image segmentation model that can incorporate various types of human interventions, such as texts, masks, and points. SEEM[68] proposes a unified prompting scheme that encodes user intents into prompts in a joint visual-semantic space. This approach enables SEEM to generalize to unseen prompts for segmentation, achieving open-vocabulary and zero-shot capabilities. Referring segmentation can also be considered as promptable segmentation with text prompts.

Comparison with Previous Work. Table1 compares our HIPIE method with previous work in terms of key properties. Notably, HIPIE is the only method that supports open-vocabulary universal image segmentation and detection, enabling the object detection, instance-, semantic-, panoptic-, hierarchical-(whole instance, part, subpart), and referring-segmentation tasks, all within a single unified framework.

Figure 3: Diagram of HIPIE for hierarchical, universal and open-vocabulary image segmentation and detection. The image and text prompts are first passed to the image and text decoder to obtain visual features F v subscript 𝐹 𝑣 F_{v}italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and text features F t subscript 𝐹 𝑡 F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Early fusion is then applied to merge image and text features to get F v′,F t′superscript subscript 𝐹 𝑣′superscript subscript 𝐹 𝑡′F_{v}^{\prime},F_{t}^{\prime}italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Two independent decoders are used for things (foreground) classes and stuff (background) classes.

3 Method

We consider all relevant tasks under the unified framework of language-guided segmentation, which performs open-vocabulary segmentation and detection tasks for arbitrary text-based descriptions.

3.1 Overall Framework

The proposed HIPIE model comprises three main components, as illustrated in Fig.3:

Text-image feature extraction and information fusion (detailed in Secs.3.2, 3.3 and3.4): We first generate a text prompt T 𝑇 T italic_T from labels or referring expressions. Then, we extract image (I 𝐼 I italic_I) and text (T 𝑇 T italic_T) features F v=Enc v⁢(I),F t=Enc t⁢(T)formulae-sequence subscript 𝐹 𝑣 subscript Enc 𝑣 𝐼 subscript 𝐹 𝑡 subscript Enc 𝑡 𝑇 F_{v}=\text{Enc}{v}(I),F{t}=\text{Enc}{t}(T)italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = Enc start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_I ) , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = Enc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_T ) using image encoder Enc v subscript Enc 𝑣\text{Enc}{v}Enc start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and text encoder Enc t subscript Enc 𝑡\text{Enc}{t}Enc start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, respectively. We then perform feature fusion and obtain fused features F v′,F t′=FeatureFusion⁢(F v,F t)superscript subscript 𝐹 𝑣′superscript subscript 𝐹 𝑡′FeatureFusion subscript 𝐹 𝑣 subscript 𝐹 𝑡 F{v}^{\prime},F_{t}^{\prime}=\text{FeatureFusion}(F_{v},F_{t})italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = FeatureFusion ( italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
Foreground (referred to as things) and background (referred to as stuffs) mask generation (detailed in Sec.3.5): Each of the decoders takes in a set of image features and text features and returns sets of masks, bounding boxes, and object embeddings (M,B,E)𝑀 𝐵 𝐸(M,B,E)( italic_M , italic_B , italic_E ). We compute the foreground and background proposals and concatenate them to obtain the final proposals and masks as follows:

Stuff:(M 2,B 2,E 2)=StuffDecoder⁢(F v,F t)Thing:(M 1,B 1,E 1)=ThingDecoder⁢(FeatureFusion⁢(F v,F t))Overall:(M,B,E)=(M 1⊕M 2,B 1⊕B 2,E 1⊕E 2):Stuff absent subscript 𝑀 2 subscript 𝐵 2 subscript 𝐸 2 StuffDecoder subscript 𝐹 𝑣 subscript 𝐹 𝑡:Thing absent subscript 𝑀 1 subscript 𝐵 1 subscript 𝐸 1 ThingDecoder FeatureFusion subscript 𝐹 𝑣 subscript 𝐹 𝑡:Overall absent 𝑀 𝐵 𝐸 direct-sum subscript 𝑀 1 subscript 𝑀 2 direct-sum subscript 𝐵 1 subscript 𝐵 2 direct-sum subscript 𝐸 1 subscript 𝐸 2\begin{array}[]{rlll}\mathrm{Stuff:}&(M_{2},B_{2},E_{2})&=&\mathrm{% StuffDecoder}(F_{v},F_{t})\ \mathrm{Thing:}&(M_{1},B_{1},E_{1})&=&\mathrm{ThingDecoder}(\mathrm{% FeatureFusion}(F_{v},F_{t}))\ \mathrm{Overall:}&(M,B,E)&=&(M_{1}\oplus M_{2},B_{1}\oplus B_{2},E_{1}\oplus E% _{2})\ \end{array}start_ARRAY start_ROW start_CELL roman_Stuff : end_CELL start_CELL ( italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL = end_CELL start_CELL roman_StuffDecoder ( italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL roman_Thing : end_CELL start_CELL ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL = end_CELL start_CELL roman_ThingDecoder ( roman_FeatureFusion ( italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_CELL end_ROW start_ROW start_CELL roman_Overall : end_CELL start_CELL ( italic_M , italic_B , italic_E ) end_CELL start_CELL = end_CELL start_CELL ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊕ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊕ italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊕ italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY(1)

where ⊕direct-sum\oplus⊕ denotes the concatenation operation.

Proposal and mask retrieval using text prompts (detailed in Sec.3.6): To assign class labels to these proposals, we compute the cosine similarity between object embedding E 𝐸 E italic_E and the corresponding embedding E i′superscript subscript 𝐸 𝑖′E_{i}^{\prime}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT of class i∈{1,2⁢…,c}𝑖 1 2…𝑐 i\in{1,2...,c}italic_i ∈ { 1 , 2 … , italic_c }. For a set of category names, the expression is a concatenated string containing all categories. We obtain E i′superscript subscript 𝐸 𝑖′E_{i}^{\prime}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by pooling tokens corresponding to each label from the encoded sequence F t subscript 𝐹 𝑡 F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For referring expressions, we taken the [CLS] token from BERT output as E i′superscript subscript 𝐸 𝑖′E_{i}^{\prime}italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

3.2 Text Prompts

Text prompting is a common approach used in open-vocabulary segmentation models[20, 61, 58, 59].

For open-vocabualry instance segmentation, panoptic segmentation, and semantic segmentation, the set of all labels C 𝐶 C italic_C is concatenated into a single text prompt T i subscript 𝑇 𝑖 T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using a “.” delimiter. Given an image I 𝐼 I italic_I and a set of text prompts T 𝑇 T italic_T, the model aims to classify N 𝑁 N italic_N masks in the label space C∪{⁢⁢o⁢t⁢h⁢e⁢r⁢"}𝐶𝑜 𝑡 ℎ 𝑒 𝑟"C\cup\{other"}italic_C ∪ { italic_o italic_t italic_h italic_e italic_r " }, where N 𝑁 N italic_N is the maximum number of mask proposals generated by the model.

For referring expressions, the text prompt is simply the sentence itself. The goal is to locate one mask in the image corresponding to the language expression.

3.3 Image and Text Feature Extraction

We employ a pretrained BERT model[6] to extract features for text prompts. Because the BERT-base model can only process input sequences up to 512 tokens, we divide longer sequences into segments of 512 tokens and encode each segment individually. The resulting features are then concatenated to obtain features of the original sequence length.

We utilize ResNet-50[19] and Vision Transformer (ViT)[11] as base architectures for image encoding. In the case of ResNet-50, we extract multiscale features from the last three blocks and denote them as F v subscript 𝐹 𝑣 F_{v}italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. For ViT, we use the output features from blocks 8, 16, and 32 as the multiscale features F v subscript 𝐹 𝑣 F_{v}italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT.

Figure 4: Various design choices for generating thing and stuff masks with arbitrary text descriptions. In version a), We use a single decoder for all masks. Early fusion is applied. In version b), two independent decoders are used for things and stuff classes. Early fusion is adopted for both decoders. Version c) is identical to version b) with the only difference being that the stuff decoder do not make use of early fusion.

3.4 Text-Image Feature Fusion

We explored several design choices for text-image feature fusion and mask generation modules as shown in Fig.4 and Sec.4.2, and discovered that Fig.4c) can give us the optimal performance. We adopt bi-directional cross-attention (Bi⁢-⁢Xattn Bi-Xattn\mathrm{Bi\mbox{-}Xattn}roman_Bi - roman_Xattn) to extract text-guided visual features F t⁢2⁢v subscript 𝐹 𝑡 2 𝑣 F_{t2v}italic_F start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT and image-guided text features F v⁢2⁢t subscript 𝐹 𝑣 2 𝑡 F_{v2t}italic_F start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT. These attentive features are then integrated with the vanilla text features F t subscript 𝐹 𝑡 F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and image features F v subscript 𝐹 𝑣 F_{v}italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT through residual connections, as shown below:

F t⁢2⁢v,F v⁢2⁢t=Bi⁢-⁢Xattn⁢(F v,F t)(F v′,F t′)=(F v+F t⁢2⁢v,F t+F v⁢2⁢t)subscript 𝐹 𝑡 2 𝑣 subscript 𝐹 𝑣 2 𝑡 Bi-Xattn subscript 𝐹 𝑣 subscript 𝐹 𝑡 superscript subscript 𝐹 𝑣′superscript subscript 𝐹 𝑡′subscript 𝐹 𝑣 subscript 𝐹 𝑡 2 𝑣 subscript 𝐹 𝑡 subscript 𝐹 𝑣 2 𝑡\begin{array}[]{lll}F_{t2v},>F_{v2t}&=&\mathrm{Bi\mbox{-}Xattn}(F_{v},F_{t})% \ (F_{v}^{\prime},>F_{t}^{\prime})&=&(F_{v}+F_{t2v},>F_{t}+F_{v2t})\end{array}start_ARRAY start_ROW start_CELL italic_F start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL roman_Bi - roman_Xattn ( italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ( italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL = end_CELL start_CELL ( italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + italic_F start_POSTSUBSCRIPT italic_t 2 italic_v end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_F start_POSTSUBSCRIPT italic_v 2 italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY(2)

where F v subscript 𝐹 𝑣 F_{v}italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and F t subscript 𝐹 𝑡 F_{t}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represent the visual and text-prompt features, respectively.

3.5 Thing and Stuff Mask Generation

We then generate masks and proposals for the thing and stuff classes by utilizing F v′superscript subscript 𝐹 𝑣′F_{v}^{\prime}italic_F start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and F t′superscript subscript 𝐹 𝑡′F_{t}^{\prime}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that we obtained in Sec.3.4.

Model Architecture. While architectures such as Mask2Former and MaskDINO[4, 29] can perform instance, semantic and panoptic segmentation simultaneously, models trained jointly show inferior performance compared with the same model trained for a specific task (e.g. instance segmentation only). We hypothesize that this may result from the different distribution of spatial location and geometry of foreground instance masks and background semantic masks. For example, instance masks are more likely to be connected, convex shapes constrained by a bounding box, whereas semantic masks may be disjoint, irregular shapes spanning across the whole image.

To address this issue, in a stark contrast to previous approaches[59, 37, 58] that use a unified decoder all both stuffs and things, we decouple the stuff and thing mask prediction using two separate decoders. For the thing decoder, we adopt Deformable DETR[66] with a mask head following the UNINEXT[59] architecture and incorporate denoising procedures proposed by DINO[63]. For the stuff decoder, we use the architecture of MaskDINO[29].

Proposal and Ground-Truth Matching Mechanisms. We make the following distinctions between the two heads. For thing decoder, we adopt simOTA[15] to perform many-to-one matching between box proposals and ground truth when calculating the loss. We also use box-iou-based NMS to remove duplicate predictions. For the stuff decoder, we adopt one-to-one Hungarian matching[26]. Additionally, we disable the box loss for stuff masks. We set the number of queries to 900 for the things and 300 for the stuffs.

Loss Functions. For both decoders, we calculate the class logits as the normalized dot product between mask embeddings (M 𝑀 M italic_M) and text embeddings (F t′superscript subscript 𝐹 𝑡′F_{t}^{\prime}italic_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT). We adopt Focal Loss[34] for classification outputs, L1 loss, and GIoU loss[46] for box predictions, pixel-wise binary classification loss and DICE loss[50] for mask predictions. Given predictions (M 1,B 1,E 1),(M 2,B 2,E 2)subscript 𝑀 1 subscript 𝐵 1 subscript 𝐸 1 subscript 𝑀 2 subscript 𝐵 2 subscript 𝐸 2(M_{1},B_{1},E_{1}),(M_{2},B_{2},E_{2})( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), groundtruth labels (M′,B′,C)superscript 𝑀′superscript 𝐵′𝐶(M^{\prime},B^{\prime},C)( italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_B start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_C ) and its foreground and background subset (M f′,B f′,C f)superscript subscript 𝑀 𝑓′superscript subscript 𝐵 𝑓′subscript 𝐶 𝑓(M_{f}^{\prime},B_{f}^{\prime},C_{f})( italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_B start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) and (M b′,B b′,C b)superscript subscript 𝑀 𝑏′superscript subscript 𝐵 𝑏′subscript 𝐶 𝑏(M_{b}^{\prime},B_{b}^{\prime},C_{b})( italic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ), The final Loss is computed as

ℒ thing=λ cls⁢ℒ cls⁢(E 1,C f′)+λ mask⁢ℒ mask⁢(M 1,M f′)+λ box⁢ℒ box⁢(B 1,B f′)ℒ stuff=λ cls⁢ℒ cls⁢(E 2,C′)+λ mask⁢ℒ mask⁢(M 2,M′)+λ box⁢ℒ box⁢(B 2,B b′)ℒ=ℒ thing+ℒ stuff subscript ℒ thing subscript 𝜆 cls subscript ℒ cls subscript 𝐸 1 superscript subscript 𝐶 𝑓′subscript 𝜆 mask subscript ℒ mask subscript 𝑀 1 superscript subscript 𝑀 𝑓′subscript 𝜆 box subscript ℒ box subscript 𝐵 1 superscript subscript 𝐵 𝑓′subscript ℒ stuff subscript 𝜆 cls subscript ℒ cls subscript 𝐸 2 superscript 𝐶′subscript 𝜆 mask subscript ℒ mask subscript 𝑀 2 superscript 𝑀′subscript 𝜆 box subscript ℒ box subscript 𝐵 2 superscript subscript 𝐵 𝑏′ℒ subscript ℒ thing subscript ℒ stuff\begin{array}[]{lll}\mathcal{L}{\text{thing}}&=&\lambda{\text{cls}}\mathcal{% L}{\text{cls}}(E{1},C_{f}^{\prime})+\lambda_{\text{mask}}\mathcal{L}{\text{% mask}}(M{1},M_{f}^{\prime})+\lambda_{\text{box}}\mathcal{L}{\text{box}}(B{1% },B_{f}^{\prime})\ \mathcal{L}{\text{stuff}}&=&\lambda{\text{cls}}\mathcal{L}{\text{cls}}(E{2% },C^{\prime})+\lambda_{\text{mask}}\mathcal{L}{\text{mask}}(M{2},M^{\prime})% +\lambda_{\text{box}}\mathcal{L}{\text{box}}(B{2},B_{b}^{\prime})\ \mathcal{L}&=&\mathcal{L}{\text{thing}}+\mathcal{L}{\text{stuff}}\ \end{array}start_ARRAY start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT thing end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_λ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT box end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT box end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT stuff end_POSTSUBSCRIPT end_CELL start_CELL = end_CELL start_CELL italic_λ start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT ( italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_λ start_POSTSUBSCRIPT box end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT box end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_B start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL caligraphic_L end_CELL start_CELL = end_CELL start_CELL caligraphic_L start_POSTSUBSCRIPT thing end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT stuff end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY(3)

where ℒ box=λ L⁢1⁢ℒ L⁢1+λ giou⁢ℒ giou subscript ℒ box subscript 𝜆 𝐿 1 subscript ℒ 𝐿 1 subscript 𝜆 giou subscript ℒ giou\mathcal{L}{\text{box}}=\lambda{L1}\mathcal{L}{L1}+\lambda{\text{giou}}% \mathcal{L}{\text{giou}}caligraphic_L start_POSTSUBSCRIPT box end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT giou end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT giou end_POSTSUBSCRIPT, ℒ mask=λ ce⁢ℒ ce+λ dice⁢ℒ dice subscript ℒ mask subscript 𝜆 ce subscript ℒ ce subscript 𝜆 dice subscript ℒ dice\mathcal{L}{\text{mask}}=\lambda_{\text{ce}}\mathcal{L}{\text{ce}}+\lambda{% \text{dice}}\mathcal{L}{\text{dice}}caligraphic_L start_POSTSUBSCRIPT mask end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT ce end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT ce end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT dice end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT dice end_POSTSUBSCRIPT, and ℒ cls=ℒ focal subscript ℒ cls subscript ℒ focal\mathcal{L}{\text{cls}}=\mathcal{L}_{\text{focal}}caligraphic_L start_POSTSUBSCRIPT cls end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT focal end_POSTSUBSCRIPT. Note that while we do not use the stuff decoder for thing prediction, we still match its predictions with things and compute the class and box losses in the training. We find such auxiliary loss setup make the stuff decoder aware of the thing distribution and imporves the final performance.

3.6 Open-Vocabulary Universal Segmentation

In closed set setting, we simply merge the output of two decoders and perform the standard postprocessing of UNINEXT[59] and MaskDINO[29] to obtain the final output.

In zero-shot open vocabulary setting, we follow ODISE[57] and combining our classification logits with a text-image discriminative model, e.g., CLIP[44]. Specially, given the a mask M 𝑀 M italic_M on image I 𝐼 I italic_I, its features E 𝐸 E italic_E and test classes C test subscript 𝐶 test C_{\text{test}}italic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT, we first compute the probability p 1⁢(E,C test)=ℙ⁢(C test|E)subscript 𝑝 1 𝐸 subscript 𝐶 test ℙ conditional subscript 𝐶 test 𝐸 p_{1}(E,C_{\text{test}})=\mathbb{P}(C_{\text{test}}|E)italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_E , italic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ) = blackboard_P ( italic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT | italic_E ) in the standard way as mentioned before. We additionally compute mask-pooled features of M 𝑀 M italic_M from the vision encoder 𝒱 𝒱\mathcal{V}caligraphic_V of CLIP as E CLIP=MaskPooling⁢(M,𝒱⁢(I))subscript 𝐸 CLIP MaskPooling 𝑀 𝒱 𝐼 E_{\text{CLIP}}=\text{MaskPooling}(M,\mathcal{V}(I))italic_E start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT = MaskPooling ( italic_M , caligraphic_V ( italic_I ) ). Then we compute the CLIP logits p 2⁢(E,C test)=ℙ⁢(C test|E CLIP)subscript 𝑝 2 𝐸 subscript 𝐶 test ℙ conditional subscript 𝐶 test subscript 𝐸 CLIP p_{2}(E,C_{\text{test}})=\mathbb{P}(C_{\text{test}}|E_{\text{CLIP}})italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_E , italic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ) = blackboard_P ( italic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT | italic_E start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT ) as the similarity between the CLIP text features and the E CLIP subscript 𝐸 CLIP E_{\text{CLIP}}italic_E start_POSTSUBSCRIPT CLIP end_POSTSUBSCRIPT. Finally we combine the final prediction as

p final⁢(E,C test)∝p 1⁢(E,C test)λ⁢p 2⁢(E,C test)1−λ proportional-to subscript 𝑝 final 𝐸 subscript 𝐶 test subscript 𝑝 1 superscript 𝐸 subscript 𝐶 test 𝜆 subscript 𝑝 2 superscript 𝐸 subscript 𝐶 test 1 𝜆 p_{\text{final}}(E,C_{\text{test}})\propto p_{1}(E,C_{\text{\text{test}}})^{% \lambda}p_{2}(E,C_{\text{test}})^{1-\lambda}italic_p start_POSTSUBSCRIPT final end_POSTSUBSCRIPT ( italic_E , italic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ) ∝ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_E , italic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_λ end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_E , italic_C start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_λ end_POSTSUPERSCRIPT(4)

Where λ 𝜆\lambda italic_λ is a balancing factor. Emprically, we found such setting leads to better performance than naively relying completely on CLIP features only or close-set logits.

Figure 5: Hierarchical segmentation pipeline. We concatenate the instance class names and part class names as labels. During the training process, we supervise the classification head using both part labels and instance labels. During inference, we perform two separate forward passes using the same image but different prompts to generate instance and part segmentations. By combining the part segmentation and instance segmentation of the same image, we obtain hierarchical segmentation results on the right side.

3.7 Hierarchical segmentation

In addition to the instance-level segmentation, we can also perform part-aware hierarchical segmentation. We concatenate the instance class names and part class names as labels. Some examples are "human ear", and "cat head". In the training process, we supervise the classification head with both part labels and instance labels. Specifically, we replace L c⁢l⁢s subscript 𝐿 𝑐 𝑙 𝑠 L_{cls}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_s end_POSTSUBSCRIPT with L c⁢l⁢s⁢P⁢a⁢r⁢t+L c⁢l⁢s⁢T⁢h⁢i⁢n⁢g subscript 𝐿 𝑐 𝑙 𝑠 𝑃 𝑎 𝑟 𝑡 subscript 𝐿 𝑐 𝑙 𝑠 𝑇 ℎ 𝑖 𝑛 𝑔 L_{clsPart}+L_{clsThing}italic_L start_POSTSUBSCRIPT italic_c italic_l italic_s italic_P italic_a italic_r italic_t end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_c italic_l italic_s italic_T italic_h italic_i italic_n italic_g end_POSTSUBSCRIPT in Eq.3. We combine parts segmentation and instance segmentation of the same image to get part-aware instance segmentation. Additionally, layers of hierarchy is obtained by grouping the parts. For example, the "head" consists of ears, hair, eyes, nose, etc. Fig.5 illustrates this process. Fig.A1 highlights the difference of our approach with other methods.

3.8 Class-aware part segmentation with SAM

We can also perform the class-aware hierarchical segmentation by combining our semantic output with class-agnostic masks produced by SAM[25]. Specifically, given semantic masks M 𝑀 M italic_M, their class probability P M subscript 𝑃 𝑀 P_{M}italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, and SAM-generated part masks S 𝑆 S italic_S, we compute the class probability of mask S i∈S subscript 𝑆 𝑖 𝑆 S_{i}\in S italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_S with respect to class j 𝑗 j italic_j as

P S⁢(S i,j)∝∑M k∈M P M⁢(M k,j)⁢|M k∩S i|proportional-to subscript 𝑃 𝑆 subscript 𝑆 𝑖 𝑗 subscript subscript 𝑀 𝑘 𝑀 subscript 𝑃 𝑀 subscript 𝑀 𝑘 𝑗 subscript 𝑀 𝑘 subscript 𝑆 𝑖 P_{S}(S_{i},j)\propto\sum_{M_{k}\in M}P_{M}(M_{k},j)|M_{k}\cap S_{i}|italic_P start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_j ) ∝ ∑ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_M end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_j ) | italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∩ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |(5)

Where |M k∩S i|subscript 𝑀 𝑘 subscript 𝑆 𝑖|M_{k}\cap S_{i}|| italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∩ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | is the area of intersection between mask M k subscript 𝑀 𝑘 M_{k}italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and S i subscript 𝑆 𝑖 S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We combine our semantic output with SAM because our pretraining datasets only contains object-centric masks, whereas the SA-1B dataset used by SAM contains many local segments and object parts.

4 Experiments

We comprehensively evaluate HIPIE through quantitative and qualitative analyses to demonstrate its effectiveness in performing various types of open-vocabulary segmentation and detection tasks. The implementation details of HIPIE are explained in Sec.4.1. Sec.4.2 presents the evaluation results of HIPIE. Additionally, we conduct an ablation study of various design choices in Sec.4.3.

4.1 Implementation Details

Model Learning Settings can be found in our appendix materials.

Figure 6: Qualitative Analysis of Open Vocabulary Hierarchal Segmentation. Because of our hierarchal design, our model produces better-quality masks. In particular, our model can generalize to novel hierarchies that do not exist in part segmentation datasets.

Evaluation Metrics.Semantic Segmentation performance is evaluated using the mean Intersection-Over-Union (mIoU) metric. For Part segmentation, we report mIoU PartS PartS{}_{\text{PartS}}start_FLOATSUBSCRIPT PartS end_FLOATSUBSCRIPT, which is the mean IoU for part segmentation on grouped part classes [5]. Object Detection and Instance Segmentation results are measured using the COCO-style evaluation metric - mean average precision (AP) [35]. Panoptic Segmentation is evaluated using the Panoptic Quality (PQ) metric[24]. Referring Image Segmentation (RIS)[20, 61] is evaluated with overall IoU (oIoU).

4.2 Results

Table 2: Open-vocabulary panoptic segmentation (PQ), instance segmentation (AP mask mask{}^{\text{mask}}start_FLOATSUPERSCRIPT mask end_FLOATSUPERSCRIPT), semantic segmentation (mIoU), part segmentation (mIoU PartS PartS{}_{\text{PartS}}start_FLOATSUBSCRIPT PartS end_FLOATSUBSCRIPT), and object detection (AP box box{}^{\text{box}}start_FLOATSUPERSCRIPT box end_FLOATSUPERSCRIPT). N/A: not applicable. -: not reported.

Method Backbone COCO ADE20K PAS-P PQ AP mask mask{}^{\text{mask}}start_FLOATSUPERSCRIPT mask end_FLOATSUPERSCRIPT AP box box{}^{\text{box}}start_FLOATSUPERSCRIPT box end_FLOATSUPERSCRIPT mIoU PQ AP mask mask{}^{\text{mask}}start_FLOATSUPERSCRIPT mask end_FLOATSUPERSCRIPT AP box box{}^{\text{box}}start_FLOATSUPERSCRIPT box end_FLOATSUPERSCRIPT mIoU mIoU PartS PartS{}_{\text{PartS}}start_FLOATSUBSCRIPT PartS end_FLOATSUBSCRIPT MaskCLIP [10]ViT16----15.1 6.0-23.7- X-Decoder [67]FocalT 52.6 41.3-62.4 18.8 9.8-25.0- X-Decoder DaViT-B 56.2 45.8-66.0 21.1 11.7-27.2- SEEM [68]FocalT 50.6 39.5-61.2----- SEEM DaViT-B 56.2 46.8-65.3----- ODISE [57]ViT-H+SD 55.4 46.0 46.1 65.2 22.6 14.4 15.8 29.9- JPPF [22]EffNet-b5--------54.4 PPS [5]RNST269--------58.6 HIPIE RN50 52.7 45.9 53.9 59.5 18.4 13.0 16.2 26.8 57.2 HIPIE ViT-H 58.0 51.9 61.3 66.8 20.6 15.0 18.7 29.0 63.8

Table 3: Open-vocabulary panoptic segmentation (PQ), instance segmentation (AP mask mask{}^{\text{mask}}start_FLOATSUPERSCRIPT mask end_FLOATSUPERSCRIPT), semantic segmentation (mIoU), part segmentation (mIoU PartS PartS{}_{\text{PartS}}start_FLOATSUBSCRIPT PartS end_FLOATSUBSCRIPT), and object detection (AP box box{}^{\text{box}}start_FLOATSUPERSCRIPT box end_FLOATSUPERSCRIPT). N/A: not applicable. -: not reported.

Method COCO ADE20K PAS-P RefCOCO/+/g A-847 PQ AP mask mask{}^{\text{mask}}start_FLOATSUPERSCRIPT mask end_FLOATSUPERSCRIPT AP box box{}^{\text{box}}start_FLOATSUPERSCRIPT box end_FLOATSUPERSCRIPT mIoU PQ AP mask mask{}^{\text{mask}}start_FLOATSUPERSCRIPT mask end_FLOATSUPERSCRIPT AP box box{}^{\text{box}}start_FLOATSUPERSCRIPT box end_FLOATSUPERSCRIPT mIoU mIoU PartS PartS{}_{\text{PartS}}start_FLOATSUBSCRIPT PartS end_FLOATSUBSCRIPT oIoU mIoU prev. SOTA 56.2 46.8 46.1 66.0 22.6 14.4 15.8 29.9 58.6 82.2 72.5 74.7 9.2 HIPIE 58.0 51.9 61.3 66.8 22.9 19.0 22.9 29.0 63.8 82.6 73.0 75.3 9.7

Table 4: Open-Vocabulary Universal Segmentation. We compare against other universal multi-task segmentation models. (*) denotes pretraining dataset of representations.

Method Data A-150 A-847 CTX459 SeginW PQ AP mask mask{}^{\text{mask}}start_FLOATSUPERSCRIPT mask end_FLOATSUPERSCRIPT AP box box{}^{\text{box}}start_FLOATSUPERSCRIPT box end_FLOATSUPERSCRIPT mIoU mIoU mIoU AP mask mask{}^{\text{mask}}start_FLOATSUPERSCRIPT mask end_FLOATSUPERSCRIPT OpenSeed O365,COCO 19.7 15.0 17.7 23.4--36.1 X-Decoder COCO,CC3M,SBU-C,VG,COCO-Caption,(Florence)21.8 13.1-29.6 9.2 16.1 32.2 UNINEXT O365,COCO,RefCOCO 8.9 14.9 11.9 6.4 1.8 5.8 42.1 HIPIE w/o CLIP O365,COCO,RefCOCO,PACO 18.1 16.7 20.2 19.8 4.8 12.2 41.0 HIPIE w/ CLIP+ (CLIP)22.9 19.0 22.9 29.0 9.7 14.4 41.6

Table 5: Comparison on open-vocabulary semantic segmentation. Baseline results are copied from [57].

Table 6: An ablation study on different decoder and text-image fusion designs, as depicted in Fig.4. We report PQ for panoptic segmentation on MSCOCO, AP mask mask{}^{\text{mask}}start_FLOATSUPERSCRIPT mask end_FLOATSUPERSCRIPT for instance segmentation on MSCOCO, and oIoU for referring segmentation on RefCOCO’s validation set. Our final choice is highlighted in gray.

Method A-150 PC-59 PAS-21 COCO ZS3Net[2]-19.4 38.3- LSeg+[27, 16]18.0 46.5-55.1 HIPIE 26.8 53.6 75.7 59.5 vs. prev. SOTA+7.1+10.7+28.3+4.4 GroupViT[55]10.6 25.9 50.7 21.1 OpenSeg[16]21.1 42.1-36.1 MaskCLIP[10]23.7 45.9-- ODISE[57]29.9 57.3 84.6 65.2 HIPIE 29.0 59.3 83.3 66.8 vs. prev. SOTA-0.9+2.0-1.3+1.6

Decoder Fusion (things)Fusion (stuff)PQ AP mask superscript AP mask\text{AP}^{\text{mask}}AP start_POSTSUPERSCRIPT mask end_POSTSUPERSCRIPT oIOU Unified 45.1 42.9 67.1 Decoupled 50.6 43.6 67.6 Unified (Fig.4a)✓✓44.6 42.5 66.8 Decoupled (Fig.4b)✓✓50.0 44.4 77.1 Decoupled (Fig.4c)✓51.3 44.4 77.3

Panoptic Segmentation. We examine Panoptic Quality (PQ) performance across MSCOCO[35] for closed-set and ADE20K[65] for open-set zero shot transfer learning. Based on Sec.4.2 our model is able to outperform the previous close-set state-of-the-art using a ViT-H backbone by +1.8. In addition, we match the best open-set PQ results, while being able to run on more tasks and having a simpler backbone than ODISE [57]. Semantic Segmentation. The evaluation of our model’s performance on various open-vocabulary semantic segmentation datasets is presented in Sec.4.2. These datasets include: 1) A-150: This dataset comprises 150 common classes from ADE20K[65]. 2) A-847: This dataset includes all 847 classes from ADE20K[65]. 3) PC-59: It consists of 59 common classes from Pascal Context[40]. 4) PC-459: This dataset encompasses the full 459 classes of Pascal Context[40]. 5) PAS-21: The vanilla Pascal VOC dataset[12], containing 20 foreground classes and 1 background class. These diverse datasets enable a comprehensive evaluation of our model’s performance across different settings, such as varying class sizes and dataset complexities. Sec.4.2 provides insights into how our model performs in handling open-vocabulary semantic segmentation tasks, demonstrating its effectiveness and versatility in detecting and segmenting a wide range of object categories in real-world scenarios.

Part Segmentation. We evaluate our models performance on Pascal-Panoptic-Parts dataset [5] and report mIoU partS partS{}_{\text{partS}}start_FLOATSUBSCRIPT partS end_FLOATSUBSCRIPT in Sec.4.2. We followed the standard grouping from [5]. Our model outperforms state-of-the-art by +5.2 in this metric. We also provide qualitative comparisons with Grounding DINO + SAM in Fig.7. Our findings reveal that the results of Grounding SAM are heavily constrained by the detection performance of Grounding DINO. As a result, they are unable to fully leverage the benefits of SAM in producing accurate and fine-grained part segmentation masks.

Figure 7: Results of merging HIPIE with SAM for class-aware image segmentation on SA-1B dataset. Grounded-SAM (Grounding DINO + SAM)[30, 25] cannot fully leverage the benefits of SAM in producing accurate and fine-grained part segmentation masks. Our method demonstrates fewer misclassifications and overlooked masks across the SA-1B dataset compared to the Grounded-SAM approach.

Table 7: Comparisons on the instance segmentation and object detection tasks. We evaluate model performance on the validation set of MSCOCO.

Table 8: Comparison on the referring image segmentation (RIS) task. We evaluate the model performance on the validation sets of RefCOCO, RefCOCO+, and RefCOCOg datasets using overall IoU (oIoU) metrics.

Method Backbone Object Detection AP AP S 𝑆{}{S}start_FLOATSUBSCRIPT italic_S end_FLOATSUBSCRIPT AP M 𝑀{}{M}start_FLOATSUBSCRIPT italic_M end_FLOATSUBSCRIPT AP L 𝐿{}_{L}start_FLOATSUBSCRIPT italic_L end_FLOATSUBSCRIPT Deform. DETR[66]RN50 46.9 29.6 50.1 61.6 DN-DETR[28]RN50 48.6 31.0 52.0 63.7 UNINEXT[59]RN50 51.3 32.6 55.7 66.5 HIPIE RN50 53.9 37.5 58.0 68.0 vs. prev. SOTA+2.6+4.9+2.3+1.5 Cas. Mask-RCNN[3]CNeXtL 54.8--- ViTDet-H[32]ViT-H 58.7--- UNINEXT[59]ViT-H 58.1 40.7 62.5 73.6 HIPIE ViT-H 61.3 45.8 65.7 75.9 vs. prev. SOTA+3.2+5.1+3.2+2.3

Method Backbone COCO COCO+COCOg oIoU oIoU oIoU MAttNet[61]RN101 56.5 46.7 47.6 VLT[9]Dark56 65.7 55.5 53.0 RefTR[41]RN101 74.3 66.8 64.7 UNINEXT[59]RN50 77.9 66.2 70.0 UNINEXT[59]ViT-H 82.2 72.5 74.7 HIPIE RN50 78.3 66.2 69.8 HIPIE ViT-H 82.6 73.0 75.3 vs. prev. SOTA+0.4+0.5+0.6

Object Detection and Instance Segmentation. We evaluate our model’s object detection and instance segmentation capabilities following previous works[29, 68, 57]. On MSCOCO[35] and ADE20K[65] datasets, HIPIE achieves an increase of +5.1 and +0.6 AP mask mask{}^{\text{mask}}start_FLOATSUPERSCRIPT mask end_FLOATSUPERSCRIPT respectively. Detailed comparisons are provided in Sec.4.2 which demonstrate state-of-the-art results on ResNet and ViT architectures consistently across all Average Precision metrics.

Referring Segmentation. Referring image segmentation (RIS) tasks are examined using the RefCOCO, RefCOCO+, and RefCOCOg datasets. Our model outperforms all the other alternatives by an average of +0.5 in overall IoU (oIoU).

Table 7: Comparisons on the instance segmentation and object detection tasks. We evaluate model performance on the validation set of MSCOCO.

Table 5: Comparison on open-vocabulary semantic segmentation. Baseline results are copied from [57].

Table 4: Open-Vocabulary Universal Segmentation. We compare against other universal multi-task segmentation models. (*) denotes pretraining dataset of representations.

Xet Storage Details

Size:: 66.4 kB
Xet hash:: 8923fb85eb8e29cbb646cb1d1745e249b6189e5d5dd85628ee9476d33cacc72e

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.