lorebianchi98 commited on
Commit
da070a4
·
verified ·
1 Parent(s): 81a24d7

Push model using huggingface_hub.

Browse files
Files changed (3) hide show
  1. README.md +4 -118
  2. config.json +0 -6
  3. model.safetensors +1 -1
README.md CHANGED
@@ -1,124 +1,10 @@
1
  ---
2
- license: other
3
- license_name: dinov3-license
4
- pipeline_tag: image-segmentation
5
- library_name: Pytorch
6
  tags:
7
  - model_hub_mixin
8
  - pytorch_model_hub_mixin
9
- - DINOv3
10
- - CLIP
11
- - open-vocabulary segmentation
12
  ---
13
 
14
- <div align="center">
15
- <h1>
16
- Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation (ICCV 2025)
17
- </h1>
18
-
19
- <h3>
20
- <a href="https://www.linkedin.com/in/luca-barsellotti/">Luca Barsellotti*</a>&ensp;
21
- <a href="https://www.linkedin.com/in/lorenzo-bianchi-893bb225a/">Lorenzo Bianchi*</a>&ensp;
22
- <a href="https://www.linkedin.com/in/nicola-messina-a33848164/">Nicola Messina</a>&ensp;
23
- <a href="https://www.linkedin.com/in/fabio-carrara-b28a2b111/">Fabio Carrara</a>&ensp;
24
- <a href="https://aimagelab.ing.unimore.it/imagelab/person.asp?idpersona=90">Marcella Cornia</a>&ensp;
25
- <a href="https://www.lorenzobaraldi.com/">Lorenzo Baraldi</a>&ensp;
26
- <a href="https://fabriziofalchi.it">Fabrizio Falchi</a>&ensp;
27
- <a href="https://www.linkedin.com/in/rita-cucchiara-a4653a13/">Rita Cucchiara</a>
28
- </h3>
29
-
30
- [Project Page](https://lorebianchi98.github.io/Talk2DINO/) | [Paper](http://arxiv.org/abs/2411.19331) | [Code](https://github.com/lorebianchi98/Talk2DINO)
31
-
32
- </div>
33
-
34
- <div align="center">
35
- <figure>
36
- <img alt="Overview of Talk2DINO" src="./assets/overview.png" width="90%">
37
- </figure>
38
- </div>
39
-
40
- ## About
41
- Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks.
42
-
43
- ## Sample Usage
44
-
45
- ### Mapping CLIP Text Embeddings to DINOv2 space with Talk2DINO
46
- We can use Talk2DINO to map CLIP text embeddings into the DINOv3 patch embedding space.
47
- ```python
48
- from transformers import AutoModel
49
- from torchvision.io import read_image
50
-
51
- # Device setup
52
- device = 'cuda' if torch.cuda.is_available() else 'cpu'
53
-
54
- # Model Loading
55
- model = AutoModel.from_pretrained("lorebianchi98/Talk2DINO_v3-ViTB").to(device).eval()
56
-
57
- # Embedding generation
58
- with torch.no_grad():
59
- text_embed = model.encode_text("a pikachu")
60
- image_embed = model.encode_image(image)
61
-
62
- # normalize the features to perform cosine similarity
63
- text_embed = text_embed / text_embed.norm(dim=-1, keepdim=True)
64
- image_embed = image_embed / image_embed.norm(dim=-1, keepdim=True)
65
-
66
- similarity = (image_embed @ text_embed.T).squeeze(0, -1).cpu().numpy()
67
- ```
68
-
69
- ### Demo
70
- In `demo.ipynb` we provide a simple example on how to use Talk2DINO for inference on a given image with custom textual categories.
71
- Result:
72
- <div align="center">
73
- <table><tr><td><figure>
74
- <img alt="" src="./assets/pikachu.png" width=300>
75
- </figure></td><td><figure>
76
- <img alt="" src="./assets/pikachu_seg.png" width=300>
77
- </figure></td></tr></table>
78
- </div>
79
-
80
- ## Installation
81
-
82
- To use the **Hugging Face interface** for inference:
83
-
84
- ```bash
85
- # Clone the repository
86
- git clone https://huggingface.co/lorebianchi98/Talk2DINO-ViTB
87
- cd Talk2DINO-ViTB
88
-
89
- # Install dependencies
90
- pip install -r requirements.txt
91
-
92
- # Install PyTorch and torchvision with the appropriate CUDA version
93
- pip install torch torchvision --index-url https://download.pytorch.org/whl/cu126
94
- ```
95
-
96
- For the **full MMCV interface** to perform evaluation on segmentation benchmarks, please refer to the [original Talk2DINO repository](https://github.com/lorebianchi98/Talk2DINO).
97
-
98
-
99
-
100
- <details>
101
- <summary>Qualitative Results</summary>
102
-
103
- | **Image** | **Ground Truth** | **FreeDA** | **ProxyCLIP** | **CLIP-DINOiser** | **Ours (Talk2DINO)** |
104
- |-----------|------------------|------------|---------------|-------------------|------------------|
105
- | ![Image](assets/qualitatives/voc/2_img.jpg) | ![Ground Truth](assets/qualitatives/voc/2_gt.png) | ![FreeDA](assets/qualitatives/voc/2_freeda.png) | ![ProxyCLIP](assets/qualitatives/voc/2_proxy.png) | ![CLIP-DINOiser](assets/qualitatives/voc/2_clipdinoiser.png) | ![Ours](assets/qualitatives/voc/2_talk2dino.png) |
106
- | ![Image](assets/qualitatives/object/2r_img.png) | ![Ground Truth](assets/qualitatives/object/2r_gt.png) | ![FreeDA](assets/qualitatives/object/2r_freeda.png) | ![ProxyCLIP](assets/qualitatives/object/2r_proxy.png) | ![CLIP-DINOiser](assets/qualitatives/object/2r_clipdinoiser.png) | ![Ours](assets/qualitatives/object/2r_talk2dino.png) |
107
- | ![Image](assets/qualitatives/cityscapes/1r_image.png) | ![Ground Truth](assets/qualitatives/cityscapes/1r_gt.png) | ![FreeDA](assets/qualitatives/cityscapes/1r_freeda.png) | ![ProxyCLIP](assets/qualitatives/cityscapes/1r_proxyclip.png) | ![CLIP-DINOiser](assets/qualitatives/cityscapes/1r_clipdinoiser.png) | ![Ours](assets/qualitatives/cityscapes/1r_talk2dino.png) |
108
- | ![Image](assets/qualitatives/context/1r_img.png) | ![Ground Truth](assets/qualitatives/context/1r_gt.png) | ![FreeDA](assets/qualitatives/context/1r_freeda.png) | ![ProxyCLIP](assets/qualitatives/context/1r_proxy.png) | ![CLIP-DINOiser](assets/qualitatives/context/1r_clipdinoiser.png) | ![Ours](assets/qualitatives/context/1r_talk2dino.png) |
109
- </details>
110
-
111
-
112
- ## Reference
113
- If you found this code useful, please cite the following paper:
114
- ```
115
- @misc{barsellotti2024talkingdinobridgingselfsupervised,
116
- title={Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation},
117
- author={Luca Barsellotti and Lorenzo Bianchi and Nicola Messina and Fabio Carrara and Marcella Cornia and Lorenzo Baraldi and Fabrizio Falchi and Rita Cucchiara},
118
- year={2024},
119
- eprint={2411.19331},
120
- archivePrefix={arXiv},
121
- primaryClass={cs.CV},
122
- url={https://arxiv.org/abs/2411.19331},
123
- }
124
- ```
 
1
  ---
 
 
 
 
2
  tags:
3
  - model_hub_mixin
4
  - pytorch_model_hub_mixin
 
 
 
5
  ---
6
 
7
+ This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
+ - Code: [More Information Needed]
9
+ - Paper: [More Information Needed]
10
+ - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
config.json CHANGED
@@ -1,10 +1,4 @@
1
  {
2
- "architectures": ["Talk2DINO"],
3
- "model_type": "talk2dino",
4
- "auto_map": {
5
- "AutoConfig": "configuration_talk2dino.Talk2DINOConfig",
6
- "AutoModel": "modeling_talk2dino.Talk2DINO"
7
- },
8
  "avg_self_attn_token": false,
9
  "clip_model_name": "ViT-B/16",
10
  "disentangled_self_attn_token": true,
 
1
  {
 
 
 
 
 
 
2
  "avg_self_attn_token": false,
3
  "clip_model_name": "ViT-B/16",
4
  "disentangled_self_attn_token": true,
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:32eb653db8e3fcdc48a5d64ccdf472a175ddee413a3f5d3ca1ac6d1c92982e24
3
  size 696907428
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03c38bb3444dd80ae6d2af9900c07513248babe625c7b6a175e5663fe8df2d53
3
  size 696907428