Feature Extraction
Safetensors
clip_vision_model
Vision
LLaVA
nielsr HF Staff commited on
Commit
7a67cd3
·
verified ·
1 Parent(s): bd75627

Improve model card: Update to RICE paper details, add usage, and refine metadata

Browse files

This PR significantly updates the model card for the `rice-vit-large-patch14-560` model to accurately reflect its association with the paper **"Region-based Cluster Discrimination for Visual Representation Learning" (RICE)**.

Key changes include:
* Updating the primary paper link to the correct RICE paper ([https://huggingface.co/papers/2507.20025](https://huggingface.co/papers/2507.20025)).
* Adding `library_name: transformers` to enable the "How to use" widget, as the model is compatible with the Transformers library.
* Refining the `pipeline_tag` from `feature-extraction` to `image-feature-extraction` for better discoverability.
* Incorporating comprehensive sections from the official GitHub repository's RICE section, including "Highlights", "Experiments" (with the relevant image, replacing outdated MLCD tables), "How to use" (sample code), "Visualize Semantic Features", "ModelZoo", and "Citation".
* Adding the paper abstract for a detailed overview.

This ensures the model card provides precise and up-to-date information for users.

Files changed (1) hide show
  1. README.md +135 -68
README.md CHANGED
@@ -1,94 +1,161 @@
1
  ---
2
- license: apache-2.0
3
  datasets:
4
  - laion/laion400m
5
  - kakaobrain/coyo-700m
6
- pipeline_tag: feature-extraction
 
7
  tags:
8
  - Vision
9
  - LLaVA
 
10
  ---
11
 
 
12
 
 
13
 
 
 
14
 
15
- [[Paper]](https://arxiv.org/abs/2407.17331) [[GitHub]](https://github.com/deepglint/unicom)
16
- ## Model
17
- We used the same Vision Transformer architecture [ViT-L/14@336px as CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336).
18
 
19
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png)
20
 
 
 
 
 
21
 
22
  ## Data
23
- Our model was trained on publicly available image-caption data from the [LAION400M](https://arxiv.org/abs/2111.02114) and [COYO700M](https://github.com/kakaobrain/coyo-dataset) datasets.
24
 
25
  ## Performance and Limitations
26
 
27
- ### A. MLLMs Evaluation Results
28
- In our experiments, we replaced the CLIP model in [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) with the MLCD model to demonstrate the performance of the MLCD model in Multimodal Large Language Models (MLLMs). For the language model, we used [Qwen2.5-7B](https://huggingface.co/Qwen/Qwen2.5-7B). The evaluation results show that the modified model performs exceptionally well across multiple benchmarks, validating the effectiveness of the MLCD model within MLLMs.
29
-
30
- | Vision Tower | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
31
- |:----------------|:----------------------|:----------------------|
32
- | LLM | Qwen2.5-7B | Qwen2.5-7B |
33
- | AI2D | <span style="color:red">76.98</span> | 73.15 |
34
- | ScienceQA_img | <span style="color:red">78.09</span> | 76.35 |
35
- | GQA | <span style="color:red">64.17</span> | 63.31 |
36
- | InfoVQA_val | <span style="color:red">43.48</span> | 38.88 |
37
- | MMBench_cn_dev | <span style="color:red">74.83</span> | 72.51 |
38
- | MMBench_en_dev | <span style="color:red">76.37</span> | 74.57 |
39
- | MME(cognition) | <span style="color:red">432</span> | 384 |
40
- | MME(perception) | <span style="color:red">1598</span> | 1512 |
41
- | SeedBench | <span style="color:red">68.20</span> | 66.80 |
42
- | SeedBench_img | <span style="color:red">73.75</span> | 72.72 |
43
- | MMStar | <span style="color:red">50.98</span> | 48.98 |
44
- | MMMU | <span style="color:red">44.30</span> | 44.20 |
45
- | OCRBench | <span style="color:red">531.00</span> | 525.00 |
46
- | ChartQA | <span style="color:red">67.84</span> | 66.52 |
47
- | DocVQA_val | <span style="color:red">76.46</span> | 75.21 |
48
- | POPE | 88.69 | <span style="color:red">88.83</span> |
49
- | TextVQA_val | 61.69 | <span style="color:red">62.47</span> |
50
-
51
-
52
-
53
-
54
- ### B. Linear Probe Evaluation Results
55
- This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.
56
-
57
- | Dataset | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
58
- |:---------------|:----------------------|:----------------------|
59
- | AVG | <span style="color:red">87.15</span> | 85.35 |
60
- | Food101 | <span style="color:red">96.21</span> | 95.90 |
61
- | CIFAR-10 | <span style="color:red">99.36</span> | 97.90 |
62
- | CIFAR-100 | <span style="color:red">93.69</span> | 87.40 |
63
- | Birdsnap | <span style="color:red">88.18</span> | 79.90 |
64
- | SUN397 | <span style="color:red">87.96</span> | 82.20 |
65
- | Stanford Cars | <span style="color:red">95.16</span> | 91.50 |
66
- | FGVC Aircraft | <span style="color:red">86.38</span> | 71.60 |
67
- | Describable Textures Dataset | <span style="color:red">86.70</span> | 83.00 |
68
- | Oxford-IIIT Pets | <span style="color:red">96.27</span> | 95.10 |
69
- | Caltech-101 | <span style="color:red">97.92</span> | 96.00 |
70
- | Flowers102 | <span style="color:red">99.58</span> | 99.20 |
71
- | MNIST | 98.67 | <span style="color:red">99.20</span> |
72
- | STL-10 | 99.28 | <span style="color:red">99.70</span> |
73
- | EuroSAT | <span style="color:red">99.06</span> | 98.10 |
74
- | RESISC45 | <span style="color:red">95.48</span> | 94.90 |
75
- | GTSRB | 92.32 | <span style="color:red">92.40</span> |
76
- | KITTI | <span style="color:red">75.39</span> | 69.20 |
77
- | Country211 | 38.12 | <span style="color:red">46.40</span> |
78
- | PatchCamelyon | <span style="color:red">88.00</span> | 85.60 |
79
- | UCF101 | <span style="color:red">92.86</span> | 92.00 |
80
- | Kinetics-700 | <span style="color:red">73.35</span> | 73.00 |
81
- | CLEVR | <span style="color:red">64.40</span> | 60.30 |
82
- | Hateful Memes | 72.00 | <span style="color:red">77.30</span> |
83
- | SST-2 | 76.33 | <span style="color:red">80.50</span> |
84
- | ImageNet | <span style="color:red">86.30</span> | 85.40 |
85
-
86
-
87
- ### C. Limitations
88
 
 
 
 
89
  Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.
90
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  ## Acknowledgments
93
 
94
- We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin-Xie) and [Yumeng Wang](https://huggingface.co/devymex) for their significant contributions to the experimental validation in MLLMs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  datasets:
3
  - laion/laion400m
4
  - kakaobrain/coyo-700m
5
+ license: apache-2.0
6
+ pipeline_tag: image-feature-extraction
7
  tags:
8
  - Vision
9
  - LLaVA
10
+ library_name: transformers
11
  ---
12
 
13
+ # Region-based Cluster Discrimination for Visual Representation Learning (RICE)
14
 
15
+ [[Paper]](https://huggingface.co/papers/2507.20025) [[GitHub]](https://github.com/deepglint/unicom)
16
 
17
+ ## Abstract
18
+ Region-Aware Cluster Discrimination (RICE) is a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs).
19
 
20
+ ## Model Overview
21
+ We used the Vision Transformer architecture [ViT-L/14@336px as CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336).
 
22
 
23
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6478679d7b370854241b2ad8/8n_jBobanaLNAQjM5eZeg.png)
24
 
25
+ ## Highlights
26
+ ![470695215-38e89eea-8a73-4e3f-b43a-fa1ea6e32f0f](https://github.com/user-attachments/assets/e0de38b3-b20a-491e-9382-1839e9968481)
27
+
28
+ RICE efficiently processes diverse semantic regions within the image using a single forward pass. The model jointly captures both general visual semantics (objects) and OCR semantics (texts), seamlessly integrating them into a unified representation.
29
 
30
  ## Data
31
+ Our model was trained on publicly available image-caption data from the [LAION400M](https://arxiv.org/abs/2111.02114) and [COYO700M](https://github.com/kakaobrain/coyo-dataset) datasets.
32
 
33
  ## Performance and Limitations
34
 
35
+ ### Experiments
36
+ This table presents a comprehensive performance comparison of RICE with state-of-the-art vision encoders. For all experiments within the LLaVA-NeXT framework, we adopt a high-resolution tiling strategy: each input image is divided into a 2×2+1 grid of crops, where each crop matches the pre-training resolution of the backbone model (e.g., 336px, 378px, or 560px).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
+ ![470696193-65b351ac-9399-4dac-8999-b4412286731a](https://github.com/user-attachments/assets/cd66223f-1757-4ff4-859c-19dd25f1246d)
39
+
40
+ ### Limitations
41
  Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.
42
 
43
+ ## How to use
44
+
45
+ #### 1. Standard Usage
46
+
47
+ ```python
48
+ # Install dependencies
49
+ # pip install torch transformers
50
+ # git clone https://github.com/deepglint/unicom
51
+ # cd unicom/mlcd
52
+
53
+ from vit_rope2d_hf import MLCDVisionModel
54
+ from transformers import CLIPImageProcessor
55
+ from PIL import Image
56
+ import requests
57
+ import torch
58
+
59
+ # Load model and processor
60
+ model = MLCDVisionModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
61
+ processor = CLIPImageProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
62
+
63
+ # Load and process an image
64
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
65
+ image = Image.open(requests.get(url, stream=True).raw)
66
+ inputs = processor(images=image, return_tensors="pt")
67
+
68
+ # Extract visual features
69
+ with torch.no_grad():
70
+ outputs = model(**inputs)
71
+ features = outputs.last_hidden_state
72
+
73
+ print(f"Extracted features shape: {features.shape}")
74
+ ```
75
+
76
+ #### 2. Using HuggingFace Transformers >= 4.51.3
77
+
78
+ ```python
79
+ # pip install torch transformers>=4.51.3
80
+
81
+ from transformers import AutoProcessor, AutoModel
82
+ from PIL import Image
83
+ import requests
84
+ import torch
85
+
86
+ # Load model and processor
87
+ model = AutoModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
88
+ processor = AutoProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
89
+
90
+ # Load and process an image
91
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
92
+ image = Image.open(requests.get(url, stream=True).raw)
93
+ inputs = processor(images=image, return_tensors="pt")
94
+
95
+ # Extract visual features
96
+ with torch.no_grad():
97
+ outputs = model(**inputs)
98
+ features = outputs.last_hidden_state[0]
99
+
100
+ print(f"Extracted features shape: {features.shape}")
101
+ ```
102
+
103
+ ### Visualize Semantic Features
104
+
105
+ ![screenshot-20250725-232729](https://github.com/user-attachments/assets/0ff3b764-c5b6-4a10-a63c-89ccbc99d06b)
106
+
107
+ Using 2048-resolution images as input to a ViT-B/16 model, we project token features onto RGB channels via
108
+ PCA to visualize the semantic structure. Sequential frames (arranged vertically) illustrate the evolution of model attention, consistently
109
+ highlighting salient objects across time. The visualization reveals stable color patterns for tracked entities such as ice skaters, deers,
110
+ motorcyclists, and cyclists, demonstrating the model’s ability to maintain semantic focus throughout the sequence.
111
+
112
+ ## ModelZoo
113
+
114
+ | Model | Download |
115
+ |-------|-------------|
116
+ | RICE-ViT-L-14-560px | [huggingface](https://huggingface.co/DeepGlint-AI/rice-vit-large-patch14-560) |
117
+ | MLCD-ViT-bigG-14-448px | [huggingface](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448) |
118
+ | MLCD-ViT-L-14-336px | [huggingface](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336) |
119
+ | MLCD-ViT-B-32-224px | [huggingface](https://huggingface.co/DeepGlint-AI/mlcd-vit-base-patch32-224) |
120
 
121
  ## Acknowledgments
122
 
123
+ We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin-Xie) and [Yumeng Wang](https://huggingface.co/devymex) for their significant contributions to the experimental validation in MLLMs.
124
+ The authors are from DeepGlint team and Huawei London Research Institute.
125
+
126
+ ## Citation
127
+
128
+ If you find our work helpful or inspiring, please feel free to cite it.
129
+
130
+ ```latex
131
+ @inproceedings{yinxie_2025_rice,
132
+ title={Region-based Cluster Discrimination for Visual Representation Learning},
133
+ author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong And Roy, Miles And Ismail, Elezi And Deng, Jiankang},
134
+ booktitle={ICCV},
135
+ year={2025}
136
+ }
137
+ @inproceedings{anxiang_2024_mlcd,
138
+ title={Multi-label Cluster Discrimination for Visual Representation Learning},
139
+ author={An, Xiang and Yang, Kaicheng and Dai, Xiangzi and Feng, Ziyong and Deng, Jiankang},
140
+ booktitle={ECCV},
141
+ year={2024}
142
+ }
143
+ @inproceedings{anxiang_2023_unicom,
144
+ title={Unicom: Universal and Compact Representation Learning for Image Retrieval},
145
+ author={An, Xiang and Deng, Jiankang and Yang, Kaicheng and Li, Jiawei and Feng, Ziyong and Guo, Jia and Yang, Jing and Liu, Tongliang},
146
+ booktitle={ICLR},
147
+ year={2023}
148
+ }
149
+ @inproceedings{anxiang_2022_partialfc,
150
+ author={An, Xiang and Deng, Jiankang and Guo, Jia and Feng, Ziyong and Zhu, XuHan and Yang, Jing and Liu, Tongliang},
151
+ title={Killing Two Birds With One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC},
152
+ booktitle={CVPR},
153
+ year={2022},
154
+ }
155
+ @inproceedings{deng_2019_arcface,
156
+ title={Arcface: Additive angular margin loss for deep face recognition},
157
+ author={Deng, Jiankang and Guo, Jia and Xue, Niannan and Zafeiriou, Stefanos},
158
+ booktitle={CVPR},
159
+ year={2019}
160
+ }
161
+ ```