File size: 7,222 Bytes
8fec223 7a67cd3 8fec223 7a67cd3 8fec223 7a67cd3 e6656a0 7a67cd3 e6656a0 7a67cd3 e6656a0 7a67cd3 8fec223 e6656a0 7a67cd3 e6656a0 8fec223 7a67cd3 8fec223 7a67cd3 8fec223 7a67cd3 8fec223 7a67cd3 8fec223 7a67cd3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | ---
datasets:
- laion/laion400m
- kakaobrain/coyo-700m
license: apache-2.0
pipeline_tag: image-feature-extraction
tags:
- Vision
- LLaVA
library_name: transformers
---
# Region-based Cluster Discrimination for Visual Representation Learning (RICE)
[[Paper]](https://huggingface.co/papers/2507.20025) [[GitHub]](https://github.com/deepglint/unicom)
## Abstract
Region-Aware Cluster Discrimination (RICE) is a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs).
## Model Overview
We used the Vision Transformer architecture [ViT-L/14@336px as CLIP](https://huggingface.co/openai/clip-vit-large-patch14-336).

## Highlights

RICE efficiently processes diverse semantic regions within the image using a single forward pass. The model jointly captures both general visual semantics (objects) and OCR semantics (texts), seamlessly integrating them into a unified representation.
## Data
Our model was trained on publicly available image-caption data from the [LAION400M](https://arxiv.org/abs/2111.02114) and [COYO700M](https://github.com/kakaobrain/coyo-dataset) datasets.
## Performance and Limitations
### Experiments
This table presents a comprehensive performance comparison of RICE with state-of-the-art vision encoders. For all experiments within the LLaVA-NeXT framework, we adopt a high-resolution tiling strategy: each input image is divided into a 2×2+1 grid of crops, where each crop matches the pre-training resolution of the backbone model (e.g., 336px, 378px, or 560px).

### Limitations
Models with higher resolution are more friendly to OCR results. We are currently training such models and will soon make them available.
## How to use
#### 1. Standard Usage
```python
# Install dependencies
# pip install torch transformers
# git clone https://github.com/deepglint/unicom
# cd unicom/mlcd
from vit_rope2d_hf import MLCDVisionModel
from transformers import CLIPImageProcessor
from PIL import Image
import requests
import torch
# Load model and processor
model = MLCDVisionModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
processor = CLIPImageProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
# Load and process an image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
# Extract visual features
with torch.no_grad():
outputs = model(**inputs)
features = outputs.last_hidden_state
print(f"Extracted features shape: {features.shape}")
```
#### 2. Using HuggingFace Transformers >= 4.51.3
```python
# pip install torch transformers>=4.51.3
from transformers import AutoProcessor, AutoModel
from PIL import Image
import requests
import torch
# Load model and processor
model = AutoModel.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
processor = AutoProcessor.from_pretrained("DeepGlint-AI/rice-vit-large-patch14-560")
# Load and process an image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
# Extract visual features
with torch.no_grad():
outputs = model(**inputs)
features = outputs.last_hidden_state[0]
print(f"Extracted features shape: {features.shape}")
```
### Visualize Semantic Features

Using 2048-resolution images as input to a ViT-B/16 model, we project token features onto RGB channels via
PCA to visualize the semantic structure. Sequential frames (arranged vertically) illustrate the evolution of model attention, consistently
highlighting salient objects across time. The visualization reveals stable color patterns for tracked entities such as ice skaters, deers,
motorcyclists, and cyclists, demonstrating the model’s ability to maintain semantic focus throughout the sequence.
## ModelZoo
| Model | Download |
|-------|-------------|
| RICE-ViT-L-14-560px | [huggingface](https://huggingface.co/DeepGlint-AI/rice-vit-large-patch14-560) |
| MLCD-ViT-bigG-14-448px | [huggingface](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448) |
| MLCD-ViT-L-14-336px | [huggingface](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336) |
| MLCD-ViT-B-32-224px | [huggingface](https://huggingface.co/DeepGlint-AI/mlcd-vit-base-patch32-224) |
## Acknowledgments
We would like to express our gratitude to [Xie Yin](https://huggingface.co/Yin-Xie) and [Yumeng Wang](https://huggingface.co/devymex) for their significant contributions to the experimental validation in MLLMs.
The authors are from DeepGlint team and Huawei London Research Institute.
## Citation
If you find our work helpful or inspiring, please feel free to cite it.
```latex
@inproceedings{yinxie_2025_rice,
title={Region-based Cluster Discrimination for Visual Representation Learning},
author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong And Roy, Miles And Ismail, Elezi And Deng, Jiankang},
booktitle={ICCV},
year={2025}
}
@inproceedings{anxiang_2024_mlcd,
title={Multi-label Cluster Discrimination for Visual Representation Learning},
author={An, Xiang and Yang, Kaicheng and Dai, Xiangzi and Feng, Ziyong and Deng, Jiankang},
booktitle={ECCV},
year={2024}
}
@inproceedings{anxiang_2023_unicom,
title={Unicom: Universal and Compact Representation Learning for Image Retrieval},
author={An, Xiang and Deng, Jiankang and Yang, Kaicheng and Li, Jiawei and Feng, Ziyong and Guo, Jia and Yang, Jing and Liu, Tongliang},
booktitle={ICLR},
year={2023}
}
@inproceedings{anxiang_2022_partialfc,
author={An, Xiang and Deng, Jiankang and Guo, Jia and Feng, Ziyong and Zhu, XuHan and Yang, Jing and Liu, Tongliang},
title={Killing Two Birds With One Stone: Efficient and Robust Training of Face Recognition CNNs by Partial FC},
booktitle={CVPR},
year={2022},
}
@inproceedings{deng_2019_arcface,
title={Arcface: Additive angular margin loss for deep face recognition},
author={Deng, Jiankang and Guo, Jia and Xue, Niannan and Zafeiriou, Stefanos},
booktitle={CVPR},
year={2019}
}
``` |