Improve model card: Add library_name, relevant tags, and update paper info for RICE

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +23 -8
README.md CHANGED
@@ -1,15 +1,31 @@
1
  ---
2
  license: mit
3
  pipeline_tag: image-feature-extraction
 
 
 
 
 
 
 
 
4
  ---
5
- ## MLCD-ViT-bigG Model Card
 
6
 
7
  > [!TIP]
8
  > **[LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) and [transformers](https://github.com/huggingface/transformers/releases/tag/v4.51.3-MLCD-preview) now supports MLCD-ViT-bigG-14-448px.**
9
  >
10
  >
11
 
12
- MLCD-ViT-bigG is a state-of-the-art vision transformer model enhanced with 2D Rotary Position Embedding (RoPE2D), achieving superior performance on document understanding and visual question answering tasks. Developed by DeepGlint AI, this model demonstrates exceptional capabilities in processing complex visual-language interactions.
 
 
 
 
 
 
 
13
 
14
  We adopted the official [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) and the official training dataset [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) for evaluating the foundational visual models.
15
  The language model is Qwen2.5-7B.
@@ -60,12 +76,11 @@ print(f"Extracted features shape: {features.shape}")
60
 
61
  ## Citation
62
 
63
-
64
  ```latex
65
- @inproceedings{anxiang_2024_mlcd,
66
- title={Multi-label Cluster Discrimination for Visual Representation Learning},
67
- author={An, Xiang and Yang, Kaicheng and Dai, Xiangzi and Feng, Ziyong and Deng, Jiankang},
68
- booktitle={ECCV},
69
- year={2024}
70
  }
71
  ```
 
1
  ---
2
  license: mit
3
  pipeline_tag: image-feature-extraction
4
+ library_name: transformers
5
+ tags:
6
+ - vision-transformer
7
+ - ocr
8
+ - object-detection
9
+ - semantic-segmentation
10
+ - multimodal-llm
11
+ - general-purpose
12
  ---
13
+
14
+ ## MLCD-ViT-bigG Model Card for Region-based Cluster Discrimination (RICE)
15
 
16
  > [!TIP]
17
  > **[LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) and [transformers](https://github.com/huggingface/transformers/releases/tag/v4.51.3-MLCD-preview) now supports MLCD-ViT-bigG-14-448px.**
18
  >
19
  >
20
 
21
+ This `mlcd-vit-bigG-patch14-448` model is a state-of-the-art vision transformer, enhanced with 2D Rotary Position Embedding (RoPE2D). It is a key component within the framework of **Region-Aware Cluster Discrimination (RICE)**, presented in the paper [Region-based Cluster Discrimination for Visual Representation Learning](https://huggingface.co/papers/2507.20025).
22
+
23
+ **Paper Abstract:**
24
+ Learning visual representations is foundational for a broad spectrum of downstream tasks. Although recent vision-language contrastive models, such as CLIP and SigLIP, have achieved impressive zero-shot performance via large-scale vision-language alignment, their reliance on global representations constrains their effectiveness for dense prediction tasks, such as grounding, OCR, and segmentation. To address this gap, we introduce Region-Aware Cluster Discrimination (RICE), a novel method that enhances region-level visual and OCR capabilities. We first construct a billion-scale candidate region dataset and propose a Region Transformer layer to extract rich regional semantics. We further design a unified region cluster discrimination loss that jointly supports object and OCR learning within a single classification framework, enabling efficient and scalable distributed training on large-scale data. Extensive experiments show that RICE consistently outperforms previous methods on tasks, including segmentation, dense detection, and visual perception for Multimodal Large Language Models (MLLMs).
25
+
26
+ **Code / GitHub Repository:** [https://github.com/deepglint/unicom](https://github.com/deepglint/unicom)
27
+
28
+ Developed by DeepGlint AI, this model demonstrates exceptional capabilities in processing complex visual-language interactions.
29
 
30
  We adopted the official [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT) and the official training dataset [LLaVA-NeXT-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data) for evaluating the foundational visual models.
31
  The language model is Qwen2.5-7B.
 
76
 
77
  ## Citation
78
 
 
79
  ```latex
80
+ @inproceedings{yinxie_2025_rice,
81
+ title={Region-based Cluster Discrimination for Visual Representation Learning},
82
+ author={Xie, Yin and Yang, Kaicheng and An, Xiang and Wu, Kun and Zhao, Yongle and Deng, Weimo and Ran, Zimin and Wang, Yumeng and Feng, Ziyong And Roy, Miles And Ismail, Elezi And Deng, Jiankang},
83
+ booktitle={ICCV},
84
+ year={2025}
85
  }
86
  ```