GreatBird nielsr HF Staff commited on
Commit
2d26fd2
·
verified ·
1 Parent(s): 8aba460

Add metadata and improve model card (#1)

Browse files

- Add metadata and improve model card (60fb4762ca4b20b24426bd9c7c36b1fc51306aee)


Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +54 -12
README.md CHANGED
@@ -1,23 +1,65 @@
1
- Visual Instruction Pretraining for Domain-Specific Foundation Models
 
 
 
 
 
 
 
 
 
 
 
 
2
  <p align="center">
3
  <a href="http://arxiv.org/abs/2509.17562"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=Arxiv"></a>
 
4
  </p>
5
- # Introduction
6
 
7
- Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce **V**isual **i**ns**T**ruction **P**retraining (**ViTP**), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at [GitHub](https://github.com/zcablii/ViTP).
8
-
9
- ----
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
- ![image/png](docs/loop_radar.png)
12
- The synergistic relationship between perception, generation, and reasoning in modern CV. Our proposed ViTP forges a novel link from high-level reasoning to low-level perception, a previously underexplored connection. ViTP sets new SOTA performance across a diverse range of downstream tasks in medical imaging and remote sensing.
 
 
13
 
14
- ----
15
 
16
- ![image/png](docs/vitp.png)
17
- A conceptual illustration of the ViTP framework. A ViT backbone is embedded within a large VLM and then pretrained with domain-specific instruction following objective and Visual Robustness Learning (VRL). This process instils high-level semantic understanding into the ViT. The resulting weights are then used to initialize models for various downstream perception tasks.
18
 
19
- ----
20
 
21
  ```bibtex
 
 
 
 
 
 
 
22
 
23
- ```
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ pipeline_tag: image-feature-extraction
4
+ tags:
5
+ - remote-sensing
6
+ - medical-imaging
7
+ - vision-transformer
8
+ ---
9
+
10
+ # Visual Instruction Pretraining for Domain-Specific Foundation Models
11
+
12
+ Official model weights and documentation for **ViTP** (Visual insTruction Pretraining), a novel paradigm for pretraining foundation models in downstream domains like Remote Sensing and Medical Imaging.
13
+
14
  <p align="center">
15
  <a href="http://arxiv.org/abs/2509.17562"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=Arxiv"></a>
16
+ <a href="https://github.com/zcablii/ViTP"><img src="https://img.shields.io/badge/GitHub-Code-blue?logo=github"></a>
17
  </p>
 
18
 
19
+ ## Introduction
20
+
21
+ Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is often underexplored.
22
+
23
+ ViTP addresses this gap by directly leveraging reasoning to enhance perception. It embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens.
24
+
25
+ ---
26
+
27
+ ![Framework](docs/vitp.png)
28
+ *A conceptual illustration of the ViTP framework. A ViT backbone is embedded within a large VLM and pretrained with domain-specific instruction following and Visual Robustness Learning (VRL).*
29
+
30
+ ![Synergy](docs/loop_radar.png)
31
+ *ViTP forges a novel link from high-level reasoning to low-level perception, establishing new state-of-the-art performance across 16 challenging benchmarks.*
32
+
33
+ ---
34
+
35
+ ## Pretrained Backbones
36
+
37
+ The following ViT-Large (300M) backbones are available in the repository:
38
+
39
+ | Model | Pretrain Domain | Weights |
40
+ | :--- | :--- | :--- |
41
+ | **ViTP_ViT_L_rs** | Remote Sensing | [Download](ckpts/ViTP_ViT_L_300M_rs.safetensors) |
42
+ | **ViTP_ViT_L_med** | Medical Imaging | [Download](ckpts/ViTP_ViT_L_300M_med.safetensors) |
43
 
44
+ These weights are designed to be used as initializations for various downstream tasks, including:
45
+ - **Object Detection** (via MMRotate)
46
+ - **Semantic Segmentation** (via MMSegmentation)
47
+ - **Change Detection** (via OpenCD)
48
 
49
+ For detailed installation and usage instructions, please refer to the [official GitHub repository](https://github.com/zcablii/ViTP).
50
 
51
+ ## Citation
 
52
 
53
+ If you use this work in your research, please cite:
54
 
55
  ```bibtex
56
+ @article{Li_2025_ViTP,
57
+ title={Visual Instruction Pretraining for Domain-Specific Foundation Models},
58
+ author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
59
+ journal={arXiv},
60
+ year={2025}
61
+ }
62
+ ```
63
 
64
+ ## License
65
+ Licensed under a [Creative Commons Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/) for Non-commercial use only. Any commercial use should obtain formal permission from the authors.