File size: 3,092 Bytes
2d26fd2
 
 
 
 
 
 
 
 
 
 
 
 
8822396
 
2d26fd2
8822396
af20dda
2d26fd2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af20dda
2d26fd2
 
 
 
af20dda
2d26fd2
af20dda
2d26fd2
af20dda
2d26fd2
af20dda
 
2d26fd2
 
 
 
 
 
 
f8eb63b
2d26fd2
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
---
license: cc-by-nc-4.0
pipeline_tag: image-feature-extraction
tags:
- remote-sensing
- medical-imaging
- vision-transformer
---

# Visual Instruction Pretraining for Domain-Specific Foundation Models

Official model weights and documentation for **ViTP** (Visual insTruction Pretraining), a novel paradigm for pretraining foundation models in downstream domains like Remote Sensing and Medical Imaging.

<p align="center">
<a href="http://arxiv.org/abs/2509.17562"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=Arxiv"></a>
<a href="https://github.com/zcablii/ViTP"><img src="https://img.shields.io/badge/GitHub-Code-blue?logo=github"></a>
</p>

## Introduction

Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is often underexplored. 

ViTP addresses this gap by directly leveraging reasoning to enhance perception. It embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens.

---

![Framework](docs/vitp.png)
*A conceptual illustration of the ViTP framework. A ViT backbone is embedded within a large VLM and pretrained with domain-specific instruction following and Visual Robustness Learning (VRL).*

![Synergy](docs/loop_radar.png) 
*ViTP forges a novel link from high-level reasoning to low-level perception, establishing new state-of-the-art performance across 16 challenging benchmarks.*

---

## Pretrained Backbones

The following ViT-Large (300M) backbones are available in the repository:

| Model | Pretrain Domain | Weights |
| :--- | :--- | :--- |
| **ViTP_ViT_L_rs** | Remote Sensing | [Download](ckpts/ViTP_ViT_L_300M_rs.safetensors) |
| **ViTP_ViT_L_med** | Medical Imaging | [Download](ckpts/ViTP_ViT_L_300M_med.safetensors) |

These weights are designed to be used as initializations for various downstream tasks, including:
- **Object Detection** (via MMRotate)
- **Semantic Segmentation** (via MMSegmentation)
- **Change Detection** (via OpenCD)

For detailed installation and usage instructions, please refer to the [official GitHub repository](https://github.com/zcablii/ViTP).

## Citation

If you use this work in your research, please cite:

```bibtex
@article{Li_2025_ViTP,
  title={Visual Instruction Pretraining for Domain-Specific Foundation Models},
  author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
  journal={arXiv},
  year={2025}
}
```

## License
Licensed under a [Creative Commons Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/) for Non-commercial use only. Any commercial use should obtain formal permission from the authors.