--- license: cc-by-nc-4.0 pipeline_tag: image-feature-extraction tags: - remote-sensing - medical-imaging - vision-transformer --- # Visual Instruction Pretraining for Domain-Specific Foundation Models Official model weights and documentation for **ViTP** (Visual insTruction Pretraining), a novel paradigm for pretraining foundation models in downstream domains like Remote Sensing and Medical Imaging.

## Introduction Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is often underexplored. ViTP addresses this gap by directly leveraging reasoning to enhance perception. It embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. --- ![Framework](docs/vitp.png) *A conceptual illustration of the ViTP framework. A ViT backbone is embedded within a large VLM and pretrained with domain-specific instruction following and Visual Robustness Learning (VRL).* ![Synergy](docs/loop_radar.png) *ViTP forges a novel link from high-level reasoning to low-level perception, establishing new state-of-the-art performance across 16 challenging benchmarks.* --- ## Pretrained Backbones The following ViT-Large (300M) backbones are available in the repository: | Model | Pretrain Domain | Weights | | :--- | :--- | :--- | | **ViTP_ViT_L_rs** | Remote Sensing | [Download](ckpts/ViTP_ViT_L_300M_rs.safetensors) | | **ViTP_ViT_L_med** | Medical Imaging | [Download](ckpts/ViTP_ViT_L_300M_med.safetensors) | These weights are designed to be used as initializations for various downstream tasks, including: - **Object Detection** (via MMRotate) - **Semantic Segmentation** (via MMSegmentation) - **Change Detection** (via OpenCD) For detailed installation and usage instructions, please refer to the [official GitHub repository](https://github.com/zcablii/ViTP). ## Citation If you use this work in your research, please cite: ```bibtex @article{Li_2025_ViTP, title={Visual Instruction Pretraining for Domain-Specific Foundation Models}, author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian}, journal={arXiv}, year={2025} } ``` ## License Licensed under a [Creative Commons Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/) for Non-commercial use only. Any commercial use should obtain formal permission from the authors.