GreatBird
/

ViTP

Image Feature Extraction

medical-imaging

vision-transformer

Model card Files Files and versions

ViTP / README.md

nielsr's picture

nielsr HF Staff

Add metadata and improve model card

60fb476 verified 5 days ago

|

3.09 kB

	---
	license: cc-by-nc-4.0
	pipeline_tag: image-feature-extraction
	tags:
	- remote-sensing
	- medical-imaging
	- vision-transformer
	---

	# Visual Instruction Pretraining for Domain-Specific Foundation Models

	Official model weights and documentation for ViTP (Visual insTruction Pretraining), a novel paradigm for pretraining foundation models in downstream domains like Remote Sensing and Medical Imaging.

	<p align="center">
	<a href="http://arxiv.org/abs/2509.17562"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=Arxiv"></a>
	<a href="https://github.com/zcablii/ViTP"><img src="https://img.shields.io/badge/GitHub-Code-blue?logo=github"></a>
	</p>

	## Introduction

	Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is often underexplored.

	ViTP addresses this gap by directly leveraging reasoning to enhance perception. It embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens.

	---

	![Framework](docs/vitp.png)
	A conceptual illustration of the ViTP framework. A ViT backbone is embedded within a large VLM and pretrained with domain-specific instruction following and Visual Robustness Learning (VRL).

	![Synergy](docs/loop_radar.png)
	ViTP forges a novel link from high-level reasoning to low-level perception, establishing new state-of-the-art performance across 16 challenging benchmarks.

	---

	## Pretrained Backbones

	The following ViT-Large (300M) backbones are available in the repository:

	\| Model \| Pretrain Domain \| Weights \|
	\| :--- \| :--- \| :--- \|
	\| ViTP_ViT_L_rs \| Remote Sensing \| [Download](ckpts/ViTP_ViT_L_300M_rs.safetensors) \|
	\| ViTP_ViT_L_med \| Medical Imaging \| [Download](ckpts/ViTP_ViT_L_300M_med.safetensors) \|

	These weights are designed to be used as initializations for various downstream tasks, including:
	- Object Detection (via MMRotate)
	- Semantic Segmentation (via MMSegmentation)
	- Change Detection (via OpenCD)

	For detailed installation and usage instructions, please refer to the [official GitHub repository](https://github.com/zcablii/ViTP).

	## Citation

	If you use this work in your research, please cite:

	```bibtex
	@article{Li_2025_ViTP,
	title={Visual Instruction Pretraining for Domain-Specific Foundation Models},
	author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
	journal={arXiv},
	year={2025}
	}
	```

	## License
	Licensed under a [Creative Commons Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/) for Non-commercial use only. Any commercial use should obtain formal permission from the authors.