Add metadata and improve model card (#1)
Browse files- Add metadata and improve model card (60fb4762ca4b20b24426bd9c7c36b1fc51306aee)
Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>
README.md
CHANGED
|
@@ -1,23 +1,65 @@
|
|
| 1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
<p align="center">
|
| 3 |
<a href="http://arxiv.org/abs/2509.17562"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=Arxiv"></a>
|
|
|
|
| 4 |
</p>
|
| 5 |
-
# Introduction
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
-
|
| 12 |
-
|
|
|
|
|
|
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
-
A conceptual illustration of the ViTP framework. A ViT backbone is embedded within a large VLM and then pretrained with domain-specific instruction following objective and Visual Robustness Learning (VRL). This process instils high-level semantic understanding into the ViT. The resulting weights are then used to initialize models for various downstream perception tasks.
|
| 18 |
|
| 19 |
-
|
| 20 |
|
| 21 |
```bibtex
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: cc-by-nc-4.0
|
| 3 |
+
pipeline_tag: image-feature-extraction
|
| 4 |
+
tags:
|
| 5 |
+
- remote-sensing
|
| 6 |
+
- medical-imaging
|
| 7 |
+
- vision-transformer
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# Visual Instruction Pretraining for Domain-Specific Foundation Models
|
| 11 |
+
|
| 12 |
+
Official model weights and documentation for **ViTP** (Visual insTruction Pretraining), a novel paradigm for pretraining foundation models in downstream domains like Remote Sensing and Medical Imaging.
|
| 13 |
+
|
| 14 |
<p align="center">
|
| 15 |
<a href="http://arxiv.org/abs/2509.17562"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=Arxiv"></a>
|
| 16 |
+
<a href="https://github.com/zcablii/ViTP"><img src="https://img.shields.io/badge/GitHub-Code-blue?logo=github"></a>
|
| 17 |
</p>
|
|
|
|
| 18 |
|
| 19 |
+
## Introduction
|
| 20 |
+
|
| 21 |
+
Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is often underexplored.
|
| 22 |
+
|
| 23 |
+
ViTP addresses this gap by directly leveraging reasoning to enhance perception. It embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens.
|
| 24 |
+
|
| 25 |
+
---
|
| 26 |
+
|
| 27 |
+

|
| 28 |
+
*A conceptual illustration of the ViTP framework. A ViT backbone is embedded within a large VLM and pretrained with domain-specific instruction following and Visual Robustness Learning (VRL).*
|
| 29 |
+
|
| 30 |
+

|
| 31 |
+
*ViTP forges a novel link from high-level reasoning to low-level perception, establishing new state-of-the-art performance across 16 challenging benchmarks.*
|
| 32 |
+
|
| 33 |
+
---
|
| 34 |
+
|
| 35 |
+
## Pretrained Backbones
|
| 36 |
+
|
| 37 |
+
The following ViT-Large (300M) backbones are available in the repository:
|
| 38 |
+
|
| 39 |
+
| Model | Pretrain Domain | Weights |
|
| 40 |
+
| :--- | :--- | :--- |
|
| 41 |
+
| **ViTP_ViT_L_rs** | Remote Sensing | [Download](ckpts/ViTP_ViT_L_300M_rs.safetensors) |
|
| 42 |
+
| **ViTP_ViT_L_med** | Medical Imaging | [Download](ckpts/ViTP_ViT_L_300M_med.safetensors) |
|
| 43 |
|
| 44 |
+
These weights are designed to be used as initializations for various downstream tasks, including:
|
| 45 |
+
- **Object Detection** (via MMRotate)
|
| 46 |
+
- **Semantic Segmentation** (via MMSegmentation)
|
| 47 |
+
- **Change Detection** (via OpenCD)
|
| 48 |
|
| 49 |
+
For detailed installation and usage instructions, please refer to the [official GitHub repository](https://github.com/zcablii/ViTP).
|
| 50 |
|
| 51 |
+
## Citation
|
|
|
|
| 52 |
|
| 53 |
+
If you use this work in your research, please cite:
|
| 54 |
|
| 55 |
```bibtex
|
| 56 |
+
@article{Li_2025_ViTP,
|
| 57 |
+
title={Visual Instruction Pretraining for Domain-Specific Foundation Models},
|
| 58 |
+
author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
|
| 59 |
+
journal={arXiv},
|
| 60 |
+
year={2025}
|
| 61 |
+
}
|
| 62 |
+
```
|
| 63 |
|
| 64 |
+
## License
|
| 65 |
+
Licensed under a [Creative Commons Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/) for Non-commercial use only. Any commercial use should obtain formal permission from the authors.
|