Add metadata and improve model card (#1)

Browse files

- Add metadata and improve model card (60fb4762ca4b20b24426bd9c7c36b1fc51306aee)

Co-authored-by: Niels Rogge <nielsr@users.noreply.huggingface.co>

Files changed (1) hide show

README.md +54 -12

README.md CHANGED Viewed

@@ -1,23 +1,65 @@
-Visual Instruction Pretraining for Domain-Specific Foundation Models
 <p align="center">
 <a href="http://arxiv.org/abs/2509.17562"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=Arxiv"></a>
 </p>
-# Introduction
-Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce **V**isual **i**ns**T**ruction **P**retraining (**ViTP**), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at [GitHub](https://github.com/zcablii/ViTP).
- ----
-![image/png](docs/loop_radar.png)
-The synergistic relationship between perception, generation, and reasoning in modern CV. Our proposed ViTP forges a novel link from high-level reasoning to low-level perception, a previously underexplored connection. ViTP sets new SOTA performance across a diverse range of downstream tasks in medical imaging and remote sensing.
-----
-![image/png](docs/vitp.png)
-A conceptual illustration of the ViTP framework. A ViT backbone is embedded within a large VLM and then pretrained with domain-specific instruction following objective and Visual Robustness Learning (VRL). This process instils high-level semantic understanding into the ViT. The resulting weights are then used to initialize models for various downstream perception tasks.
-----
 ```bibtex
-```

+---
+license: cc-by-nc-4.0
+pipeline_tag: image-feature-extraction
+tags:
+- remote-sensing
+- medical-imaging
+- vision-transformer
+---
+# Visual Instruction Pretraining for Domain-Specific Foundation Models
+Official model weights and documentation for **ViTP** (Visual insTruction Pretraining), a novel paradigm for pretraining foundation models in downstream domains like Remote Sensing and Medical Imaging.
 <p align="center">
 <a href="http://arxiv.org/abs/2509.17562"><img src="https://img.shields.io/badge/arXiv-Paper-b31b1b?logo=Arxiv"></a>
+<a href="https://github.com/zcablii/ViTP"><img src="https://img.shields.io/badge/GitHub-Code-blue?logo=github"></a>
 </p>
+## Introduction
+Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is often underexplored.
+ViTP addresses this gap by directly leveraging reasoning to enhance perception. It embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens.
+---
+![Framework](docs/vitp.png)
+*A conceptual illustration of the ViTP framework. A ViT backbone is embedded within a large VLM and pretrained with domain-specific instruction following and Visual Robustness Learning (VRL).*
+![Synergy](docs/loop_radar.png)
+*ViTP forges a novel link from high-level reasoning to low-level perception, establishing new state-of-the-art performance across 16 challenging benchmarks.*
+---
+## Pretrained Backbones
+The following ViT-Large (300M) backbones are available in the repository:
+| Model | Pretrain Domain | Weights |
+| :--- | :--- | :--- |
+| **ViTP_ViT_L_rs** | Remote Sensing | [Download](ckpts/ViTP_ViT_L_300M_rs.safetensors) |
+| **ViTP_ViT_L_med** | Medical Imaging | [Download](ckpts/ViTP_ViT_L_300M_med.safetensors) |
+These weights are designed to be used as initializations for various downstream tasks, including:
+- **Object Detection** (via MMRotate)
+- **Semantic Segmentation** (via MMSegmentation)
+- **Change Detection** (via OpenCD)
+For detailed installation and usage instructions, please refer to the [official GitHub repository](https://github.com/zcablii/ViTP).
+## Citation
+If you use this work in your research, please cite:
 ```bibtex
+@article{Li_2025_ViTP,
+  title={Visual Instruction Pretraining for Domain-Specific Foundation Models},
+  author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
+  journal={arXiv},
+  year={2025}
+}
+```
+## License
+Licensed under a [Creative Commons Attribution-NonCommercial 4.0 International](https://creativecommons.org/licenses/by-nc/4.0/) for Non-commercial use only. Any commercial use should obtain formal permission from the authors.