| | --- |
| | license: cc-by-nc-4.0 |
| | tags: |
| | - vision |
| | - image-classification |
| | - vit |
| | - ViTP |
| | - InternVL |
| | - domain-adaptation |
| | - general |
| | language: |
| | - en |
| | library_name: transformers |
| | pipeline_tag: image-feature-extraction |
| | base_model: |
| | - GreatBird/ViTP |
| | --- |
| | |
| | # ViTP-InternVL-1B-General |
| |
|
| | ViTP (Visual Instruction Pretraining) vision backbone — **InternVL 1B** variant pretrained on **general** domain visual instruction data. Compatible with `InternVisionModel` from [InternVL](https://github.com/OpenGVLab/InternVL). |
| |
|
| | ## Model Details |
| |
|
| | - **Architecture**: InternVisionModel (24 layers, 1024 hidden, 16 heads) |
| | - **Image size**: 448×448 |
| | - **Patch size**: 14 |
| | - **Domain**: General |
| |
|
| | ## Usage |
| |
|
| | The model repo includes the modeling code. Load with `transformers` (no ViTP repo needed): |
| |
|
| | ```python |
| | from transformers import AutoModel, AutoImageProcessor |
| | import torch |
| | |
| | device = "cuda" |
| | model = AutoModel.from_pretrained( |
| | "BiliSakura/ViTP-InternVL-1B-General", |
| | trust_remote_code=True, |
| | torch_dtype=torch.bfloat16, |
| | device_map=device, |
| | ).eval() |
| | |
| | processor = AutoImageProcessor.from_pretrained("BiliSakura/ViTP-InternVL-1B-General") |
| | pixel_values = processor(images="image.jpg", return_tensors="pt").pixel_values.to(device, model.dtype) |
| | |
| | with torch.no_grad(): |
| | outputs = model(pixel_values=pixel_values) |
| | |
| | # Pooled CLS token: (1, 1024) |
| | features = outputs.pooler_output |
| | # Or full sequence: outputs.last_hidden_state |
| | ``` |
| |
|
| | ## Citation |
| |
|
| | ```bibtex |
| | @article{Li_2025_ViTP, |
| | title={Visual Instruction Pretraining for Domain-Specific Foundation Models}, |
| | author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian}, |
| | journal={arXiv}, |
| | year={2025} |
| | } |
| | ``` |