--- license: cc-by-nc-4.0 tags: - vision - image-classification - vit - ViTP - InternVL - domain-adaptation - general language: - en library_name: transformers pipeline_tag: image-feature-extraction base_model: - GreatBird/ViTP --- # ViTP-InternVL-1B-General ViTP (Visual Instruction Pretraining) vision backbone — **InternVL 1B** variant pretrained on **general** domain visual instruction data. Compatible with `InternVisionModel` from [InternVL](https://github.com/OpenGVLab/InternVL). ## Model Details - **Architecture**: InternVisionModel (24 layers, 1024 hidden, 16 heads) - **Image size**: 448×448 - **Patch size**: 14 - **Domain**: General ## Usage The model repo includes the modeling code. Load with `transformers` (no ViTP repo needed): ```python from transformers import AutoModel, AutoImageProcessor import torch device = "cuda" model = AutoModel.from_pretrained( "BiliSakura/ViTP-InternVL-1B-General", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=device, ).eval() processor = AutoImageProcessor.from_pretrained("BiliSakura/ViTP-InternVL-1B-General") pixel_values = processor(images="image.jpg", return_tensors="pt").pixel_values.to(device, model.dtype) with torch.no_grad(): outputs = model(pixel_values=pixel_values) # Pooled CLS token: (1, 1024) features = outputs.pooler_output # Or full sequence: outputs.last_hidden_state ``` ## Citation ```bibtex @article{Li_2025_ViTP, title={Visual Instruction Pretraining for Domain-Specific Foundation Models}, author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian}, journal={arXiv}, year={2025} } ```