README.md · BiliSakura/ViTP-InternVL-1B-General at main

ViTP-InternVL-1B-General / README.md

BiliSakura

Update README.md

cb4e5b7 verified 4 days ago

preview code

raw

history blame contribute delete

1.69 kB

	---
	license: cc-by-nc-4.0
	tags:
	- vision
	- image-classification
	- vit
	- ViTP
	- InternVL
	- domain-adaptation
	- general
	language:
	- en
	library_name: transformers
	pipeline_tag: image-feature-extraction
	base_model:
	- GreatBird/ViTP
	---

	# ViTP-InternVL-1B-General

	ViTP (Visual Instruction Pretraining) vision backbone — InternVL 1B variant pretrained on general domain visual instruction data. Compatible with `InternVisionModel` from [InternVL](https://github.com/OpenGVLab/InternVL).

	## Model Details

	- Architecture: InternVisionModel (24 layers, 1024 hidden, 16 heads)
	- Image size: 448×448
	- Patch size: 14
	- Domain: General

	## Usage

	The model repo includes the modeling code. Load with `transformers` (no ViTP repo needed):

	```python
	from transformers import AutoModel, AutoImageProcessor
	import torch

	device = "cuda"
	model = AutoModel.from_pretrained(
	"BiliSakura/ViTP-InternVL-1B-General",
	trust_remote_code=True,
	torch_dtype=torch.bfloat16,
	device_map=device,
	).eval()

	processor = AutoImageProcessor.from_pretrained("BiliSakura/ViTP-InternVL-1B-General")
	pixel_values = processor(images="image.jpg", return_tensors="pt").pixel_values.to(device, model.dtype)

	with torch.no_grad():
	outputs = model(pixel_values=pixel_values)

	# Pooled CLS token: (1, 1024)
	features = outputs.pooler_output
	# Or full sequence: outputs.last_hidden_state
	```

	## Citation

	```bibtex
	@article{Li_2025_ViTP,
	title={Visual Instruction Pretraining for Domain-Specific Foundation Models},
	author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
	journal={arXiv},
	year={2025}
	}
	```