BiliSakura commited on
Commit
cb4e5b7
·
verified ·
1 Parent(s): 5709afa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -64
README.md CHANGED
@@ -1,64 +1,66 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- tags:
4
- - vision
5
- - image-classification
6
- - vit
7
- - ViTP
8
- - InternVL
9
- - domain-adaptation
10
- - general
11
- language:
12
- - en
13
- library_name: transformers
14
- pipeline_tag: image-feature-extraction
15
- ---
16
-
17
- # ViTP-InternVL-1B-General
18
-
19
- ViTP (Visual Instruction Pretraining) vision backbone — **InternVL 1B** variant pretrained on **general** domain visual instruction data. Compatible with `InternVisionModel` from [InternVL](https://github.com/OpenGVLab/InternVL).
20
-
21
- ## Model Details
22
-
23
- - **Architecture**: InternVisionModel (24 layers, 1024 hidden, 16 heads)
24
- - **Image size**: 448×448
25
- - **Patch size**: 14
26
- - **Domain**: General
27
-
28
- ## Usage
29
-
30
- The model repo includes the modeling code. Load with `transformers` (no ViTP repo needed):
31
-
32
- ```python
33
- from transformers import AutoModel, AutoImageProcessor
34
- import torch
35
-
36
- device = "cuda"
37
- model = AutoModel.from_pretrained(
38
- "BiliSakura/ViTP-InternVL-1B-General",
39
- trust_remote_code=True,
40
- torch_dtype=torch.bfloat16,
41
- device_map=device,
42
- ).eval()
43
-
44
- processor = AutoImageProcessor.from_pretrained("BiliSakura/ViTP-InternVL-1B-General")
45
- pixel_values = processor(images="image.jpg", return_tensors="pt").pixel_values.to(device, model.dtype)
46
-
47
- with torch.no_grad():
48
- outputs = model(pixel_values=pixel_values)
49
-
50
- # Pooled CLS token: (1, 1024)
51
- features = outputs.pooler_output
52
- # Or full sequence: outputs.last_hidden_state
53
- ```
54
-
55
- ## Citation
56
-
57
- ```bibtex
58
- @article{Li_2025_ViTP,
59
- title={Visual Instruction Pretraining for Domain-Specific Foundation Models},
60
- author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
61
- journal={arXiv},
62
- year={2025}
63
- }
64
- ```
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ tags:
4
+ - vision
5
+ - image-classification
6
+ - vit
7
+ - ViTP
8
+ - InternVL
9
+ - domain-adaptation
10
+ - general
11
+ language:
12
+ - en
13
+ library_name: transformers
14
+ pipeline_tag: image-feature-extraction
15
+ base_model:
16
+ - GreatBird/ViTP
17
+ ---
18
+
19
+ # ViTP-InternVL-1B-General
20
+
21
+ ViTP (Visual Instruction Pretraining) vision backbone — **InternVL 1B** variant pretrained on **general** domain visual instruction data. Compatible with `InternVisionModel` from [InternVL](https://github.com/OpenGVLab/InternVL).
22
+
23
+ ## Model Details
24
+
25
+ - **Architecture**: InternVisionModel (24 layers, 1024 hidden, 16 heads)
26
+ - **Image size**: 448×448
27
+ - **Patch size**: 14
28
+ - **Domain**: General
29
+
30
+ ## Usage
31
+
32
+ The model repo includes the modeling code. Load with `transformers` (no ViTP repo needed):
33
+
34
+ ```python
35
+ from transformers import AutoModel, AutoImageProcessor
36
+ import torch
37
+
38
+ device = "cuda"
39
+ model = AutoModel.from_pretrained(
40
+ "BiliSakura/ViTP-InternVL-1B-General",
41
+ trust_remote_code=True,
42
+ torch_dtype=torch.bfloat16,
43
+ device_map=device,
44
+ ).eval()
45
+
46
+ processor = AutoImageProcessor.from_pretrained("BiliSakura/ViTP-InternVL-1B-General")
47
+ pixel_values = processor(images="image.jpg", return_tensors="pt").pixel_values.to(device, model.dtype)
48
+
49
+ with torch.no_grad():
50
+ outputs = model(pixel_values=pixel_values)
51
+
52
+ # Pooled CLS token: (1, 1024)
53
+ features = outputs.pooler_output
54
+ # Or full sequence: outputs.last_hidden_state
55
+ ```
56
+
57
+ ## Citation
58
+
59
+ ```bibtex
60
+ @article{Li_2025_ViTP,
61
+ title={Visual Instruction Pretraining for Domain-Specific Foundation Models},
62
+ author={Li, Yuxuan and Zhang, Yicheng and Tang, Wenhao and Dai, Yimian and Cheng, Ming-Ming and Li, Xiang and Yang, Jian},
63
+ journal={arXiv},
64
+ year={2025}
65
+ }
66
+ ```