Upload folder using huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,6 +1,20 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
pipeline_tag: image-text-to-text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
# InternVL-Chat-V1-2-Plus
|
|
@@ -27,10 +41,11 @@ InternVL-Chat-V1-2-Plus uses the same model architecture as [InternVL-Chat-V1-2]
|
|
| 27 |
|
| 28 |
- **Training Strategy:**
|
| 29 |
|
| 30 |
-
-
|
| 31 |
-
- Learnable Component: MLP
|
| 32 |
-
- Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
| 33 |
-
|
|
|
|
| 34 |
- Learnable Component: ViT + MLP + LLM
|
| 35 |
- Data: 12 million SFT samples.
|
| 36 |
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
pipeline_tag: image-text-to-text
|
| 4 |
+
library_name: transformers
|
| 5 |
+
base_model:
|
| 6 |
+
- OpenGVLab/InternViT-6B-448px-V1-2
|
| 7 |
+
- NousResearch/Nous-Hermes-2-Yi-34B
|
| 8 |
+
base_model_relation: merge
|
| 9 |
+
language:
|
| 10 |
+
- multilingual
|
| 11 |
+
tags:
|
| 12 |
+
- internvl
|
| 13 |
+
- vision
|
| 14 |
+
- ocr
|
| 15 |
+
- multi-image
|
| 16 |
+
- video
|
| 17 |
+
- custom_code
|
| 18 |
---
|
| 19 |
|
| 20 |
# InternVL-Chat-V1-2-Plus
|
|
|
|
| 41 |
|
| 42 |
- **Training Strategy:**
|
| 43 |
|
| 44 |
+
- Pre-training Stage
|
| 45 |
+
- Learnable Component: ViT + MLP
|
| 46 |
+
- Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
| 47 |
+
- Note: In this stage, we first load the pre-trained weights of [InternViT-6B-448px-V1-0](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) and connect it to Nous-Hermes-2-Yi-34B. After pre-training, the extracted ViT is published as [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
| 48 |
+
- Supervised Fine-tuning Stage
|
| 49 |
- Learnable Component: ViT + MLP + LLM
|
| 50 |
- Data: 12 million SFT samples.
|
| 51 |
|