kawhiiiileo commited on
Commit
f986c12
·
verified ·
1 Parent(s): 95cc085

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -5
README.md CHANGED
@@ -11,16 +11,15 @@ pipeline_tag: text-generation
11
 
12
  ## Model Summary
13
 
14
- **Innovator-VL-8B-Instruct** is a multimodal instruction-following large language model designed for scientific understanding and reasoning.
15
- The model integrates strong general-purpose vision-language capabilities with enhanced scientific multimodal alignment, while maintaining a fully transparent and reproducible training pipeline.
16
 
17
  Unlike approaches that rely on large-scale domain-specific pretraining, Innovator-VL-8B-Instruct achieves competitive scientific performance using high-quality instruction tuning, without additional scientific text continued pretraining.
18
 
19
- ---
20
 
21
  ## Model Architecture
22
 
23
- ![Innovator-VL Architecture](assets/innovator_vl_architecture.png)
24
 
25
  - **Vision Encoder**: RICE-ViT (region-aware visual representation)
26
  - **Projector**: PatchMerger for visual token compression
@@ -29,7 +28,6 @@ Unlike approaches that rely on large-scale domain-specific pretraining, Innovato
29
 
30
  The model supports native-resolution multi-image inputs and is suitable for complex scientific visual analysis.
31
 
32
-
33
  ## Training Overview
34
 
35
  - **Multimodal Alignment**: LLaVA-1.5 (558K)
 
11
 
12
  ## Model Summary
13
 
14
+ **Innovator-VL-8B-Instruct** is a multimodal instruction-following large language model designed for scientific understanding and reasoning. The model integrates strong general-purpose vision-language capabilities with enhanced scientific multimodal alignment, while maintaining a fully transparent and reproducible training pipeline.
 
15
 
16
  Unlike approaches that rely on large-scale domain-specific pretraining, Innovator-VL-8B-Instruct achieves competitive scientific performance using high-quality instruction tuning, without additional scientific text continued pretraining.
17
 
18
+ --
19
 
20
  ## Model Architecture
21
 
22
+ <img src="assets/innovator_vl_architecture.png" width="600"/>
23
 
24
  - **Vision Encoder**: RICE-ViT (region-aware visual representation)
25
  - **Projector**: PatchMerger for visual token compression
 
28
 
29
  The model supports native-resolution multi-image inputs and is suitable for complex scientific visual analysis.
30
 
 
31
  ## Training Overview
32
 
33
  - **Multimodal Alignment**: LLaVA-1.5 (558K)