kawhiiiileo commited on
Commit
3822129
·
verified ·
1 Parent(s): 3d69a3d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -3
README.md CHANGED
@@ -1,3 +1,65 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ pipeline_tag: text-generation
7
+ ---
8
+
9
+
10
+ # Innovator-VL-8B-Instruct
11
+
12
+ ## Model Summary
13
+
14
+ **Innovator-VL-8B-Instruct** is a multimodal instruction-following large language model designed for scientific understanding and reasoning.
15
+ The model integrates strong general-purpose vision-language capabilities with enhanced scientific multimodal alignment, while maintaining a fully transparent and reproducible training pipeline.
16
+
17
+ Unlike approaches that rely on large-scale domain-specific pretraining, Innovator-VL-8B-Instruct achieves competitive scientific performance using high-quality instruction tuning, without additional scientific text continued pretraining.
18
+
19
+ ---
20
+
21
+ ## Model Architecture
22
+
23
+ ![Innovator-VL Architecture](assets/innovator_vl_architecture.png)
24
+
25
+ - **Vision Encoder**: RICE-ViT (region-aware visual representation)
26
+ - **Projector**: PatchMerger for visual token compression
27
+ - **Language Model**: Qwen3-8B-Base
28
+ - **Model Size**: 8B parameters
29
+
30
+ The model supports native-resolution multi-image inputs and is suitable for complex scientific visual analysis.
31
+
32
+
33
+ ## Training Overview
34
+
35
+ - **Multimodal Alignment**: LLaVA-1.5 (558K)
36
+ - **Mid-training**: LLaVA-OneVision-1.5 (85M)
37
+ - **Instruction Tuning**: High-quality multimodal and scientific instruction data (~46M)
38
+
39
+ No additional scientific text continued pretraining is applied.
40
+
41
+ ---
42
+
43
+ ## Intended Use
44
+
45
+ - Scientific image understanding and question answering
46
+ - Multimodal reasoning and analysis
47
+ - Interpretation of scientific figures, charts, and experimental results
48
+ - General-purpose vision-language instruction following
49
+
50
+ ---
51
+
52
+ ## Limitations
53
+
54
+ - The Instruct version does not explicitly optimize long-chain reasoning efficiency.
55
+ - For tasks requiring structured or token-efficient reasoning, a dedicated Thinking or RL-aligned model is recommended.
56
+
57
+ ---
58
+
59
+ ## Citation
60
+
61
+ ```bibtex
62
+ @article{innovator-vl,
63
+ title={Innovator-VL: A Multimodal Large Language Model for Scientific Discovery},
64
+ year={2025}
65
+ }