InnovatorLab
/

Innovator-VL-8B-Instruct

Text Generation

Model card Files Files and versions

kawhiiiileo commited on Jan 24

Commit

f986c12

·

verified ·

1 Parent(s): 95cc085

Update README.md

Files changed (1) hide show

README.md +3 -5

README.md CHANGED Viewed

@@ -11,16 +11,15 @@ pipeline_tag: text-generation
 ## Model Summary
-**Innovator-VL-8B-Instruct** is a multimodal instruction-following large language model designed for scientific understanding and reasoning.
-The model integrates strong general-purpose vision-language capabilities with enhanced scientific multimodal alignment, while maintaining a fully transparent and reproducible training pipeline.
 Unlike approaches that rely on large-scale domain-specific pretraining, Innovator-VL-8B-Instruct achieves competitive scientific performance using high-quality instruction tuning, without additional scientific text continued pretraining.
----
 ## Model Architecture
-![Innovator-VL Architecture](assets/innovator_vl_architecture.png)
 - **Vision Encoder**: RICE-ViT (region-aware visual representation)
 - **Projector**: PatchMerger for visual token compression
@@ -29,7 +28,6 @@ Unlike approaches that rely on large-scale domain-specific pretraining, Innovato
 The model supports native-resolution multi-image inputs and is suitable for complex scientific visual analysis.
 ## Training Overview
 - **Multimodal Alignment**: LLaVA-1.5 (558K)

 ## Model Summary
+**Innovator-VL-8B-Instruct** is a multimodal instruction-following large language model designed for scientific understanding and reasoning. The model integrates strong general-purpose vision-language capabilities with enhanced scientific multimodal alignment, while maintaining a fully transparent and reproducible training pipeline.
 Unlike approaches that rely on large-scale domain-specific pretraining, Innovator-VL-8B-Instruct achieves competitive scientific performance using high-quality instruction tuning, without additional scientific text continued pretraining.
+--
 ## Model Architecture
+<img src="assets/innovator_vl_architecture.png" width="600"/>
 - **Vision Encoder**: RICE-ViT (region-aware visual representation)
 - **Projector**: PatchMerger for visual token compression
 The model supports native-resolution multi-image inputs and is suitable for complex scientific visual analysis.
 ## Training Overview
 - **Multimodal Alignment**: LLaVA-1.5 (558K)