InnovatorLab
/

Innovator-VL-8B-Instruct

Text Generation

Model card Files Files and versions

kawhiiiileo commited on Jan 24

Commit

3822129

·

verified ·

1 Parent(s): 3d69a3d

Update README.md

Files changed (1) hide show

README.md +65 -3

README.md CHANGED Viewed

@@ -1,3 +1,65 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- en
+- zh
+pipeline_tag: text-generation
+---
+# Innovator-VL-8B-Instruct
+## Model Summary
+**Innovator-VL-8B-Instruct** is a multimodal instruction-following large language model designed for scientific understanding and reasoning.
+The model integrates strong general-purpose vision-language capabilities with enhanced scientific multimodal alignment, while maintaining a fully transparent and reproducible training pipeline.
+Unlike approaches that rely on large-scale domain-specific pretraining, Innovator-VL-8B-Instruct achieves competitive scientific performance using high-quality instruction tuning, without additional scientific text continued pretraining.
+---
+## Model Architecture
+![Innovator-VL Architecture](assets/innovator_vl_architecture.png)
+- **Vision Encoder**: RICE-ViT (region-aware visual representation)
+- **Projector**: PatchMerger for visual token compression
+- **Language Model**: Qwen3-8B-Base
+- **Model Size**: 8B parameters
+The model supports native-resolution multi-image inputs and is suitable for complex scientific visual analysis.
+## Training Overview
+- **Multimodal Alignment**: LLaVA-1.5 (558K)
+- **Mid-training**: LLaVA-OneVision-1.5 (85M)
+- **Instruction Tuning**: High-quality multimodal and scientific instruction data (~46M)
+No additional scientific text continued pretraining is applied.
+---
+## Intended Use
+- Scientific image understanding and question answering
+- Multimodal reasoning and analysis
+- Interpretation of scientific figures, charts, and experimental results
+- General-purpose vision-language instruction following
+---
+## Limitations
+- The Instruct version does not explicitly optimize long-chain reasoning efficiency.
+- For tasks requiring structured or token-efficient reasoning, a dedicated Thinking or RL-aligned model is recommended.
+---
+## Citation
+```bibtex
+@article{innovator-vl,
+  title={Innovator-VL: A Multimodal Large Language Model for Scientific Discovery},
+  year={2025}
+}