lmms-lab
/

LLaVA-OneVision-1.5-8B-Instruct

@@ -5,6 +5,7 @@ base_model:
 datasets:
 - lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M
 - lmms-lab/LLaVA-OneVision-1.5-Insturct-Data
 library_name: transformers
 license: apache-2.0
 pipeline_tag: image-text-to-text
@@ -19,29 +20,24 @@ Project Page: [https://huggingface.co/spaces/lmms-lab/LLaVA-OneVision-1.5](https
 Code: [https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5)
-**LLaVA-OneVision1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images.
-- **Superior Performance**
-A family of fully open-source large multimodal models demonstrating
-    - Superior performance across multiple multimodal benchmarks
-    - outperforming **Qwen2.5-VL** in most evaluation tasks.
-- **High-Quality Data at Scale**
-Meticulously curated **pre-training and SFT data** with rigorous filtering and quality control, achieving **superior data efficiency** with only **64B tokens**.
-    - Concept-balanced, highly diverse, high-quality caption data
-    - Comprehensive instruction fine-tuning data covering a wide range of tasks
-- **Ultra-Efficient Training Framework** Complete end-to-end training framework designed for maximum efficiency:
-    - $16000 total budget for full model training on A100 GPUs  ($0.6 per GPU/Hour)
-    - Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization**
-    - Optimized codebase for cost-effective scaling
-- **Fully Open Framework** for community access and reproducibility:
-    - High-quality pre-training & SFT data
-    - Complete training framework & code
-    - Training recipes & configurations
-    - Comprehensive training logs & metrics
 ## Models

 datasets:
 - lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M
 - lmms-lab/LLaVA-OneVision-1.5-Insturct-Data
+- HuggingFaceM4/FineVision
 library_name: transformers
 license: apache-2.0
 pipeline_tag: image-text-to-text
 Code: [https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5)
+## Introduction
+**LLaVA-OneVision-1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance**  with substantially **lower cost** through training on **native resolution** images.
+#### **Superior Performance**
+  - The model leads on multiple multimodal benchmarks and generally surpasses Qwen2.5-VL.
+  - Training on native-resolution images significantly improves its visual understanding.
+#### **High-Quality Data at Scale**
+  - The pretraining corpus comprises large-scale, concept-balanced, diverse, and high-quality captions curated with strict filtering and quality control.
+  - The instruction-tuning dataset is comprehensive and covers a wide range of tasks.
+#### **Ultra-Efficient Training Framework**
+  - The end-to-end training cost is about $16,000 on A100 GPUs at roughly $0.60 per GPU-hour.
+  - The system is built on Megatron-LM with support for MoE, FP8, and long-sequence parallelism, and the codebase is optimized for cost-effective scaling.
+#### **Fully Open Framework**
+  - The project releases high-quality pretraining and SFT datasets along with the complete training framework, configurations, and recipes.
+  - It also provides detailed training logs and metrics to enable reproducibility and community adoption.
 ## Models