Update README.md
Browse files
README.md
CHANGED
|
@@ -5,6 +5,7 @@ base_model:
|
|
| 5 |
datasets:
|
| 6 |
- lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M
|
| 7 |
- lmms-lab/LLaVA-OneVision-1.5-Insturct-Data
|
|
|
|
| 8 |
library_name: transformers
|
| 9 |
license: apache-2.0
|
| 10 |
pipeline_tag: image-text-to-text
|
|
@@ -19,29 +20,24 @@ Project Page: [https://huggingface.co/spaces/lmms-lab/LLaVA-OneVision-1.5](https
|
|
| 19 |
Code: [https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5)
|
| 20 |
|
| 21 |
|
| 22 |
-
|
|
|
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
- outperforming **Qwen2.5-VL** in most evaluation tasks.
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
- Comprehensive instruction fine-tuning data covering a wide range of tasks
|
| 33 |
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
- Optimized codebase for cost-effective scaling
|
| 38 |
|
| 39 |
-
|
| 40 |
-
-
|
| 41 |
-
|
| 42 |
-
- Complete training framework & code
|
| 43 |
-
- Training recipes & configurations
|
| 44 |
-
- Comprehensive training logs & metrics
|
| 45 |
|
| 46 |
|
| 47 |
## Models
|
|
|
|
| 5 |
datasets:
|
| 6 |
- lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M
|
| 7 |
- lmms-lab/LLaVA-OneVision-1.5-Insturct-Data
|
| 8 |
+
- HuggingFaceM4/FineVision
|
| 9 |
library_name: transformers
|
| 10 |
license: apache-2.0
|
| 11 |
pipeline_tag: image-text-to-text
|
|
|
|
| 20 |
Code: [https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5)
|
| 21 |
|
| 22 |
|
| 23 |
+
## Introduction
|
| 24 |
+
**LLaVA-OneVision-1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images.
|
| 25 |
|
| 26 |
+
#### **Superior Performance**
|
| 27 |
+
- The model leads on multiple multimodal benchmarks and generally surpasses Qwen2.5-VL.
|
| 28 |
+
- Training on native-resolution images significantly improves its visual understanding.
|
|
|
|
| 29 |
|
| 30 |
+
#### **High-Quality Data at Scale**
|
| 31 |
+
- The pretraining corpus comprises large-scale, concept-balanced, diverse, and high-quality captions curated with strict filtering and quality control.
|
| 32 |
+
- The instruction-tuning dataset is comprehensive and covers a wide range of tasks.
|
|
|
|
| 33 |
|
| 34 |
+
#### **Ultra-Efficient Training Framework**
|
| 35 |
+
- The end-to-end training cost is about $16,000 on A100 GPUs at roughly $0.60 per GPU-hour.
|
| 36 |
+
- The system is built on Megatron-LM with support for MoE, FP8, and long-sequence parallelism, and the codebase is optimized for cost-effective scaling.
|
|
|
|
| 37 |
|
| 38 |
+
#### **Fully Open Framework**
|
| 39 |
+
- The project releases high-quality pretraining and SFT datasets along with the complete training framework, configurations, and recipes.
|
| 40 |
+
- It also provides detailed training logs and metrics to enable reproducibility and community adoption.
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
|
| 43 |
## Models
|