Image-Text-to-Text
Transformers
TensorBoard
Safetensors
feature-extraction
conversational
custom_code
xiangan commited on
Commit
35f4ece
·
verified ·
1 Parent(s): 2e2c9d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -19
README.md CHANGED
@@ -5,6 +5,7 @@ base_model:
5
  datasets:
6
  - lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M
7
  - lmms-lab/LLaVA-OneVision-1.5-Insturct-Data
 
8
  library_name: transformers
9
  license: apache-2.0
10
  pipeline_tag: image-text-to-text
@@ -19,29 +20,24 @@ Project Page: [https://huggingface.co/spaces/lmms-lab/LLaVA-OneVision-1.5](https
19
  Code: [https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5)
20
 
21
 
22
- **LLaVA-OneVision1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images.
 
23
 
24
- - **Superior Performance**
25
- A family of fully open-source large multimodal models demonstrating
26
- - Superior performance across multiple multimodal benchmarks
27
- - outperforming **Qwen2.5-VL** in most evaluation tasks.
28
 
29
- - **High-Quality Data at Scale**
30
- Meticulously curated **pre-training and SFT data** with rigorous filtering and quality control, achieving **superior data efficiency** with only **64B tokens**.
31
- - Concept-balanced, highly diverse, high-quality caption data
32
- - Comprehensive instruction fine-tuning data covering a wide range of tasks
33
 
34
- - **Ultra-Efficient Training Framework** Complete end-to-end training framework designed for maximum efficiency:
35
- - $16000 total budget for full model training on A100 GPUs ($0.6 per GPU/Hour)
36
- - Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization**
37
- - Optimized codebase for cost-effective scaling
38
 
39
-
40
- - **Fully Open Framework** for community access and reproducibility:
41
- - High-quality pre-training & SFT data
42
- - Complete training framework & code
43
- - Training recipes & configurations
44
- - Comprehensive training logs & metrics
45
 
46
 
47
  ## Models
 
5
  datasets:
6
  - lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M
7
  - lmms-lab/LLaVA-OneVision-1.5-Insturct-Data
8
+ - HuggingFaceM4/FineVision
9
  library_name: transformers
10
  license: apache-2.0
11
  pipeline_tag: image-text-to-text
 
20
  Code: [https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5](https://github.com/EvolvingLMMs-Lab/LLaVA-OneVision-1.5)
21
 
22
 
23
+ ## Introduction
24
+ **LLaVA-OneVision-1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images.
25
 
26
+ #### **Superior Performance**
27
+ - The model leads on multiple multimodal benchmarks and generally surpasses Qwen2.5-VL.
28
+ - Training on native-resolution images significantly improves its visual understanding.
 
29
 
30
+ #### **High-Quality Data at Scale**
31
+ - The pretraining corpus comprises large-scale, concept-balanced, diverse, and high-quality captions curated with strict filtering and quality control.
32
+ - The instruction-tuning dataset is comprehensive and covers a wide range of tasks.
 
33
 
34
+ #### **Ultra-Efficient Training Framework**
35
+ - The end-to-end training cost is about $16,000 on A100 GPUs at roughly $0.60 per GPU-hour.
36
+ - The system is built on Megatron-LM with support for MoE, FP8, and long-sequence parallelism, and the codebase is optimized for cost-effective scaling.
 
37
 
38
+ #### **Fully Open Framework**
39
+ - The project releases high-quality pretraining and SFT datasets along with the complete training framework, configurations, and recipes.
40
+ - It also provides detailed training logs and metrics to enable reproducibility and community adoption.
 
 
 
41
 
42
 
43
  ## Models