---
license: apache-2.0
language:
- en
base_model:
- lmms-lab/llava-onevision-qwen2-0.5b-si
---

It is just a student project reviewing the training process of LLaVA Onevision utilizing only a single 16GB graphics card. 
The code and config have been modified a little to fit in the card. The details could be found in the repository https://github.com/Meur3ault/LLaVA-NeXT

The training includes:

1.   Stage-1: Pre-training of projector (**about 21h, 558K data set**)
2.   Stage-1.5: Mid-stage training (**about 3h, 20k data set**)
3.   Stage-2: Single image training (**about 10h, 73K samples.**)
4.   Stage-2: OneVision training (**about 6h, 9.9K multi-images data and 0.34K video data**)

The evaluation results compared with llava-hf/llava-onevision-qwen2-0.5b-ov-hf:

Images:
1.   AI2D : 57.1% (ours is 46.86%)
2.   ChartQA: 61.4% (ours is 7.88%)
3.   DocVQA_val: 70.0% (ours is 15.90%)
4.   SeedBench (image):  65.5% (ours is 49.18%)

Videos:
1.   SeedBench (video):  44.2% (ours is 42.03%)