--- license: apache-2.0 language: - en base_model: - lmms-lab/llava-onevision-qwen2-0.5b-si --- It is just a student project reviewing the training process of LLaVA Onevision utilizing only a single 16GB graphics card. The code and config have been modified a little to fit in the card. The details could be found in the repository https://github.com/Meur3ault/LLaVA-NeXT The training includes: 1. Stage-1: Pre-training of projector (**about 21h, 558K data set**) 2. Stage-1.5: Mid-stage training (**about 3h, 20k data set**) 3. Stage-2: Single image training (**about 10h, 73K samples.**) 4. Stage-2: OneVision training (**about 6h, 9.9K multi-images data and 0.34K video data**) The evaluation results compared with llava-hf/llava-onevision-qwen2-0.5b-ov-hf: Images: 1. AI2D : 57.1% (ours is 46.86%) 2. ChartQA: 61.4% (ours is 7.88%) 3. DocVQA_val: 70.0% (ours is 15.90%) 4. SeedBench (image): 65.5% (ours is 49.18%) Videos: 1. SeedBench (video): 44.2% (ours is 42.03%)