| # LLaVA OneVision | |
| ## Model Details | |
| LLaVA OneVision is a multi-modal model capable of processing images, text, image-text interleaved inputs, and videos. The model is trained in multiple stages: | |
| 1. Stage-1: Initial training on 558K samples from the LCS dataset. | |
| 2. Stage-1.5: Training on 4M high-quality samples with detailed captions, OCR and knowledge data. | |
| 3. Stage-2: | |
| - Single-Image: Training on 3.2M instruction-following image samples. | |
| - OneVision: Training on 1.6M single-image, multi-image and video samples with instructions. | |
| Key features: | |
| - Supports various input resolutions up to 2304 * 2304 pixels. | |
| - Single image input is represented by 729 * (9+1) tokens at most under `anyres_max_9` mode. | |
| - Supports multi-image and video inputs. Multi-image input is represented by 729 token for each image, and video input is represented by 196 token for each frame. | |
| - Available in three sizes: 0.5B, 7B and 72B parameter versions, fit for different memory and inference latency requirements. | |
| Some Implementation Details: | |
| - Trained using a combination of vision-specific (AdamW, 2e-6) and language model (AdamW, 1e-5) learning rates. | |
| - Each stage is trained for 1 epoch. | |
| The model uses [SO400M](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba) as the vision encoder and [Qwen-2.0](https://huggingface.co/docs/transformers/model_doc/qwen2) as the language model, with trainable components including a projector and the full model in later stages. | |
| We recommend to use the scripts in [training](../scripts/) to get the details of the training process. | |
| ## Inference Guidance | |
| We recommend to follow the [tutorial](./LLaVA_OneVision_Tutorials.ipynb) to get started on using our most basic 0.5B model for image, text, image-text interleaved, and video input. We use our 0.5B version as an example. This could be running on a GPU with 4GB memory. And with the following examples, you could see it's surprisingly have promising performance on understanding the image, interleaved image-text, and video. Tiny but mighty! | |
| ## Evaluation Guidance | |
| We use the [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) toolkit to evaluate our models. Ensure you have installed the LLaVA-NeXT model files as per the instructions in the main README.md. | |
| Install lmms-eval: | |
| > pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git | |
| ### Reproducing Evaluation Results | |
| Our models' evaluation results can be fully reproduced using the lmms-eval toolkit. After installing lmms-eval and llava, you can run the evaluation using the following commands. | |
| Note: These commands require flash-attn. If you prefer not to install it, disable flash-attn by adding `attn_implementation=None` to the `--model_args` parameter. | |
| Important: Different torch versions may cause slight variations in results. By default in `lmms-eval`, the requirement for torch version is set to the latest version. In `llava` repo, the torch version is set to `2.1.2`. Torch version `2.1.2` would be stable for both `llava` and `lmms-eval` | |
| ### Evaluating LLaVA-OneVision on multiple datasets | |
| We recommend the developers and researchers to thoroughly evaluate the models on more datasets to get a comprehensive understanding of their performance in different scenarios. So we provide a comprehensive list of datasets for evaluation, and welcome to incoporate more evaluation tasks. Please refer to the [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) for more details. | |
| Task: single-image tasks. | |
| ```bash | |
| # image tasks | |
| accelerate launch --num_processes=8 \ | |
| -m lmms_eval \ | |
| --model llava_onevision \ | |
| --model_args pretrained=lmms-lab/llava-onevision-qwen2-0.5b-si,conv_template=qwen_1_5,model_name=llava_qwen \ | |
| --tasks ai2d,chartqa,docvqa_val,infovqa_val,mme,realworldqa,mathvista_testmini,llava_in_the_wild,mmvet,mmbench_en_dev,ocrbench,mmmu,mathverse_testmini_vision_intensive,mathverse_testmini_vision_only,seedbench,scienceqa_img,mmstar \ | |
| --batch_size 1 \ | |
| --log_samples \ | |
| --log_samples_suffix llava_onevision \ | |
| --output_path ./logs/ | |
| ``` | |
| Task: video tasks. The video tasks are more computationally expensive. We recommend running them on a machine with a GPU with at least 16GB memory. | |
| ```bash | |
| # video tasks | |
| accelerate launch --num_processes=8 \ | |
| -m lmms_eval \ | |
| --model llava_onevision \ | |
| --model_args pretrained=lmms-lab/llava-onevision-qwen2-0.5b-ov,conv_template=qwen_1_5,model_name=llava_qwen \ | |
| --tasks activitynetqa,videochatgpt,nextqa_mc_test,egoschema,video_dc499,videmme,videomme_w_subtitle,perceptiontest_val_mc \ | |
| --batch_size 1 \ | |
| --log_samples \ | |
| --log_samples_suffix llava_onevision \ | |
| --output_path ./logs/ | |
| ``` | |
| Task: interleave tasks (`llava-interleave-bench` already contains most of existing image-text tasks). `mmmu_test` contains single image and multiple images as input, we run the model to obtain a submission file and you need to submit it to the [leaderboard](https://eval.ai/web/challenges/challenge-page/1700/overview) to get the accuracy for MMMU (multi-image) result. | |
| ```bash | |
| accelerate launch --num_processes=8 \ | |
| -m lmms_eval \ | |
| --model llava_onevision \ | |
| --model_args pretrained=lmms-lab/llava-onevision-qwen2-0.5b-ov,conv_template=qwen_1_5,model_name=llava_qwen \ | |
| --tasks llava-interleave-bench,muirbench,mmmu_test \ | |
| --batch_size 1 \ | |
| --log_samples \ | |
| --log_samples_suffix llava_onevision \ | |
| --output_path ./logs/ | |
| ``` | |