| --- |
| library_name: transformers |
| license: apache-2.0 |
| datasets: |
| - deepvk/LLaVA-Instruct-ru |
| - Lin-Chen/ShareGPT4V |
| - deepvk/GQA-ru |
| language: |
| - ru |
| - en |
| base_model: IlyaGusev/saiga_llama3_8b |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # LLaVA-Saiga-8b |
|
|
| LLaVA-Saiga-8b is a Vision-Language Model (VLM) based on [`IlyaGusev/saiga_llama3_8b`](https://huggingface.co/IlyaGusev/saiga_llama3_8b) model |
| and trained in original LLaVA setup. This model is primarily adapted to work with Russian, but still capable to work with English. |
|
|
| ## Usage |
|
|
| Model usage is simple via `transformers` API |
|
|
| ```python |
| import requests |
| |
| from PIL import Image |
| from transformers import AutoProcessor, AutoTokenizer, LlavaForConditionalGeneration |
| |
| model_name = "deepvk/llava-saiga-8b" |
| |
| model = LlavaForConditionalGeneration.from_pretrained(model_name) |
| processor = AutoProcessor.from_pretrained(model_name) |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| |
| url = "https://www.ilankelman.org/stopsigns/australia.jpg" |
| img = Image.open(requests.get(url, stream=True).raw) |
| messages = [ |
| {"role": "user", "content": "<image>\nОпиши картинку несколькими словами."} |
| ] |
| |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = processor(images=[img], text=text, return_tensors="pt") |
| |
| generate_ids = model.generate(**inputs, max_new_tokens=30) |
| answer = tokenizer.decode(generate_ids[0, inputs.input_ids.shape[1]:], skip_special_tokens=True) |
| print(answer) |
| ``` |
|
|
| Use the `<image>` tag to point to an image in the text and follow the chat template for a multi-turn conversation. |
| The model is capable of chatting without any images or working with multiple images in a conversation, but this behavior has not been tested. |
|
|
| The model format allows it to be directly used in popular frameworks, |
| e.g. you can test the model using [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), see Results section for details. |
|
|
|
|
| ## Train |
|
|
| To train this model, we follow the original LLaVA pipeline and reuse [`haotian-liu/LLaVA`](https://github.com/haotian-liu/LLaVA) framework. |
|
|
| The model was trained in two stages: |
| 1. The adapter was trained using pre-training data from [`ShareGPT4V`](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4V). |
| 2. Instruction tuning included training the LLM and the adapter, for this we use: |
| * [`deepvk/LLaVA-Instruct-ru`](https://huggingface.co/datasets/deepvk/LLaVA-Instruct-ru) - our new dataset of VLM instructions in Russian |
| * [`deepvk/GQA-ru`](https://huggingface.co/datasets/deepvk/GQA-ru) - the training part of the popular GQA test, translated into Russian, we used the post-prompt "Ответь одним словом. ". |
| * We also used instruction data from ShareGPT4V. |
|
|
| The entire training process took 3-4 days on 8 x A100 80GB. |
|
|
| ## Results |
|
|
| The model's performance was evaluated using [`lmms-eval`](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main) framework |
| ```bash |
| accelerate launch -m lmms_eval --model llava_hf --model_args pretrained="deepvk/llava-saiga-8b" \ |
| --tasks gqa-ru,mmbench_ru_dev,gqa,mmbench_en_dev --batch_size 1 \ |
| --log_samples --log_samples_suffix llava-saiga-8b --output_path ./logs/ |
| ``` |
|
|
| | Model | GQA | GQA-ru | MMBench | MMBench-ru | |
| | ----------------------------------------------------------------------------------------------- |:---------:|:---------:|:---------:|:----------:| |
| | [`deepvk/llava-gemma-2b-lora`](https://huggingface.co/deepvk/llava-gemma-2b-lora) | 56.39 | 46.37 | 51.72 | 40.19 | |
| | [`Intel/llava-gemma-2b`](https://huggingface.co/Intel/llava-gemma-2b) | 59.80 | 0.20 | 39.40 | 28.30 | |
| | `deepvk/llava-saiga-8b` [this model] | 62.00 | **51.44** | 64.26 | **56.65** | |
| | [`llava-hf/llava-1.5-7b-hf`](https://huggingface.co/llava-hf/llava-1.5-7b-hf) | 61.31 | 28.39 | 62.97 | 52.25 | |
| | [`llava-hf/llava-v1.6-mistral-7b-hf`](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) | **64.65** | 6.65 | **67.70** | 48.80 | |
|
|
| *Note*: for MMBench we didn't use OpenAI API for finding quantifier in generated string. Therefore, the score is similar to Exact Match as in GQA benchmark. |
|
|
| ## Citation |
|
|
| ``` |
| @misc{liu2023llava, |
| title={Visual Instruction Tuning}, |
| author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae}, |
| publisher={NeurIPS}, |
| year={2023}, |
| } |
| ``` |
|
|
| ``` |
| @misc{deepvk2024llava-saiga-8b, |
| title={LLaVA-Saiga-8b}, |
| author={Belopolskih, Daniil and Spirin, Egor}, |
| url={https://huggingface.co/deepvk/llava-saiga-8b}, |
| publisher={Hugging Face} |
| year={2024}, |
| } |
| ``` |
|
|
|
|