Image-Text-to-Text
Safetensors
English
qwen2_vl
vision-language-model
chart-qa
qwen2-vl
lora
finetuned
unsloth
conversational
Instructions to use CloveAI/clov-vl-2b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps Settings
- Unsloth Studio
How to use CloveAI/clov-vl-2b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CloveAI/clov-vl-2b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CloveAI/clov-vl-2b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for CloveAI/clov-vl-2b to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="CloveAI/clov-vl-2b", max_seq_length=2048, )
| license: mit | |
| base_model: unsloth/Qwen2-VL-2B-Instruct | |
| tags: | |
| - vision-language-model | |
| - chart-qa | |
| - qwen2-vl | |
| - lora | |
| - finetuned | |
| - unsloth | |
| datasets: | |
| - weijiezz/chartqa_split_test | |
| language: | |
| - en | |
| pipeline_tag: image-text-to-text | |
| # π alan-vlm β ChartQA Vision Language Model | |
| A finetuned version of **Qwen2-VL 2B Instruct** trained to answer natural language questions about charts and graphs. | |
| Finetuned on the [ChartQA dataset](https://huggingface.co/datasets/weijiezz/chartqa_split_test) using [Unsloth](https://github.com/unslothai/unsloth) on a Google Colab free T4 GPU. | |
| --- | |
| ## π§ Model Details | |
| | | | | |
| |---|---| | |
| | **Base Model** | Qwen2-VL-2B-Instruct | | |
| | **Finetuning Method** | LoRA (r=8, alpha=8) | | |
| | **Training Data** | 2,000 chart QA pairs | | |
| | **Training Steps** | 500 | | |
| | **Batch Size** | 8 (2 per device Γ 4 gradient accumulation) | | |
| | **Trainable Parameters** | 9,232,384 (0.42% of total) | | |
| | **Precision** | fp16 | | |
| | **Hardware** | Google Colab T4 (15GB VRAM) | | |
| --- | |
| ## π Quick Start | |
| ```python | |
| from transformers import AutoProcessor, Qwen2VLForConditionalGeneration | |
| from PIL import Image | |
| import torch | |
| # Load model | |
| model = Qwen2VLForConditionalGeneration.from_pretrained( | |
| "alanjoshua2005/alan-vlm", | |
| torch_dtype=torch.float16, | |
| device_map="auto", | |
| ) | |
| processor = AutoProcessor.from_pretrained("alanjoshua2005/alan-vlm") | |
| # Run inference | |
| def ask(image_path, question): | |
| image = Image.open(image_path).convert("RGB") | |
| messages = [{"role": "user", "content": [ | |
| {"type": "image"}, | |
| {"type": "text", "text": question}, | |
| ]}] | |
| text_prompt = processor.apply_chat_template( | |
| messages, | |
| add_generation_prompt=True, | |
| tokenize=False, | |
| ) | |
| inputs = processor( | |
| text=text_prompt, | |
| images=image, | |
| return_tensors="pt" | |
| ) | |
| inputs = {k: v.to("cuda") for k, v in inputs.items()} | |
| with torch.no_grad(): | |
| output = model.generate(**inputs, max_new_tokens=64) | |
| input_len = inputs["input_ids"].shape[1] | |
| return processor.decode(output[0][input_len:], skip_special_tokens=True) | |
| # Example | |
| answer = ask("chart.png", "What is the value of the highest bar?") | |
| print(answer) | |
| ``` | |
| --- | |
| ## ποΈ Gradio Demo | |
| ```python | |
| import gradio as gr | |
| from transformers import AutoProcessor, Qwen2VLForConditionalGeneration | |
| from PIL import Image | |
| import torch | |
| model = Qwen2VLForConditionalGeneration.from_pretrained( | |
| "alanjoshua2005/alan-vlm", | |
| torch_dtype=torch.float16, | |
| device_map="auto", | |
| ) | |
| processor = AutoProcessor.from_pretrained("alanjoshua2005/alan-vlm") | |
| def answer_chart_question(image, question): | |
| if image is None or not question.strip(): | |
| return "Please provide both an image and a question." | |
| image = image.convert("RGB") | |
| messages = [{"role": "user", "content": [ | |
| {"type": "image"}, | |
| {"type": "text", "text": question}, | |
| ]}] | |
| text_prompt = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) | |
| inputs = processor(text=text_prompt, images=image, return_tensors="pt") | |
| inputs = {k: v.to("cuda") for k, v in inputs.items()} | |
| with torch.no_grad(): | |
| output = model.generate(**inputs, max_new_tokens=64) | |
| input_len = inputs["input_ids"].shape[1] | |
| return processor.decode(output[0][input_len:], skip_special_tokens=True) | |
| gr.Interface( | |
| fn=answer_chart_question, | |
| inputs=[gr.Image(type="pil"), gr.Textbox(label="Question")], | |
| outputs=gr.Textbox(label="Answer"), | |
| title="π ChartQA - alan-vlm" | |
| ).launch() | |
| ``` | |
| --- | |
| ## π¦ Dataset | |
| Trained on [weijiezz/chartqa_split_test](https://huggingface.co/datasets/weijiezz/chartqa_split_test) β a 2,000 row dataset of chart images paired with questions and answers. Contains two types of questions: | |
| - `human_test` β questions written by human annotators | |
| - `augmented_test` β questions generated via data augmentation | |
| --- | |
| ## ποΈ Training Details | |
| Training was done using **Unsloth** for optimized LoRA finetuning: | |
| ```python | |
| from unsloth import FastVisionModel | |
| model, tokenizer = FastVisionModel.from_pretrained( | |
| "unsloth/Qwen2-VL-2B-Instruct", | |
| load_in_4bit=True, | |
| ) | |
| model = FastVisionModel.get_peft_model( | |
| model, | |
| finetune_vision_layers=True, | |
| finetune_language_layers=True, | |
| finetune_attention_modules=True, | |
| finetune_mlp_modules=True, | |
| r=8, | |
| lora_alpha=8, | |
| lora_dropout=0, | |
| bias="none", | |
| use_gradient_checkpointing="unsloth", | |
| target_modules=["q_proj", "v_proj", "k_proj", "o_proj", | |
| "gate_proj", "up_proj", "down_proj"], | |
| ) | |
| ``` | |
| --- | |
| ## β οΈ Limitations | |
| - Trained on only 2,000 samples β a learning/experimental project | |
| - May struggle with complex multi-series charts or heavily annotated graphs | |
| - Not evaluated on the full ChartQA benchmark yet | |
| - Best suited for simple bar, pie, and line chart questions | |
| --- | |
| ## π Acknowledgements | |
| - [Unsloth](https://github.com/unslothai/unsloth) for making VLM finetuning feasible on free Colab GPUs | |
| - [Qwen2-VL](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) by Alibaba Cloud | |
| - [ChartQA dataset](https://huggingface.co/datasets/weijiezz/chartqa_split_test) by weijiezz |