| --- |
| language: |
| - en |
| license: apache-2.0 |
| base_model: Qwen/Qwen2-VL-2B-Instruct |
| datasets: |
| - HuggingFaceM4/ChartQA |
| tags: |
| - vision-language |
| - chart-qa |
| - qwen2-vl |
| - unsloth |
| - multimodal |
| --- |
| |
| # The Orange Problem — Qwen2-VL-2B ChartQA |
|
|
| Fine-tuned version of Qwen/Qwen2-VL-2B-Instruct on ChartQA using QLoRA via Unsloth. |
| The model answers questions about charts with a number or short phrase. |
| This is a fully merged model — no adapter loading required. |
|
|
| Training: Unsloth + TRL SFTTrainer on Kaggle T4 x 2, 155 min. |
|
|
| --- |
|
|
| ## Results (300 test samples) |
|
|
| | Metric | Base Model | Fine-Tuned | Delta | |
| |---|---|---|---| |
| | Relaxed Accuracy (primary) | 54.33% | 59.67% | +5.33% | |
| | Exact Match | 50.00% | 54.67% | +4.67% | |
| | ROUGE-L | 54.74% | 59.17% | +4.42% | |
|
|
| Relaxed Accuracy is the primary ChartQA metric: correct if exact string match OR numeric within 5%. |
|
|
| --- |
|
|
| ## Quickstart |
|
|
| pip install transformers qwen-vl-utils torch |
|
|
| ```python |
| from transformers import Qwen2VLForConditionalGeneration, AutoProcessor |
| from qwen_vl_utils import process_vision_info |
| from PIL import Image |
| import torch |
| |
| model = Qwen2VLForConditionalGeneration.from_pretrained( |
| "Sriram1806/NLP_Orange_Problem-How_I_Met_You", |
| torch_dtype=torch.float16, |
| device_map="auto", |
| ) |
| processor = AutoProcessor.from_pretrained("Sriram1806/NLP_Orange_Problem-How_I_Met_You") |
| |
| def ask_chart(image, question): |
| msgs = [ |
| {"role": "system", "content": [{"type": "text", "text": ( |
| "You are a chart analysis assistant. " |
| "When given a chart and a question, briefly identify the relevant data, " |
| "then output your final answer after '<ANSWER>:'. " |
| "The answer must be a number, percentage, or short phrase only." |
| )}]}, |
| {"role": "user", "content": [ |
| {"type": "image", "image": image}, |
| {"type": "text", "text": ( |
| "Look at the chart carefully.\n" |
| f"Question: {question}\n\n" |
| "Briefly identify the relevant data point, then give your answer as:\n" |
| "<ANSWER>: [value]" |
| )}, |
| ]}, |
| ] |
| text = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True) |
| image_inputs, _ = process_vision_info(msgs) |
| inputs = processor(text=[text], images=image_inputs, return_tensors="pt").to(model.device) |
| with torch.inference_mode(): |
| out = model.generate(**inputs, max_new_tokens=96, do_sample=False, |
| temperature=None, top_p=None) |
| gen = out[0][inputs["input_ids"].shape[-1]:] |
| raw = processor.decode(gen, skip_special_tokens=True).strip() |
| return raw.split("<ANSWER>:")[-1].strip() if "<ANSWER>:" in raw else raw |
| |
| image = Image.open("your_chart.png") |
| print(ask_chart(image, "What is the value in 2020?")) |
| ``` |
|
|
| --- |
|
|
| ## Training Details |
|
|
| | Parameter | Value | |
| |---|---| |
| | Base model | Qwen/Qwen2-VL-2B-Instruct | |
| | Dataset | HuggingFaceM4/ChartQA | |
| | Training samples | 6,000 | |
| | Epochs | 2 | |
| | Steps completed | 750 | |
| | Train time | 155.1 min | |
| | LoRA rank | 32 | |
| | LoRA alpha | 32 | |
| | LoRA dropout | 0.0 | |
| | LoRA targets | Language attention + MLP layers | |
| | Learning rate | 2e-4 | |
| | LR schedule | Cosine | |
| | Effective batch size | 16 (2 per device x 8 grad accum) | |
| | Precision | float16 | |
| | Max sequence length | 640 tokens | |
| | Optimizer | AdamW 8-bit | |
| | Warmup steps | 50 | |
|
|
| ### Key Decisions |
|
|
| **fp16 instead of 4-bit:** Unsloth's compiled VisionMlp_forward triggers a bitsandbytes |
| Linear4bit recursion error when the ViT MLP blocks are quantised. fp16 avoids this. |
| Two T4s (31 GB total) hold the 2B model comfortably. |
| |
| **finetune_vision_layers=False:** ViT MLP blocks cannot be safely LoRA-wrapped alongside |
| Unsloth gradient checkpointing when quantised. Language layers are fine-tuned instead. |
| |
| **CoT prompt with ANSWER: marker:** The model emits a brief reasoning step before a |
| structured answer tag, improving extraction reliability at inference time. |
| |
| **normalise_answer before EM scoring:** Strips commas and % before comparison so |
| "44%" matches "44" and "1,000" matches "1000". |