The Orange Problem โ Qwen2-VL-2B ChartQA
Fine-tuned version of Qwen/Qwen2-VL-2B-Instruct on ChartQA using QLoRA via Unsloth. The model answers questions about charts with a number or short phrase. This is a fully merged model โ no adapter loading required.
Training: Unsloth + TRL SFTTrainer on Kaggle T4 x 2, 155 min.
Results (300 test samples)
| Metric | Base Model | Fine-Tuned | Delta |
|---|---|---|---|
| Relaxed Accuracy (primary) | 54.33% | 59.67% | +5.33% |
| Exact Match | 50.00% | 54.67% | +4.67% |
| ROUGE-L | 54.74% | 59.17% | +4.42% |
Relaxed Accuracy is the primary ChartQA metric: correct if exact string match OR numeric within 5%.
Quickstart
pip install transformers qwen-vl-utils torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Sriram1806/NLP_Orange_Problem-How_I_Met_You",
torch_dtype=torch.float16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("Sriram1806/NLP_Orange_Problem-How_I_Met_You")
def ask_chart(image, question):
msgs = [
{"role": "system", "content": [{"type": "text", "text": (
"You are a chart analysis assistant. "
"When given a chart and a question, briefly identify the relevant data, "
"then output your final answer after '<ANSWER>:'. "
"The answer must be a number, percentage, or short phrase only."
)}]},
{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": (
"Look at the chart carefully.\n"
f"Question: {question}\n\n"
"Briefly identify the relevant data point, then give your answer as:\n"
"<ANSWER>: [value]"
)},
]},
]
text = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
image_inputs, _ = process_vision_info(msgs)
inputs = processor(text=[text], images=image_inputs, return_tensors="pt").to(model.device)
with torch.inference_mode():
out = model.generate(**inputs, max_new_tokens=96, do_sample=False,
temperature=None, top_p=None)
gen = out[0][inputs["input_ids"].shape[-1]:]
raw = processor.decode(gen, skip_special_tokens=True).strip()
return raw.split("<ANSWER>:")[-1].strip() if "<ANSWER>:" in raw else raw
image = Image.open("your_chart.png")
print(ask_chart(image, "What is the value in 2020?"))
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2-VL-2B-Instruct |
| Dataset | HuggingFaceM4/ChartQA |
| Training samples | 6,000 |
| Epochs | 2 |
| Steps completed | 750 |
| Train time | 155.1 min |
| LoRA rank | 32 |
| LoRA alpha | 32 |
| LoRA dropout | 0.0 |
| LoRA targets | Language attention + MLP layers |
| Learning rate | 2e-4 |
| LR schedule | Cosine |
| Effective batch size | 16 (2 per device x 8 grad accum) |
| Precision | float16 |
| Max sequence length | 640 tokens |
| Optimizer | AdamW 8-bit |
| Warmup steps | 50 |
Key Decisions
fp16 instead of 4-bit: Unsloth's compiled VisionMlp_forward triggers a bitsandbytes Linear4bit recursion error when the ViT MLP blocks are quantised. fp16 avoids this. Two T4s (31 GB total) hold the 2B model comfortably.
finetune_vision_layers=False: ViT MLP blocks cannot be safely LoRA-wrapped alongside Unsloth gradient checkpointing when quantised. Language layers are fine-tuned instead.
CoT prompt with ANSWER: marker: The model emits a brief reasoning step before a structured answer tag, improving extraction reliability at inference time.
normalise_answer before EM scoring: Strips commas and % before comparison so "44%" matches "44" and "1,000" matches "1000".
- Downloads last month
- 33