--- language: - en license: apache-2.0 base_model: Qwen/Qwen2-VL-2B-Instruct datasets: - HuggingFaceM4/ChartQA tags: - vision-language - chart-qa - qwen2-vl - unsloth - multimodal --- # The Orange Problem — Qwen2-VL-2B ChartQA Fine-tuned version of Qwen/Qwen2-VL-2B-Instruct on ChartQA using QLoRA via Unsloth. The model answers questions about charts with a number or short phrase. This is a fully merged model — no adapter loading required. Training: Unsloth + TRL SFTTrainer on Kaggle T4 x 2, 155 min. --- ## Results (300 test samples) | Metric | Base Model | Fine-Tuned | Delta | |---|---|---|---| | Relaxed Accuracy (primary) | 54.33% | 59.67% | +5.33% | | Exact Match | 50.00% | 54.67% | +4.67% | | ROUGE-L | 54.74% | 59.17% | +4.42% | Relaxed Accuracy is the primary ChartQA metric: correct if exact string match OR numeric within 5%. --- ## Quickstart pip install transformers qwen-vl-utils torch ```python from transformers import Qwen2VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info from PIL import Image import torch model = Qwen2VLForConditionalGeneration.from_pretrained( "Sriram1806/NLP_Orange_Problem-How_I_Met_You", torch_dtype=torch.float16, device_map="auto", ) processor = AutoProcessor.from_pretrained("Sriram1806/NLP_Orange_Problem-How_I_Met_You") def ask_chart(image, question): msgs = [ {"role": "system", "content": [{"type": "text", "text": ( "You are a chart analysis assistant. " "When given a chart and a question, briefly identify the relevant data, " "then output your final answer after ':'. " "The answer must be a number, percentage, or short phrase only." )}]}, {"role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": ( "Look at the chart carefully.\n" f"Question: {question}\n\n" "Briefly identify the relevant data point, then give your answer as:\n" ": [value]" )}, ]}, ] text = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True) image_inputs, _ = process_vision_info(msgs) inputs = processor(text=[text], images=image_inputs, return_tensors="pt").to(model.device) with torch.inference_mode(): out = model.generate(**inputs, max_new_tokens=96, do_sample=False, temperature=None, top_p=None) gen = out[0][inputs["input_ids"].shape[-1]:] raw = processor.decode(gen, skip_special_tokens=True).strip() return raw.split(":")[-1].strip() if ":" in raw else raw image = Image.open("your_chart.png") print(ask_chart(image, "What is the value in 2020?")) ``` --- ## Training Details | Parameter | Value | |---|---| | Base model | Qwen/Qwen2-VL-2B-Instruct | | Dataset | HuggingFaceM4/ChartQA | | Training samples | 6,000 | | Epochs | 2 | | Steps completed | 750 | | Train time | 155.1 min | | LoRA rank | 32 | | LoRA alpha | 32 | | LoRA dropout | 0.0 | | LoRA targets | Language attention + MLP layers | | Learning rate | 2e-4 | | LR schedule | Cosine | | Effective batch size | 16 (2 per device x 8 grad accum) | | Precision | float16 | | Max sequence length | 640 tokens | | Optimizer | AdamW 8-bit | | Warmup steps | 50 | ### Key Decisions **fp16 instead of 4-bit:** Unsloth's compiled VisionMlp_forward triggers a bitsandbytes Linear4bit recursion error when the ViT MLP blocks are quantised. fp16 avoids this. Two T4s (31 GB total) hold the 2B model comfortably. **finetune_vision_layers=False:** ViT MLP blocks cannot be safely LoRA-wrapped alongside Unsloth gradient checkpointing when quantised. Language layers are fine-tuned instead. **CoT prompt with ANSWER: marker:** The model emits a brief reasoning step before a structured answer tag, improving extraction reliability at inference time. **normalise_answer before EM scoring:** Strips commas and % before comparison so "44%" matches "44" and "1,000" matches "1000".