The Orange Problem โ€” Qwen2-VL-2B ChartQA

Fine-tuned version of Qwen/Qwen2-VL-2B-Instruct on ChartQA using QLoRA via Unsloth. The model answers questions about charts with a number or short phrase. This is a fully merged model โ€” no adapter loading required.

Training: Unsloth + TRL SFTTrainer on Kaggle T4 x 2, 155 min.


Results (300 test samples)

Metric Base Model Fine-Tuned Delta
Relaxed Accuracy (primary) 54.33% 59.67% +5.33%
Exact Match 50.00% 54.67% +4.67%
ROUGE-L 54.74% 59.17% +4.42%

Relaxed Accuracy is the primary ChartQA metric: correct if exact string match OR numeric within 5%.


Quickstart

pip install transformers qwen-vl-utils torch

from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Sriram1806/NLP_Orange_Problem-How_I_Met_You",
    torch_dtype=torch.float16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Sriram1806/NLP_Orange_Problem-How_I_Met_You")

def ask_chart(image, question):
    msgs = [
        {"role": "system", "content": [{"type": "text", "text": (
            "You are a chart analysis assistant. "
            "When given a chart and a question, briefly identify the relevant data, "
            "then output your final answer after '<ANSWER>:'. "
            "The answer must be a number, percentage, or short phrase only."
        )}]},
        {"role": "user", "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": (
                "Look at the chart carefully.\n"
                f"Question: {question}\n\n"
                "Briefly identify the relevant data point, then give your answer as:\n"
                "<ANSWER>: [value]"
            )},
        ]},
    ]
    text = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
    image_inputs, _ = process_vision_info(msgs)
    inputs = processor(text=[text], images=image_inputs, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        out = model.generate(**inputs, max_new_tokens=96, do_sample=False,
                             temperature=None, top_p=None)
    gen = out[0][inputs["input_ids"].shape[-1]:]
    raw = processor.decode(gen, skip_special_tokens=True).strip()
    return raw.split("<ANSWER>:")[-1].strip() if "<ANSWER>:" in raw else raw

image = Image.open("your_chart.png")
print(ask_chart(image, "What is the value in 2020?"))

Training Details

Parameter Value
Base model Qwen/Qwen2-VL-2B-Instruct
Dataset HuggingFaceM4/ChartQA
Training samples 6,000
Epochs 2
Steps completed 750
Train time 155.1 min
LoRA rank 32
LoRA alpha 32
LoRA dropout 0.0
LoRA targets Language attention + MLP layers
Learning rate 2e-4
LR schedule Cosine
Effective batch size 16 (2 per device x 8 grad accum)
Precision float16
Max sequence length 640 tokens
Optimizer AdamW 8-bit
Warmup steps 50

Key Decisions

fp16 instead of 4-bit: Unsloth's compiled VisionMlp_forward triggers a bitsandbytes Linear4bit recursion error when the ViT MLP blocks are quantised. fp16 avoids this. Two T4s (31 GB total) hold the 2B model comfortably.

finetune_vision_layers=False: ViT MLP blocks cannot be safely LoRA-wrapped alongside Unsloth gradient checkpointing when quantised. Language layers are fine-tuned instead.

CoT prompt with ANSWER: marker: The model emits a brief reasoning step before a structured answer tag, improving extraction reliability at inference time.

normalise_answer before EM scoring: Strips commas and % before comparison so "44%" matches "44" and "1,000" matches "1000".

Downloads last month
33
Safetensors
Model size
2B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Sriram1806/NLP_Orange_Problem-How_I_Met_You

Base model

Qwen/Qwen2-VL-2B
Finetuned
(339)
this model

Dataset used to train Sriram1806/NLP_Orange_Problem-How_I_Met_You