---
language:
- en
license: apache-2.0
base_model: Qwen/Qwen2-VL-2B-Instruct
datasets:
- HuggingFaceM4/ChartQA
tags:
- vision-language
- chart-qa
- qwen2-vl
- unsloth
- multimodal
---

# The Orange Problem — Qwen2-VL-2B ChartQA

Fine-tuned version of Qwen/Qwen2-VL-2B-Instruct on ChartQA using QLoRA via Unsloth.
The model answers questions about charts with a number or short phrase.
This is a fully merged model — no adapter loading required.

Training: Unsloth + TRL SFTTrainer on Kaggle T4 x 2, 155 min.

---

## Results (300 test samples)

| Metric | Base Model | Fine-Tuned | Delta |
|---|---|---|---|
| Relaxed Accuracy (primary) | 54.33% | 59.67% | +5.33% |
| Exact Match | 50.00% | 54.67% | +4.67% |
| ROUGE-L | 54.74% | 59.17% | +4.42% |

Relaxed Accuracy is the primary ChartQA metric: correct if exact string match OR numeric within 5%.

---

## Quickstart

pip install transformers qwen-vl-utils torch

```python
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
import torch

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Sriram1806/NLP_Orange_Problem-How_I_Met_You",
    torch_dtype=torch.float16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Sriram1806/NLP_Orange_Problem-How_I_Met_You")

def ask_chart(image, question):
    msgs = [
        {"role": "system", "content": [{"type": "text", "text": (
            "You are a chart analysis assistant. "
            "When given a chart and a question, briefly identify the relevant data, "
            "then output your final answer after '<ANSWER>:'. "
            "The answer must be a number, percentage, or short phrase only."
        )}]},
        {"role": "user", "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": (
                "Look at the chart carefully.\n"
                f"Question: {question}\n\n"
                "Briefly identify the relevant data point, then give your answer as:\n"
                "<ANSWER>: [value]"
            )},
        ]},
    ]
    text = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
    image_inputs, _ = process_vision_info(msgs)
    inputs = processor(text=[text], images=image_inputs, return_tensors="pt").to(model.device)
    with torch.inference_mode():
        out = model.generate(**inputs, max_new_tokens=96, do_sample=False,
                             temperature=None, top_p=None)
    gen = out[0][inputs["input_ids"].shape[-1]:]
    raw = processor.decode(gen, skip_special_tokens=True).strip()
    return raw.split("<ANSWER>:")[-1].strip() if "<ANSWER>:" in raw else raw

image = Image.open("your_chart.png")
print(ask_chart(image, "What is the value in 2020?"))
```

---

## Training Details

| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2-VL-2B-Instruct |
| Dataset | HuggingFaceM4/ChartQA |
| Training samples | 6,000 |
| Epochs | 2 |
| Steps completed | 750 |
| Train time | 155.1 min |
| LoRA rank | 32 |
| LoRA alpha | 32 |
| LoRA dropout | 0.0 |
| LoRA targets | Language attention + MLP layers |
| Learning rate | 2e-4 |
| LR schedule | Cosine |
| Effective batch size | 16 (2 per device x 8 grad accum) |
| Precision | float16 |
| Max sequence length | 640 tokens |
| Optimizer | AdamW 8-bit |
| Warmup steps | 50 |

### Key Decisions

**fp16 instead of 4-bit:** Unsloth's compiled VisionMlp_forward triggers a bitsandbytes
Linear4bit recursion error when the ViT MLP blocks are quantised. fp16 avoids this.
Two T4s (31 GB total) hold the 2B model comfortably.

**finetune_vision_layers=False:** ViT MLP blocks cannot be safely LoRA-wrapped alongside
Unsloth gradient checkpointing when quantised. Language layers are fine-tuned instead.

**CoT prompt with ANSWER: marker:** The model emits a brief reasoning step before a
structured answer tag, improving extraction reliability at inference time.

**normalise_answer before EM scoring:** Strips commas and % before comparison so
"44%" matches "44" and "1,000" matches "1000".