VisionTriage — Config C (image-only QLoRA)

QLoRA adapter on Qwen2.5-VL-7B-Instruct for UI bug severity classification from screenshots alone. Best performer in the VisionTriage 5-config ablation.

Project: https://github.com/tathadn/visiontriage
Paired dataset: https://huggingface.co/datasets/tathadn/visiontriage-multimodal
Base model: Qwen/Qwen2.5-VL-7B-Instruct

Key result

Image-only fine-tuning (this model) significantly outperforms text-only on multiclass severity (McNemar p=0.00807 on n=265 paired disagreements) and matches or beats the full multimodal variant — adding synthetic bug text on top of the screenshot gives no measurable gain, because the synthetic text is largely redundant with the visual signal.

Config	Binary Acc	Binary F1	MCC	Multiclass Acc
B — text-only	0.674	0.782	0.184	0.562
C — image-only (this)	0.695	0.801	0.232	0.618
D — multimodal (image+text)	0.683	0.784	0.220	0.595
E — zero-shot multimodal	0.672	0.800	0.104	0.353

Evaluated on a held-out 555-sample synthetic test split shared across B/C/D/E. Full per-sample predictions, confusion matrices, and McNemar breakdown are in the repo under results/.

Input / output

Input: a UI screenshot (PNG / JPG; square or portrait Android UI).
Output: one severity token from {blocker, critical, major, minor, trivial}.

The adapter is trained with a fixed prompt template that does not include the bug-report text — only the image is fed in.

Training

Base model: Qwen/Qwen2.5-VL-7B-Instruct (loaded 4-bit NF4 via bitsandbytes).
Adapter: LoRA, r=32, α=64, dropout 0.05, bias none.
Target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj.
Epochs: 3.
Dataset: tathadn/visiontriage-multimodal train split (4,441 samples; image-only prompt format).
Hardware: 1× NVIDIA H100 NVL (~30 GB peak VRAM with 4-bit base).
Framework versions: transformers + peft 0.18.1 + trl + bitsandbytes.

See configs/sft_config_c.yaml and src/models/qlora_sft.py in the project repo for the exact training recipe.

How to use

from peft import PeftModel
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

BASE  = "Qwen/Qwen2.5-VL-7B-Instruct"
ADAPT = "tathadn/visiontriage-config-c"

proc  = AutoProcessor.from_pretrained(BASE)
base  = Qwen2_5_VLForConditionalGeneration.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, ADAPT)

img = Image.open("screenshot.png")
messages = [{"role": "user", "content": [
    {"type": "image", "image": img},
    {"type": "text",  "text": "What is the severity of the bug shown in this screenshot? Answer with one of: blocker, critical, major, minor, trivial."},
]}]
text = proc.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = proc(text=[text], images=[img], return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=8, do_sample=False)
print(proc.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
# → e.g. "critical"

Intended use

Prototyping automated triage from UI screenshots in bug reports where the screenshot itself is informative.
Studying image-vs-text contributions to severity classification (ablation baseline).

Out-of-scope

Production triage of real bug reports — the training distribution is synthetic (deterministic mutators + LLM-generated text). Expect degradation on real-world reports without domain adaptation.
Non-UI bugs (backend crashes, API contract violations, logic bugs with no visual surface).
Safety-critical or high-stakes triage decisions.

Limitations

Synthetic training signal — every "bug" is one of 5 deterministic mutators. Real bugs are more varied.
Severity-label coupling — each bug type maps 1:1 to a severity, so the model learns bug_type → severity, not independent severity reasoning.
Fine-grained visual bugs — subtle_offset (trivial) recall is ~0 across all configs; the 7B VLM is insensitive to few-pixel shifts.
English prompt template only.

License

Apache-2.0 (matching the base model). Base model weights are not redistributed — this repository contains the LoRA adapter only.

Citation

@misc{visiontriage2026,
  title  = {VisionTriage: Multimodal Severity Prediction for UI Bug Reports},
  author = {Debnath, Tathagata},
  year   = {2026},
  url    = {https://github.com/tathadn/visiontriage}
}

Framework versions

PEFT 0.18.1

Downloads last month: 14

Model tree for tathadn/visiontriage-config-c

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Adapter

(255)

this model

tathadn
/

visiontriage-config-c