VisionTriage β Config C (image-only QLoRA)
QLoRA adapter on Qwen2.5-VL-7B-Instruct for UI bug severity classification from screenshots alone. Best performer in the VisionTriage 5-config ablation.
- Project: https://github.com/tathadn/visiontriage
- Paired dataset: https://huggingface.co/datasets/tathadn/visiontriage-multimodal
- Base model: Qwen/Qwen2.5-VL-7B-Instruct
Key result
Image-only fine-tuning (this model) significantly outperforms text-only on multiclass severity (McNemar p=0.00807 on n=265 paired disagreements) and matches or beats the full multimodal variant β adding synthetic bug text on top of the screenshot gives no measurable gain, because the synthetic text is largely redundant with the visual signal.
| Config | Binary Acc | Binary F1 | MCC | Multiclass Acc |
|---|---|---|---|---|
| B β text-only | 0.674 | 0.782 | 0.184 | 0.562 |
| C β image-only (this) | 0.695 | 0.801 | 0.232 | 0.618 |
| D β multimodal (image+text) | 0.683 | 0.784 | 0.220 | 0.595 |
| E β zero-shot multimodal | 0.672 | 0.800 | 0.104 | 0.353 |
Evaluated on a held-out 555-sample synthetic test split shared across B/C/D/E. Full per-sample predictions, confusion matrices, and McNemar breakdown are in the repo under results/.
Input / output
- Input: a UI screenshot (PNG / JPG; square or portrait Android UI).
- Output: one severity token from
{blocker, critical, major, minor, trivial}.
The adapter is trained with a fixed prompt template that does not include the bug-report text β only the image is fed in.
Training
- Base model:
Qwen/Qwen2.5-VL-7B-Instruct(loaded 4-bit NF4 via bitsandbytes). - Adapter: LoRA,
r=32,Ξ±=64, dropout0.05, biasnone. - Target modules:
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj. - Epochs: 3.
- Dataset:
tathadn/visiontriage-multimodaltrain split (4,441 samples; image-only prompt format). - Hardware: 1Γ NVIDIA H100 NVL (~30 GB peak VRAM with 4-bit base).
- Framework versions: transformers + peft 0.18.1 + trl + bitsandbytes.
See configs/sft_config_c.yaml and src/models/qlora_sft.py in the project repo for the exact training recipe.
How to use
from peft import PeftModel
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
BASE = "Qwen/Qwen2.5-VL-7B-Instruct"
ADAPT = "tathadn/visiontriage-config-c"
proc = AutoProcessor.from_pretrained(BASE)
base = Qwen2_5_VLForConditionalGeneration.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, ADAPT)
img = Image.open("screenshot.png")
messages = [{"role": "user", "content": [
{"type": "image", "image": img},
{"type": "text", "text": "What is the severity of the bug shown in this screenshot? Answer with one of: blocker, critical, major, minor, trivial."},
]}]
text = proc.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = proc(text=[text], images=[img], return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=8, do_sample=False)
print(proc.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
# β e.g. "critical"
Intended use
- Prototyping automated triage from UI screenshots in bug reports where the screenshot itself is informative.
- Studying image-vs-text contributions to severity classification (ablation baseline).
Out-of-scope
- Production triage of real bug reports β the training distribution is synthetic (deterministic mutators + LLM-generated text). Expect degradation on real-world reports without domain adaptation.
- Non-UI bugs (backend crashes, API contract violations, logic bugs with no visual surface).
- Safety-critical or high-stakes triage decisions.
Limitations
- Synthetic training signal β every "bug" is one of 5 deterministic mutators. Real bugs are more varied.
- Severity-label coupling β each bug type maps 1:1 to a severity, so the model learns
bug_type β severity, not independent severity reasoning. - Fine-grained visual bugs β
subtle_offset(trivial) recall is ~0 across all configs; the 7B VLM is insensitive to few-pixel shifts. - English prompt template only.
License
Apache-2.0 (matching the base model). Base model weights are not redistributed β this repository contains the LoRA adapter only.
Citation
@misc{visiontriage2026,
title = {VisionTriage: Multimodal Severity Prediction for UI Bug Reports},
author = {Debnath, Tathagata},
year = {2026},
url = {https://github.com/tathadn/visiontriage}
}
Framework versions
- PEFT 0.18.1
- Downloads last month
- 14
Model tree for tathadn/visiontriage-config-c
Base model
Qwen/Qwen2.5-VL-7B-Instruct