tathadn commited on
Commit
d68dfa6
·
verified ·
1 Parent(s): ea620b4

initial upload: Config C image-only QLoRA adapter on Qwen2.5-VL-7B

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen2.5-VL-7B-Instruct
4
+ library_name: peft
5
+ pipeline_tag: image-text-to-text
6
+ language:
7
+ - en
8
+ tags:
9
+ - qlora
10
+ - lora
11
+ - peft
12
+ - vision-language
13
+ - bug-triage
14
+ - severity-classification
15
+ - qwen2.5-vl
16
+ datasets:
17
+ - tathadn/visiontriage-multimodal
18
+ ---
19
+
20
+ # VisionTriage — Config C (image-only QLoRA)
21
+
22
+ **QLoRA adapter on Qwen2.5-VL-7B-Instruct for UI bug severity classification from screenshots alone. Best performer in the VisionTriage 5-config ablation.**
23
+
24
+ - **Project:** https://github.com/tathadn/visiontriage
25
+ - **Paired dataset:** https://huggingface.co/datasets/tathadn/visiontriage-multimodal
26
+ - **Base model:** [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
27
+
28
+ ## Key result
29
+
30
+ Image-only fine-tuning (this model) **significantly outperforms text-only** on multiclass severity (McNemar p=0.00807 on n=265 paired disagreements) and matches or beats the full multimodal variant — adding synthetic bug text on top of the screenshot gives no measurable gain, because the synthetic text is largely redundant with the visual signal.
31
+
32
+ | Config | Binary Acc | Binary F1 | MCC | Multiclass Acc |
33
+ |----------------------------|------------|-----------|-------|----------------|
34
+ | B — text-only | 0.674 | 0.782 | 0.184 | 0.562 |
35
+ | **C — image-only (this)** | **0.695** | **0.801** | **0.232** | **0.618** |
36
+ | D — multimodal (image+text)| 0.683 | 0.784 | 0.220 | 0.595 |
37
+ | E — zero-shot multimodal | 0.672 | 0.800 | 0.104 | 0.353 |
38
+
39
+ Evaluated on a held-out 555-sample synthetic test split shared across B/C/D/E. Full per-sample predictions, confusion matrices, and McNemar breakdown are in the repo under `results/`.
40
+
41
+ ## Input / output
42
+
43
+ - **Input:** a UI screenshot (PNG / JPG; square or portrait Android UI).
44
+ - **Output:** one severity token from `{blocker, critical, major, minor, trivial}`.
45
+
46
+ The adapter is trained with a fixed prompt template that does **not** include the bug-report text — only the image is fed in.
47
+
48
+ ## Training
49
+
50
+ - **Base model:** `Qwen/Qwen2.5-VL-7B-Instruct` (loaded 4-bit NF4 via bitsandbytes).
51
+ - **Adapter:** LoRA, `r=32`, `α=64`, dropout `0.05`, bias `none`.
52
+ - **Target modules:** `q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`.
53
+ - **Epochs:** 3.
54
+ - **Dataset:** `tathadn/visiontriage-multimodal` train split (4,441 samples; image-only prompt format).
55
+ - **Hardware:** 1× NVIDIA H100 NVL (~30 GB peak VRAM with 4-bit base).
56
+ - **Framework versions:** transformers + peft 0.18.1 + trl + bitsandbytes.
57
+
58
+ See `configs/sft_config_c.yaml` and `src/models/qlora_sft.py` in the project repo for the exact training recipe.
59
+
60
+ ## How to use
61
+
62
+ ```python
63
+ from peft import PeftModel
64
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
65
+ from PIL import Image
66
+ import torch
67
+
68
+ BASE = "Qwen/Qwen2.5-VL-7B-Instruct"
69
+ ADAPT = "tathadn/visiontriage-config-c"
70
+
71
+ proc = AutoProcessor.from_pretrained(BASE)
72
+ base = Qwen2_5_VLForConditionalGeneration.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")
73
+ model = PeftModel.from_pretrained(base, ADAPT)
74
+
75
+ img = Image.open("screenshot.png")
76
+ messages = [{"role": "user", "content": [
77
+ {"type": "image", "image": img},
78
+ {"type": "text", "text": "What is the severity of the bug shown in this screenshot? Answer with one of: blocker, critical, major, minor, trivial."},
79
+ ]}]
80
+ text = proc.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
81
+ inputs = proc(text=[text], images=[img], return_tensors="pt").to(model.device)
82
+
83
+ out = model.generate(**inputs, max_new_tokens=8, do_sample=False)
84
+ print(proc.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
85
+ # → e.g. "critical"
86
+ ```
87
+
88
+ ## Intended use
89
+
90
+ - Prototyping automated triage from UI screenshots in bug reports where the screenshot itself is informative.
91
+ - Studying image-vs-text contributions to severity classification (ablation baseline).
92
+
93
+ ## Out-of-scope
94
+
95
+ - Production triage of real bug reports — the training distribution is synthetic (deterministic mutators + LLM-generated text). Expect degradation on real-world reports without domain adaptation.
96
+ - Non-UI bugs (backend crashes, API contract violations, logic bugs with no visual surface).
97
+ - Safety-critical or high-stakes triage decisions.
98
+
99
+ ## Limitations
100
+
101
+ - **Synthetic training signal** — every "bug" is one of 5 deterministic mutators. Real bugs are more varied.
102
+ - **Severity-label coupling** — each bug type maps 1:1 to a severity, so the model learns `bug_type → severity`, not independent severity reasoning.
103
+ - **Fine-grained visual bugs** — `subtle_offset` (trivial) recall is ~0 across all configs; the 7B VLM is insensitive to few-pixel shifts.
104
+ - **English prompt template only.**
105
+
106
+ ## License
107
+
108
+ Apache-2.0 (matching the base model). Base model weights are not redistributed — this repository contains the LoRA adapter only.
109
+
110
+ ## Citation
111
+
112
+ ```bibtex
113
+ @misc{visiontriage2026,
114
+ title = {VisionTriage: Multimodal Severity Prediction for UI Bug Reports},
115
+ author = {Debnath, Tathagata},
116
+ year = {2026},
117
+ url = {https://github.com/tathadn/visiontriage}
118
+ }
119
+ ```
120
+
121
+ ### Framework versions
122
+
123
+ - PEFT 0.18.1
adapter_config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "Qwen/Qwen2.5-VL-7B-Instruct",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 64,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "megatron_config": null,
23
+ "megatron_core": "megatron.core",
24
+ "modules_to_save": null,
25
+ "peft_type": "LORA",
26
+ "peft_version": "0.18.1",
27
+ "qalora_group_size": 16,
28
+ "r": 32,
29
+ "rank_pattern": {},
30
+ "revision": null,
31
+ "target_modules": [
32
+ "o_proj",
33
+ "gate_proj",
34
+ "v_proj",
35
+ "down_proj",
36
+ "q_proj",
37
+ "k_proj",
38
+ "up_proj"
39
+ ],
40
+ "target_parameters": null,
41
+ "task_type": "CAUSAL_LM",
42
+ "trainable_token_indices": null,
43
+ "use_dora": false,
44
+ "use_qalora": false,
45
+ "use_rslora": false
46
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a43cb198e32f0e89695fd97137f36a0f1bd67da299fce9b1f84f8779b30613c1
3
+ size 380800528
chat_template.jinja ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system
2
+ You are a helpful assistant.<|im_end|>
3
+ {% endif %}<|im_start|>{{ message['role'] }}
4
+ {% if message['content'] is string %}{{ message['content'] }}<|im_end|>
5
+ {% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>
6
+ {% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant
7
+ {% endif %}
processor_config.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "image_processor": {
3
+ "do_convert_rgb": true,
4
+ "do_normalize": true,
5
+ "do_rescale": true,
6
+ "do_resize": true,
7
+ "image_mean": [
8
+ 0.48145466,
9
+ 0.4578275,
10
+ 0.40821073
11
+ ],
12
+ "image_processor_type": "Qwen2VLImageProcessor",
13
+ "image_std": [
14
+ 0.26862954,
15
+ 0.26130258,
16
+ 0.27577711
17
+ ],
18
+ "max_pixels": 602112,
19
+ "merge_size": 2,
20
+ "min_pixels": 200704,
21
+ "patch_size": 14,
22
+ "resample": 3,
23
+ "rescale_factor": 0.00392156862745098,
24
+ "size": {
25
+ "longest_edge": 12845056,
26
+ "shortest_edge": 3136
27
+ },
28
+ "temporal_patch_size": 2
29
+ },
30
+ "processor_class": "Qwen2_5_VLProcessor",
31
+ "video_processor": {
32
+ "do_convert_rgb": true,
33
+ "do_normalize": true,
34
+ "do_rescale": true,
35
+ "do_resize": true,
36
+ "do_sample_frames": false,
37
+ "image_mean": [
38
+ 0.48145466,
39
+ 0.4578275,
40
+ 0.40821073
41
+ ],
42
+ "image_std": [
43
+ 0.26862954,
44
+ 0.26130258,
45
+ 0.27577711
46
+ ],
47
+ "max_frames": 768,
48
+ "merge_size": 2,
49
+ "min_frames": 4,
50
+ "patch_size": 14,
51
+ "resample": 3,
52
+ "rescale_factor": 0.00392156862745098,
53
+ "return_metadata": false,
54
+ "size": {
55
+ "longest_edge": 12845056,
56
+ "shortest_edge": 3136
57
+ },
58
+ "temporal_patch_size": 2,
59
+ "video_processor_type": "Qwen2VLVideoProcessor"
60
+ }
61
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c47a17a5ec1c2cdadb68a727e1fa12b6ff89fd89a67b136eda88b4c91d267714
3
+ size 11422172
tokenizer_config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": null,
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|im_end|>",
7
+ "errors": "replace",
8
+ "extra_special_tokens": [
9
+ "<|im_start|>",
10
+ "<|im_end|>",
11
+ "<|object_ref_start|>",
12
+ "<|object_ref_end|>",
13
+ "<|box_start|>",
14
+ "<|box_end|>",
15
+ "<|quad_start|>",
16
+ "<|quad_end|>",
17
+ "<|vision_start|>",
18
+ "<|vision_end|>",
19
+ "<|vision_pad|>",
20
+ "<|image_pad|>",
21
+ "<|video_pad|>"
22
+ ],
23
+ "is_local": false,
24
+ "model_max_length": 131072,
25
+ "pad_token": "<|endoftext|>",
26
+ "processor_class": "Qwen2_5_VLProcessor",
27
+ "split_special_tokens": false,
28
+ "tokenizer_class": "Qwen2Tokenizer",
29
+ "unk_token": null
30
+ }