File size: 3,430 Bytes
a16daa8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---

license: apache-2.0
language:
- en
tags:
- vision
- hotel
- cleanliness-detection
- lora
- qlora
- unsloth
- qwen3-vl
---


# RoomAudit LoRA Adapters

QLoRA adapters for hotel room cleanliness detection, fine-tuned on Qwen3-VL-4B-Instruct. Part of the roomaudit project.

Three adapters are included here, each from a different training approach. All were trained on the same synthetic dataset: 218 clean hotel room images with defects painted in using SAM3 + FLUX.1 Fill inpainting.

---

## Adapters

### `lora_adapter` — primary adapter, use this one



Single-turn format. Takes a room image, returns a JSON verdict with clean/messy classification and a defect list.



| Metric | Score |

|---|---|

| Accuracy | 0.714 |

| Precision | 0.676 |

| Recall | 0.906 |

| F1 | 0.774 |



### `lora_adapter_agent` — agentic (two-turn) adapter



Two-turn format: Round 1 selects 1-2 regions to inspect, Round 2 gives the final verdict after seeing the crops. Scores below the single-turn adapter on the current synthetic dataset. Included as a reference for the agentic training approach.



| Metric | Score |

|---|---|

| Accuracy | 0.663 |

| Precision | 0.622 |

| Recall | 0.902 |

| F1 | 0.736 |



### `lora_adapter_vit` — ViT + LLM adapter



Same single-turn format as the primary adapter, but with LoRA applied to the vision encoder as well as the language layers. Worse than LLM-only training: the ViT adapters learn to detect FLUX inpainting artefacts rather than actual room defects. Included as a reference.



| Metric | Score |

|---|---|

| Accuracy | 0.587 |

| Precision | 0.568 |

| Recall | 0.991 |

| F1 | 0.722 |



---



## Quickstart



```python

from huggingface_hub import snapshot_download

from unsloth import FastVisionModel

from peft import PeftModel

from PIL import Image

import json, re

from qwen_vl_utils import process_vision_info



snapshot_download(
    "RanenSim/RoomAudit-Lora",

    allow_patterns="lora_adapter/*",

    local_dir="outputs/",

)


model, tokenizer = FastVisionModel.from_pretrained(

    "unsloth/Qwen3-VL-4B-Instruct-unsloth-bnb-4bit",

    load_in_4bit=True,

)

model = PeftModel.from_pretrained(model, "outputs/lora_adapter")

FastVisionModel.for_inference(model)

image = Image.open("room.jpg").convert("RGB")
messages = [
    {"role": "system", "content": [{"type": "text", "text": "You are a hotel room cleanliness inspector. Respond ONLY with valid JSON."}]},

    {"role": "user", "content": [

        {"type": "image", "image": image},

        {"type": "text", "text": '{"clean": true/false, "defects": [{"object": "...", "type": "...", "description": "..."}]}'},

    ]},

]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

image_inputs, _ = process_vision_info(messages)

inputs = tokenizer(text=[text], images=image_inputs, padding=True, return_tensors="pt").to("cuda")

out_ids = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.1)

output = tokenizer.decode(out_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

result = json.loads(re.search(r"\{.*\}", output, re.DOTALL).group())

```


See each adapter's README for full usage instructions, training config, and results.

---

Source code, training notebooks, and data generation pipeline: [github.com/Razorbird360/roomaudit](https://github.com/Razorbird360/roomaudit)