| ---
|
| license: apache-2.0
|
| language:
|
| - en
|
| tags:
|
| - vision
|
| - hotel
|
| - cleanliness-detection
|
| - lora
|
| - qlora
|
| - unsloth
|
| - qwen3-vl
|
| ---
|
|
|
| # RoomAudit LoRA Adapters
|
|
|
| QLoRA adapters for hotel room cleanliness detection, fine-tuned on Qwen3-VL-4B-Instruct. Part of the roomaudit project.
|
|
|
| Three adapters are included here, each from a different training approach. All were trained on the same synthetic dataset: 218 clean hotel room images with defects painted in using SAM3 + FLUX.1 Fill inpainting.
|
|
|
| ---
|
|
|
| ## Adapters
|
|
|
| ### `lora_adapter` — primary adapter, use this one
|
|
|
| Single-turn format. Takes a room image, returns a JSON verdict with clean/messy classification and a defect list.
|
|
|
| | Metric | Score |
|
| |---|---|
|
| | Accuracy | 0.714 |
|
| | Precision | 0.676 |
|
| | Recall | 0.906 |
|
| | F1 | 0.774 |
|
|
|
| ### `lora_adapter_agent` — agentic (two-turn) adapter
|
|
|
| Two-turn format: Round 1 selects 1-2 regions to inspect, Round 2 gives the final verdict after seeing the crops. Scores below the single-turn adapter on the current synthetic dataset. Included as a reference for the agentic training approach.
|
|
|
| | Metric | Score |
|
| |---|---|
|
| | Accuracy | 0.663 |
|
| | Precision | 0.622 |
|
| | Recall | 0.902 |
|
| | F1 | 0.736 |
|
|
|
| ### `lora_adapter_vit` — ViT + LLM adapter
|
|
|
| Same single-turn format as the primary adapter, but with LoRA applied to the vision encoder as well as the language layers. Worse than LLM-only training: the ViT adapters learn to detect FLUX inpainting artefacts rather than actual room defects. Included as a reference.
|
|
|
| | Metric | Score |
|
| |---|---|
|
| | Accuracy | 0.587 |
|
| | Precision | 0.568 |
|
| | Recall | 0.991 |
|
| | F1 | 0.722 |
|
|
|
| ---
|
|
|
| ## Quickstart
|
|
|
| ```python
|
| from huggingface_hub import snapshot_download
|
| from unsloth import FastVisionModel
|
| from peft import PeftModel
|
| from PIL import Image
|
| import json, re
|
| from qwen_vl_utils import process_vision_info
|
|
|
| snapshot_download(
|
| "RanenSim/RoomAudit-Lora",
|
| allow_patterns="lora_adapter/*",
|
| local_dir="outputs/",
|
| )
|
|
|
| model, tokenizer = FastVisionModel.from_pretrained(
|
| "unsloth/Qwen3-VL-4B-Instruct-unsloth-bnb-4bit",
|
| load_in_4bit=True,
|
| )
|
| model = PeftModel.from_pretrained(model, "outputs/lora_adapter")
|
| FastVisionModel.for_inference(model)
|
|
|
| image = Image.open("room.jpg").convert("RGB")
|
| messages = [
|
| {"role": "system", "content": [{"type": "text", "text": "You are a hotel room cleanliness inspector. Respond ONLY with valid JSON."}]},
|
| {"role": "user", "content": [
|
| {"type": "image", "image": image},
|
| {"type": "text", "text": '{"clean": true/false, "defects": [{"object": "...", "type": "...", "description": "..."}]}'},
|
| ]},
|
| ]
|
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| image_inputs, _ = process_vision_info(messages)
|
| inputs = tokenizer(text=[text], images=image_inputs, padding=True, return_tensors="pt").to("cuda")
|
| out_ids = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.1)
|
| output = tokenizer.decode(out_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
|
| result = json.loads(re.search(r"\{.*\}", output, re.DOTALL).group())
|
| ```
|
|
|
| See each adapter's README for full usage instructions, training config, and results.
|
|
|
| ---
|
|
|
| Source code, training notebooks, and data generation pipeline: [github.com/Razorbird360/roomaudit](https://github.com/Razorbird360/roomaudit)
|
|
|