RanenSim
/

RoomAudit-Lora

cleanliness-detection

Model card Files Files and versions

RoomAudit-Lora / README.md

RanenSim's picture

Upload README.md with huggingface_hub

a16daa8 verified 27 days ago

|

history blame contribute delete

3.43 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- vision
	- hotel
	- cleanliness-detection
	- lora
	- qlora
	- unsloth
	- qwen3-vl
	---

	# RoomAudit LoRA Adapters

	QLoRA adapters for hotel room cleanliness detection, fine-tuned on Qwen3-VL-4B-Instruct. Part of the roomaudit project.

	Three adapters are included here, each from a different training approach. All were trained on the same synthetic dataset: 218 clean hotel room images with defects painted in using SAM3 + FLUX.1 Fill inpainting.

	---

	## Adapters

	### `lora_adapter` — primary adapter, use this one

	Single-turn format. Takes a room image, returns a JSON verdict with clean/messy classification and a defect list.

	\| Metric \| Score \|
	\|---\|---\|
	\| Accuracy \| 0.714 \|
	\| Precision \| 0.676 \|
	\| Recall \| 0.906 \|
	\| F1 \| 0.774 \|

	### `lora_adapter_agent` — agentic (two-turn) adapter

	Two-turn format: Round 1 selects 1-2 regions to inspect, Round 2 gives the final verdict after seeing the crops. Scores below the single-turn adapter on the current synthetic dataset. Included as a reference for the agentic training approach.

	\| Metric \| Score \|
	\|---\|---\|
	\| Accuracy \| 0.663 \|
	\| Precision \| 0.622 \|
	\| Recall \| 0.902 \|
	\| F1 \| 0.736 \|

	### `lora_adapter_vit` — ViT + LLM adapter

	Same single-turn format as the primary adapter, but with LoRA applied to the vision encoder as well as the language layers. Worse than LLM-only training: the ViT adapters learn to detect FLUX inpainting artefacts rather than actual room defects. Included as a reference.

	\| Metric \| Score \|
	\|---\|---\|
	\| Accuracy \| 0.587 \|
	\| Precision \| 0.568 \|
	\| Recall \| 0.991 \|
	\| F1 \| 0.722 \|

	---

	## Quickstart

	```python
	from huggingface_hub import snapshot_download
	from unsloth import FastVisionModel
	from peft import PeftModel
	from PIL import Image
	import json, re
	from qwen_vl_utils import process_vision_info

	snapshot_download(
	"RanenSim/RoomAudit-Lora",
	allow_patterns="lora_adapter/*",
	local_dir="outputs/",
	)

	model, tokenizer = FastVisionModel.from_pretrained(
	"unsloth/Qwen3-VL-4B-Instruct-unsloth-bnb-4bit",
	load_in_4bit=True,
	)
	model = PeftModel.from_pretrained(model, "outputs/lora_adapter")
	FastVisionModel.for_inference(model)

	image = Image.open("room.jpg").convert("RGB")
	messages = [
	{"role": "system", "content": [{"type": "text", "text": "You are a hotel room cleanliness inspector. Respond ONLY with valid JSON."}]},
	{"role": "user", "content": [
	{"type": "image", "image": image},
	{"type": "text", "text": '{"clean": true/false, "defects": [{"object": "...", "type": "...", "description": "..."}]}'},
	]},
	]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	image_inputs, _ = process_vision_info(messages)
	inputs = tokenizer(text=[text], images=image_inputs, padding=True, return_tensors="pt").to("cuda")
	out_ids = model.generate(**inputs, max_new_tokens=256, do_sample=True, temperature=0.1)
	output = tokenizer.decode(out_ids[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
	result = json.loads(re.search(r"\{.*\}", output, re.DOTALL).group())
	```

	See each adapter's README for full usage instructions, training config, and results.

	---

	Source code, training notebooks, and data generation pipeline: [github.com/Razorbird360/roomaudit](https://github.com/Razorbird360/roomaudit)