manitocross
/

floorplan-vlm-training

Model card Files Files and versions

floorplan-vlm-training / README.md

manitocross's picture

Upload README.md

6ea7ea3 verified 2 months ago

|

History Blame Contribute Delete

2.72 kB

	# FloorplanVLM Training

	Fine-tune Qwen2.5-VL-3B to extract wall, door, and window geometry from floor plan images as structured JSON.

	Based on [FloorplanVLM (arxiv:2602.06507)](https://arxiv.org/abs/2602.06507) — two-stage training:
	1. SFT on CubiCasa5K (5000 real floor plans)
	2. GRPO with geometric reward functions (wall IoU, room IoU, JSON validity)

	## Quick Start

	```bash
	# Install dependencies
	pip install torch torchvision transformers trl peft datasets accelerate shapely Pillow lxml numpy tqdm huggingface_hub

	# Optional (faster attention on GPU)
	pip install flash-attn

	# Login to HuggingFace
	huggingface-cli login

	# Stage 1: SFT Training
	python train_floorplan_vlm.py

	# Stage 2: GRPO Training (after SFT completes)
	python train_floorplan_grpo.py
	```

	## What it does

	- Downloads CubiCasa5K dataset (~5GB) from Zenodo automatically
	- Converts SVG floor plan annotations → structured JSON (walls with coordinates, doors, windows, rooms)
	- Trains Qwen2.5-VL-3B with LoRA to predict this JSON from floor plan images
	- Pushes the model to HuggingFace Hub
	- Auto-detects GPU vs CPU (GPU recommended for full training)

	## Configuration

	Edit the top of each script:

	\| Setting \| Default \| Description \|
	\|---\|---\|---\|
	\| `MAX_SAMPLES` \| `None` (all) \| Set to `100` for a quick test run \|
	\| `NUM_EPOCHS` \| `2` \| Training epochs \|
	\| `PUSH_TO_HUB` \| `True` \| Push model to HF Hub \|
	\| `HUB_MODEL_ID` \| `manitocross/floorplan-vlm-sft` \| Your model repo \|

	## Hardware Requirements

	\| Mode \| VRAM \| Time (full dataset) \|
	\|---\|---\|---\|
	\| GPU (A100 80GB) \| ~20GB \| ~4-6 hours \|
	\| GPU (RTX 3090/4090) \| ~20GB \| ~8-12 hours \|
	\| CPU \| ~14GB RAM \| ~days (for testing only) \|

	## Output JSON Schema

	```json
	{
	"walls": [
	{
	"id": "wall_1",
	"start": [120, 80],
	"end": [520, 80],
	"thickness": 15,
	"curvature": 0,
	"openings": [
	{"type": "door", "center": 320, "width": 90},
	{"type": "window", "center": 450, "width": 60}
	]
	}
	],
	"rooms": [
	{"label": "bedroom", "walls": ["wall_1", "wall_2", "wall_3", "wall_4"]}
	]
	}
	```

	## GRPO Reward Functions

	Stage 2 uses geometric rewards from the FloorplanVLM paper:
	- R_val (0.1 weight): JSON validity + schema compliance
	- R_ext (0.5 weight): External wall boundary IoU (Shapely polygon comparison)
	- R_int (0.4 weight): Room IoU, gated by α when external walls are wrong

	## References

	- [FloorplanVLM: A Vision-Language Model for Floorplan Vectorization](https://arxiv.org/abs/2602.06507)
	- [CubiCasa5K: A Dataset for Floorplan Image Analysis](https://arxiv.org/abs/1904.01920)
	- [TRL: Transformer Reinforcement Learning](https://huggingface.co/docs/trl)