| # FloorplanVLM Training |
|
|
| Fine-tune **Qwen2.5-VL-3B** to extract wall, door, and window geometry from floor plan images as structured JSON. |
|
|
| Based on [FloorplanVLM (arxiv:2602.06507)](https://arxiv.org/abs/2602.06507) — two-stage training: |
| 1. **SFT** on CubiCasa5K (5000 real floor plans) |
| 2. **GRPO** with geometric reward functions (wall IoU, room IoU, JSON validity) |
|
|
| ## Quick Start |
|
|
| ```bash |
| # Install dependencies |
| pip install torch torchvision transformers trl peft datasets accelerate shapely Pillow lxml numpy tqdm huggingface_hub |
| |
| # Optional (faster attention on GPU) |
| pip install flash-attn |
| |
| # Login to HuggingFace |
| huggingface-cli login |
| |
| # Stage 1: SFT Training |
| python train_floorplan_vlm.py |
| |
| # Stage 2: GRPO Training (after SFT completes) |
| python train_floorplan_grpo.py |
| ``` |
|
|
| ## What it does |
|
|
| - **Downloads** CubiCasa5K dataset (~5GB) from Zenodo automatically |
| - **Converts** SVG floor plan annotations → structured JSON (walls with coordinates, doors, windows, rooms) |
| - **Trains** Qwen2.5-VL-3B with LoRA to predict this JSON from floor plan images |
| - **Pushes** the model to HuggingFace Hub |
| - **Auto-detects** GPU vs CPU (GPU recommended for full training) |
|
|
| ## Configuration |
|
|
| Edit the top of each script: |
|
|
| | Setting | Default | Description | |
| |---|---|---| |
| | `MAX_SAMPLES` | `None` (all) | Set to `100` for a quick test run | |
| | `NUM_EPOCHS` | `2` | Training epochs | |
| | `PUSH_TO_HUB` | `True` | Push model to HF Hub | |
| | `HUB_MODEL_ID` | `manitocross/floorplan-vlm-sft` | Your model repo | |
|
|
| ## Hardware Requirements |
|
|
| | Mode | VRAM | Time (full dataset) | |
| |---|---|---| |
| | GPU (A100 80GB) | ~20GB | ~4-6 hours | |
| | GPU (RTX 3090/4090) | ~20GB | ~8-12 hours | |
| | CPU | ~14GB RAM | ~days (for testing only) | |
|
|
| ## Output JSON Schema |
|
|
| ```json |
| { |
| "walls": [ |
| { |
| "id": "wall_1", |
| "start": [120, 80], |
| "end": [520, 80], |
| "thickness": 15, |
| "curvature": 0, |
| "openings": [ |
| {"type": "door", "center": 320, "width": 90}, |
| {"type": "window", "center": 450, "width": 60} |
| ] |
| } |
| ], |
| "rooms": [ |
| {"label": "bedroom", "walls": ["wall_1", "wall_2", "wall_3", "wall_4"]} |
| ] |
| } |
| ``` |
|
|
| ## GRPO Reward Functions |
|
|
| Stage 2 uses geometric rewards from the FloorplanVLM paper: |
| - **R_val** (0.1 weight): JSON validity + schema compliance |
| - **R_ext** (0.5 weight): External wall boundary IoU (Shapely polygon comparison) |
| - **R_int** (0.4 weight): Room IoU, gated by α when external walls are wrong |
| |
| ## References |
| |
| - [FloorplanVLM: A Vision-Language Model for Floorplan Vectorization](https://arxiv.org/abs/2602.06507) |
| - [CubiCasa5K: A Dataset for Floorplan Image Analysis](https://arxiv.org/abs/1904.01920) |
| - [TRL: Transformer Reinforcement Learning](https://huggingface.co/docs/trl) |
| |