InteriorFusion / docs /RESEARCH_REPORT.md

Upload docs/RESEARCH_REPORT.md

708fe64 verified 15 days ago

13.4 kB

	# InteriorFusion: Research Report & Literature Review

	## Executive Summary

	After analyzing 50+ papers, 20+ repositories, and 15+ datasets, we identified that no existing open-source system solves single-image-to-3D-interior at production quality. All current SOTA models are object-centric. InteriorFusion bridges this gap through a scene-aware hybrid architecture.

	---

	## SOTA Comparison Table

	\| System \| Geometry Quality \| Texture Quality \| Inference Speed \| VRAM Usage \| Multi-View Consistency \| Scene Generation \| Mesh Quality \| CAD Compatible \| Controllable \| Training Cost \| Fine-Tuning Difficulty \| Commercial Usable \|
	\|--------\|-----------------\|-----------------\|-----------------\|------------\|----------------------\|-----------------\|--------------\|---------------\|-------------\|--------------\|----------------------\|-------------------\|
	\| TRELLIS \| ⭐⭐⭐⭐ \| ⭐⭐⭐ \| 15s \| 24GB \| ⭐⭐⭐⭐ \| ❌ (object-only) \| ⭐⭐⭐⭐ \| ⚠️ (needs export) \| ⭐⭐⭐ \| $50K (64×A100) \| Medium \| ✅ MIT \|
	\| TRELLIS.2 \| ⭐⭐⭐⭐⭐ \| ⭐⭐⭐⭐⭐ \| 12s \| 32GB \| ⭐⭐⭐⭐⭐ \| ❌ (object-only) \| ⭐⭐⭐⭐⭐ \| ✅ Native PBR \| ⭐⭐⭐⭐ \| $100K (32×H100) \| Hard \| ✅ MIT \|
	\| Hunyuan3D-2 \| ⭐⭐⭐⭐ \| ⭐⭐⭐⭐⭐ \| 25s \| 24GB \| ⭐⭐⭐⭐ \| ❌ (object-only) \| ⭐⭐⭐⭐ \| ✅ \| ⭐⭐⭐ \| Unknown \| Hard \| ⚠️ (Tencent license) \|
	\| Hunyuan3D-2.5 \| ⭐⭐⭐⭐⭐ \| ⭐⭐⭐⭐⭐ \| 30s \| 48GB \| ⭐⭐⭐⭐⭐ \| ❌ (object-only) \| ⭐⭐⭐⭐⭐ \| ✅ \| ⭐⭐⭐⭐ \| Unknown \| Hard \| ⚠️ \|
	\| TripoSR \| ⭐⭐⭐ \| ⭐⭐⭐ \| 0.5s \| 8GB \| ⭐⭐⭐ \| ❌ \| ⭐⭐⭐ \| ⚠️ \| ⭐⭐ \| $5K (8×A100) \| Easy \| ✅ MIT \|
	\| SF3D \| ⭐⭐⭐⭐ \| ⭐⭐⭐⭐ \| 0.5s \| 10GB \| ⭐⭐⭐⭐ \| ❌ \| ⭐⭐⭐⭐ \| ✅ PBR \| ⭐⭐⭐ \| $5K \| Medium \| ✅ MIT \|
	\| InstantMesh \| ⭐⭐⭐⭐ \| ⭐⭐⭐⭐ \| 10s \| 16GB \| ⭐⭐⭐⭐⭐ \| ❌ \| ⭐⭐⭐⭐⭐ \| ⚠️ \| ⭐⭐⭐ \| $20K \| Medium \| ✅ \|
	\| CRM \| ⭐⭐⭐⭐⭐ \| ⭐⭐⭐ \| 4s \| 16GB \| ⭐⭐⭐⭐ \| ❌ \| ⭐⭐⭐⭐⭐ \| ⚠️ \| ⭐⭐⭐ \| $8K (8×A800) \| Medium \| ✅ \|
	\| LGM \| ⭐⭐⭐ \| ⭐⭐⭐⭐ \| 5s \| 24GB \| ⭐⭐⭐⭐ \| ❌ \| ⭐⭐⭐ (Gaussian) \| ❌ \| ⭐⭐ \| $30K (32×A100) \| Medium \| ✅ \|
	\| Era3D \| ⭐⭐⭐⭐ \| ⭐⭐⭐⭐ \| 4min \| 24GB \| ⭐⭐⭐⭐⭐ \| ❌ \| ⭐⭐⭐⭐ \| ⚠️ \| ⭐⭐⭐ \| $15K (16×H800) \| Hard \| ✅ \|
	\| Wonder3D \| ⭐⭐⭐⭐ \| ⭐⭐⭐⭐ \| 2min \| 16GB \| ⭐⭐⭐⭐⭐ \| ❌ \| ⭐⭐⭐⭐ \| ⚠️ \| ⭐⭐⭐ \| $10K \| Medium \| ✅ \|
	\| SyncDreamer \| ⭐⭐⭐ \| ⭐⭐⭐⭐ \| 30s \| 16GB \| ⭐⭐⭐⭐⭐ \| ❌ \| ⭐⭐⭐ \| ❌ \| ⭐⭐ \| $8K \| Easy \| ✅ \|
	\| MVDream \| ⭐⭐ \| ⭐⭐⭐ \| 20s \| 16GB \| ⭐⭐⭐⭐ \| ❌ \| ⭐⭐ \| ❌ \| ⭐⭐ \| $10K \| Medium \| ✅ \|
	\| 2DGS-Room \| ⭐⭐⭐⭐ \| ⭐⭐⭐⭐ \| ~30s \| 24GB \| ⭐⭐⭐ \| ✅ (rooms!) \| ⭐⭐⭐ \| ❌ \| ⭐⭐ \| $5K \| Hard \| ✅ \|
	\| Pano2Room \| ⭐⭐⭐⭐ \| ⭐⭐⭐⭐⭐ \| ~2min \| 16GB \| ⭐⭐⭐⭐ \| ✅ (panoramas) \| ⭐⭐⭐ \| ❌ \| ⭐⭐ \| $3K \| Medium \| ✅ \|
	\| SpatialLM \| N/A \| N/A \| 1s \| 8GB \| N/A \| ✅ (layouts!) \| N/A \| N/A \| ⭐⭐⭐⭐⭐ \| $20K \| Easy \| ✅ Apache 2.0 \|
	\| InteriorFusion (target) \| ⭐⭐⭐⭐⭐ \| ⭐⭐⭐⭐⭐ \| 8s \| 16GB \| ⭐⭐⭐⭐⭐ \| ✅✅✅ \| ⭐⭐⭐⭐⭐ \| ✅✅✅ \| ⭐⭐⭐⭐⭐ \| $60K \| Medium \| ✅ MIT \|

	---

	## Why Current Models Fail for Interiors

	### 1. Inconsistent Room Geometry
	Root cause: No room topology prior. Object models generate in unit cube; rooms need planar walls with right angles.
	Fix in InteriorFusion: Explicit room layout estimation (SpatialLM) constrains wall/floor/ceiling to Manhattan-world planes.

	### 2. Furniture Floating
	Root cause: No gravity/physics prior. Objects generated independently with no floor contact constraint.
	Fix: Collision detection + physics relaxation in scene assembly phase. Floor plane from depth estimation anchors all objects.

	### 3. Inaccurate Scaling
	Root cause: Object-centric models normalize to unit cube. A chair and a sofa both fit in [−1,1]³.
	Fix: Metric depth estimation (Depth Anything V2 metric indoor) provides real-world scale in meters. Furniture dimensions matched against a prior database.

	### 4. Wall/Floor Topology Issues
	Root cause: No distinction between room shell and furniture. Models try to generate everything as one mesh.
	Fix: Separate room shell generation (planar meshes) from per-object generation. Room shell voxels flagged separately in SLAT-Interior.

	### 5. Poor Spatial Relationships
	Root cause: Independent object generation. No knowledge that "lamp goes on table" or "sofa faces TV".
	Fix: Scene graph generation + learned layout prior from 3D-FRONT. Spatial relations encoded as edge features in scene graph.

	### 6. Weak Depth Consistency
	Root cause: Single-view depth estimators produce inconsistent depth across object boundaries.
	Fix: Multi-view depth fusion + cross-view depth-normal consistency loss. Depth-conditioned generation at every stage.

	### 7. Multi-Object Scene Collapse
	Root cause: When multiple objects appear in one image, models merge them into a single blob.
	Fix: Semantic segmentation → per-object isolation → independent generation → scene assembly.

	### 8. Texture Bleeding
	Root cause: Multi-view texture projection without occlusion handling. Wall texture bleeds onto furniture.
	Fix: Visibility-aware texture baking with depth-buffer occlusion testing. Per-object UV atlases.

	### 9. Incomplete Room Reconstruction
	Root cause: Occluded regions (behind sofa, under table) are hallucinated incorrectly.
	Fix: Inpainting diffusion for occluded regions, conditioned on detected room layout. Ceiling/floor inpainting from detected planes.

	### 10. Inability to Edit Generated Rooms
	Root cause: Single output mesh. Can't move sofa without regenerating everything.
	Fix: Scene graph representation. Each object is a separate node. Objects generated independently, assembled via scene graph. Move sofa = update scene graph node position.

	### 11. Lack of Semantic Room Understanding
	Root cause: No training on room types. Model doesn't know "kitchen needs stove, bedroom needs bed".
	Fix: Room type classifier trained on 3D-FRONT room labels. Style-conditioned generation (modern, scandinavian, luxury, indian, commercial).

	---

	## Bottleneck Analysis

	\| Bottleneck \| Impact \| Solution in InteriorFusion \|
	\|-----------\|--------\|---------------------------\|
	\| Latent representation \| Object-only latents can't encode rooms \| SLAT-Interior: sparse voxels with room-shell vs object flags \|
	\| Scene encoding \| No scene-level conditioning \| Multi-encoder: image + depth + layout + semantic tokens \|
	\| Geometry priors \| No Manhattan world / planar constraints \| Room shell generation enforces planar walls/floor/ceiling \|
	\| Rendering pipeline \| Object-only rendering (sphere cameras) \| Indoor camera distribution (room-centered, limited elevation) \|
	\| Training datasets \| Only object datasets (Objaverse) \| 3D-FRONT + Structured3D + InteriorNet + ScanNet \|
	\| Sparse-view reconstruction \| 150 views per object; rooms need more \| Seed-guided 2D Gaussian splatting for room-scale \|
	\| Scene graph modeling \| No relationship modeling \| SpatialLM scene scripts + learned layout prior \|

	---

	## Key Papers & arXiv IDs

	\| Paper \| arXiv ID \| Key Contribution \|
	\|-------\|----------\|-----------------\|
	\| TRELLIS v1 \| 2412.01506 \| Structured latent (SLAT) for 3D generation \|
	\| TRELLIS.2 \| 2512.14692 \| O-Voxel with PBR materials, 16× compression \|
	\| TRELLISWorld \| 2510.23880 \| Tiled diffusion for scene generation \|
	\| Hunyuan3D-2.0 \| 2501.12202 \| Shape+texture two-stage pipeline \|
	\| Hunyuan3D-2.1 \| 2506.15442 \| Full training code release \|
	\| Hunyuan3D-2.5 \| 2506.16504 \| LATTICE 10B model \|
	\| HunyuanWorld \| 2507.21809 \| Panoramic world proxies \|
	\| SF3D \| 2408.00653 \| Sub-second mesh + PBR \|
	\| InstantMesh \| 2404.07191 \| Best open-source mesh quality \|
	\| CRM \| 2403.05034 \| Best geometry fidelity (CD 0.0094) \|
	\| TripoSR \| 2403.02151 \| Fastest baseline (0.5s) \|
	\| LGM \| 2402.05054 \| Gaussian splatting output \|
	\| Era3D \| 2405.11616 \| High-res multi-view (512²) \|
	\| Wonder3D \| 2310.15008 \| Cross-domain diffusion \|
	\| SyncDreamer \| 2309.03453 \| Synchronized multi-view \|
	\| MVDream \| 2308.16512 \| Multi-view diffusion \|
	\| 2DGS-Room \| 2412.03428 \| Indoor GS reconstruction \|
	\| Pano2Room \| 2408.11413 \| Single panorama to 3DGS \|
	\| SpatialLM \| 2506.07491 \| LLM for indoor scene understanding \|
	\| RoomFormer \| CVPR 2023 \| Floorplan from point clouds \|
	\| EchoScene \| 2405.00915 \| Scene graph → 3D indoor \|
	\| CHOrD \| 2503.11958 \| Collision-free house-scale scenes \|
	\| Direct3D \| 2405.14832 \| Triplane VAE + DiT \|
	\| Direct3D-S2 \| 2505.17412 \| Sparse SDF VAE, 1024³ on 8 GPUs \|
	\| CLAY \| 2406.13897 \| 1.5B param multi-condition model \|
	\| RL3DEdit \| 2603.03143 \| RL (GRPO) for 3D editing \|
	\| AR3D-R1 \| (recent) \| RL-enhanced text-to-3D \|
	\| Grendel-GS \| 2406.18533 \| Distributed 3DGS training \|
	\| TriplaneTurbo \| 2503.21694 \| Progressive rendering distillation \|
	\| Depth Anything V2 \| 2406.09414 \| SOTA monocular depth \|

	---

	## Dataset Rankings for Interior 3D

	### Tier 1 (Essential)

	\| Rank \| Dataset \| Size \| Key Strength \| HF Hub \|
	\|------\|---------\|------\|-------------\|--------\|
	\| 1 \| 3D-FRONT (MIDI-3D) \| 17K rooms \| End-to-end room scenes with furniture \| `huanngzh/3D-Front` \|
	\| 2 \| Structured3D \| 21K rooms \| Best structured 3D annotations (planes, lines, junctions) \| `Gen3DF/Structured3D` \|
	\| 3 \| ScanNet++ \| 1.6K scenes \| Real-world validation, dense annotations \| `marvex/scannet-dataset` \|

	### Tier 2 (Pre-training & Scale)

	\| Rank \| Dataset \| Size \| Key Strength \|
	\|------\|---------\|------\|-------------\|
	\| 4 \| InteriorNet \| 1.7M layouts \| Massive scale, multi-sensor \|
	\| 5 \| HM3D \| 1K scenes \| Largest real-world dataset \|
	\| 6 \| Hypersim \| 461 scenes \| High photorealism, material decomposition \|
	\| 7 \| Replica \| 18 scenes \| HDR textures, highest quality \|

	### Tier 3 (Assets & Objects)

	\| Rank \| Dataset \| Size \| Key Strength \| HF Hub \|
	\|------\|---------\|------\|-------------\|--------\|
	\| 8 \| Objaverse-XL \| 10M objects \| Largest 3D object repo \| `allenai/objaverse-xl` \|
	\| 9 \| OmniObject3D \| 6K objects \| High-quality real scans \| N/A \|
	\| 10 \| 3D-FUTURE \| 10K furniture \| Professional furniture models \| N/A \|

	### Tier 4 (Auxiliary)

	\| Dataset \| Purpose \|
	\|---------\|---------\|
	\| SceneVerse \| Language grounding \|
	\| ProcTHOR \| Procedural augmentation \|
	\| ARKitScenes \| Mobile capture \|
	\| 3RScan \| Change detection \|
	\| MultiScan \| Articulated furniture \|
	\| Infinigen \| Procedural generation \|
	\| MVImgNet \| Object multi-view \|
	\| GSO \| Evaluation benchmark \|

	---

	## Training Recipe Summary

	### Stage 1: VAE (1 week, 8×A100)
	- Dataset: 3D-FRONT + Structured3D (synthetic rooms)
	- Multi-resolution: 256³ → 512³ → 1024³ curriculum
	- Optimizer: AdamW, lr 1e-4, weight decay 0.01
	- Loss: MSE reconstruction + KL (λ=1e-3) + depth L1 + normal cosine
	- Batch: 8 per GPU, effective 64

	### Stage 2: Structure DiT (1 week, 32×A100)
	- Rectified flow matching
	- Conditioning: DINOv3-L image features + depth + layout tokens
	- Resolution curriculum: 256³ → 512³ → 1024³
	- Batch: 8 per GPU, effective 256
	- Optimizer: AdamW, lr 1e-4 → 2e-5 (progressive)

	### Stage 3: Material DiT (1 week, 16×A100)
	- Conditioned on generated geometry + input image
	- PBR material prediction
	- Batch: 16 per GPU, effective 256
	- Loss: L1 on albedo + L1 on metallic/roughness + LPIPS on rendered appearance

	### Stage 4: Real-world Fine-tuning (3 days, 8×A100)
	- LoRA rank 32 on DiT attention layers
	- Dataset: ScanNet + HM3D real photos
	- RL fine-tuning: GRPO with VGGT geometric rewards
	- Domain adaptation from synthetic → real

	### Total Cost Estimate: ~$60K (4 weeks on 32×A100)

	---

	## Novel Contributions of InteriorFusion

	1. SLAT-Interior: First structured latent representation designed for indoor scenes with room-shell vs object separation
	2. Scene-aware generation pipeline: First end-to-end pipeline from single image to editable 3D interior
	3. Metric-scale consistency: Leverages metric depth for real-world furniture scaling
	4. Hybrid output: Simultaneous mesh + Gaussian splatting + PBR materials
	5. Editable scene graph: Objects are independent, movable, replaceable nodes
	6. Style-conditioned: Supports modern, scandinavian, luxury, indian, commercial interiors
	7. PBR material generation: Native metallic/roughness/normal output (not just baked textures)
	8. Training-free scene assembly: Uses SpatialLM + learned layout prior without scene-level diffusion training

	---

	## Business Moat Analysis

	\| Moat \| InteriorFusion \| Competitors \|
	\|------\|---------------\|-------------\|
	\| Dataset moat \| 3D-FRONT + Structured3D rooms (interior-specific) \| Generic object datasets \|
	\| Architecture moat \| Scene-aware SLAT + scene graph \| Object-only representations \|
	\| Integration moat \| Blender/UE/Unity plugins + ComfyUI nodes \| Mostly web/API only \|
	\| Speed moat \| 8s on A100 \| 0.5s (TripoSR) but no interiors; 15-30s for quality \|
	\| Quality moat \| PBR + editable + scene-aware \| Single mesh blob \|
	\| Open-source moat \| MIT license, full code \| Mixed licenses (some proprietary) \|