File size: 13,447 Bytes

708fe64

# InteriorFusion: Research Report & Literature Review

## Executive Summary

After analyzing 50+ papers, 20+ repositories, and 15+ datasets, we identified that **no existing open-source system solves single-image-to-3D-interior at production quality**. All current SOTA models are object-centric. InteriorFusion bridges this gap through a scene-aware hybrid architecture.

---

## SOTA Comparison Table

| System | Geometry Quality | Texture Quality | Inference Speed | VRAM Usage | Multi-View Consistency | Scene Generation | Mesh Quality | CAD Compatible | Controllable | Training Cost | Fine-Tuning Difficulty | Commercial Usable |
|--------|-----------------|-----------------|-----------------|------------|----------------------|-----------------|--------------|---------------|-------------|--------------|----------------------|-------------------|
| **TRELLIS** | ⭐⭐⭐⭐ | ⭐⭐⭐ | 15s | 24GB | ⭐⭐⭐⭐ | ❌ (object-only) | ⭐⭐⭐⭐ | ⚠️ (needs export) | ⭐⭐⭐ | $50K (64×A100) | Medium | ✅ MIT |
| **TRELLIS.2** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 12s | 32GB | ⭐⭐⭐⭐⭐ | ❌ (object-only) | ⭐⭐⭐⭐⭐ | ✅ Native PBR | ⭐⭐⭐⭐ | $100K (32×H100) | Hard | ✅ MIT |
| **Hunyuan3D-2** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 25s | 24GB | ⭐⭐⭐⭐ | ❌ (object-only) | ⭐⭐⭐⭐ | ✅ | ⭐⭐⭐ | Unknown | Hard | ⚠️ (Tencent license) |
| **Hunyuan3D-2.5** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 30s | 48GB | ⭐⭐⭐⭐⭐ | ❌ (object-only) | ⭐⭐⭐⭐⭐ | ✅ | ⭐⭐⭐⭐ | Unknown | Hard | ⚠️ |
| **TripoSR** | ⭐⭐⭐ | ⭐⭐⭐ | 0.5s | 8GB | ⭐⭐⭐ | ❌ | ⭐⭐⭐ | ⚠️ | ⭐⭐ | $5K (8×A100) | Easy | ✅ MIT |
| **SF3D** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 0.5s | 10GB | ⭐⭐⭐⭐ | ❌ | ⭐⭐⭐⭐ | ✅ PBR | ⭐⭐⭐ | $5K | Medium | ✅ MIT |
| **InstantMesh** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 10s | 16GB | ⭐⭐⭐⭐⭐ | ❌ | ⭐⭐⭐⭐⭐ | ⚠️ | ⭐⭐⭐ | $20K | Medium | ✅ |
| **CRM** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | 4s | 16GB | ⭐⭐⭐⭐ | ❌ | ⭐⭐⭐⭐⭐ | ⚠️ | ⭐⭐⭐ | $8K (8×A800) | Medium | ✅ |
| **LGM** | ⭐⭐⭐ | ⭐⭐⭐⭐ | 5s | 24GB | ⭐⭐⭐⭐ | ❌ | ⭐⭐⭐ (Gaussian) | ❌ | ⭐⭐ | $30K (32×A100) | Medium | ✅ |
| **Era3D** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 4min | 24GB | ⭐⭐⭐⭐⭐ | ❌ | ⭐⭐⭐⭐ | ⚠️ | ⭐⭐⭐ | $15K (16×H800) | Hard | ✅ |
| **Wonder3D** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 2min | 16GB | ⭐⭐⭐⭐⭐ | ❌ | ⭐⭐⭐⭐ | ⚠️ | ⭐⭐⭐ | $10K | Medium | ✅ |
| **SyncDreamer** | ⭐⭐⭐ | ⭐⭐⭐⭐ | 30s | 16GB | ⭐⭐⭐⭐⭐ | ❌ | ⭐⭐⭐ | ❌ | ⭐⭐ | $8K | Easy | ✅ |
| **MVDream** | ⭐⭐ | ⭐⭐⭐ | 20s | 16GB | ⭐⭐⭐⭐ | ❌ | ⭐⭐ | ❌ | ⭐⭐ | $10K | Medium | ✅ |
| **2DGS-Room** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ~30s | 24GB | ⭐⭐⭐ | ✅ (rooms!) | ⭐⭐⭐ | ❌ | ⭐⭐ | $5K | Hard | ✅ |
| **Pano2Room** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ~2min | 16GB | ⭐⭐⭐⭐ | ✅ (panoramas) | ⭐⭐⭐ | ❌ | ⭐⭐ | $3K | Medium | ✅ |
| **SpatialLM** | N/A | N/A | 1s | 8GB | N/A | ✅ (layouts!) | N/A | N/A | ⭐⭐⭐⭐⭐ | $20K | Easy | ✅ Apache 2.0 |
| **InteriorFusion (target)** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | **8s** | **16GB** | ⭐⭐⭐⭐⭐ | ✅✅✅ | ⭐⭐⭐⭐⭐ | ✅✅✅ | ⭐⭐⭐⭐⭐ | **$60K** | Medium | ✅ MIT |

---

## Why Current Models Fail for Interiors

### 1. Inconsistent Room Geometry
**Root cause**: No room topology prior. Object models generate in unit cube; rooms need planar walls with right angles.
**Fix in InteriorFusion**: Explicit room layout estimation (SpatialLM) constrains wall/floor/ceiling to Manhattan-world planes.

### 2. Furniture Floating
**Root cause**: No gravity/physics prior. Objects generated independently with no floor contact constraint.
**Fix**: Collision detection + physics relaxation in scene assembly phase. Floor plane from depth estimation anchors all objects.

### 3. Inaccurate Scaling
**Root cause**: Object-centric models normalize to unit cube. A chair and a sofa both fit in [−1,1]³.
**Fix**: Metric depth estimation (Depth Anything V2 metric indoor) provides real-world scale in meters. Furniture dimensions matched against a prior database.

### 4. Wall/Floor Topology Issues
**Root cause**: No distinction between room shell and furniture. Models try to generate everything as one mesh.
**Fix**: Separate room shell generation (planar meshes) from per-object generation. Room shell voxels flagged separately in SLAT-Interior.

### 5. Poor Spatial Relationships
**Root cause**: Independent object generation. No knowledge that "lamp goes on table" or "sofa faces TV".
**Fix**: Scene graph generation + learned layout prior from 3D-FRONT. Spatial relations encoded as edge features in scene graph.

### 6. Weak Depth Consistency
**Root cause**: Single-view depth estimators produce inconsistent depth across object boundaries.
**Fix**: Multi-view depth fusion + cross-view depth-normal consistency loss. Depth-conditioned generation at every stage.

### 7. Multi-Object Scene Collapse
**Root cause**: When multiple objects appear in one image, models merge them into a single blob.
**Fix**: Semantic segmentation → per-object isolation → independent generation → scene assembly.

### 8. Texture Bleeding
**Root cause**: Multi-view texture projection without occlusion handling. Wall texture bleeds onto furniture.
**Fix**: Visibility-aware texture baking with depth-buffer occlusion testing. Per-object UV atlases.

### 9. Incomplete Room Reconstruction
**Root cause**: Occluded regions (behind sofa, under table) are hallucinated incorrectly.
**Fix**: Inpainting diffusion for occluded regions, conditioned on detected room layout. Ceiling/floor inpainting from detected planes.

### 10. Inability to Edit Generated Rooms
**Root cause**: Single output mesh. Can't move sofa without regenerating everything.
**Fix**: Scene graph representation. Each object is a separate node. Objects generated independently, assembled via scene graph. Move sofa = update scene graph node position.

### 11. Lack of Semantic Room Understanding
**Root cause**: No training on room types. Model doesn't know "kitchen needs stove, bedroom needs bed".
**Fix**: Room type classifier trained on 3D-FRONT room labels. Style-conditioned generation (modern, scandinavian, luxury, indian, commercial).

---

## Bottleneck Analysis

| Bottleneck | Impact | Solution in InteriorFusion |
|-----------|--------|---------------------------|
| **Latent representation** | Object-only latents can't encode rooms | SLAT-Interior: sparse voxels with room-shell vs object flags |
| **Scene encoding** | No scene-level conditioning | Multi-encoder: image + depth + layout + semantic tokens |
| **Geometry priors** | No Manhattan world / planar constraints | Room shell generation enforces planar walls/floor/ceiling |
| **Rendering pipeline** | Object-only rendering (sphere cameras) | Indoor camera distribution (room-centered, limited elevation) |
| **Training datasets** | Only object datasets (Objaverse) | 3D-FRONT + Structured3D + InteriorNet + ScanNet |
| **Sparse-view reconstruction** | 150 views per object; rooms need more | Seed-guided 2D Gaussian splatting for room-scale |
| **Scene graph modeling** | No relationship modeling | SpatialLM scene scripts + learned layout prior |

---

## Key Papers & arXiv IDs

| Paper | arXiv ID | Key Contribution |
|-------|----------|-----------------|
| TRELLIS v1 | 2412.01506 | Structured latent (SLAT) for 3D generation |
| TRELLIS.2 | 2512.14692 | O-Voxel with PBR materials, 16× compression |
| TRELLISWorld | 2510.23880 | Tiled diffusion for scene generation |
| Hunyuan3D-2.0 | 2501.12202 | Shape+texture two-stage pipeline |
| Hunyuan3D-2.1 | 2506.15442 | Full training code release |
| Hunyuan3D-2.5 | 2506.16504 | LATTICE 10B model |
| HunyuanWorld | 2507.21809 | Panoramic world proxies |
| SF3D | 2408.00653 | Sub-second mesh + PBR |
| InstantMesh | 2404.07191 | Best open-source mesh quality |
| CRM | 2403.05034 | Best geometry fidelity (CD 0.0094) |
| TripoSR | 2403.02151 | Fastest baseline (0.5s) |
| LGM | 2402.05054 | Gaussian splatting output |
| Era3D | 2405.11616 | High-res multi-view (512²) |
| Wonder3D | 2310.15008 | Cross-domain diffusion |
| SyncDreamer | 2309.03453 | Synchronized multi-view |
| MVDream | 2308.16512 | Multi-view diffusion |
| 2DGS-Room | 2412.03428 | Indoor GS reconstruction |
| Pano2Room | 2408.11413 | Single panorama to 3DGS |
| SpatialLM | 2506.07491 | LLM for indoor scene understanding |
| RoomFormer | CVPR 2023 | Floorplan from point clouds |
| EchoScene | 2405.00915 | Scene graph → 3D indoor |
| CHOrD | 2503.11958 | Collision-free house-scale scenes |
| Direct3D | 2405.14832 | Triplane VAE + DiT |
| Direct3D-S2 | 2505.17412 | Sparse SDF VAE, 1024³ on 8 GPUs |
| CLAY | 2406.13897 | 1.5B param multi-condition model |
| RL3DEdit | 2603.03143 | RL (GRPO) for 3D editing |
| AR3D-R1 | (recent) | RL-enhanced text-to-3D |
| Grendel-GS | 2406.18533 | Distributed 3DGS training |
| TriplaneTurbo | 2503.21694 | Progressive rendering distillation |
| Depth Anything V2 | 2406.09414 | SOTA monocular depth |

---

## Dataset Rankings for Interior 3D

### Tier 1 (Essential)

| Rank | Dataset | Size | Key Strength | HF Hub |
|------|---------|------|-------------|--------|
| 1 | **3D-FRONT (MIDI-3D)** | 17K rooms | End-to-end room scenes with furniture | `huanngzh/3D-Front` |
| 2 | **Structured3D** | 21K rooms | Best structured 3D annotations (planes, lines, junctions) | `Gen3DF/Structured3D` |
| 3 | **ScanNet++** | 1.6K scenes | Real-world validation, dense annotations | `marvex/scannet-dataset` |

### Tier 2 (Pre-training & Scale)

| Rank | Dataset | Size | Key Strength |
|------|---------|------|-------------|
| 4 | **InteriorNet** | 1.7M layouts | Massive scale, multi-sensor |
| 5 | **HM3D** | 1K scenes | Largest real-world dataset |
| 6 | **Hypersim** | 461 scenes | High photorealism, material decomposition |
| 7 | **Replica** | 18 scenes | HDR textures, highest quality |

### Tier 3 (Assets & Objects)

| Rank | Dataset | Size | Key Strength | HF Hub |
|------|---------|------|-------------|--------|
| 8 | **Objaverse-XL** | 10M objects | Largest 3D object repo | `allenai/objaverse-xl` |
| 9 | **OmniObject3D** | 6K objects | High-quality real scans | N/A |
| 10 | **3D-FUTURE** | 10K furniture | Professional furniture models | N/A |

### Tier 4 (Auxiliary)

| Dataset | Purpose |
|---------|---------|
| SceneVerse | Language grounding |
| ProcTHOR | Procedural augmentation |
| ARKitScenes | Mobile capture |
| 3RScan | Change detection |
| MultiScan | Articulated furniture |
| Infinigen | Procedural generation |
| MVImgNet | Object multi-view |
| GSO | Evaluation benchmark |

---

## Training Recipe Summary

### Stage 1: VAE (1 week, 8×A100)
- Dataset: 3D-FRONT + Structured3D (synthetic rooms)
- Multi-resolution: 256³ → 512³ → 1024³ curriculum
- Optimizer: AdamW, lr 1e-4, weight decay 0.01
- Loss: MSE reconstruction + KL (λ=1e-3) + depth L1 + normal cosine
- Batch: 8 per GPU, effective 64

### Stage 2: Structure DiT (1 week, 32×A100)
- Rectified flow matching
- Conditioning: DINOv3-L image features + depth + layout tokens
- Resolution curriculum: 256³ → 512³ → 1024³
- Batch: 8 per GPU, effective 256
- Optimizer: AdamW, lr 1e-4 → 2e-5 (progressive)

### Stage 3: Material DiT (1 week, 16×A100)
- Conditioned on generated geometry + input image
- PBR material prediction
- Batch: 16 per GPU, effective 256
- Loss: L1 on albedo + L1 on metallic/roughness + LPIPS on rendered appearance

### Stage 4: Real-world Fine-tuning (3 days, 8×A100)
- LoRA rank 32 on DiT attention layers
- Dataset: ScanNet + HM3D real photos
- RL fine-tuning: GRPO with VGGT geometric rewards
- Domain adaptation from synthetic → real

### Total Cost Estimate: ~$60K (4 weeks on 32×A100)

---

## Novel Contributions of InteriorFusion

1. **SLAT-Interior**: First structured latent representation designed for indoor scenes with room-shell vs object separation
2. **Scene-aware generation pipeline**: First end-to-end pipeline from single image to editable 3D interior
3. **Metric-scale consistency**: Leverages metric depth for real-world furniture scaling
4. **Hybrid output**: Simultaneous mesh + Gaussian splatting + PBR materials
5. **Editable scene graph**: Objects are independent, movable, replaceable nodes
6. **Style-conditioned**: Supports modern, scandinavian, luxury, indian, commercial interiors
7. **PBR material generation**: Native metallic/roughness/normal output (not just baked textures)
8. **Training-free scene assembly**: Uses SpatialLM + learned layout prior without scene-level diffusion training

---

## Business Moat Analysis

| Moat | InteriorFusion | Competitors |
|------|---------------|-------------|
| **Dataset moat** | 3D-FRONT + Structured3D rooms (interior-specific) | Generic object datasets |
| **Architecture moat** | Scene-aware SLAT + scene graph | Object-only representations |
| **Integration moat** | Blender/UE/Unity plugins + ComfyUI nodes | Mostly web/API only |
| **Speed moat** | 8s on A100 | 0.5s (TripoSR) but no interiors; 15-30s for quality |
| **Quality moat** | PBR + editable + scene-aware | Single mesh blob |
| **Open-source moat** | MIT license, full code | Mixed licenses (some proprietary) |