File size: 13,447 Bytes
708fe64 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 | # InteriorFusion: Research Report & Literature Review
## Executive Summary
After analyzing 50+ papers, 20+ repositories, and 15+ datasets, we identified that **no existing open-source system solves single-image-to-3D-interior at production quality**. All current SOTA models are object-centric. InteriorFusion bridges this gap through a scene-aware hybrid architecture.
---
## SOTA Comparison Table
| System | Geometry Quality | Texture Quality | Inference Speed | VRAM Usage | Multi-View Consistency | Scene Generation | Mesh Quality | CAD Compatible | Controllable | Training Cost | Fine-Tuning Difficulty | Commercial Usable |
|--------|-----------------|-----------------|-----------------|------------|----------------------|-----------------|--------------|---------------|-------------|--------------|----------------------|-------------------|
| **TRELLIS** | ββββ | βββ | 15s | 24GB | ββββ | β (object-only) | ββββ | β οΈ (needs export) | βββ | $50K (64ΓA100) | Medium | β
MIT |
| **TRELLIS.2** | βββββ | βββββ | 12s | 32GB | βββββ | β (object-only) | βββββ | β
Native PBR | ββββ | $100K (32ΓH100) | Hard | β
MIT |
| **Hunyuan3D-2** | ββββ | βββββ | 25s | 24GB | ββββ | β (object-only) | ββββ | β
| βββ | Unknown | Hard | β οΈ (Tencent license) |
| **Hunyuan3D-2.5** | βββββ | βββββ | 30s | 48GB | βββββ | β (object-only) | βββββ | β
| ββββ | Unknown | Hard | β οΈ |
| **TripoSR** | βββ | βββ | 0.5s | 8GB | βββ | β | βββ | β οΈ | ββ | $5K (8ΓA100) | Easy | β
MIT |
| **SF3D** | ββββ | ββββ | 0.5s | 10GB | ββββ | β | ββββ | β
PBR | βββ | $5K | Medium | β
MIT |
| **InstantMesh** | ββββ | ββββ | 10s | 16GB | βββββ | β | βββββ | β οΈ | βββ | $20K | Medium | β
|
| **CRM** | βββββ | βββ | 4s | 16GB | ββββ | β | βββββ | β οΈ | βββ | $8K (8ΓA800) | Medium | β
|
| **LGM** | βββ | ββββ | 5s | 24GB | ββββ | β | βββ (Gaussian) | β | ββ | $30K (32ΓA100) | Medium | β
|
| **Era3D** | ββββ | ββββ | 4min | 24GB | βββββ | β | ββββ | β οΈ | βββ | $15K (16ΓH800) | Hard | β
|
| **Wonder3D** | ββββ | ββββ | 2min | 16GB | βββββ | β | ββββ | β οΈ | βββ | $10K | Medium | β
|
| **SyncDreamer** | βββ | ββββ | 30s | 16GB | βββββ | β | βββ | β | ββ | $8K | Easy | β
|
| **MVDream** | ββ | βββ | 20s | 16GB | ββββ | β | ββ | β | ββ | $10K | Medium | β
|
| **2DGS-Room** | ββββ | ββββ | ~30s | 24GB | βββ | β
(rooms!) | βββ | β | ββ | $5K | Hard | β
|
| **Pano2Room** | ββββ | βββββ | ~2min | 16GB | ββββ | β
(panoramas) | βββ | β | ββ | $3K | Medium | β
|
| **SpatialLM** | N/A | N/A | 1s | 8GB | N/A | β
(layouts!) | N/A | N/A | βββββ | $20K | Easy | β
Apache 2.0 |
| **InteriorFusion (target)** | βββββ | βββββ | **8s** | **16GB** | βββββ | β
β
β
| βββββ | β
β
β
| βββββ | **$60K** | Medium | β
MIT |
---
## Why Current Models Fail for Interiors
### 1. Inconsistent Room Geometry
**Root cause**: No room topology prior. Object models generate in unit cube; rooms need planar walls with right angles.
**Fix in InteriorFusion**: Explicit room layout estimation (SpatialLM) constrains wall/floor/ceiling to Manhattan-world planes.
### 2. Furniture Floating
**Root cause**: No gravity/physics prior. Objects generated independently with no floor contact constraint.
**Fix**: Collision detection + physics relaxation in scene assembly phase. Floor plane from depth estimation anchors all objects.
### 3. Inaccurate Scaling
**Root cause**: Object-centric models normalize to unit cube. A chair and a sofa both fit in [β1,1]Β³.
**Fix**: Metric depth estimation (Depth Anything V2 metric indoor) provides real-world scale in meters. Furniture dimensions matched against a prior database.
### 4. Wall/Floor Topology Issues
**Root cause**: No distinction between room shell and furniture. Models try to generate everything as one mesh.
**Fix**: Separate room shell generation (planar meshes) from per-object generation. Room shell voxels flagged separately in SLAT-Interior.
### 5. Poor Spatial Relationships
**Root cause**: Independent object generation. No knowledge that "lamp goes on table" or "sofa faces TV".
**Fix**: Scene graph generation + learned layout prior from 3D-FRONT. Spatial relations encoded as edge features in scene graph.
### 6. Weak Depth Consistency
**Root cause**: Single-view depth estimators produce inconsistent depth across object boundaries.
**Fix**: Multi-view depth fusion + cross-view depth-normal consistency loss. Depth-conditioned generation at every stage.
### 7. Multi-Object Scene Collapse
**Root cause**: When multiple objects appear in one image, models merge them into a single blob.
**Fix**: Semantic segmentation β per-object isolation β independent generation β scene assembly.
### 8. Texture Bleeding
**Root cause**: Multi-view texture projection without occlusion handling. Wall texture bleeds onto furniture.
**Fix**: Visibility-aware texture baking with depth-buffer occlusion testing. Per-object UV atlases.
### 9. Incomplete Room Reconstruction
**Root cause**: Occluded regions (behind sofa, under table) are hallucinated incorrectly.
**Fix**: Inpainting diffusion for occluded regions, conditioned on detected room layout. Ceiling/floor inpainting from detected planes.
### 10. Inability to Edit Generated Rooms
**Root cause**: Single output mesh. Can't move sofa without regenerating everything.
**Fix**: Scene graph representation. Each object is a separate node. Objects generated independently, assembled via scene graph. Move sofa = update scene graph node position.
### 11. Lack of Semantic Room Understanding
**Root cause**: No training on room types. Model doesn't know "kitchen needs stove, bedroom needs bed".
**Fix**: Room type classifier trained on 3D-FRONT room labels. Style-conditioned generation (modern, scandinavian, luxury, indian, commercial).
---
## Bottleneck Analysis
| Bottleneck | Impact | Solution in InteriorFusion |
|-----------|--------|---------------------------|
| **Latent representation** | Object-only latents can't encode rooms | SLAT-Interior: sparse voxels with room-shell vs object flags |
| **Scene encoding** | No scene-level conditioning | Multi-encoder: image + depth + layout + semantic tokens |
| **Geometry priors** | No Manhattan world / planar constraints | Room shell generation enforces planar walls/floor/ceiling |
| **Rendering pipeline** | Object-only rendering (sphere cameras) | Indoor camera distribution (room-centered, limited elevation) |
| **Training datasets** | Only object datasets (Objaverse) | 3D-FRONT + Structured3D + InteriorNet + ScanNet |
| **Sparse-view reconstruction** | 150 views per object; rooms need more | Seed-guided 2D Gaussian splatting for room-scale |
| **Scene graph modeling** | No relationship modeling | SpatialLM scene scripts + learned layout prior |
---
## Key Papers & arXiv IDs
| Paper | arXiv ID | Key Contribution |
|-------|----------|-----------------|
| TRELLIS v1 | 2412.01506 | Structured latent (SLAT) for 3D generation |
| TRELLIS.2 | 2512.14692 | O-Voxel with PBR materials, 16Γ compression |
| TRELLISWorld | 2510.23880 | Tiled diffusion for scene generation |
| Hunyuan3D-2.0 | 2501.12202 | Shape+texture two-stage pipeline |
| Hunyuan3D-2.1 | 2506.15442 | Full training code release |
| Hunyuan3D-2.5 | 2506.16504 | LATTICE 10B model |
| HunyuanWorld | 2507.21809 | Panoramic world proxies |
| SF3D | 2408.00653 | Sub-second mesh + PBR |
| InstantMesh | 2404.07191 | Best open-source mesh quality |
| CRM | 2403.05034 | Best geometry fidelity (CD 0.0094) |
| TripoSR | 2403.02151 | Fastest baseline (0.5s) |
| LGM | 2402.05054 | Gaussian splatting output |
| Era3D | 2405.11616 | High-res multi-view (512Β²) |
| Wonder3D | 2310.15008 | Cross-domain diffusion |
| SyncDreamer | 2309.03453 | Synchronized multi-view |
| MVDream | 2308.16512 | Multi-view diffusion |
| 2DGS-Room | 2412.03428 | Indoor GS reconstruction |
| Pano2Room | 2408.11413 | Single panorama to 3DGS |
| SpatialLM | 2506.07491 | LLM for indoor scene understanding |
| RoomFormer | CVPR 2023 | Floorplan from point clouds |
| EchoScene | 2405.00915 | Scene graph β 3D indoor |
| CHOrD | 2503.11958 | Collision-free house-scale scenes |
| Direct3D | 2405.14832 | Triplane VAE + DiT |
| Direct3D-S2 | 2505.17412 | Sparse SDF VAE, 1024Β³ on 8 GPUs |
| CLAY | 2406.13897 | 1.5B param multi-condition model |
| RL3DEdit | 2603.03143 | RL (GRPO) for 3D editing |
| AR3D-R1 | (recent) | RL-enhanced text-to-3D |
| Grendel-GS | 2406.18533 | Distributed 3DGS training |
| TriplaneTurbo | 2503.21694 | Progressive rendering distillation |
| Depth Anything V2 | 2406.09414 | SOTA monocular depth |
---
## Dataset Rankings for Interior 3D
### Tier 1 (Essential)
| Rank | Dataset | Size | Key Strength | HF Hub |
|------|---------|------|-------------|--------|
| 1 | **3D-FRONT (MIDI-3D)** | 17K rooms | End-to-end room scenes with furniture | `huanngzh/3D-Front` |
| 2 | **Structured3D** | 21K rooms | Best structured 3D annotations (planes, lines, junctions) | `Gen3DF/Structured3D` |
| 3 | **ScanNet++** | 1.6K scenes | Real-world validation, dense annotations | `marvex/scannet-dataset` |
### Tier 2 (Pre-training & Scale)
| Rank | Dataset | Size | Key Strength |
|------|---------|------|-------------|
| 4 | **InteriorNet** | 1.7M layouts | Massive scale, multi-sensor |
| 5 | **HM3D** | 1K scenes | Largest real-world dataset |
| 6 | **Hypersim** | 461 scenes | High photorealism, material decomposition |
| 7 | **Replica** | 18 scenes | HDR textures, highest quality |
### Tier 3 (Assets & Objects)
| Rank | Dataset | Size | Key Strength | HF Hub |
|------|---------|------|-------------|--------|
| 8 | **Objaverse-XL** | 10M objects | Largest 3D object repo | `allenai/objaverse-xl` |
| 9 | **OmniObject3D** | 6K objects | High-quality real scans | N/A |
| 10 | **3D-FUTURE** | 10K furniture | Professional furniture models | N/A |
### Tier 4 (Auxiliary)
| Dataset | Purpose |
|---------|---------|
| SceneVerse | Language grounding |
| ProcTHOR | Procedural augmentation |
| ARKitScenes | Mobile capture |
| 3RScan | Change detection |
| MultiScan | Articulated furniture |
| Infinigen | Procedural generation |
| MVImgNet | Object multi-view |
| GSO | Evaluation benchmark |
---
## Training Recipe Summary
### Stage 1: VAE (1 week, 8ΓA100)
- Dataset: 3D-FRONT + Structured3D (synthetic rooms)
- Multi-resolution: 256Β³ β 512Β³ β 1024Β³ curriculum
- Optimizer: AdamW, lr 1e-4, weight decay 0.01
- Loss: MSE reconstruction + KL (Ξ»=1e-3) + depth L1 + normal cosine
- Batch: 8 per GPU, effective 64
### Stage 2: Structure DiT (1 week, 32ΓA100)
- Rectified flow matching
- Conditioning: DINOv3-L image features + depth + layout tokens
- Resolution curriculum: 256Β³ β 512Β³ β 1024Β³
- Batch: 8 per GPU, effective 256
- Optimizer: AdamW, lr 1e-4 β 2e-5 (progressive)
### Stage 3: Material DiT (1 week, 16ΓA100)
- Conditioned on generated geometry + input image
- PBR material prediction
- Batch: 16 per GPU, effective 256
- Loss: L1 on albedo + L1 on metallic/roughness + LPIPS on rendered appearance
### Stage 4: Real-world Fine-tuning (3 days, 8ΓA100)
- LoRA rank 32 on DiT attention layers
- Dataset: ScanNet + HM3D real photos
- RL fine-tuning: GRPO with VGGT geometric rewards
- Domain adaptation from synthetic β real
### Total Cost Estimate: ~$60K (4 weeks on 32ΓA100)
---
## Novel Contributions of InteriorFusion
1. **SLAT-Interior**: First structured latent representation designed for indoor scenes with room-shell vs object separation
2. **Scene-aware generation pipeline**: First end-to-end pipeline from single image to editable 3D interior
3. **Metric-scale consistency**: Leverages metric depth for real-world furniture scaling
4. **Hybrid output**: Simultaneous mesh + Gaussian splatting + PBR materials
5. **Editable scene graph**: Objects are independent, movable, replaceable nodes
6. **Style-conditioned**: Supports modern, scandinavian, luxury, indian, commercial interiors
7. **PBR material generation**: Native metallic/roughness/normal output (not just baked textures)
8. **Training-free scene assembly**: Uses SpatialLM + learned layout prior without scene-level diffusion training
---
## Business Moat Analysis
| Moat | InteriorFusion | Competitors |
|------|---------------|-------------|
| **Dataset moat** | 3D-FRONT + Structured3D rooms (interior-specific) | Generic object datasets |
| **Architecture moat** | Scene-aware SLAT + scene graph | Object-only representations |
| **Integration moat** | Blender/UE/Unity plugins + ComfyUI nodes | Mostly web/API only |
| **Speed moat** | 8s on A100 | 0.5s (TripoSR) but no interiors; 15-30s for quality |
| **Quality moat** | PBR + editable + scene-aware | Single mesh blob |
| **Open-source moat** | MIT license, full code | Mixed licenses (some proprietary) |
|