InteriorFusion: Research Report & Literature Review
Executive Summary
After analyzing 50+ papers, 20+ repositories, and 15+ datasets, we identified that no existing open-source system solves single-image-to-3D-interior at production quality. All current SOTA models are object-centric. InteriorFusion bridges this gap through a scene-aware hybrid architecture.
SOTA Comparison Table
| System | Geometry Quality | Texture Quality | Inference Speed | VRAM Usage | Multi-View Consistency | Scene Generation | Mesh Quality | CAD Compatible | Controllable | Training Cost | Fine-Tuning Difficulty | Commercial Usable |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TRELLIS | ββββ | βββ | 15s | 24GB | ββββ | β (object-only) | ββββ | β οΈ (needs export) | βββ | $50K (64ΓA100) | Medium | β MIT |
| TRELLIS.2 | βββββ | βββββ | 12s | 32GB | βββββ | β (object-only) | βββββ | β Native PBR | ββββ | $100K (32ΓH100) | Hard | β MIT |
| Hunyuan3D-2 | ββββ | βββββ | 25s | 24GB | ββββ | β (object-only) | ββββ | β | βββ | Unknown | Hard | β οΈ (Tencent license) |
| Hunyuan3D-2.5 | βββββ | βββββ | 30s | 48GB | βββββ | β (object-only) | βββββ | β | ββββ | Unknown | Hard | β οΈ |
| TripoSR | βββ | βββ | 0.5s | 8GB | βββ | β | βββ | β οΈ | ββ | $5K (8ΓA100) | Easy | β MIT |
| SF3D | ββββ | ββββ | 0.5s | 10GB | ββββ | β | ββββ | β PBR | βββ | $5K | Medium | β MIT |
| InstantMesh | ββββ | ββββ | 10s | 16GB | βββββ | β | βββββ | β οΈ | βββ | $20K | Medium | β |
| CRM | βββββ | βββ | 4s | 16GB | ββββ | β | βββββ | β οΈ | βββ | $8K (8ΓA800) | Medium | β |
| LGM | βββ | ββββ | 5s | 24GB | ββββ | β | βββ (Gaussian) | β | ββ | $30K (32ΓA100) | Medium | β |
| Era3D | ββββ | ββββ | 4min | 24GB | βββββ | β | ββββ | β οΈ | βββ | $15K (16ΓH800) | Hard | β |
| Wonder3D | ββββ | ββββ | 2min | 16GB | βββββ | β | ββββ | β οΈ | βββ | $10K | Medium | β |
| SyncDreamer | βββ | ββββ | 30s | 16GB | βββββ | β | βββ | β | ββ | $8K | Easy | β |
| MVDream | ββ | βββ | 20s | 16GB | ββββ | β | ββ | β | ββ | $10K | Medium | β |
| 2DGS-Room | ββββ | ββββ | ~30s | 24GB | βββ | β (rooms!) | βββ | β | ββ | $5K | Hard | β |
| Pano2Room | ββββ | βββββ | ~2min | 16GB | ββββ | β (panoramas) | βββ | β | ββ | $3K | Medium | β |
| SpatialLM | N/A | N/A | 1s | 8GB | N/A | β (layouts!) | N/A | N/A | βββββ | $20K | Easy | β Apache 2.0 |
| InteriorFusion (target) | βββββ | βββββ | 8s | 16GB | βββββ | β β β | βββββ | β β β | βββββ | $60K | Medium | β MIT |
Why Current Models Fail for Interiors
1. Inconsistent Room Geometry
Root cause: No room topology prior. Object models generate in unit cube; rooms need planar walls with right angles. Fix in InteriorFusion: Explicit room layout estimation (SpatialLM) constrains wall/floor/ceiling to Manhattan-world planes.
2. Furniture Floating
Root cause: No gravity/physics prior. Objects generated independently with no floor contact constraint. Fix: Collision detection + physics relaxation in scene assembly phase. Floor plane from depth estimation anchors all objects.
3. Inaccurate Scaling
Root cause: Object-centric models normalize to unit cube. A chair and a sofa both fit in [β1,1]Β³. Fix: Metric depth estimation (Depth Anything V2 metric indoor) provides real-world scale in meters. Furniture dimensions matched against a prior database.
4. Wall/Floor Topology Issues
Root cause: No distinction between room shell and furniture. Models try to generate everything as one mesh. Fix: Separate room shell generation (planar meshes) from per-object generation. Room shell voxels flagged separately in SLAT-Interior.
5. Poor Spatial Relationships
Root cause: Independent object generation. No knowledge that "lamp goes on table" or "sofa faces TV". Fix: Scene graph generation + learned layout prior from 3D-FRONT. Spatial relations encoded as edge features in scene graph.
6. Weak Depth Consistency
Root cause: Single-view depth estimators produce inconsistent depth across object boundaries. Fix: Multi-view depth fusion + cross-view depth-normal consistency loss. Depth-conditioned generation at every stage.
7. Multi-Object Scene Collapse
Root cause: When multiple objects appear in one image, models merge them into a single blob. Fix: Semantic segmentation β per-object isolation β independent generation β scene assembly.
8. Texture Bleeding
Root cause: Multi-view texture projection without occlusion handling. Wall texture bleeds onto furniture. Fix: Visibility-aware texture baking with depth-buffer occlusion testing. Per-object UV atlases.
9. Incomplete Room Reconstruction
Root cause: Occluded regions (behind sofa, under table) are hallucinated incorrectly. Fix: Inpainting diffusion for occluded regions, conditioned on detected room layout. Ceiling/floor inpainting from detected planes.
10. Inability to Edit Generated Rooms
Root cause: Single output mesh. Can't move sofa without regenerating everything. Fix: Scene graph representation. Each object is a separate node. Objects generated independently, assembled via scene graph. Move sofa = update scene graph node position.
11. Lack of Semantic Room Understanding
Root cause: No training on room types. Model doesn't know "kitchen needs stove, bedroom needs bed". Fix: Room type classifier trained on 3D-FRONT room labels. Style-conditioned generation (modern, scandinavian, luxury, indian, commercial).
Bottleneck Analysis
| Bottleneck | Impact | Solution in InteriorFusion |
|---|---|---|
| Latent representation | Object-only latents can't encode rooms | SLAT-Interior: sparse voxels with room-shell vs object flags |
| Scene encoding | No scene-level conditioning | Multi-encoder: image + depth + layout + semantic tokens |
| Geometry priors | No Manhattan world / planar constraints | Room shell generation enforces planar walls/floor/ceiling |
| Rendering pipeline | Object-only rendering (sphere cameras) | Indoor camera distribution (room-centered, limited elevation) |
| Training datasets | Only object datasets (Objaverse) | 3D-FRONT + Structured3D + InteriorNet + ScanNet |
| Sparse-view reconstruction | 150 views per object; rooms need more | Seed-guided 2D Gaussian splatting for room-scale |
| Scene graph modeling | No relationship modeling | SpatialLM scene scripts + learned layout prior |
Key Papers & arXiv IDs
| Paper | arXiv ID | Key Contribution |
|---|---|---|
| TRELLIS v1 | 2412.01506 | Structured latent (SLAT) for 3D generation |
| TRELLIS.2 | 2512.14692 | O-Voxel with PBR materials, 16Γ compression |
| TRELLISWorld | 2510.23880 | Tiled diffusion for scene generation |
| Hunyuan3D-2.0 | 2501.12202 | Shape+texture two-stage pipeline |
| Hunyuan3D-2.1 | 2506.15442 | Full training code release |
| Hunyuan3D-2.5 | 2506.16504 | LATTICE 10B model |
| HunyuanWorld | 2507.21809 | Panoramic world proxies |
| SF3D | 2408.00653 | Sub-second mesh + PBR |
| InstantMesh | 2404.07191 | Best open-source mesh quality |
| CRM | 2403.05034 | Best geometry fidelity (CD 0.0094) |
| TripoSR | 2403.02151 | Fastest baseline (0.5s) |
| LGM | 2402.05054 | Gaussian splatting output |
| Era3D | 2405.11616 | High-res multi-view (512Β²) |
| Wonder3D | 2310.15008 | Cross-domain diffusion |
| SyncDreamer | 2309.03453 | Synchronized multi-view |
| MVDream | 2308.16512 | Multi-view diffusion |
| 2DGS-Room | 2412.03428 | Indoor GS reconstruction |
| Pano2Room | 2408.11413 | Single panorama to 3DGS |
| SpatialLM | 2506.07491 | LLM for indoor scene understanding |
| RoomFormer | CVPR 2023 | Floorplan from point clouds |
| EchoScene | 2405.00915 | Scene graph β 3D indoor |
| CHOrD | 2503.11958 | Collision-free house-scale scenes |
| Direct3D | 2405.14832 | Triplane VAE + DiT |
| Direct3D-S2 | 2505.17412 | Sparse SDF VAE, 1024Β³ on 8 GPUs |
| CLAY | 2406.13897 | 1.5B param multi-condition model |
| RL3DEdit | 2603.03143 | RL (GRPO) for 3D editing |
| AR3D-R1 | (recent) | RL-enhanced text-to-3D |
| Grendel-GS | 2406.18533 | Distributed 3DGS training |
| TriplaneTurbo | 2503.21694 | Progressive rendering distillation |
| Depth Anything V2 | 2406.09414 | SOTA monocular depth |
Dataset Rankings for Interior 3D
Tier 1 (Essential)
| Rank | Dataset | Size | Key Strength | HF Hub |
|---|---|---|---|---|
| 1 | 3D-FRONT (MIDI-3D) | 17K rooms | End-to-end room scenes with furniture | huanngzh/3D-Front |
| 2 | Structured3D | 21K rooms | Best structured 3D annotations (planes, lines, junctions) | Gen3DF/Structured3D |
| 3 | ScanNet++ | 1.6K scenes | Real-world validation, dense annotations | marvex/scannet-dataset |
Tier 2 (Pre-training & Scale)
| Rank | Dataset | Size | Key Strength |
|---|---|---|---|
| 4 | InteriorNet | 1.7M layouts | Massive scale, multi-sensor |
| 5 | HM3D | 1K scenes | Largest real-world dataset |
| 6 | Hypersim | 461 scenes | High photorealism, material decomposition |
| 7 | Replica | 18 scenes | HDR textures, highest quality |
Tier 3 (Assets & Objects)
| Rank | Dataset | Size | Key Strength | HF Hub |
|---|---|---|---|---|
| 8 | Objaverse-XL | 10M objects | Largest 3D object repo | allenai/objaverse-xl |
| 9 | OmniObject3D | 6K objects | High-quality real scans | N/A |
| 10 | 3D-FUTURE | 10K furniture | Professional furniture models | N/A |
Tier 4 (Auxiliary)
| Dataset | Purpose |
|---|---|
| SceneVerse | Language grounding |
| ProcTHOR | Procedural augmentation |
| ARKitScenes | Mobile capture |
| 3RScan | Change detection |
| MultiScan | Articulated furniture |
| Infinigen | Procedural generation |
| MVImgNet | Object multi-view |
| GSO | Evaluation benchmark |
Training Recipe Summary
Stage 1: VAE (1 week, 8ΓA100)
- Dataset: 3D-FRONT + Structured3D (synthetic rooms)
- Multi-resolution: 256Β³ β 512Β³ β 1024Β³ curriculum
- Optimizer: AdamW, lr 1e-4, weight decay 0.01
- Loss: MSE reconstruction + KL (Ξ»=1e-3) + depth L1 + normal cosine
- Batch: 8 per GPU, effective 64
Stage 2: Structure DiT (1 week, 32ΓA100)
- Rectified flow matching
- Conditioning: DINOv3-L image features + depth + layout tokens
- Resolution curriculum: 256Β³ β 512Β³ β 1024Β³
- Batch: 8 per GPU, effective 256
- Optimizer: AdamW, lr 1e-4 β 2e-5 (progressive)
Stage 3: Material DiT (1 week, 16ΓA100)
- Conditioned on generated geometry + input image
- PBR material prediction
- Batch: 16 per GPU, effective 256
- Loss: L1 on albedo + L1 on metallic/roughness + LPIPS on rendered appearance
Stage 4: Real-world Fine-tuning (3 days, 8ΓA100)
- LoRA rank 32 on DiT attention layers
- Dataset: ScanNet + HM3D real photos
- RL fine-tuning: GRPO with VGGT geometric rewards
- Domain adaptation from synthetic β real
Total Cost Estimate: ~$60K (4 weeks on 32ΓA100)
Novel Contributions of InteriorFusion
- SLAT-Interior: First structured latent representation designed for indoor scenes with room-shell vs object separation
- Scene-aware generation pipeline: First end-to-end pipeline from single image to editable 3D interior
- Metric-scale consistency: Leverages metric depth for real-world furniture scaling
- Hybrid output: Simultaneous mesh + Gaussian splatting + PBR materials
- Editable scene graph: Objects are independent, movable, replaceable nodes
- Style-conditioned: Supports modern, scandinavian, luxury, indian, commercial interiors
- PBR material generation: Native metallic/roughness/normal output (not just baked textures)
- Training-free scene assembly: Uses SpatialLM + learned layout prior without scene-level diffusion training
Business Moat Analysis
| Moat | InteriorFusion | Competitors |
|---|---|---|
| Dataset moat | 3D-FRONT + Structured3D rooms (interior-specific) | Generic object datasets |
| Architecture moat | Scene-aware SLAT + scene graph | Object-only representations |
| Integration moat | Blender/UE/Unity plugins + ComfyUI nodes | Mostly web/API only |
| Speed moat | 8s on A100 | 0.5s (TripoSR) but no interiors; 15-30s for quality |
| Quality moat | PBR + editable + scene-aware | Single mesh blob |
| Open-source moat | MIT license, full code | Mixed licenses (some proprietary) |