InteriorFusion / docs /RESEARCH_REPORT.md
stevee00's picture
Upload docs/RESEARCH_REPORT.md
708fe64 verified

InteriorFusion: Research Report & Literature Review

Executive Summary

After analyzing 50+ papers, 20+ repositories, and 15+ datasets, we identified that no existing open-source system solves single-image-to-3D-interior at production quality. All current SOTA models are object-centric. InteriorFusion bridges this gap through a scene-aware hybrid architecture.


SOTA Comparison Table

System Geometry Quality Texture Quality Inference Speed VRAM Usage Multi-View Consistency Scene Generation Mesh Quality CAD Compatible Controllable Training Cost Fine-Tuning Difficulty Commercial Usable
TRELLIS ⭐⭐⭐⭐ ⭐⭐⭐ 15s 24GB ⭐⭐⭐⭐ ❌ (object-only) ⭐⭐⭐⭐ ⚠️ (needs export) ⭐⭐⭐ $50K (64Γ—A100) Medium βœ… MIT
TRELLIS.2 ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ 12s 32GB ⭐⭐⭐⭐⭐ ❌ (object-only) ⭐⭐⭐⭐⭐ βœ… Native PBR ⭐⭐⭐⭐ $100K (32Γ—H100) Hard βœ… MIT
Hunyuan3D-2 ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ 25s 24GB ⭐⭐⭐⭐ ❌ (object-only) ⭐⭐⭐⭐ βœ… ⭐⭐⭐ Unknown Hard ⚠️ (Tencent license)
Hunyuan3D-2.5 ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ 30s 48GB ⭐⭐⭐⭐⭐ ❌ (object-only) ⭐⭐⭐⭐⭐ βœ… ⭐⭐⭐⭐ Unknown Hard ⚠️
TripoSR ⭐⭐⭐ ⭐⭐⭐ 0.5s 8GB ⭐⭐⭐ ❌ ⭐⭐⭐ ⚠️ ⭐⭐ $5K (8Γ—A100) Easy βœ… MIT
SF3D ⭐⭐⭐⭐ ⭐⭐⭐⭐ 0.5s 10GB ⭐⭐⭐⭐ ❌ ⭐⭐⭐⭐ βœ… PBR ⭐⭐⭐ $5K Medium βœ… MIT
InstantMesh ⭐⭐⭐⭐ ⭐⭐⭐⭐ 10s 16GB ⭐⭐⭐⭐⭐ ❌ ⭐⭐⭐⭐⭐ ⚠️ ⭐⭐⭐ $20K Medium βœ…
CRM ⭐⭐⭐⭐⭐ ⭐⭐⭐ 4s 16GB ⭐⭐⭐⭐ ❌ ⭐⭐⭐⭐⭐ ⚠️ ⭐⭐⭐ $8K (8Γ—A800) Medium βœ…
LGM ⭐⭐⭐ ⭐⭐⭐⭐ 5s 24GB ⭐⭐⭐⭐ ❌ ⭐⭐⭐ (Gaussian) ❌ ⭐⭐ $30K (32Γ—A100) Medium βœ…
Era3D ⭐⭐⭐⭐ ⭐⭐⭐⭐ 4min 24GB ⭐⭐⭐⭐⭐ ❌ ⭐⭐⭐⭐ ⚠️ ⭐⭐⭐ $15K (16Γ—H800) Hard βœ…
Wonder3D ⭐⭐⭐⭐ ⭐⭐⭐⭐ 2min 16GB ⭐⭐⭐⭐⭐ ❌ ⭐⭐⭐⭐ ⚠️ ⭐⭐⭐ $10K Medium βœ…
SyncDreamer ⭐⭐⭐ ⭐⭐⭐⭐ 30s 16GB ⭐⭐⭐⭐⭐ ❌ ⭐⭐⭐ ❌ ⭐⭐ $8K Easy βœ…
MVDream ⭐⭐ ⭐⭐⭐ 20s 16GB ⭐⭐⭐⭐ ❌ ⭐⭐ ❌ ⭐⭐ $10K Medium βœ…
2DGS-Room ⭐⭐⭐⭐ ⭐⭐⭐⭐ ~30s 24GB ⭐⭐⭐ βœ… (rooms!) ⭐⭐⭐ ❌ ⭐⭐ $5K Hard βœ…
Pano2Room ⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ~2min 16GB ⭐⭐⭐⭐ βœ… (panoramas) ⭐⭐⭐ ❌ ⭐⭐ $3K Medium βœ…
SpatialLM N/A N/A 1s 8GB N/A βœ… (layouts!) N/A N/A ⭐⭐⭐⭐⭐ $20K Easy βœ… Apache 2.0
InteriorFusion (target) ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ 8s 16GB ⭐⭐⭐⭐⭐ βœ…βœ…βœ… ⭐⭐⭐⭐⭐ βœ…βœ…βœ… ⭐⭐⭐⭐⭐ $60K Medium βœ… MIT

Why Current Models Fail for Interiors

1. Inconsistent Room Geometry

Root cause: No room topology prior. Object models generate in unit cube; rooms need planar walls with right angles. Fix in InteriorFusion: Explicit room layout estimation (SpatialLM) constrains wall/floor/ceiling to Manhattan-world planes.

2. Furniture Floating

Root cause: No gravity/physics prior. Objects generated independently with no floor contact constraint. Fix: Collision detection + physics relaxation in scene assembly phase. Floor plane from depth estimation anchors all objects.

3. Inaccurate Scaling

Root cause: Object-centric models normalize to unit cube. A chair and a sofa both fit in [βˆ’1,1]Β³. Fix: Metric depth estimation (Depth Anything V2 metric indoor) provides real-world scale in meters. Furniture dimensions matched against a prior database.

4. Wall/Floor Topology Issues

Root cause: No distinction between room shell and furniture. Models try to generate everything as one mesh. Fix: Separate room shell generation (planar meshes) from per-object generation. Room shell voxels flagged separately in SLAT-Interior.

5. Poor Spatial Relationships

Root cause: Independent object generation. No knowledge that "lamp goes on table" or "sofa faces TV". Fix: Scene graph generation + learned layout prior from 3D-FRONT. Spatial relations encoded as edge features in scene graph.

6. Weak Depth Consistency

Root cause: Single-view depth estimators produce inconsistent depth across object boundaries. Fix: Multi-view depth fusion + cross-view depth-normal consistency loss. Depth-conditioned generation at every stage.

7. Multi-Object Scene Collapse

Root cause: When multiple objects appear in one image, models merge them into a single blob. Fix: Semantic segmentation β†’ per-object isolation β†’ independent generation β†’ scene assembly.

8. Texture Bleeding

Root cause: Multi-view texture projection without occlusion handling. Wall texture bleeds onto furniture. Fix: Visibility-aware texture baking with depth-buffer occlusion testing. Per-object UV atlases.

9. Incomplete Room Reconstruction

Root cause: Occluded regions (behind sofa, under table) are hallucinated incorrectly. Fix: Inpainting diffusion for occluded regions, conditioned on detected room layout. Ceiling/floor inpainting from detected planes.

10. Inability to Edit Generated Rooms

Root cause: Single output mesh. Can't move sofa without regenerating everything. Fix: Scene graph representation. Each object is a separate node. Objects generated independently, assembled via scene graph. Move sofa = update scene graph node position.

11. Lack of Semantic Room Understanding

Root cause: No training on room types. Model doesn't know "kitchen needs stove, bedroom needs bed". Fix: Room type classifier trained on 3D-FRONT room labels. Style-conditioned generation (modern, scandinavian, luxury, indian, commercial).


Bottleneck Analysis

Bottleneck Impact Solution in InteriorFusion
Latent representation Object-only latents can't encode rooms SLAT-Interior: sparse voxels with room-shell vs object flags
Scene encoding No scene-level conditioning Multi-encoder: image + depth + layout + semantic tokens
Geometry priors No Manhattan world / planar constraints Room shell generation enforces planar walls/floor/ceiling
Rendering pipeline Object-only rendering (sphere cameras) Indoor camera distribution (room-centered, limited elevation)
Training datasets Only object datasets (Objaverse) 3D-FRONT + Structured3D + InteriorNet + ScanNet
Sparse-view reconstruction 150 views per object; rooms need more Seed-guided 2D Gaussian splatting for room-scale
Scene graph modeling No relationship modeling SpatialLM scene scripts + learned layout prior

Key Papers & arXiv IDs

Paper arXiv ID Key Contribution
TRELLIS v1 2412.01506 Structured latent (SLAT) for 3D generation
TRELLIS.2 2512.14692 O-Voxel with PBR materials, 16Γ— compression
TRELLISWorld 2510.23880 Tiled diffusion for scene generation
Hunyuan3D-2.0 2501.12202 Shape+texture two-stage pipeline
Hunyuan3D-2.1 2506.15442 Full training code release
Hunyuan3D-2.5 2506.16504 LATTICE 10B model
HunyuanWorld 2507.21809 Panoramic world proxies
SF3D 2408.00653 Sub-second mesh + PBR
InstantMesh 2404.07191 Best open-source mesh quality
CRM 2403.05034 Best geometry fidelity (CD 0.0094)
TripoSR 2403.02151 Fastest baseline (0.5s)
LGM 2402.05054 Gaussian splatting output
Era3D 2405.11616 High-res multi-view (512Β²)
Wonder3D 2310.15008 Cross-domain diffusion
SyncDreamer 2309.03453 Synchronized multi-view
MVDream 2308.16512 Multi-view diffusion
2DGS-Room 2412.03428 Indoor GS reconstruction
Pano2Room 2408.11413 Single panorama to 3DGS
SpatialLM 2506.07491 LLM for indoor scene understanding
RoomFormer CVPR 2023 Floorplan from point clouds
EchoScene 2405.00915 Scene graph β†’ 3D indoor
CHOrD 2503.11958 Collision-free house-scale scenes
Direct3D 2405.14832 Triplane VAE + DiT
Direct3D-S2 2505.17412 Sparse SDF VAE, 1024Β³ on 8 GPUs
CLAY 2406.13897 1.5B param multi-condition model
RL3DEdit 2603.03143 RL (GRPO) for 3D editing
AR3D-R1 (recent) RL-enhanced text-to-3D
Grendel-GS 2406.18533 Distributed 3DGS training
TriplaneTurbo 2503.21694 Progressive rendering distillation
Depth Anything V2 2406.09414 SOTA monocular depth

Dataset Rankings for Interior 3D

Tier 1 (Essential)

Rank Dataset Size Key Strength HF Hub
1 3D-FRONT (MIDI-3D) 17K rooms End-to-end room scenes with furniture huanngzh/3D-Front
2 Structured3D 21K rooms Best structured 3D annotations (planes, lines, junctions) Gen3DF/Structured3D
3 ScanNet++ 1.6K scenes Real-world validation, dense annotations marvex/scannet-dataset

Tier 2 (Pre-training & Scale)

Rank Dataset Size Key Strength
4 InteriorNet 1.7M layouts Massive scale, multi-sensor
5 HM3D 1K scenes Largest real-world dataset
6 Hypersim 461 scenes High photorealism, material decomposition
7 Replica 18 scenes HDR textures, highest quality

Tier 3 (Assets & Objects)

Rank Dataset Size Key Strength HF Hub
8 Objaverse-XL 10M objects Largest 3D object repo allenai/objaverse-xl
9 OmniObject3D 6K objects High-quality real scans N/A
10 3D-FUTURE 10K furniture Professional furniture models N/A

Tier 4 (Auxiliary)

Dataset Purpose
SceneVerse Language grounding
ProcTHOR Procedural augmentation
ARKitScenes Mobile capture
3RScan Change detection
MultiScan Articulated furniture
Infinigen Procedural generation
MVImgNet Object multi-view
GSO Evaluation benchmark

Training Recipe Summary

Stage 1: VAE (1 week, 8Γ—A100)

  • Dataset: 3D-FRONT + Structured3D (synthetic rooms)
  • Multi-resolution: 256Β³ β†’ 512Β³ β†’ 1024Β³ curriculum
  • Optimizer: AdamW, lr 1e-4, weight decay 0.01
  • Loss: MSE reconstruction + KL (Ξ»=1e-3) + depth L1 + normal cosine
  • Batch: 8 per GPU, effective 64

Stage 2: Structure DiT (1 week, 32Γ—A100)

  • Rectified flow matching
  • Conditioning: DINOv3-L image features + depth + layout tokens
  • Resolution curriculum: 256Β³ β†’ 512Β³ β†’ 1024Β³
  • Batch: 8 per GPU, effective 256
  • Optimizer: AdamW, lr 1e-4 β†’ 2e-5 (progressive)

Stage 3: Material DiT (1 week, 16Γ—A100)

  • Conditioned on generated geometry + input image
  • PBR material prediction
  • Batch: 16 per GPU, effective 256
  • Loss: L1 on albedo + L1 on metallic/roughness + LPIPS on rendered appearance

Stage 4: Real-world Fine-tuning (3 days, 8Γ—A100)

  • LoRA rank 32 on DiT attention layers
  • Dataset: ScanNet + HM3D real photos
  • RL fine-tuning: GRPO with VGGT geometric rewards
  • Domain adaptation from synthetic β†’ real

Total Cost Estimate: ~$60K (4 weeks on 32Γ—A100)


Novel Contributions of InteriorFusion

  1. SLAT-Interior: First structured latent representation designed for indoor scenes with room-shell vs object separation
  2. Scene-aware generation pipeline: First end-to-end pipeline from single image to editable 3D interior
  3. Metric-scale consistency: Leverages metric depth for real-world furniture scaling
  4. Hybrid output: Simultaneous mesh + Gaussian splatting + PBR materials
  5. Editable scene graph: Objects are independent, movable, replaceable nodes
  6. Style-conditioned: Supports modern, scandinavian, luxury, indian, commercial interiors
  7. PBR material generation: Native metallic/roughness/normal output (not just baked textures)
  8. Training-free scene assembly: Uses SpatialLM + learned layout prior without scene-level diffusion training

Business Moat Analysis

Moat InteriorFusion Competitors
Dataset moat 3D-FRONT + Structured3D rooms (interior-specific) Generic object datasets
Architecture moat Scene-aware SLAT + scene graph Object-only representations
Integration moat Blender/UE/Unity plugins + ComfyUI nodes Mostly web/API only
Speed moat 8s on A100 0.5s (TripoSR) but no interiors; 15-30s for quality
Quality moat PBR + editable + scene-aware Single mesh blob
Open-source moat MIT license, full code Mixed licenses (some proprietary)