stevee00
/

InteriorFusion

Model card Files Files and versions

xet

Community

stevee00 commited on 9 days ago

Commit

708fe64

verified ·

1 Parent(s): 8af6a60

Upload docs/RESEARCH_REPORT.md

Browse files

Files changed (1) hide show

docs/RESEARCH_REPORT.md +228 -0

docs/RESEARCH_REPORT.md ADDED Viewed

	@@ -0,0 +1,228 @@

+# InteriorFusion: Research Report & Literature Review
+## Executive Summary
+After analyzing 50+ papers, 20+ repositories, and 15+ datasets, we identified that **no existing open-source system solves single-image-to-3D-interior at production quality**. All current SOTA models are object-centric. InteriorFusion bridges this gap through a scene-aware hybrid architecture.
+---
+## SOTA Comparison Table
+| System | Geometry Quality | Texture Quality | Inference Speed | VRAM Usage | Multi-View Consistency | Scene Generation | Mesh Quality | CAD Compatible | Controllable | Training Cost | Fine-Tuning Difficulty | Commercial Usable |
+|--------|-----------------|-----------------|-----------------|------------|----------------------|-----------------|--------------|---------------|-------------|--------------|----------------------|-------------------|
+| **TRELLIS** | ⭐⭐⭐⭐ | ⭐⭐⭐ | 15s | 24GB | ⭐⭐⭐⭐ | ❌ (object-only) | ⭐⭐⭐⭐ | ⚠️ (needs export) | ⭐⭐⭐ | $50K (64×A100) | Medium | ✅ MIT |
+| **TRELLIS.2** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 12s | 32GB | ⭐⭐⭐⭐⭐ | ❌ (object-only) | ⭐⭐⭐⭐⭐ | ✅ Native PBR | ⭐⭐⭐⭐ | $100K (32×H100) | Hard | ✅ MIT |
+| **Hunyuan3D-2** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 25s | 24GB | ⭐⭐⭐⭐ | ❌ (object-only) | ⭐⭐⭐⭐ | ✅ | ⭐⭐⭐ | Unknown | Hard | ⚠️ (Tencent license) |
+| **Hunyuan3D-2.5** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 30s | 48GB | ⭐⭐⭐⭐⭐ | ❌ (object-only) | ⭐⭐⭐⭐⭐ | ✅ | ⭐⭐⭐⭐ | Unknown | Hard | ⚠️ |
+| **TripoSR** | ⭐⭐⭐ | ⭐⭐⭐ | 0.5s | 8GB | ⭐⭐⭐ | ❌ | ⭐⭐⭐ | ⚠️ | ⭐⭐ | $5K (8×A100) | Easy | ✅ MIT |
+| **SF3D** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 0.5s | 10GB | ⭐⭐⭐⭐ | ❌ | ⭐⭐⭐⭐ | ✅ PBR | ⭐⭐⭐ | $5K | Medium | ✅ MIT |
+| **InstantMesh** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 10s | 16GB | ⭐⭐⭐⭐⭐ | ❌ | ⭐⭐⭐⭐⭐ | ⚠️ | ⭐⭐⭐ | $20K | Medium | ✅ |
+| **CRM** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | 4s | 16GB | ⭐⭐⭐⭐ | ❌ | ⭐⭐⭐⭐⭐ | ⚠️ | ⭐⭐⭐ | $8K (8×A800) | Medium | ✅ |
+| **LGM** | ⭐⭐⭐ | ⭐⭐⭐⭐ | 5s | 24GB | ⭐⭐⭐⭐ | ❌ | ⭐⭐⭐ (Gaussian) | ❌ | ⭐⭐ | $30K (32×A100) | Medium | ✅ |
+| **Era3D** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 4min | 24GB | ⭐⭐⭐⭐⭐ | ❌ | ⭐⭐⭐⭐ | ⚠️ | ⭐⭐⭐ | $15K (16×H800) | Hard | ✅ |
+| **Wonder3D** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 2min | 16GB | ⭐⭐⭐⭐⭐ | ❌ | ⭐⭐⭐⭐ | ⚠️ | ⭐⭐⭐ | $10K | Medium | ✅ |
+| **SyncDreamer** | ⭐⭐⭐ | ⭐⭐⭐⭐ | 30s | 16GB | ⭐⭐⭐⭐⭐ | ❌ | ⭐⭐⭐ | ❌ | ⭐⭐ | $8K | Easy | ✅ |
+| **MVDream** | ⭐⭐ | ⭐⭐⭐ | 20s | 16GB | ⭐⭐⭐⭐ | ❌ | ⭐⭐ | ❌ | ⭐⭐ | $10K | Medium | ✅ |
+| **2DGS-Room** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ~30s | 24GB | ⭐⭐⭐ | ✅ (rooms!) | ⭐⭐⭐ | ❌ | ⭐⭐ | $5K | Hard | ✅ |
+| **Pano2Room** | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ~2min | 16GB | ⭐⭐⭐⭐ | ✅ (panoramas) | ⭐⭐⭐ | ❌ | ⭐⭐ | $3K | Medium | ✅ |
+| **SpatialLM** | N/A | N/A | 1s | 8GB | N/A | ✅ (layouts!) | N/A | N/A | ⭐⭐⭐⭐⭐ | $20K | Easy | ✅ Apache 2.0 |
+| **InteriorFusion (target)** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | **8s** | **16GB** | ⭐⭐⭐⭐⭐ | ✅✅✅ | ⭐⭐⭐⭐⭐ | ✅✅✅ | ⭐⭐⭐⭐⭐ | **$60K** | Medium | ✅ MIT |
+---
+## Why Current Models Fail for Interiors
+### 1. Inconsistent Room Geometry
+**Root cause**: No room topology prior. Object models generate in unit cube; rooms need planar walls with right angles.
+**Fix in InteriorFusion**: Explicit room layout estimation (SpatialLM) constrains wall/floor/ceiling to Manhattan-world planes.
+### 2. Furniture Floating
+**Root cause**: No gravity/physics prior. Objects generated independently with no floor contact constraint.
+**Fix**: Collision detection + physics relaxation in scene assembly phase. Floor plane from depth estimation anchors all objects.
+### 3. Inaccurate Scaling
+**Root cause**: Object-centric models normalize to unit cube. A chair and a sofa both fit in [−1,1]³.
+**Fix**: Metric depth estimation (Depth Anything V2 metric indoor) provides real-world scale in meters. Furniture dimensions matched against a prior database.
+### 4. Wall/Floor Topology Issues
+**Root cause**: No distinction between room shell and furniture. Models try to generate everything as one mesh.
+**Fix**: Separate room shell generation (planar meshes) from per-object generation. Room shell voxels flagged separately in SLAT-Interior.
+### 5. Poor Spatial Relationships
+**Root cause**: Independent object generation. No knowledge that "lamp goes on table" or "sofa faces TV".
+**Fix**: Scene graph generation + learned layout prior from 3D-FRONT. Spatial relations encoded as edge features in scene graph.
+### 6. Weak Depth Consistency
+**Root cause**: Single-view depth estimators produce inconsistent depth across object boundaries.
+**Fix**: Multi-view depth fusion + cross-view depth-normal consistency loss. Depth-conditioned generation at every stage.
+### 7. Multi-Object Scene Collapse
+**Root cause**: When multiple objects appear in one image, models merge them into a single blob.
+**Fix**: Semantic segmentation → per-object isolation → independent generation → scene assembly.
+### 8. Texture Bleeding
+**Root cause**: Multi-view texture projection without occlusion handling. Wall texture bleeds onto furniture.
+**Fix**: Visibility-aware texture baking with depth-buffer occlusion testing. Per-object UV atlases.
+### 9. Incomplete Room Reconstruction
+**Root cause**: Occluded regions (behind sofa, under table) are hallucinated incorrectly.
+**Fix**: Inpainting diffusion for occluded regions, conditioned on detected room layout. Ceiling/floor inpainting from detected planes.
+### 10. Inability to Edit Generated Rooms
+**Root cause**: Single output mesh. Can't move sofa without regenerating everything.
+**Fix**: Scene graph representation. Each object is a separate node. Objects generated independently, assembled via scene graph. Move sofa = update scene graph node position.
+### 11. Lack of Semantic Room Understanding
+**Root cause**: No training on room types. Model doesn't know "kitchen needs stove, bedroom needs bed".
+**Fix**: Room type classifier trained on 3D-FRONT room labels. Style-conditioned generation (modern, scandinavian, luxury, indian, commercial).
+---
+## Bottleneck Analysis
+| Bottleneck | Impact | Solution in InteriorFusion |
+|-----------|--------|---------------------------|
+| **Latent representation** | Object-only latents can't encode rooms | SLAT-Interior: sparse voxels with room-shell vs object flags |
+| **Scene encoding** | No scene-level conditioning | Multi-encoder: image + depth + layout + semantic tokens |
+| **Geometry priors** | No Manhattan world / planar constraints | Room shell generation enforces planar walls/floor/ceiling |
+| **Rendering pipeline** | Object-only rendering (sphere cameras) | Indoor camera distribution (room-centered, limited elevation) |
+| **Training datasets** | Only object datasets (Objaverse) | 3D-FRONT + Structured3D + InteriorNet + ScanNet |
+| **Sparse-view reconstruction** | 150 views per object; rooms need more | Seed-guided 2D Gaussian splatting for room-scale |
+| **Scene graph modeling** | No relationship modeling | SpatialLM scene scripts + learned layout prior |
+---
+## Key Papers & arXiv IDs
+| Paper | arXiv ID | Key Contribution |
+|-------|----------|-----------------|
+| TRELLIS v1 | 2412.01506 | Structured latent (SLAT) for 3D generation |
+| TRELLIS.2 | 2512.14692 | O-Voxel with PBR materials, 16× compression |
+| TRELLISWorld | 2510.23880 | Tiled diffusion for scene generation |
+| Hunyuan3D-2.0 | 2501.12202 | Shape+texture two-stage pipeline |
+| Hunyuan3D-2.1 | 2506.15442 | Full training code release |
+| Hunyuan3D-2.5 | 2506.16504 | LATTICE 10B model |
+| HunyuanWorld | 2507.21809 | Panoramic world proxies |
+| SF3D | 2408.00653 | Sub-second mesh + PBR |
+| InstantMesh | 2404.07191 | Best open-source mesh quality |
+| CRM | 2403.05034 | Best geometry fidelity (CD 0.0094) |
+| TripoSR | 2403.02151 | Fastest baseline (0.5s) |
+| LGM | 2402.05054 | Gaussian splatting output |
+| Era3D | 2405.11616 | High-res multi-view (512²) |
+| Wonder3D | 2310.15008 | Cross-domain diffusion |
+| SyncDreamer | 2309.03453 | Synchronized multi-view |
+| MVDream | 2308.16512 | Multi-view diffusion |
+| 2DGS-Room | 2412.03428 | Indoor GS reconstruction |
+| Pano2Room | 2408.11413 | Single panorama to 3DGS |
+| SpatialLM | 2506.07491 | LLM for indoor scene understanding |
+| RoomFormer | CVPR 2023 | Floorplan from point clouds |
+| EchoScene | 2405.00915 | Scene graph → 3D indoor |
+| CHOrD | 2503.11958 | Collision-free house-scale scenes |
+| Direct3D | 2405.14832 | Triplane VAE + DiT |
+| Direct3D-S2 | 2505.17412 | Sparse SDF VAE, 1024³ on 8 GPUs |
+| CLAY | 2406.13897 | 1.5B param multi-condition model |
+| RL3DEdit | 2603.03143 | RL (GRPO) for 3D editing |
+| AR3D-R1 | (recent) | RL-enhanced text-to-3D |
+| Grendel-GS | 2406.18533 | Distributed 3DGS training |
+| TriplaneTurbo | 2503.21694 | Progressive rendering distillation |
+| Depth Anything V2 | 2406.09414 | SOTA monocular depth |
+---
+## Dataset Rankings for Interior 3D
+### Tier 1 (Essential)
+| Rank | Dataset | Size | Key Strength | HF Hub |
+|------|---------|------|-------------|--------|
+| 1 | **3D-FRONT (MIDI-3D)** | 17K rooms | End-to-end room scenes with furniture | `huanngzh/3D-Front` |
+| 2 | **Structured3D** | 21K rooms | Best structured 3D annotations (planes, lines, junctions) | `Gen3DF/Structured3D` |
+| 3 | **ScanNet++** | 1.6K scenes | Real-world validation, dense annotations | `marvex/scannet-dataset` |
+### Tier 2 (Pre-training & Scale)
+| Rank | Dataset | Size | Key Strength |
+|------|---------|------|-------------|
+| 4 | **InteriorNet** | 1.7M layouts | Massive scale, multi-sensor |
+| 5 | **HM3D** | 1K scenes | Largest real-world dataset |
+| 6 | **Hypersim** | 461 scenes | High photorealism, material decomposition |
+| 7 | **Replica** | 18 scenes | HDR textures, highest quality |
+### Tier 3 (Assets & Objects)
+| Rank | Dataset | Size | Key Strength | HF Hub |
+|------|---------|------|-------------|--------|
+| 8 | **Objaverse-XL** | 10M objects | Largest 3D object repo | `allenai/objaverse-xl` |
+| 9 | **OmniObject3D** | 6K objects | High-quality real scans | N/A |
+| 10 | **3D-FUTURE** | 10K furniture | Professional furniture models | N/A |
+### Tier 4 (Auxiliary)
+| Dataset | Purpose |
+|---------|---------|
+| SceneVerse | Language grounding |
+| ProcTHOR | Procedural augmentation |
+| ARKitScenes | Mobile capture |
+| 3RScan | Change detection |
+| MultiScan | Articulated furniture |
+| Infinigen | Procedural generation |
+| MVImgNet | Object multi-view |
+| GSO | Evaluation benchmark |
+---
+## Training Recipe Summary
+### Stage 1: VAE (1 week, 8×A100)
+- Dataset: 3D-FRONT + Structured3D (synthetic rooms)
+- Multi-resolution: 256³ → 512³ → 1024³ curriculum
+- Optimizer: AdamW, lr 1e-4, weight decay 0.01
+- Loss: MSE reconstruction + KL (λ=1e-3) + depth L1 + normal cosine
+- Batch: 8 per GPU, effective 64
+### Stage 2: Structure DiT (1 week, 32×A100)
+- Rectified flow matching
+- Conditioning: DINOv3-L image features + depth + layout tokens
+- Resolution curriculum: 256³ → 512³ → 1024³
+- Batch: 8 per GPU, effective 256
+- Optimizer: AdamW, lr 1e-4 → 2e-5 (progressive)
+### Stage 3: Material DiT (1 week, 16×A100)
+- Conditioned on generated geometry + input image
+- PBR material prediction
+- Batch: 16 per GPU, effective 256
+- Loss: L1 on albedo + L1 on metallic/roughness + LPIPS on rendered appearance
+### Stage 4: Real-world Fine-tuning (3 days, 8×A100)
+- LoRA rank 32 on DiT attention layers
+- Dataset: ScanNet + HM3D real photos
+- RL fine-tuning: GRPO with VGGT geometric rewards
+- Domain adaptation from synthetic → real
+### Total Cost Estimate: ~$60K (4 weeks on 32×A100)
+---
+## Novel Contributions of InteriorFusion
+1. **SLAT-Interior**: First structured latent representation designed for indoor scenes with room-shell vs object separation
+2. **Scene-aware generation pipeline**: First end-to-end pipeline from single image to editable 3D interior
+3. **Metric-scale consistency**: Leverages metric depth for real-world furniture scaling
+4. **Hybrid output**: Simultaneous mesh + Gaussian splatting + PBR materials
+5. **Editable scene graph**: Objects are independent, movable, replaceable nodes
+6. **Style-conditioned**: Supports modern, scandinavian, luxury, indian, commercial interiors
+7. **PBR material generation**: Native metallic/roughness/normal output (not just baked textures)
+8. **Training-free scene assembly**: Uses SpatialLM + learned layout prior without scene-level diffusion training
+---
+## Business Moat Analysis
+| Moat | InteriorFusion | Competitors |
+|------|---------------|-------------|
+| **Dataset moat** | 3D-FRONT + Structured3D rooms (interior-specific) | Generic object datasets |
+| **Architecture moat** | Scene-aware SLAT + scene graph | Object-only representations |
+| **Integration moat** | Blender/UE/Unity plugins + ComfyUI nodes | Mostly web/API only |
+| **Speed moat** | 8s on A100 | 0.5s (TripoSR) but no interiors; 15-30s for quality |
+| **Quality moat** | PBR + editable + scene-aware | Single mesh blob |
+| **Open-source moat** | MIT license, full code | Mixed licenses (some proprietary) |