| # InteriorFusion Architecture Design |
|
|
| ## Design Philosophy |
|
|
| InteriorFusion is built on a critical insight: **interior scenes are fundamentally different from single objects**. Current SOTA models (TRELLIS, Hunyuan3D-2, TripoSR, SF3D) are trained on object-centric datasets (Objaverse) and produce unit-cube-scaled assets. They have no concept of: |
|
|
| - Room topology (walls, floors, ceilings) |
| - Spatial relationships (table NEAR sofa, lamp ON nightstand) |
| - Real-world scale (meters, not arbitrary units) |
| - Multi-object coherence (furniture doesn't float) |
| - Semantic room understanding (kitchen vs bedroom vs office) |
|
|
| InteriorFusion addresses all of these through a **5-phase hybrid pipeline**. |
|
|
| --- |
|
|
| ## Phase 1: Scene Understanding |
|
|
| ### 1.1 Metric Depth Estimation |
| **Model**: `depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf` |
|
|
| Why metric indoor variant? It predicts depth in **real-world meters** (trained on Hypersim), essential for correct furniture scaling. Non-metric depth estimators produce relative depth that breaks room reconstruction. |
|
|
| ### 1.2 Room Layout Estimation |
| **Model**: `manycore-research/SpatialLM-Llama-1B` (or Qwen-0.5B for Apache 2.0) |
|
|
| SpatialLM processes point clouds from depth + camera intrinsics to produce structured scene scripts: |
| ```python |
| @dataclass |
| class RoomLayout: |
| walls: List[Plane] # Wall planes with normals |
| floor: Plane # Floor plane |
| ceiling: Plane # Ceiling plane |
| doors: List[Doorway] # Doorway locations |
| windows: List[Window] # Window locations |
| objects: List[ObjectBBox] # Furniture bounding boxes |
| ``` |
|
|
| ### 1.3 Semantic Segmentation |
| **Model**: Mask2Former / OneFormer with indoor-trained heads |
|
|
| Segments the input image into: |
| - Wall regions (with material type: paint, wallpaper, brick) |
| - Floor regions (wood, tile, carpet) |
| - Ceiling region |
| - Per-furniture instances (sofa, table, lamp, etc.) |
| - Decorative elements (plants, paintings, curtains) |
|
|
| ### 1.4 Multi-Object Detection & Isolation |
| Using SAM (Segment Anything Model) with indoor priors: |
| - Segment each furniture piece |
| - Extract per-object crops with alpha masks |
| - Remove background context for clean object generation |
|
|
| --- |
|
|
| ## Phase 2: Multi-View Generation |
|
|
| ### 2.1 Per-Object Multi-View Diffusion |
| **Model**: `stabilityai/stable-zero123` or Zero123++ community pipeline |
|
|
| For each segmented furniture object: |
| - Generate 6 consistent orthographic views (0°, 60°, 120°, 180°, 240°, 300° azimuth) |
| - Condition on the original crop + depth edge map |
| - Use depth-conditioned ControlNet for geometric consistency |
|
|
| ### 2.2 Room Shell Multi-View |
| For walls, floor, ceiling: |
| - Generate panoramic-style extended views from the single image |
| - Use depth-guided inpainting for occluded regions |
| - Produce ceiling, floor, and wall texture atlases |
|
|
| ### 2.3 Depth-Conditioned View Synthesis |
| Condition all multi-view generation on the metric depth map: |
| - Depth acts as a geometric prior preventing shape hallucination |
| - Cross-view depth consistency enforced via depth-normal consistency loss |
|
|
| --- |
|
|
| ## Phase 3: 3D Reconstruction |
|
|
| ### 3.1 Room Shell Reconstruction |
| Walls, floor, ceiling are reconstructed as **planar meshes** with UV atlases: |
| - Walls: Extruded from detected wall planes + depth boundaries |
| - Floor: Planar mesh with UV-mapped texture |
| - Ceiling: Planar mesh with texture from inpainted ceiling view |
|
|
| ### 3.2 Per-Object 3D Generation |
| Each furniture object is reconstructed using a **hybrid approach**: |
|
|
| **Small objects** (lamps, vases, decor): TRELLIS.2-4B → mesh with PBR |
| **Medium objects** (chairs, tables): TRELLIS.2-4B or InteriorFusion-L native |
| **Large objects** (sofas, beds, wardrobes): InteriorFusion-L with spatial constraints |
|
|
| The key innovation: **Spatial Constraint Injection** |
| - Object position is constrained by the room layout from Phase 1 |
| - Object scale is constrained by metric depth |
| - Object orientation is constrained by floor plane normal |
|
|
| ### 3.3 Gaussian Splatting Layer |
| For the entire scene, we maintain a parallel **3D Gaussian Splatting representation**: |
| - Fast novel view synthesis for interactive preview |
| - Per-object Gaussian subsets for editing |
| - Global scene Gaussians for background/room shell |
|
|
| --- |
|
|
| ## Phase 4: Scene Assembly |
|
|
| ### 4.1 Layout Optimization |
| Using SpatialLM's scene graph + learned layout prior: |
| - Place objects at detected positions from Phase 1 |
| - Resolve collisions using physics-based relaxation |
| - Ensure objects rest on floor (gravity constraint) |
| - Ensure objects don't intersect walls |
|
|
| ### 4.2 Scale Normalization |
| All objects normalized to metric scale: |
| - Use known furniture dimensions (e.g., standard chair height ~45cm) |
| - Use depth consistency to resolve ambiguous scales |
| - Human-scale reference from detected people/artifacts |
|
|
| ### 4.3 Scene Graph Construction |
| ```python |
| @dataclass |
| class SceneGraph: |
| nodes: Dict[str, SceneNode] # Objects + room shell |
| edges: List[SpatialRelation] # "on", "next to", "in front of", etc. |
| room_type: str # "modern_living_room", "scandinavian_kitchen" |
| style: str # "modern", "scandinavian", "luxury", "indian" |
| ``` |
|
|
| --- |
|
|
| ## Phase 5: Material & Texture |
|
|
| ### 5.1 PBR Material Generation |
| For each surface: |
| - Base color/albedo (diffuse) |
| - Metallic map |
| - Roughness map |
| - Normal map (bump) |
| - Ambient occlusion (optional) |
|
|
| **Model**: Custom material diffusion network fine-tuned on Hypersim + InteriorNet |
|
|
| ### 5.2 Texture Baking |
| - Project multi-view generated textures onto UV atlases |
| - Visibility-aware blending (occlusion handling) |
| - Seamless tiling for large surfaces (walls, floors) |
|
|
| ### 5.3 Lighting Estimation |
| Estimate scene lighting from the input image: |
| - HDR environment map extraction |
| - Key light / fill light / ambient light decomposition |
| - IBL (Image-Based Lighting) setup for game engines |
|
|
| --- |
|
|
| ## Core Model: InteriorFusion-L (4B Parameters) |
|
|
| ### Encoder |
| - **Image encoder**: DINOv3-L (frozen, feature extraction) |
| - **Depth encoder**: Custom CNN processing metric depth map |
| - **Layout encoder**: Transformer processing SpatialLM scene graph tokens |
| - **Semantic encoder**: Mask2Former feature pyramid |
|
|
| ### Latent Representation: SLAT-Interior |
| Extension of TRELLIS SLAT optimized for indoor scenes: |
| - Sparse 3D voxel grid, resolution 1024³ |
| - Active voxels only on surfaces (wall, furniture) |
| - Per-voxel features: shape + material + semantic class |
| - Room-shell voxels flagged separately from object voxels |
|
|
| ### Decoder |
| Three parallel decoders: |
| 1. **Mesh decoder**: Produces watertight or arbitrary-topology meshes (from O-Voxel) |
| 2. **Gaussian decoder**: Produces per-voxel Gaussian parameters |
| 3. **Material decoder**: Produces PBR material parameters per surface |
|
|
| ### Generation Pipeline |
| Two-stage rectified flow (following TRELLIS pattern): |
| 1. **Structure generation**: Dense occupancy grid → sparse structure |
| 2. **Latent generation**: Per-active-voxel features → shape + material |
|
|
| Conditioned on: DINOv3 image features + depth map + room layout tokens + semantic segmentation tokens |
|
|
| --- |
|
|
| ## Training Strategy |
|
|
| ### Stage 1: VAE Pre-training (1 week, 8×A100) |
| - Train SLAT-Interior VAE on 3D-FRONT + Structured3D rooms |
| - Multi-resolution: 256³ → 512³ → 1024³ curriculum |
| - Loss: MSE reconstruction + KL divergence + depth consistency + normal consistency |
|
|
| ### Stage 2: Flow-Matching DiT (2 weeks, 32×A100) |
| - Train rectified flow transformer for structure generation |
| - Curriculum: 256³ → 512³ → 1024³ |
| - Conditioning: image + depth + layout |
|
|
| ### Stage 3: Material DiT (1 week, 16×A100) |
| - Train material generation DiT conditioned on geometry + input image |
| - PBR material prediction: albedo, metallic, roughness, normal |
|
|
| ### Stage 4: Fine-tuning (3 days, 8×A100) |
| - LoRA fine-tuning on real interior photos (ScanNet + HM3D) |
| - Domain adaptation from synthetic to real |
| - Reinforcement learning for geometry consistency (GRPO-style) |
|
|
| ### Total Training: ~4 weeks on 32×A100 |
|
|
| --- |
|
|
| ## Inference Optimization |
|
|
| ### RTX 4090 (24GB VRAM) |
| - Model quantization: INT8 via GPTQ |
| - Gradient checkpointing disabled (inference only) |
| - Gaussian splatting for real-time preview |
| - Full mesh generation: ~15 seconds |
|
|
| ### A100 (80GB VRAM) |
| - FP16 inference |
| - Batch generation for multiple objects |
| - Full pipeline: ~8 seconds |
|
|
| ### H100 (80GB VRAM) |
| - BF16 inference |
| - ~5 seconds full generation |
|
|
| ### Edge / Mobile |
| - Core depth + layout estimation only (~2 seconds) |
| - Cloud-based 3D generation with streaming |
| - Reduced mesh quality (decimated, lower texture resolution) |
|
|
| --- |
|
|
| ## Export Formats |
|
|
| | Format | Use Case | Features | |
| |--------|----------|----------| |
| | **GLB** | Web, AR, Unity, Godot | PBR materials, animations, all data | |
| | **FBX** | Unreal Engine, Maya, 3ds Max | Full rigging support, PBR | |
| | **OBJ** | Legacy compatibility | Basic materials (MTL) | |
| | **USDZ** | iOS AR (ARKit) | Apple's native format | |
| | **3DGS (.ply)** | Real-time viewing | Gaussian splatting render | |
| | **BLEND** | Blender native | Full editability, nodes | |
|
|