# InteriorFusion Architecture Design ## Design Philosophy InteriorFusion is built on a critical insight: **interior scenes are fundamentally different from single objects**. Current SOTA models (TRELLIS, Hunyuan3D-2, TripoSR, SF3D) are trained on object-centric datasets (Objaverse) and produce unit-cube-scaled assets. They have no concept of: - Room topology (walls, floors, ceilings) - Spatial relationships (table NEAR sofa, lamp ON nightstand) - Real-world scale (meters, not arbitrary units) - Multi-object coherence (furniture doesn't float) - Semantic room understanding (kitchen vs bedroom vs office) InteriorFusion addresses all of these through a **5-phase hybrid pipeline**. --- ## Phase 1: Scene Understanding ### 1.1 Metric Depth Estimation **Model**: `depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf` Why metric indoor variant? It predicts depth in **real-world meters** (trained on Hypersim), essential for correct furniture scaling. Non-metric depth estimators produce relative depth that breaks room reconstruction. ### 1.2 Room Layout Estimation **Model**: `manycore-research/SpatialLM-Llama-1B` (or Qwen-0.5B for Apache 2.0) SpatialLM processes point clouds from depth + camera intrinsics to produce structured scene scripts: ```python @dataclass class RoomLayout: walls: List[Plane] # Wall planes with normals floor: Plane # Floor plane ceiling: Plane # Ceiling plane doors: List[Doorway] # Doorway locations windows: List[Window] # Window locations objects: List[ObjectBBox] # Furniture bounding boxes ``` ### 1.3 Semantic Segmentation **Model**: Mask2Former / OneFormer with indoor-trained heads Segments the input image into: - Wall regions (with material type: paint, wallpaper, brick) - Floor regions (wood, tile, carpet) - Ceiling region - Per-furniture instances (sofa, table, lamp, etc.) - Decorative elements (plants, paintings, curtains) ### 1.4 Multi-Object Detection & Isolation Using SAM (Segment Anything Model) with indoor priors: - Segment each furniture piece - Extract per-object crops with alpha masks - Remove background context for clean object generation --- ## Phase 2: Multi-View Generation ### 2.1 Per-Object Multi-View Diffusion **Model**: `stabilityai/stable-zero123` or Zero123++ community pipeline For each segmented furniture object: - Generate 6 consistent orthographic views (0°, 60°, 120°, 180°, 240°, 300° azimuth) - Condition on the original crop + depth edge map - Use depth-conditioned ControlNet for geometric consistency ### 2.2 Room Shell Multi-View For walls, floor, ceiling: - Generate panoramic-style extended views from the single image - Use depth-guided inpainting for occluded regions - Produce ceiling, floor, and wall texture atlases ### 2.3 Depth-Conditioned View Synthesis Condition all multi-view generation on the metric depth map: - Depth acts as a geometric prior preventing shape hallucination - Cross-view depth consistency enforced via depth-normal consistency loss --- ## Phase 3: 3D Reconstruction ### 3.1 Room Shell Reconstruction Walls, floor, ceiling are reconstructed as **planar meshes** with UV atlases: - Walls: Extruded from detected wall planes + depth boundaries - Floor: Planar mesh with UV-mapped texture - Ceiling: Planar mesh with texture from inpainted ceiling view ### 3.2 Per-Object 3D Generation Each furniture object is reconstructed using a **hybrid approach**: **Small objects** (lamps, vases, decor): TRELLIS.2-4B → mesh with PBR **Medium objects** (chairs, tables): TRELLIS.2-4B or InteriorFusion-L native **Large objects** (sofas, beds, wardrobes): InteriorFusion-L with spatial constraints The key innovation: **Spatial Constraint Injection** - Object position is constrained by the room layout from Phase 1 - Object scale is constrained by metric depth - Object orientation is constrained by floor plane normal ### 3.3 Gaussian Splatting Layer For the entire scene, we maintain a parallel **3D Gaussian Splatting representation**: - Fast novel view synthesis for interactive preview - Per-object Gaussian subsets for editing - Global scene Gaussians for background/room shell --- ## Phase 4: Scene Assembly ### 4.1 Layout Optimization Using SpatialLM's scene graph + learned layout prior: - Place objects at detected positions from Phase 1 - Resolve collisions using physics-based relaxation - Ensure objects rest on floor (gravity constraint) - Ensure objects don't intersect walls ### 4.2 Scale Normalization All objects normalized to metric scale: - Use known furniture dimensions (e.g., standard chair height ~45cm) - Use depth consistency to resolve ambiguous scales - Human-scale reference from detected people/artifacts ### 4.3 Scene Graph Construction ```python @dataclass class SceneGraph: nodes: Dict[str, SceneNode] # Objects + room shell edges: List[SpatialRelation] # "on", "next to", "in front of", etc. room_type: str # "modern_living_room", "scandinavian_kitchen" style: str # "modern", "scandinavian", "luxury", "indian" ``` --- ## Phase 5: Material & Texture ### 5.1 PBR Material Generation For each surface: - Base color/albedo (diffuse) - Metallic map - Roughness map - Normal map (bump) - Ambient occlusion (optional) **Model**: Custom material diffusion network fine-tuned on Hypersim + InteriorNet ### 5.2 Texture Baking - Project multi-view generated textures onto UV atlases - Visibility-aware blending (occlusion handling) - Seamless tiling for large surfaces (walls, floors) ### 5.3 Lighting Estimation Estimate scene lighting from the input image: - HDR environment map extraction - Key light / fill light / ambient light decomposition - IBL (Image-Based Lighting) setup for game engines --- ## Core Model: InteriorFusion-L (4B Parameters) ### Encoder - **Image encoder**: DINOv3-L (frozen, feature extraction) - **Depth encoder**: Custom CNN processing metric depth map - **Layout encoder**: Transformer processing SpatialLM scene graph tokens - **Semantic encoder**: Mask2Former feature pyramid ### Latent Representation: SLAT-Interior Extension of TRELLIS SLAT optimized for indoor scenes: - Sparse 3D voxel grid, resolution 1024³ - Active voxels only on surfaces (wall, furniture) - Per-voxel features: shape + material + semantic class - Room-shell voxels flagged separately from object voxels ### Decoder Three parallel decoders: 1. **Mesh decoder**: Produces watertight or arbitrary-topology meshes (from O-Voxel) 2. **Gaussian decoder**: Produces per-voxel Gaussian parameters 3. **Material decoder**: Produces PBR material parameters per surface ### Generation Pipeline Two-stage rectified flow (following TRELLIS pattern): 1. **Structure generation**: Dense occupancy grid → sparse structure 2. **Latent generation**: Per-active-voxel features → shape + material Conditioned on: DINOv3 image features + depth map + room layout tokens + semantic segmentation tokens --- ## Training Strategy ### Stage 1: VAE Pre-training (1 week, 8×A100) - Train SLAT-Interior VAE on 3D-FRONT + Structured3D rooms - Multi-resolution: 256³ → 512³ → 1024³ curriculum - Loss: MSE reconstruction + KL divergence + depth consistency + normal consistency ### Stage 2: Flow-Matching DiT (2 weeks, 32×A100) - Train rectified flow transformer for structure generation - Curriculum: 256³ → 512³ → 1024³ - Conditioning: image + depth + layout ### Stage 3: Material DiT (1 week, 16×A100) - Train material generation DiT conditioned on geometry + input image - PBR material prediction: albedo, metallic, roughness, normal ### Stage 4: Fine-tuning (3 days, 8×A100) - LoRA fine-tuning on real interior photos (ScanNet + HM3D) - Domain adaptation from synthetic to real - Reinforcement learning for geometry consistency (GRPO-style) ### Total Training: ~4 weeks on 32×A100 --- ## Inference Optimization ### RTX 4090 (24GB VRAM) - Model quantization: INT8 via GPTQ - Gradient checkpointing disabled (inference only) - Gaussian splatting for real-time preview - Full mesh generation: ~15 seconds ### A100 (80GB VRAM) - FP16 inference - Batch generation for multiple objects - Full pipeline: ~8 seconds ### H100 (80GB VRAM) - BF16 inference - ~5 seconds full generation ### Edge / Mobile - Core depth + layout estimation only (~2 seconds) - Cloud-based 3D generation with streaming - Reduced mesh quality (decimated, lower texture resolution) --- ## Export Formats | Format | Use Case | Features | |--------|----------|----------| | **GLB** | Web, AR, Unity, Godot | PBR materials, animations, all data | | **FBX** | Unreal Engine, Maya, 3ds Max | Full rigging support, PBR | | **OBJ** | Legacy compatibility | Basic materials (MTL) | | **USDZ** | iOS AR (ARKit) | Apple's native format | | **3DGS (.ply)** | Real-time viewing | Gaussian splatting render | | **BLEND** | Blender native | Full editability, nodes |