stevee00
/

InteriorFusion

Model card Files Files and versions

xet

Community

stevee00 commited on 23 days ago

Commit

8af6a60

verified ·

1 Parent(s): b61be7d

Upload ARCHITECTURE.md

Browse files

Files changed (1) hide show

ARCHITECTURE.md +248 -0

ARCHITECTURE.md ADDED Viewed

	@@ -0,0 +1,248 @@

+# InteriorFusion Architecture Design
+## Design Philosophy
+InteriorFusion is built on a critical insight: **interior scenes are fundamentally different from single objects**. Current SOTA models (TRELLIS, Hunyuan3D-2, TripoSR, SF3D) are trained on object-centric datasets (Objaverse) and produce unit-cube-scaled assets. They have no concept of:
+- Room topology (walls, floors, ceilings)
+- Spatial relationships (table NEAR sofa, lamp ON nightstand)
+- Real-world scale (meters, not arbitrary units)
+- Multi-object coherence (furniture doesn't float)
+- Semantic room understanding (kitchen vs bedroom vs office)
+InteriorFusion addresses all of these through a **5-phase hybrid pipeline**.
+---
+## Phase 1: Scene Understanding
+### 1.1 Metric Depth Estimation
+**Model**: `depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf`
+Why metric indoor variant? It predicts depth in **real-world meters** (trained on Hypersim), essential for correct furniture scaling. Non-metric depth estimators produce relative depth that breaks room reconstruction.
+### 1.2 Room Layout Estimation
+**Model**: `manycore-research/SpatialLM-Llama-1B` (or Qwen-0.5B for Apache 2.0)
+SpatialLM processes point clouds from depth + camera intrinsics to produce structured scene scripts:
+```python
+@dataclass
+class RoomLayout:
+    walls: List[Plane]          # Wall planes with normals
+    floor: Plane                # Floor plane
+    ceiling: Plane              # Ceiling plane
+    doors: List[Doorway]        # Doorway locations
+    windows: List[Window]       # Window locations
+    objects: List[ObjectBBox]   # Furniture bounding boxes
+```
+### 1.3 Semantic Segmentation
+**Model**: Mask2Former / OneFormer with indoor-trained heads
+Segments the input image into:
+- Wall regions (with material type: paint, wallpaper, brick)
+- Floor regions (wood, tile, carpet)
+- Ceiling region
+- Per-furniture instances (sofa, table, lamp, etc.)
+- Decorative elements (plants, paintings, curtains)
+### 1.4 Multi-Object Detection & Isolation
+Using SAM (Segment Anything Model) with indoor priors:
+- Segment each furniture piece
+- Extract per-object crops with alpha masks
+- Remove background context for clean object generation
+---
+## Phase 2: Multi-View Generation
+### 2.1 Per-Object Multi-View Diffusion
+**Model**: `stabilityai/stable-zero123` or Zero123++ community pipeline
+For each segmented furniture object:
+- Generate 6 consistent orthographic views (0°, 60°, 120°, 180°, 240°, 300° azimuth)
+- Condition on the original crop + depth edge map
+- Use depth-conditioned ControlNet for geometric consistency
+### 2.2 Room Shell Multi-View
+For walls, floor, ceiling:
+- Generate panoramic-style extended views from the single image
+- Use depth-guided inpainting for occluded regions
+- Produce ceiling, floor, and wall texture atlases
+### 2.3 Depth-Conditioned View Synthesis
+Condition all multi-view generation on the metric depth map:
+- Depth acts as a geometric prior preventing shape hallucination
+- Cross-view depth consistency enforced via depth-normal consistency loss
+---
+## Phase 3: 3D Reconstruction
+### 3.1 Room Shell Reconstruction
+Walls, floor, ceiling are reconstructed as **planar meshes** with UV atlases:
+- Walls: Extruded from detected wall planes + depth boundaries
+- Floor: Planar mesh with UV-mapped texture
+- Ceiling: Planar mesh with texture from inpainted ceiling view
+### 3.2 Per-Object 3D Generation
+Each furniture object is reconstructed using a **hybrid approach**:
+**Small objects** (lamps, vases, decor): TRELLIS.2-4B → mesh with PBR
+**Medium objects** (chairs, tables): TRELLIS.2-4B or InteriorFusion-L native
+**Large objects** (sofas, beds, wardrobes): InteriorFusion-L with spatial constraints
+The key innovation: **Spatial Constraint Injection**
+- Object position is constrained by the room layout from Phase 1
+- Object scale is constrained by metric depth
+- Object orientation is constrained by floor plane normal
+### 3.3 Gaussian Splatting Layer
+For the entire scene, we maintain a parallel **3D Gaussian Splatting representation**:
+- Fast novel view synthesis for interactive preview
+- Per-object Gaussian subsets for editing
+- Global scene Gaussians for background/room shell
+---
+## Phase 4: Scene Assembly
+### 4.1 Layout Optimization
+Using SpatialLM's scene graph + learned layout prior:
+- Place objects at detected positions from Phase 1
+- Resolve collisions using physics-based relaxation
+- Ensure objects rest on floor (gravity constraint)
+- Ensure objects don't intersect walls
+### 4.2 Scale Normalization
+All objects normalized to metric scale:
+- Use known furniture dimensions (e.g., standard chair height ~45cm)
+- Use depth consistency to resolve ambiguous scales
+- Human-scale reference from detected people/artifacts
+### 4.3 Scene Graph Construction
+```python
+@dataclass
+class SceneGraph:
+    nodes: Dict[str, SceneNode]     # Objects + room shell
+    edges: List[SpatialRelation]    # "on", "next to", "in front of", etc.
+    room_type: str                   # "modern_living_room", "scandinavian_kitchen"
+    style: str                       # "modern", "scandinavian", "luxury", "indian"
+```
+---
+## Phase 5: Material & Texture
+### 5.1 PBR Material Generation
+For each surface:
+- Base color/albedo (diffuse)
+- Metallic map
+- Roughness map
+- Normal map (bump)
+- Ambient occlusion (optional)
+**Model**: Custom material diffusion network fine-tuned on Hypersim + InteriorNet
+### 5.2 Texture Baking
+- Project multi-view generated textures onto UV atlases
+- Visibility-aware blending (occlusion handling)
+- Seamless tiling for large surfaces (walls, floors)
+### 5.3 Lighting Estimation
+Estimate scene lighting from the input image:
+- HDR environment map extraction
+- Key light / fill light / ambient light decomposition
+- IBL (Image-Based Lighting) setup for game engines
+---
+## Core Model: InteriorFusion-L (4B Parameters)
+### Encoder
+- **Image encoder**: DINOv3-L (frozen, feature extraction)
+- **Depth encoder**: Custom CNN processing metric depth map
+- **Layout encoder**: Transformer processing SpatialLM scene graph tokens
+- **Semantic encoder**: Mask2Former feature pyramid
+### Latent Representation: SLAT-Interior
+Extension of TRELLIS SLAT optimized for indoor scenes:
+- Sparse 3D voxel grid, resolution 1024³
+- Active voxels only on surfaces (wall, furniture)
+- Per-voxel features: shape + material + semantic class
+- Room-shell voxels flagged separately from object voxels
+### Decoder
+Three parallel decoders:
+1. **Mesh decoder**: Produces watertight or arbitrary-topology meshes (from O-Voxel)
+2. **Gaussian decoder**: Produces per-voxel Gaussian parameters
+3. **Material decoder**: Produces PBR material parameters per surface
+### Generation Pipeline
+Two-stage rectified flow (following TRELLIS pattern):
+1. **Structure generation**: Dense occupancy grid → sparse structure
+2. **Latent generation**: Per-active-voxel features → shape + material
+Conditioned on: DINOv3 image features + depth map + room layout tokens + semantic segmentation tokens
+---
+## Training Strategy
+### Stage 1: VAE Pre-training (1 week, 8×A100)
+- Train SLAT-Interior VAE on 3D-FRONT + Structured3D rooms
+- Multi-resolution: 256³ → 512³ → 1024³ curriculum
+- Loss: MSE reconstruction + KL divergence + depth consistency + normal consistency
+### Stage 2: Flow-Matching DiT (2 weeks, 32×A100)
+- Train rectified flow transformer for structure generation
+- Curriculum: 256³ → 512³ → 1024³
+- Conditioning: image + depth + layout
+### Stage 3: Material DiT (1 week, 16×A100)
+- Train material generation DiT conditioned on geometry + input image
+- PBR material prediction: albedo, metallic, roughness, normal
+### Stage 4: Fine-tuning (3 days, 8×A100)
+- LoRA fine-tuning on real interior photos (ScanNet + HM3D)
+- Domain adaptation from synthetic to real
+- Reinforcement learning for geometry consistency (GRPO-style)
+### Total Training: ~4 weeks on 32×A100
+---
+## Inference Optimization
+### RTX 4090 (24GB VRAM)
+- Model quantization: INT8 via GPTQ
+- Gradient checkpointing disabled (inference only)
+- Gaussian splatting for real-time preview
+- Full mesh generation: ~15 seconds
+### A100 (80GB VRAM)
+- FP16 inference
+- Batch generation for multiple objects
+- Full pipeline: ~8 seconds
+### H100 (80GB VRAM)
+- BF16 inference
+- ~5 seconds full generation
+### Edge / Mobile
+- Core depth + layout estimation only (~2 seconds)
+- Cloud-based 3D generation with streaming
+- Reduced mesh quality (decimated, lower texture resolution)
+---
+## Export Formats
+| Format | Use Case | Features |
+|--------|----------|----------|
+| **GLB** | Web, AR, Unity, Godot | PBR materials, animations, all data |
+| **FBX** | Unreal Engine, Maya, 3ds Max | Full rigging support, PBR |
+| **OBJ** | Legacy compatibility | Basic materials (MTL) |
+| **USDZ** | iOS AR (ARKit) | Apple's native format |
+| **3DGS (.ply)** | Real-time viewing | Gaussian splatting render |
+| **BLEND** | Blender native | Full editability, nodes |