InteriorFusion / ARCHITECTURE.md
stevee00's picture
Upload ARCHITECTURE.md
8af6a60 verified
# InteriorFusion Architecture Design
## Design Philosophy
InteriorFusion is built on a critical insight: **interior scenes are fundamentally different from single objects**. Current SOTA models (TRELLIS, Hunyuan3D-2, TripoSR, SF3D) are trained on object-centric datasets (Objaverse) and produce unit-cube-scaled assets. They have no concept of:
- Room topology (walls, floors, ceilings)
- Spatial relationships (table NEAR sofa, lamp ON nightstand)
- Real-world scale (meters, not arbitrary units)
- Multi-object coherence (furniture doesn't float)
- Semantic room understanding (kitchen vs bedroom vs office)
InteriorFusion addresses all of these through a **5-phase hybrid pipeline**.
---
## Phase 1: Scene Understanding
### 1.1 Metric Depth Estimation
**Model**: `depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf`
Why metric indoor variant? It predicts depth in **real-world meters** (trained on Hypersim), essential for correct furniture scaling. Non-metric depth estimators produce relative depth that breaks room reconstruction.
### 1.2 Room Layout Estimation
**Model**: `manycore-research/SpatialLM-Llama-1B` (or Qwen-0.5B for Apache 2.0)
SpatialLM processes point clouds from depth + camera intrinsics to produce structured scene scripts:
```python
@dataclass
class RoomLayout:
walls: List[Plane] # Wall planes with normals
floor: Plane # Floor plane
ceiling: Plane # Ceiling plane
doors: List[Doorway] # Doorway locations
windows: List[Window] # Window locations
objects: List[ObjectBBox] # Furniture bounding boxes
```
### 1.3 Semantic Segmentation
**Model**: Mask2Former / OneFormer with indoor-trained heads
Segments the input image into:
- Wall regions (with material type: paint, wallpaper, brick)
- Floor regions (wood, tile, carpet)
- Ceiling region
- Per-furniture instances (sofa, table, lamp, etc.)
- Decorative elements (plants, paintings, curtains)
### 1.4 Multi-Object Detection & Isolation
Using SAM (Segment Anything Model) with indoor priors:
- Segment each furniture piece
- Extract per-object crops with alpha masks
- Remove background context for clean object generation
---
## Phase 2: Multi-View Generation
### 2.1 Per-Object Multi-View Diffusion
**Model**: `stabilityai/stable-zero123` or Zero123++ community pipeline
For each segmented furniture object:
- Generate 6 consistent orthographic views (0°, 60°, 120°, 180°, 240°, 300° azimuth)
- Condition on the original crop + depth edge map
- Use depth-conditioned ControlNet for geometric consistency
### 2.2 Room Shell Multi-View
For walls, floor, ceiling:
- Generate panoramic-style extended views from the single image
- Use depth-guided inpainting for occluded regions
- Produce ceiling, floor, and wall texture atlases
### 2.3 Depth-Conditioned View Synthesis
Condition all multi-view generation on the metric depth map:
- Depth acts as a geometric prior preventing shape hallucination
- Cross-view depth consistency enforced via depth-normal consistency loss
---
## Phase 3: 3D Reconstruction
### 3.1 Room Shell Reconstruction
Walls, floor, ceiling are reconstructed as **planar meshes** with UV atlases:
- Walls: Extruded from detected wall planes + depth boundaries
- Floor: Planar mesh with UV-mapped texture
- Ceiling: Planar mesh with texture from inpainted ceiling view
### 3.2 Per-Object 3D Generation
Each furniture object is reconstructed using a **hybrid approach**:
**Small objects** (lamps, vases, decor): TRELLIS.2-4B → mesh with PBR
**Medium objects** (chairs, tables): TRELLIS.2-4B or InteriorFusion-L native
**Large objects** (sofas, beds, wardrobes): InteriorFusion-L with spatial constraints
The key innovation: **Spatial Constraint Injection**
- Object position is constrained by the room layout from Phase 1
- Object scale is constrained by metric depth
- Object orientation is constrained by floor plane normal
### 3.3 Gaussian Splatting Layer
For the entire scene, we maintain a parallel **3D Gaussian Splatting representation**:
- Fast novel view synthesis for interactive preview
- Per-object Gaussian subsets for editing
- Global scene Gaussians for background/room shell
---
## Phase 4: Scene Assembly
### 4.1 Layout Optimization
Using SpatialLM's scene graph + learned layout prior:
- Place objects at detected positions from Phase 1
- Resolve collisions using physics-based relaxation
- Ensure objects rest on floor (gravity constraint)
- Ensure objects don't intersect walls
### 4.2 Scale Normalization
All objects normalized to metric scale:
- Use known furniture dimensions (e.g., standard chair height ~45cm)
- Use depth consistency to resolve ambiguous scales
- Human-scale reference from detected people/artifacts
### 4.3 Scene Graph Construction
```python
@dataclass
class SceneGraph:
nodes: Dict[str, SceneNode] # Objects + room shell
edges: List[SpatialRelation] # "on", "next to", "in front of", etc.
room_type: str # "modern_living_room", "scandinavian_kitchen"
style: str # "modern", "scandinavian", "luxury", "indian"
```
---
## Phase 5: Material & Texture
### 5.1 PBR Material Generation
For each surface:
- Base color/albedo (diffuse)
- Metallic map
- Roughness map
- Normal map (bump)
- Ambient occlusion (optional)
**Model**: Custom material diffusion network fine-tuned on Hypersim + InteriorNet
### 5.2 Texture Baking
- Project multi-view generated textures onto UV atlases
- Visibility-aware blending (occlusion handling)
- Seamless tiling for large surfaces (walls, floors)
### 5.3 Lighting Estimation
Estimate scene lighting from the input image:
- HDR environment map extraction
- Key light / fill light / ambient light decomposition
- IBL (Image-Based Lighting) setup for game engines
---
## Core Model: InteriorFusion-L (4B Parameters)
### Encoder
- **Image encoder**: DINOv3-L (frozen, feature extraction)
- **Depth encoder**: Custom CNN processing metric depth map
- **Layout encoder**: Transformer processing SpatialLM scene graph tokens
- **Semantic encoder**: Mask2Former feature pyramid
### Latent Representation: SLAT-Interior
Extension of TRELLIS SLAT optimized for indoor scenes:
- Sparse 3D voxel grid, resolution 1024³
- Active voxels only on surfaces (wall, furniture)
- Per-voxel features: shape + material + semantic class
- Room-shell voxels flagged separately from object voxels
### Decoder
Three parallel decoders:
1. **Mesh decoder**: Produces watertight or arbitrary-topology meshes (from O-Voxel)
2. **Gaussian decoder**: Produces per-voxel Gaussian parameters
3. **Material decoder**: Produces PBR material parameters per surface
### Generation Pipeline
Two-stage rectified flow (following TRELLIS pattern):
1. **Structure generation**: Dense occupancy grid → sparse structure
2. **Latent generation**: Per-active-voxel features → shape + material
Conditioned on: DINOv3 image features + depth map + room layout tokens + semantic segmentation tokens
---
## Training Strategy
### Stage 1: VAE Pre-training (1 week, 8×A100)
- Train SLAT-Interior VAE on 3D-FRONT + Structured3D rooms
- Multi-resolution: 256³ → 512³ → 1024³ curriculum
- Loss: MSE reconstruction + KL divergence + depth consistency + normal consistency
### Stage 2: Flow-Matching DiT (2 weeks, 32×A100)
- Train rectified flow transformer for structure generation
- Curriculum: 256³ → 512³ → 1024³
- Conditioning: image + depth + layout
### Stage 3: Material DiT (1 week, 16×A100)
- Train material generation DiT conditioned on geometry + input image
- PBR material prediction: albedo, metallic, roughness, normal
### Stage 4: Fine-tuning (3 days, 8×A100)
- LoRA fine-tuning on real interior photos (ScanNet + HM3D)
- Domain adaptation from synthetic to real
- Reinforcement learning for geometry consistency (GRPO-style)
### Total Training: ~4 weeks on 32×A100
---
## Inference Optimization
### RTX 4090 (24GB VRAM)
- Model quantization: INT8 via GPTQ
- Gradient checkpointing disabled (inference only)
- Gaussian splatting for real-time preview
- Full mesh generation: ~15 seconds
### A100 (80GB VRAM)
- FP16 inference
- Batch generation for multiple objects
- Full pipeline: ~8 seconds
### H100 (80GB VRAM)
- BF16 inference
- ~5 seconds full generation
### Edge / Mobile
- Core depth + layout estimation only (~2 seconds)
- Cloud-based 3D generation with streaming
- Reduced mesh quality (decimated, lower texture resolution)
---
## Export Formats
| Format | Use Case | Features |
|--------|----------|----------|
| **GLB** | Web, AR, Unity, Godot | PBR materials, animations, all data |
| **FBX** | Unreal Engine, Maya, 3ds Max | Full rigging support, PBR |
| **OBJ** | Legacy compatibility | Basic materials (MTL) |
| **USDZ** | iOS AR (ARKit) | Apple's native format |
| **3DGS (.ply)** | Real-time viewing | Gaussian splatting render |
| **BLEND** | Blender native | Full editability, nodes |