File size: 8,961 Bytes

8af6a60

# InteriorFusion Architecture Design

## Design Philosophy

InteriorFusion is built on a critical insight: **interior scenes are fundamentally different from single objects**. Current SOTA models (TRELLIS, Hunyuan3D-2, TripoSR, SF3D) are trained on object-centric datasets (Objaverse) and produce unit-cube-scaled assets. They have no concept of:

- Room topology (walls, floors, ceilings)
- Spatial relationships (table NEAR sofa, lamp ON nightstand)
- Real-world scale (meters, not arbitrary units)
- Multi-object coherence (furniture doesn't float)
- Semantic room understanding (kitchen vs bedroom vs office)

InteriorFusion addresses all of these through a **5-phase hybrid pipeline**.

---

## Phase 1: Scene Understanding

### 1.1 Metric Depth Estimation
**Model**: `depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf`

Why metric indoor variant? It predicts depth in **real-world meters** (trained on Hypersim), essential for correct furniture scaling. Non-metric depth estimators produce relative depth that breaks room reconstruction.

### 1.2 Room Layout Estimation
**Model**: `manycore-research/SpatialLM-Llama-1B` (or Qwen-0.5B for Apache 2.0)

SpatialLM processes point clouds from depth + camera intrinsics to produce structured scene scripts:
```python
@dataclass
class RoomLayout:
    walls: List[Plane]          # Wall planes with normals
    floor: Plane                # Floor plane
    ceiling: Plane              # Ceiling plane
    doors: List[Doorway]        # Doorway locations
    windows: List[Window]       # Window locations
    objects: List[ObjectBBox]   # Furniture bounding boxes
```

### 1.3 Semantic Segmentation
**Model**: Mask2Former / OneFormer with indoor-trained heads

Segments the input image into:
- Wall regions (with material type: paint, wallpaper, brick)
- Floor regions (wood, tile, carpet)
- Ceiling region
- Per-furniture instances (sofa, table, lamp, etc.)
- Decorative elements (plants, paintings, curtains)

### 1.4 Multi-Object Detection & Isolation
Using SAM (Segment Anything Model) with indoor priors:
- Segment each furniture piece
- Extract per-object crops with alpha masks
- Remove background context for clean object generation

---

## Phase 2: Multi-View Generation

### 2.1 Per-Object Multi-View Diffusion
**Model**: `stabilityai/stable-zero123` or Zero123++ community pipeline

For each segmented furniture object:
- Generate 6 consistent orthographic views (0°, 60°, 120°, 180°, 240°, 300° azimuth)
- Condition on the original crop + depth edge map
- Use depth-conditioned ControlNet for geometric consistency

### 2.2 Room Shell Multi-View
For walls, floor, ceiling:
- Generate panoramic-style extended views from the single image
- Use depth-guided inpainting for occluded regions
- Produce ceiling, floor, and wall texture atlases

### 2.3 Depth-Conditioned View Synthesis
Condition all multi-view generation on the metric depth map:
- Depth acts as a geometric prior preventing shape hallucination
- Cross-view depth consistency enforced via depth-normal consistency loss

---

## Phase 3: 3D Reconstruction

### 3.1 Room Shell Reconstruction
Walls, floor, ceiling are reconstructed as **planar meshes** with UV atlases:
- Walls: Extruded from detected wall planes + depth boundaries
- Floor: Planar mesh with UV-mapped texture
- Ceiling: Planar mesh with texture from inpainted ceiling view

### 3.2 Per-Object 3D Generation
Each furniture object is reconstructed using a **hybrid approach**:

**Small objects** (lamps, vases, decor): TRELLIS.2-4B → mesh with PBR
**Medium objects** (chairs, tables): TRELLIS.2-4B or InteriorFusion-L native
**Large objects** (sofas, beds, wardrobes): InteriorFusion-L with spatial constraints

The key innovation: **Spatial Constraint Injection**
- Object position is constrained by the room layout from Phase 1
- Object scale is constrained by metric depth
- Object orientation is constrained by floor plane normal

### 3.3 Gaussian Splatting Layer
For the entire scene, we maintain a parallel **3D Gaussian Splatting representation**:
- Fast novel view synthesis for interactive preview
- Per-object Gaussian subsets for editing
- Global scene Gaussians for background/room shell

---

## Phase 4: Scene Assembly

### 4.1 Layout Optimization
Using SpatialLM's scene graph + learned layout prior:
- Place objects at detected positions from Phase 1
- Resolve collisions using physics-based relaxation
- Ensure objects rest on floor (gravity constraint)
- Ensure objects don't intersect walls

### 4.2 Scale Normalization
All objects normalized to metric scale:
- Use known furniture dimensions (e.g., standard chair height ~45cm)
- Use depth consistency to resolve ambiguous scales
- Human-scale reference from detected people/artifacts

### 4.3 Scene Graph Construction
```python
@dataclass
class SceneGraph:
    nodes: Dict[str, SceneNode]     # Objects + room shell
    edges: List[SpatialRelation]    # "on", "next to", "in front of", etc.
    room_type: str                   # "modern_living_room", "scandinavian_kitchen"
    style: str                       # "modern", "scandinavian", "luxury", "indian"
```

---

## Phase 5: Material & Texture

### 5.1 PBR Material Generation
For each surface:
- Base color/albedo (diffuse)
- Metallic map
- Roughness map
- Normal map (bump)
- Ambient occlusion (optional)

**Model**: Custom material diffusion network fine-tuned on Hypersim + InteriorNet

### 5.2 Texture Baking
- Project multi-view generated textures onto UV atlases
- Visibility-aware blending (occlusion handling)
- Seamless tiling for large surfaces (walls, floors)

### 5.3 Lighting Estimation
Estimate scene lighting from the input image:
- HDR environment map extraction
- Key light / fill light / ambient light decomposition
- IBL (Image-Based Lighting) setup for game engines

---

## Core Model: InteriorFusion-L (4B Parameters)

### Encoder
- **Image encoder**: DINOv3-L (frozen, feature extraction)
- **Depth encoder**: Custom CNN processing metric depth map
- **Layout encoder**: Transformer processing SpatialLM scene graph tokens
- **Semantic encoder**: Mask2Former feature pyramid

### Latent Representation: SLAT-Interior
Extension of TRELLIS SLAT optimized for indoor scenes:
- Sparse 3D voxel grid, resolution 1024³
- Active voxels only on surfaces (wall, furniture)
- Per-voxel features: shape + material + semantic class
- Room-shell voxels flagged separately from object voxels

### Decoder
Three parallel decoders:
1. **Mesh decoder**: Produces watertight or arbitrary-topology meshes (from O-Voxel)
2. **Gaussian decoder**: Produces per-voxel Gaussian parameters
3. **Material decoder**: Produces PBR material parameters per surface

### Generation Pipeline
Two-stage rectified flow (following TRELLIS pattern):
1. **Structure generation**: Dense occupancy grid → sparse structure
2. **Latent generation**: Per-active-voxel features → shape + material

Conditioned on: DINOv3 image features + depth map + room layout tokens + semantic segmentation tokens

---

## Training Strategy

### Stage 1: VAE Pre-training (1 week, 8×A100)
- Train SLAT-Interior VAE on 3D-FRONT + Structured3D rooms
- Multi-resolution: 256³ → 512³ → 1024³ curriculum
- Loss: MSE reconstruction + KL divergence + depth consistency + normal consistency

### Stage 2: Flow-Matching DiT (2 weeks, 32×A100)
- Train rectified flow transformer for structure generation
- Curriculum: 256³ → 512³ → 1024³
- Conditioning: image + depth + layout

### Stage 3: Material DiT (1 week, 16×A100)
- Train material generation DiT conditioned on geometry + input image
- PBR material prediction: albedo, metallic, roughness, normal

### Stage 4: Fine-tuning (3 days, 8×A100)
- LoRA fine-tuning on real interior photos (ScanNet + HM3D)
- Domain adaptation from synthetic to real
- Reinforcement learning for geometry consistency (GRPO-style)

### Total Training: ~4 weeks on 32×A100

---

## Inference Optimization

### RTX 4090 (24GB VRAM)
- Model quantization: INT8 via GPTQ
- Gradient checkpointing disabled (inference only)
- Gaussian splatting for real-time preview
- Full mesh generation: ~15 seconds

### A100 (80GB VRAM)
- FP16 inference
- Batch generation for multiple objects
- Full pipeline: ~8 seconds

### H100 (80GB VRAM)
- BF16 inference
- ~5 seconds full generation

### Edge / Mobile
- Core depth + layout estimation only (~2 seconds)
- Cloud-based 3D generation with streaming
- Reduced mesh quality (decimated, lower texture resolution)

---

## Export Formats

| Format | Use Case | Features |
|--------|----------|----------|
| **GLB** | Web, AR, Unity, Godot | PBR materials, animations, all data |
| **FBX** | Unreal Engine, Maya, 3ds Max | Full rigging support, PBR |
| **OBJ** | Legacy compatibility | Basic materials (MTL) |
| **USDZ** | iOS AR (ARKit) | Apple's native format |
| **3DGS (.ply)** | Real-time viewing | Gaussian splatting render |
| **BLEND** | Blender native | Full editability, nodes |