File size: 8,961 Bytes
8af6a60 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | # InteriorFusion Architecture Design
## Design Philosophy
InteriorFusion is built on a critical insight: **interior scenes are fundamentally different from single objects**. Current SOTA models (TRELLIS, Hunyuan3D-2, TripoSR, SF3D) are trained on object-centric datasets (Objaverse) and produce unit-cube-scaled assets. They have no concept of:
- Room topology (walls, floors, ceilings)
- Spatial relationships (table NEAR sofa, lamp ON nightstand)
- Real-world scale (meters, not arbitrary units)
- Multi-object coherence (furniture doesn't float)
- Semantic room understanding (kitchen vs bedroom vs office)
InteriorFusion addresses all of these through a **5-phase hybrid pipeline**.
---
## Phase 1: Scene Understanding
### 1.1 Metric Depth Estimation
**Model**: `depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf`
Why metric indoor variant? It predicts depth in **real-world meters** (trained on Hypersim), essential for correct furniture scaling. Non-metric depth estimators produce relative depth that breaks room reconstruction.
### 1.2 Room Layout Estimation
**Model**: `manycore-research/SpatialLM-Llama-1B` (or Qwen-0.5B for Apache 2.0)
SpatialLM processes point clouds from depth + camera intrinsics to produce structured scene scripts:
```python
@dataclass
class RoomLayout:
walls: List[Plane] # Wall planes with normals
floor: Plane # Floor plane
ceiling: Plane # Ceiling plane
doors: List[Doorway] # Doorway locations
windows: List[Window] # Window locations
objects: List[ObjectBBox] # Furniture bounding boxes
```
### 1.3 Semantic Segmentation
**Model**: Mask2Former / OneFormer with indoor-trained heads
Segments the input image into:
- Wall regions (with material type: paint, wallpaper, brick)
- Floor regions (wood, tile, carpet)
- Ceiling region
- Per-furniture instances (sofa, table, lamp, etc.)
- Decorative elements (plants, paintings, curtains)
### 1.4 Multi-Object Detection & Isolation
Using SAM (Segment Anything Model) with indoor priors:
- Segment each furniture piece
- Extract per-object crops with alpha masks
- Remove background context for clean object generation
---
## Phase 2: Multi-View Generation
### 2.1 Per-Object Multi-View Diffusion
**Model**: `stabilityai/stable-zero123` or Zero123++ community pipeline
For each segmented furniture object:
- Generate 6 consistent orthographic views (0°, 60°, 120°, 180°, 240°, 300° azimuth)
- Condition on the original crop + depth edge map
- Use depth-conditioned ControlNet for geometric consistency
### 2.2 Room Shell Multi-View
For walls, floor, ceiling:
- Generate panoramic-style extended views from the single image
- Use depth-guided inpainting for occluded regions
- Produce ceiling, floor, and wall texture atlases
### 2.3 Depth-Conditioned View Synthesis
Condition all multi-view generation on the metric depth map:
- Depth acts as a geometric prior preventing shape hallucination
- Cross-view depth consistency enforced via depth-normal consistency loss
---
## Phase 3: 3D Reconstruction
### 3.1 Room Shell Reconstruction
Walls, floor, ceiling are reconstructed as **planar meshes** with UV atlases:
- Walls: Extruded from detected wall planes + depth boundaries
- Floor: Planar mesh with UV-mapped texture
- Ceiling: Planar mesh with texture from inpainted ceiling view
### 3.2 Per-Object 3D Generation
Each furniture object is reconstructed using a **hybrid approach**:
**Small objects** (lamps, vases, decor): TRELLIS.2-4B → mesh with PBR
**Medium objects** (chairs, tables): TRELLIS.2-4B or InteriorFusion-L native
**Large objects** (sofas, beds, wardrobes): InteriorFusion-L with spatial constraints
The key innovation: **Spatial Constraint Injection**
- Object position is constrained by the room layout from Phase 1
- Object scale is constrained by metric depth
- Object orientation is constrained by floor plane normal
### 3.3 Gaussian Splatting Layer
For the entire scene, we maintain a parallel **3D Gaussian Splatting representation**:
- Fast novel view synthesis for interactive preview
- Per-object Gaussian subsets for editing
- Global scene Gaussians for background/room shell
---
## Phase 4: Scene Assembly
### 4.1 Layout Optimization
Using SpatialLM's scene graph + learned layout prior:
- Place objects at detected positions from Phase 1
- Resolve collisions using physics-based relaxation
- Ensure objects rest on floor (gravity constraint)
- Ensure objects don't intersect walls
### 4.2 Scale Normalization
All objects normalized to metric scale:
- Use known furniture dimensions (e.g., standard chair height ~45cm)
- Use depth consistency to resolve ambiguous scales
- Human-scale reference from detected people/artifacts
### 4.3 Scene Graph Construction
```python
@dataclass
class SceneGraph:
nodes: Dict[str, SceneNode] # Objects + room shell
edges: List[SpatialRelation] # "on", "next to", "in front of", etc.
room_type: str # "modern_living_room", "scandinavian_kitchen"
style: str # "modern", "scandinavian", "luxury", "indian"
```
---
## Phase 5: Material & Texture
### 5.1 PBR Material Generation
For each surface:
- Base color/albedo (diffuse)
- Metallic map
- Roughness map
- Normal map (bump)
- Ambient occlusion (optional)
**Model**: Custom material diffusion network fine-tuned on Hypersim + InteriorNet
### 5.2 Texture Baking
- Project multi-view generated textures onto UV atlases
- Visibility-aware blending (occlusion handling)
- Seamless tiling for large surfaces (walls, floors)
### 5.3 Lighting Estimation
Estimate scene lighting from the input image:
- HDR environment map extraction
- Key light / fill light / ambient light decomposition
- IBL (Image-Based Lighting) setup for game engines
---
## Core Model: InteriorFusion-L (4B Parameters)
### Encoder
- **Image encoder**: DINOv3-L (frozen, feature extraction)
- **Depth encoder**: Custom CNN processing metric depth map
- **Layout encoder**: Transformer processing SpatialLM scene graph tokens
- **Semantic encoder**: Mask2Former feature pyramid
### Latent Representation: SLAT-Interior
Extension of TRELLIS SLAT optimized for indoor scenes:
- Sparse 3D voxel grid, resolution 1024³
- Active voxels only on surfaces (wall, furniture)
- Per-voxel features: shape + material + semantic class
- Room-shell voxels flagged separately from object voxels
### Decoder
Three parallel decoders:
1. **Mesh decoder**: Produces watertight or arbitrary-topology meshes (from O-Voxel)
2. **Gaussian decoder**: Produces per-voxel Gaussian parameters
3. **Material decoder**: Produces PBR material parameters per surface
### Generation Pipeline
Two-stage rectified flow (following TRELLIS pattern):
1. **Structure generation**: Dense occupancy grid → sparse structure
2. **Latent generation**: Per-active-voxel features → shape + material
Conditioned on: DINOv3 image features + depth map + room layout tokens + semantic segmentation tokens
---
## Training Strategy
### Stage 1: VAE Pre-training (1 week, 8×A100)
- Train SLAT-Interior VAE on 3D-FRONT + Structured3D rooms
- Multi-resolution: 256³ → 512³ → 1024³ curriculum
- Loss: MSE reconstruction + KL divergence + depth consistency + normal consistency
### Stage 2: Flow-Matching DiT (2 weeks, 32×A100)
- Train rectified flow transformer for structure generation
- Curriculum: 256³ → 512³ → 1024³
- Conditioning: image + depth + layout
### Stage 3: Material DiT (1 week, 16×A100)
- Train material generation DiT conditioned on geometry + input image
- PBR material prediction: albedo, metallic, roughness, normal
### Stage 4: Fine-tuning (3 days, 8×A100)
- LoRA fine-tuning on real interior photos (ScanNet + HM3D)
- Domain adaptation from synthetic to real
- Reinforcement learning for geometry consistency (GRPO-style)
### Total Training: ~4 weeks on 32×A100
---
## Inference Optimization
### RTX 4090 (24GB VRAM)
- Model quantization: INT8 via GPTQ
- Gradient checkpointing disabled (inference only)
- Gaussian splatting for real-time preview
- Full mesh generation: ~15 seconds
### A100 (80GB VRAM)
- FP16 inference
- Batch generation for multiple objects
- Full pipeline: ~8 seconds
### H100 (80GB VRAM)
- BF16 inference
- ~5 seconds full generation
### Edge / Mobile
- Core depth + layout estimation only (~2 seconds)
- Cloud-based 3D generation with streaming
- Reduced mesh quality (decimated, lower texture resolution)
---
## Export Formats
| Format | Use Case | Features |
|--------|----------|----------|
| **GLB** | Web, AR, Unity, Godot | PBR materials, animations, all data |
| **FBX** | Unreal Engine, Maya, 3ds Max | Full rigging support, PBR |
| **OBJ** | Legacy compatibility | Basic materials (MTL) |
| **USDZ** | iOS AR (ARKit) | Apple's native format |
| **3DGS (.ply)** | Real-time viewing | Gaussian splatting render |
| **BLEND** | Blender native | Full editability, nodes |
|