File size: 8,961 Bytes
8af6a60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
# InteriorFusion Architecture Design

## Design Philosophy

InteriorFusion is built on a critical insight: **interior scenes are fundamentally different from single objects**. Current SOTA models (TRELLIS, Hunyuan3D-2, TripoSR, SF3D) are trained on object-centric datasets (Objaverse) and produce unit-cube-scaled assets. They have no concept of:

- Room topology (walls, floors, ceilings)
- Spatial relationships (table NEAR sofa, lamp ON nightstand)
- Real-world scale (meters, not arbitrary units)
- Multi-object coherence (furniture doesn't float)
- Semantic room understanding (kitchen vs bedroom vs office)

InteriorFusion addresses all of these through a **5-phase hybrid pipeline**.

---

## Phase 1: Scene Understanding

### 1.1 Metric Depth Estimation
**Model**: `depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf`

Why metric indoor variant? It predicts depth in **real-world meters** (trained on Hypersim), essential for correct furniture scaling. Non-metric depth estimators produce relative depth that breaks room reconstruction.

### 1.2 Room Layout Estimation
**Model**: `manycore-research/SpatialLM-Llama-1B` (or Qwen-0.5B for Apache 2.0)

SpatialLM processes point clouds from depth + camera intrinsics to produce structured scene scripts:
```python
@dataclass
class RoomLayout:
    walls: List[Plane]          # Wall planes with normals
    floor: Plane                # Floor plane
    ceiling: Plane              # Ceiling plane
    doors: List[Doorway]        # Doorway locations
    windows: List[Window]       # Window locations
    objects: List[ObjectBBox]   # Furniture bounding boxes
```

### 1.3 Semantic Segmentation
**Model**: Mask2Former / OneFormer with indoor-trained heads

Segments the input image into:
- Wall regions (with material type: paint, wallpaper, brick)
- Floor regions (wood, tile, carpet)
- Ceiling region
- Per-furniture instances (sofa, table, lamp, etc.)
- Decorative elements (plants, paintings, curtains)

### 1.4 Multi-Object Detection & Isolation
Using SAM (Segment Anything Model) with indoor priors:
- Segment each furniture piece
- Extract per-object crops with alpha masks
- Remove background context for clean object generation

---

## Phase 2: Multi-View Generation

### 2.1 Per-Object Multi-View Diffusion
**Model**: `stabilityai/stable-zero123` or Zero123++ community pipeline

For each segmented furniture object:
- Generate 6 consistent orthographic views (0°, 60°, 120°, 180°, 240°, 300° azimuth)
- Condition on the original crop + depth edge map
- Use depth-conditioned ControlNet for geometric consistency

### 2.2 Room Shell Multi-View
For walls, floor, ceiling:
- Generate panoramic-style extended views from the single image
- Use depth-guided inpainting for occluded regions
- Produce ceiling, floor, and wall texture atlases

### 2.3 Depth-Conditioned View Synthesis
Condition all multi-view generation on the metric depth map:
- Depth acts as a geometric prior preventing shape hallucination
- Cross-view depth consistency enforced via depth-normal consistency loss

---

## Phase 3: 3D Reconstruction

### 3.1 Room Shell Reconstruction
Walls, floor, ceiling are reconstructed as **planar meshes** with UV atlases:
- Walls: Extruded from detected wall planes + depth boundaries
- Floor: Planar mesh with UV-mapped texture
- Ceiling: Planar mesh with texture from inpainted ceiling view

### 3.2 Per-Object 3D Generation
Each furniture object is reconstructed using a **hybrid approach**:

**Small objects** (lamps, vases, decor): TRELLIS.2-4B → mesh with PBR
**Medium objects** (chairs, tables): TRELLIS.2-4B or InteriorFusion-L native
**Large objects** (sofas, beds, wardrobes): InteriorFusion-L with spatial constraints

The key innovation: **Spatial Constraint Injection**
- Object position is constrained by the room layout from Phase 1
- Object scale is constrained by metric depth
- Object orientation is constrained by floor plane normal

### 3.3 Gaussian Splatting Layer
For the entire scene, we maintain a parallel **3D Gaussian Splatting representation**:
- Fast novel view synthesis for interactive preview
- Per-object Gaussian subsets for editing
- Global scene Gaussians for background/room shell

---

## Phase 4: Scene Assembly

### 4.1 Layout Optimization
Using SpatialLM's scene graph + learned layout prior:
- Place objects at detected positions from Phase 1
- Resolve collisions using physics-based relaxation
- Ensure objects rest on floor (gravity constraint)
- Ensure objects don't intersect walls

### 4.2 Scale Normalization
All objects normalized to metric scale:
- Use known furniture dimensions (e.g., standard chair height ~45cm)
- Use depth consistency to resolve ambiguous scales
- Human-scale reference from detected people/artifacts

### 4.3 Scene Graph Construction
```python
@dataclass
class SceneGraph:
    nodes: Dict[str, SceneNode]     # Objects + room shell
    edges: List[SpatialRelation]    # "on", "next to", "in front of", etc.
    room_type: str                   # "modern_living_room", "scandinavian_kitchen"
    style: str                       # "modern", "scandinavian", "luxury", "indian"
```

---

## Phase 5: Material & Texture

### 5.1 PBR Material Generation
For each surface:
- Base color/albedo (diffuse)
- Metallic map
- Roughness map
- Normal map (bump)
- Ambient occlusion (optional)

**Model**: Custom material diffusion network fine-tuned on Hypersim + InteriorNet

### 5.2 Texture Baking
- Project multi-view generated textures onto UV atlases
- Visibility-aware blending (occlusion handling)
- Seamless tiling for large surfaces (walls, floors)

### 5.3 Lighting Estimation
Estimate scene lighting from the input image:
- HDR environment map extraction
- Key light / fill light / ambient light decomposition
- IBL (Image-Based Lighting) setup for game engines

---

## Core Model: InteriorFusion-L (4B Parameters)

### Encoder
- **Image encoder**: DINOv3-L (frozen, feature extraction)
- **Depth encoder**: Custom CNN processing metric depth map
- **Layout encoder**: Transformer processing SpatialLM scene graph tokens
- **Semantic encoder**: Mask2Former feature pyramid

### Latent Representation: SLAT-Interior
Extension of TRELLIS SLAT optimized for indoor scenes:
- Sparse 3D voxel grid, resolution 1024³
- Active voxels only on surfaces (wall, furniture)
- Per-voxel features: shape + material + semantic class
- Room-shell voxels flagged separately from object voxels

### Decoder
Three parallel decoders:
1. **Mesh decoder**: Produces watertight or arbitrary-topology meshes (from O-Voxel)
2. **Gaussian decoder**: Produces per-voxel Gaussian parameters
3. **Material decoder**: Produces PBR material parameters per surface

### Generation Pipeline
Two-stage rectified flow (following TRELLIS pattern):
1. **Structure generation**: Dense occupancy grid → sparse structure
2. **Latent generation**: Per-active-voxel features → shape + material

Conditioned on: DINOv3 image features + depth map + room layout tokens + semantic segmentation tokens

---

## Training Strategy

### Stage 1: VAE Pre-training (1 week, 8×A100)
- Train SLAT-Interior VAE on 3D-FRONT + Structured3D rooms
- Multi-resolution: 256³ → 512³ → 1024³ curriculum
- Loss: MSE reconstruction + KL divergence + depth consistency + normal consistency

### Stage 2: Flow-Matching DiT (2 weeks, 32×A100)
- Train rectified flow transformer for structure generation
- Curriculum: 256³ → 512³ → 1024³
- Conditioning: image + depth + layout

### Stage 3: Material DiT (1 week, 16×A100)
- Train material generation DiT conditioned on geometry + input image
- PBR material prediction: albedo, metallic, roughness, normal

### Stage 4: Fine-tuning (3 days, 8×A100)
- LoRA fine-tuning on real interior photos (ScanNet + HM3D)
- Domain adaptation from synthetic to real
- Reinforcement learning for geometry consistency (GRPO-style)

### Total Training: ~4 weeks on 32×A100

---

## Inference Optimization

### RTX 4090 (24GB VRAM)
- Model quantization: INT8 via GPTQ
- Gradient checkpointing disabled (inference only)
- Gaussian splatting for real-time preview
- Full mesh generation: ~15 seconds

### A100 (80GB VRAM)
- FP16 inference
- Batch generation for multiple objects
- Full pipeline: ~8 seconds

### H100 (80GB VRAM)
- BF16 inference
- ~5 seconds full generation

### Edge / Mobile
- Core depth + layout estimation only (~2 seconds)
- Cloud-based 3D generation with streaming
- Reduced mesh quality (decimated, lower texture resolution)

---

## Export Formats

| Format | Use Case | Features |
|--------|----------|----------|
| **GLB** | Web, AR, Unity, Godot | PBR materials, animations, all data |
| **FBX** | Unreal Engine, Maya, 3ds Max | Full rigging support, PBR |
| **OBJ** | Legacy compatibility | Basic materials (MTL) |
| **USDZ** | iOS AR (ARKit) | Apple's native format |
| **3DGS (.ply)** | Real-time viewing | Gaussian splatting render |
| **BLEND** | Blender native | Full editability, nodes |