stevee00 commited on
Commit
8af6a60
·
verified ·
1 Parent(s): b61be7d

Upload ARCHITECTURE.md

Browse files
Files changed (1) hide show
  1. ARCHITECTURE.md +248 -0
ARCHITECTURE.md ADDED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # InteriorFusion Architecture Design
2
+
3
+ ## Design Philosophy
4
+
5
+ InteriorFusion is built on a critical insight: **interior scenes are fundamentally different from single objects**. Current SOTA models (TRELLIS, Hunyuan3D-2, TripoSR, SF3D) are trained on object-centric datasets (Objaverse) and produce unit-cube-scaled assets. They have no concept of:
6
+
7
+ - Room topology (walls, floors, ceilings)
8
+ - Spatial relationships (table NEAR sofa, lamp ON nightstand)
9
+ - Real-world scale (meters, not arbitrary units)
10
+ - Multi-object coherence (furniture doesn't float)
11
+ - Semantic room understanding (kitchen vs bedroom vs office)
12
+
13
+ InteriorFusion addresses all of these through a **5-phase hybrid pipeline**.
14
+
15
+ ---
16
+
17
+ ## Phase 1: Scene Understanding
18
+
19
+ ### 1.1 Metric Depth Estimation
20
+ **Model**: `depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf`
21
+
22
+ Why metric indoor variant? It predicts depth in **real-world meters** (trained on Hypersim), essential for correct furniture scaling. Non-metric depth estimators produce relative depth that breaks room reconstruction.
23
+
24
+ ### 1.2 Room Layout Estimation
25
+ **Model**: `manycore-research/SpatialLM-Llama-1B` (or Qwen-0.5B for Apache 2.0)
26
+
27
+ SpatialLM processes point clouds from depth + camera intrinsics to produce structured scene scripts:
28
+ ```python
29
+ @dataclass
30
+ class RoomLayout:
31
+ walls: List[Plane] # Wall planes with normals
32
+ floor: Plane # Floor plane
33
+ ceiling: Plane # Ceiling plane
34
+ doors: List[Doorway] # Doorway locations
35
+ windows: List[Window] # Window locations
36
+ objects: List[ObjectBBox] # Furniture bounding boxes
37
+ ```
38
+
39
+ ### 1.3 Semantic Segmentation
40
+ **Model**: Mask2Former / OneFormer with indoor-trained heads
41
+
42
+ Segments the input image into:
43
+ - Wall regions (with material type: paint, wallpaper, brick)
44
+ - Floor regions (wood, tile, carpet)
45
+ - Ceiling region
46
+ - Per-furniture instances (sofa, table, lamp, etc.)
47
+ - Decorative elements (plants, paintings, curtains)
48
+
49
+ ### 1.4 Multi-Object Detection & Isolation
50
+ Using SAM (Segment Anything Model) with indoor priors:
51
+ - Segment each furniture piece
52
+ - Extract per-object crops with alpha masks
53
+ - Remove background context for clean object generation
54
+
55
+ ---
56
+
57
+ ## Phase 2: Multi-View Generation
58
+
59
+ ### 2.1 Per-Object Multi-View Diffusion
60
+ **Model**: `stabilityai/stable-zero123` or Zero123++ community pipeline
61
+
62
+ For each segmented furniture object:
63
+ - Generate 6 consistent orthographic views (0°, 60°, 120°, 180°, 240°, 300° azimuth)
64
+ - Condition on the original crop + depth edge map
65
+ - Use depth-conditioned ControlNet for geometric consistency
66
+
67
+ ### 2.2 Room Shell Multi-View
68
+ For walls, floor, ceiling:
69
+ - Generate panoramic-style extended views from the single image
70
+ - Use depth-guided inpainting for occluded regions
71
+ - Produce ceiling, floor, and wall texture atlases
72
+
73
+ ### 2.3 Depth-Conditioned View Synthesis
74
+ Condition all multi-view generation on the metric depth map:
75
+ - Depth acts as a geometric prior preventing shape hallucination
76
+ - Cross-view depth consistency enforced via depth-normal consistency loss
77
+
78
+ ---
79
+
80
+ ## Phase 3: 3D Reconstruction
81
+
82
+ ### 3.1 Room Shell Reconstruction
83
+ Walls, floor, ceiling are reconstructed as **planar meshes** with UV atlases:
84
+ - Walls: Extruded from detected wall planes + depth boundaries
85
+ - Floor: Planar mesh with UV-mapped texture
86
+ - Ceiling: Planar mesh with texture from inpainted ceiling view
87
+
88
+ ### 3.2 Per-Object 3D Generation
89
+ Each furniture object is reconstructed using a **hybrid approach**:
90
+
91
+ **Small objects** (lamps, vases, decor): TRELLIS.2-4B → mesh with PBR
92
+ **Medium objects** (chairs, tables): TRELLIS.2-4B or InteriorFusion-L native
93
+ **Large objects** (sofas, beds, wardrobes): InteriorFusion-L with spatial constraints
94
+
95
+ The key innovation: **Spatial Constraint Injection**
96
+ - Object position is constrained by the room layout from Phase 1
97
+ - Object scale is constrained by metric depth
98
+ - Object orientation is constrained by floor plane normal
99
+
100
+ ### 3.3 Gaussian Splatting Layer
101
+ For the entire scene, we maintain a parallel **3D Gaussian Splatting representation**:
102
+ - Fast novel view synthesis for interactive preview
103
+ - Per-object Gaussian subsets for editing
104
+ - Global scene Gaussians for background/room shell
105
+
106
+ ---
107
+
108
+ ## Phase 4: Scene Assembly
109
+
110
+ ### 4.1 Layout Optimization
111
+ Using SpatialLM's scene graph + learned layout prior:
112
+ - Place objects at detected positions from Phase 1
113
+ - Resolve collisions using physics-based relaxation
114
+ - Ensure objects rest on floor (gravity constraint)
115
+ - Ensure objects don't intersect walls
116
+
117
+ ### 4.2 Scale Normalization
118
+ All objects normalized to metric scale:
119
+ - Use known furniture dimensions (e.g., standard chair height ~45cm)
120
+ - Use depth consistency to resolve ambiguous scales
121
+ - Human-scale reference from detected people/artifacts
122
+
123
+ ### 4.3 Scene Graph Construction
124
+ ```python
125
+ @dataclass
126
+ class SceneGraph:
127
+ nodes: Dict[str, SceneNode] # Objects + room shell
128
+ edges: List[SpatialRelation] # "on", "next to", "in front of", etc.
129
+ room_type: str # "modern_living_room", "scandinavian_kitchen"
130
+ style: str # "modern", "scandinavian", "luxury", "indian"
131
+ ```
132
+
133
+ ---
134
+
135
+ ## Phase 5: Material & Texture
136
+
137
+ ### 5.1 PBR Material Generation
138
+ For each surface:
139
+ - Base color/albedo (diffuse)
140
+ - Metallic map
141
+ - Roughness map
142
+ - Normal map (bump)
143
+ - Ambient occlusion (optional)
144
+
145
+ **Model**: Custom material diffusion network fine-tuned on Hypersim + InteriorNet
146
+
147
+ ### 5.2 Texture Baking
148
+ - Project multi-view generated textures onto UV atlases
149
+ - Visibility-aware blending (occlusion handling)
150
+ - Seamless tiling for large surfaces (walls, floors)
151
+
152
+ ### 5.3 Lighting Estimation
153
+ Estimate scene lighting from the input image:
154
+ - HDR environment map extraction
155
+ - Key light / fill light / ambient light decomposition
156
+ - IBL (Image-Based Lighting) setup for game engines
157
+
158
+ ---
159
+
160
+ ## Core Model: InteriorFusion-L (4B Parameters)
161
+
162
+ ### Encoder
163
+ - **Image encoder**: DINOv3-L (frozen, feature extraction)
164
+ - **Depth encoder**: Custom CNN processing metric depth map
165
+ - **Layout encoder**: Transformer processing SpatialLM scene graph tokens
166
+ - **Semantic encoder**: Mask2Former feature pyramid
167
+
168
+ ### Latent Representation: SLAT-Interior
169
+ Extension of TRELLIS SLAT optimized for indoor scenes:
170
+ - Sparse 3D voxel grid, resolution 1024³
171
+ - Active voxels only on surfaces (wall, furniture)
172
+ - Per-voxel features: shape + material + semantic class
173
+ - Room-shell voxels flagged separately from object voxels
174
+
175
+ ### Decoder
176
+ Three parallel decoders:
177
+ 1. **Mesh decoder**: Produces watertight or arbitrary-topology meshes (from O-Voxel)
178
+ 2. **Gaussian decoder**: Produces per-voxel Gaussian parameters
179
+ 3. **Material decoder**: Produces PBR material parameters per surface
180
+
181
+ ### Generation Pipeline
182
+ Two-stage rectified flow (following TRELLIS pattern):
183
+ 1. **Structure generation**: Dense occupancy grid → sparse structure
184
+ 2. **Latent generation**: Per-active-voxel features → shape + material
185
+
186
+ Conditioned on: DINOv3 image features + depth map + room layout tokens + semantic segmentation tokens
187
+
188
+ ---
189
+
190
+ ## Training Strategy
191
+
192
+ ### Stage 1: VAE Pre-training (1 week, 8×A100)
193
+ - Train SLAT-Interior VAE on 3D-FRONT + Structured3D rooms
194
+ - Multi-resolution: 256³ → 512³ → 1024³ curriculum
195
+ - Loss: MSE reconstruction + KL divergence + depth consistency + normal consistency
196
+
197
+ ### Stage 2: Flow-Matching DiT (2 weeks, 32×A100)
198
+ - Train rectified flow transformer for structure generation
199
+ - Curriculum: 256³ → 512³ → 1024³
200
+ - Conditioning: image + depth + layout
201
+
202
+ ### Stage 3: Material DiT (1 week, 16×A100)
203
+ - Train material generation DiT conditioned on geometry + input image
204
+ - PBR material prediction: albedo, metallic, roughness, normal
205
+
206
+ ### Stage 4: Fine-tuning (3 days, 8×A100)
207
+ - LoRA fine-tuning on real interior photos (ScanNet + HM3D)
208
+ - Domain adaptation from synthetic to real
209
+ - Reinforcement learning for geometry consistency (GRPO-style)
210
+
211
+ ### Total Training: ~4 weeks on 32×A100
212
+
213
+ ---
214
+
215
+ ## Inference Optimization
216
+
217
+ ### RTX 4090 (24GB VRAM)
218
+ - Model quantization: INT8 via GPTQ
219
+ - Gradient checkpointing disabled (inference only)
220
+ - Gaussian splatting for real-time preview
221
+ - Full mesh generation: ~15 seconds
222
+
223
+ ### A100 (80GB VRAM)
224
+ - FP16 inference
225
+ - Batch generation for multiple objects
226
+ - Full pipeline: ~8 seconds
227
+
228
+ ### H100 (80GB VRAM)
229
+ - BF16 inference
230
+ - ~5 seconds full generation
231
+
232
+ ### Edge / Mobile
233
+ - Core depth + layout estimation only (~2 seconds)
234
+ - Cloud-based 3D generation with streaming
235
+ - Reduced mesh quality (decimated, lower texture resolution)
236
+
237
+ ---
238
+
239
+ ## Export Formats
240
+
241
+ | Format | Use Case | Features |
242
+ |--------|----------|----------|
243
+ | **GLB** | Web, AR, Unity, Godot | PBR materials, animations, all data |
244
+ | **FBX** | Unreal Engine, Maya, 3ds Max | Full rigging support, PBR |
245
+ | **OBJ** | Legacy compatibility | Basic materials (MTL) |
246
+ | **USDZ** | iOS AR (ARKit) | Apple's native format |
247
+ | **3DGS (.ply)** | Real-time viewing | Gaussian splatting render |
248
+ | **BLEND** | Blender native | Full editability, nodes |