Upload ARCHITECTURE.md

8af6a60 verified 5 days ago

8.96 kB

	# InteriorFusion Architecture Design

	## Design Philosophy

	InteriorFusion is built on a critical insight: interior scenes are fundamentally different from single objects. Current SOTA models (TRELLIS, Hunyuan3D-2, TripoSR, SF3D) are trained on object-centric datasets (Objaverse) and produce unit-cube-scaled assets. They have no concept of:

	- Room topology (walls, floors, ceilings)
	- Spatial relationships (table NEAR sofa, lamp ON nightstand)
	- Real-world scale (meters, not arbitrary units)
	- Multi-object coherence (furniture doesn't float)
	- Semantic room understanding (kitchen vs bedroom vs office)

	InteriorFusion addresses all of these through a 5-phase hybrid pipeline.

	---

	## Phase 1: Scene Understanding

	### 1.1 Metric Depth Estimation
	Model: `depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf`

	Why metric indoor variant? It predicts depth in real-world meters (trained on Hypersim), essential for correct furniture scaling. Non-metric depth estimators produce relative depth that breaks room reconstruction.

	### 1.2 Room Layout Estimation
	Model: `manycore-research/SpatialLM-Llama-1B` (or Qwen-0.5B for Apache 2.0)

	SpatialLM processes point clouds from depth + camera intrinsics to produce structured scene scripts:
	```python
	@dataclass
	class RoomLayout:
	walls: List[Plane] # Wall planes with normals
	floor: Plane # Floor plane
	ceiling: Plane # Ceiling plane
	doors: List[Doorway] # Doorway locations
	windows: List[Window] # Window locations
	objects: List[ObjectBBox] # Furniture bounding boxes
	```

	### 1.3 Semantic Segmentation
	Model: Mask2Former / OneFormer with indoor-trained heads

	Segments the input image into:
	- Wall regions (with material type: paint, wallpaper, brick)
	- Floor regions (wood, tile, carpet)
	- Ceiling region
	- Per-furniture instances (sofa, table, lamp, etc.)
	- Decorative elements (plants, paintings, curtains)

	### 1.4 Multi-Object Detection & Isolation
	Using SAM (Segment Anything Model) with indoor priors:
	- Segment each furniture piece
	- Extract per-object crops with alpha masks
	- Remove background context for clean object generation

	---

	## Phase 2: Multi-View Generation

	### 2.1 Per-Object Multi-View Diffusion
	Model: `stabilityai/stable-zero123` or Zero123++ community pipeline

	For each segmented furniture object:
	- Generate 6 consistent orthographic views (0°, 60°, 120°, 180°, 240°, 300° azimuth)
	- Condition on the original crop + depth edge map
	- Use depth-conditioned ControlNet for geometric consistency

	### 2.2 Room Shell Multi-View
	For walls, floor, ceiling:
	- Generate panoramic-style extended views from the single image
	- Use depth-guided inpainting for occluded regions
	- Produce ceiling, floor, and wall texture atlases

	### 2.3 Depth-Conditioned View Synthesis
	Condition all multi-view generation on the metric depth map:
	- Depth acts as a geometric prior preventing shape hallucination
	- Cross-view depth consistency enforced via depth-normal consistency loss

	---

	## Phase 3: 3D Reconstruction

	### 3.1 Room Shell Reconstruction
	Walls, floor, ceiling are reconstructed as planar meshes with UV atlases:
	- Walls: Extruded from detected wall planes + depth boundaries
	- Floor: Planar mesh with UV-mapped texture
	- Ceiling: Planar mesh with texture from inpainted ceiling view

	### 3.2 Per-Object 3D Generation
	Each furniture object is reconstructed using a hybrid approach:

	Small objects (lamps, vases, decor): TRELLIS.2-4B → mesh with PBR
	Medium objects (chairs, tables): TRELLIS.2-4B or InteriorFusion-L native
	Large objects (sofas, beds, wardrobes): InteriorFusion-L with spatial constraints

	The key innovation: Spatial Constraint Injection
	- Object position is constrained by the room layout from Phase 1
	- Object scale is constrained by metric depth
	- Object orientation is constrained by floor plane normal

	### 3.3 Gaussian Splatting Layer
	For the entire scene, we maintain a parallel 3D Gaussian Splatting representation:
	- Fast novel view synthesis for interactive preview
	- Per-object Gaussian subsets for editing
	- Global scene Gaussians for background/room shell

	---

	## Phase 4: Scene Assembly

	### 4.1 Layout Optimization
	Using SpatialLM's scene graph + learned layout prior:
	- Place objects at detected positions from Phase 1
	- Resolve collisions using physics-based relaxation
	- Ensure objects rest on floor (gravity constraint)
	- Ensure objects don't intersect walls

	### 4.2 Scale Normalization
	All objects normalized to metric scale:
	- Use known furniture dimensions (e.g., standard chair height ~45cm)
	- Use depth consistency to resolve ambiguous scales
	- Human-scale reference from detected people/artifacts

	### 4.3 Scene Graph Construction
	```python
	@dataclass
	class SceneGraph:
	nodes: Dict[str, SceneNode] # Objects + room shell
	edges: List[SpatialRelation] # "on", "next to", "in front of", etc.
	room_type: str # "modern_living_room", "scandinavian_kitchen"
	style: str # "modern", "scandinavian", "luxury", "indian"
	```

	---

	## Phase 5: Material & Texture

	### 5.1 PBR Material Generation
	For each surface:
	- Base color/albedo (diffuse)
	- Metallic map
	- Roughness map
	- Normal map (bump)
	- Ambient occlusion (optional)

	Model: Custom material diffusion network fine-tuned on Hypersim + InteriorNet

	### 5.2 Texture Baking
	- Project multi-view generated textures onto UV atlases
	- Visibility-aware blending (occlusion handling)
	- Seamless tiling for large surfaces (walls, floors)

	### 5.3 Lighting Estimation
	Estimate scene lighting from the input image:
	- HDR environment map extraction
	- Key light / fill light / ambient light decomposition
	- IBL (Image-Based Lighting) setup for game engines

	---

	## Core Model: InteriorFusion-L (4B Parameters)

	### Encoder
	- Image encoder: DINOv3-L (frozen, feature extraction)
	- Depth encoder: Custom CNN processing metric depth map
	- Layout encoder: Transformer processing SpatialLM scene graph tokens
	- Semantic encoder: Mask2Former feature pyramid

	### Latent Representation: SLAT-Interior
	Extension of TRELLIS SLAT optimized for indoor scenes:
	- Sparse 3D voxel grid, resolution 1024³
	- Active voxels only on surfaces (wall, furniture)
	- Per-voxel features: shape + material + semantic class
	- Room-shell voxels flagged separately from object voxels

	### Decoder
	Three parallel decoders:
	1. Mesh decoder: Produces watertight or arbitrary-topology meshes (from O-Voxel)
	2. Gaussian decoder: Produces per-voxel Gaussian parameters
	3. Material decoder: Produces PBR material parameters per surface

	### Generation Pipeline
	Two-stage rectified flow (following TRELLIS pattern):
	1. Structure generation: Dense occupancy grid → sparse structure
	2. Latent generation: Per-active-voxel features → shape + material

	Conditioned on: DINOv3 image features + depth map + room layout tokens + semantic segmentation tokens

	---

	## Training Strategy

	### Stage 1: VAE Pre-training (1 week, 8×A100)
	- Train SLAT-Interior VAE on 3D-FRONT + Structured3D rooms
	- Multi-resolution: 256³ → 512³ → 1024³ curriculum
	- Loss: MSE reconstruction + KL divergence + depth consistency + normal consistency

	### Stage 2: Flow-Matching DiT (2 weeks, 32×A100)
	- Train rectified flow transformer for structure generation
	- Curriculum: 256³ → 512³ → 1024³
	- Conditioning: image + depth + layout

	### Stage 3: Material DiT (1 week, 16×A100)
	- Train material generation DiT conditioned on geometry + input image
	- PBR material prediction: albedo, metallic, roughness, normal

	### Stage 4: Fine-tuning (3 days, 8×A100)
	- LoRA fine-tuning on real interior photos (ScanNet + HM3D)
	- Domain adaptation from synthetic to real
	- Reinforcement learning for geometry consistency (GRPO-style)

	### Total Training: ~4 weeks on 32×A100

	---

	## Inference Optimization

	### RTX 4090 (24GB VRAM)
	- Model quantization: INT8 via GPTQ
	- Gradient checkpointing disabled (inference only)
	- Gaussian splatting for real-time preview
	- Full mesh generation: ~15 seconds

	### A100 (80GB VRAM)
	- FP16 inference
	- Batch generation for multiple objects
	- Full pipeline: ~8 seconds

	### H100 (80GB VRAM)
	- BF16 inference
	- ~5 seconds full generation

	### Edge / Mobile
	- Core depth + layout estimation only (~2 seconds)
	- Cloud-based 3D generation with streaming
	- Reduced mesh quality (decimated, lower texture resolution)

	---

	## Export Formats

	\| Format \| Use Case \| Features \|
	\|--------\|----------\|----------\|
	\| GLB \| Web, AR, Unity, Godot \| PBR materials, animations, all data \|
	\| FBX \| Unreal Engine, Maya, 3ds Max \| Full rigging support, PBR \|
	\| OBJ \| Legacy compatibility \| Basic materials (MTL) \|
	\| USDZ \| iOS AR (ARKit) \| Apple's native format \|
	\| 3DGS (.ply) \| Real-time viewing \| Gaussian splatting render \|
	\| BLEND \| Blender native \| Full editability, nodes \|