InteriorFusion / ARCHITECTURE.md

Upload ARCHITECTURE.md

8af6a60 verified 5 days ago

preview code

raw

history blame contribute delete

8.96 kB

InteriorFusion Architecture Design

Design Philosophy

InteriorFusion is built on a critical insight: interior scenes are fundamentally different from single objects. Current SOTA models (TRELLIS, Hunyuan3D-2, TripoSR, SF3D) are trained on object-centric datasets (Objaverse) and produce unit-cube-scaled assets. They have no concept of:

Room topology (walls, floors, ceilings)
Spatial relationships (table NEAR sofa, lamp ON nightstand)
Real-world scale (meters, not arbitrary units)
Multi-object coherence (furniture doesn't float)
Semantic room understanding (kitchen vs bedroom vs office)

InteriorFusion addresses all of these through a 5-phase hybrid pipeline.

Phase 1: Scene Understanding

1.1 Metric Depth Estimation

Model: depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf

Why metric indoor variant? It predicts depth in real-world meters (trained on Hypersim), essential for correct furniture scaling. Non-metric depth estimators produce relative depth that breaks room reconstruction.

1.2 Room Layout Estimation

Model: manycore-research/SpatialLM-Llama-1B (or Qwen-0.5B for Apache 2.0)

SpatialLM processes point clouds from depth + camera intrinsics to produce structured scene scripts:

@dataclass
class RoomLayout:
    walls: List[Plane]          # Wall planes with normals
    floor: Plane                # Floor plane
    ceiling: Plane              # Ceiling plane
    doors: List[Doorway]        # Doorway locations
    windows: List[Window]       # Window locations
    objects: List[ObjectBBox]   # Furniture bounding boxes

1.3 Semantic Segmentation

Model: Mask2Former / OneFormer with indoor-trained heads

Segments the input image into:

Wall regions (with material type: paint, wallpaper, brick)
Floor regions (wood, tile, carpet)
Ceiling region
Per-furniture instances (sofa, table, lamp, etc.)
Decorative elements (plants, paintings, curtains)

1.4 Multi-Object Detection & Isolation

Using SAM (Segment Anything Model) with indoor priors:

Segment each furniture piece
Extract per-object crops with alpha masks
Remove background context for clean object generation

Phase 2: Multi-View Generation

2.1 Per-Object Multi-View Diffusion

Model: stabilityai/stable-zero123 or Zero123++ community pipeline

For each segmented furniture object:

Generate 6 consistent orthographic views (0°, 60°, 120°, 180°, 240°, 300° azimuth)
Condition on the original crop + depth edge map
Use depth-conditioned ControlNet for geometric consistency

2.2 Room Shell Multi-View

For walls, floor, ceiling:

Generate panoramic-style extended views from the single image
Use depth-guided inpainting for occluded regions
Produce ceiling, floor, and wall texture atlases

2.3 Depth-Conditioned View Synthesis

Condition all multi-view generation on the metric depth map:

Depth acts as a geometric prior preventing shape hallucination
Cross-view depth consistency enforced via depth-normal consistency loss

Phase 3: 3D Reconstruction

3.1 Room Shell Reconstruction

Walls, floor, ceiling are reconstructed as planar meshes with UV atlases:

Walls: Extruded from detected wall planes + depth boundaries
Floor: Planar mesh with UV-mapped texture
Ceiling: Planar mesh with texture from inpainted ceiling view

3.2 Per-Object 3D Generation

Each furniture object is reconstructed using a hybrid approach:

Small objects (lamps, vases, decor): TRELLIS.2-4B → mesh with PBR Medium objects (chairs, tables): TRELLIS.2-4B or InteriorFusion-L native Large objects (sofas, beds, wardrobes): InteriorFusion-L with spatial constraints

The key innovation: Spatial Constraint Injection

Object position is constrained by the room layout from Phase 1
Object scale is constrained by metric depth
Object orientation is constrained by floor plane normal

3.3 Gaussian Splatting Layer

For the entire scene, we maintain a parallel 3D Gaussian Splatting representation:

Fast novel view synthesis for interactive preview
Per-object Gaussian subsets for editing
Global scene Gaussians for background/room shell

Phase 4: Scene Assembly

4.1 Layout Optimization

Using SpatialLM's scene graph + learned layout prior:

Place objects at detected positions from Phase 1
Resolve collisions using physics-based relaxation
Ensure objects rest on floor (gravity constraint)
Ensure objects don't intersect walls

4.2 Scale Normalization

All objects normalized to metric scale:

Use known furniture dimensions (e.g., standard chair height ~45cm)
Use depth consistency to resolve ambiguous scales
Human-scale reference from detected people/artifacts

4.3 Scene Graph Construction

@dataclass
class SceneGraph:
    nodes: Dict[str, SceneNode]     # Objects + room shell
    edges: List[SpatialRelation]    # "on", "next to", "in front of", etc.
    room_type: str                   # "modern_living_room", "scandinavian_kitchen"
    style: str                       # "modern", "scandinavian", "luxury", "indian"

Phase 5: Material & Texture

5.1 PBR Material Generation

For each surface:

Base color/albedo (diffuse)
Metallic map
Roughness map
Normal map (bump)
Ambient occlusion (optional)

Model: Custom material diffusion network fine-tuned on Hypersim + InteriorNet

5.2 Texture Baking

Project multi-view generated textures onto UV atlases
Visibility-aware blending (occlusion handling)
Seamless tiling for large surfaces (walls, floors)

5.3 Lighting Estimation

Estimate scene lighting from the input image:

HDR environment map extraction
Key light / fill light / ambient light decomposition
IBL (Image-Based Lighting) setup for game engines

Core Model: InteriorFusion-L (4B Parameters)

Encoder

Image encoder: DINOv3-L (frozen, feature extraction)
Depth encoder: Custom CNN processing metric depth map
Layout encoder: Transformer processing SpatialLM scene graph tokens
Semantic encoder: Mask2Former feature pyramid

Latent Representation: SLAT-Interior

Extension of TRELLIS SLAT optimized for indoor scenes:

Sparse 3D voxel grid, resolution 1024³
Active voxels only on surfaces (wall, furniture)
Per-voxel features: shape + material + semantic class
Room-shell voxels flagged separately from object voxels

Decoder

Three parallel decoders:

Mesh decoder: Produces watertight or arbitrary-topology meshes (from O-Voxel)
Gaussian decoder: Produces per-voxel Gaussian parameters
Material decoder: Produces PBR material parameters per surface

Generation Pipeline

Two-stage rectified flow (following TRELLIS pattern):

Structure generation: Dense occupancy grid → sparse structure
Latent generation: Per-active-voxel features → shape + material

Conditioned on: DINOv3 image features + depth map + room layout tokens + semantic segmentation tokens

Training Strategy

Stage 1: VAE Pre-training (1 week, 8×A100)

Train SLAT-Interior VAE on 3D-FRONT + Structured3D rooms
Multi-resolution: 256³ → 512³ → 1024³ curriculum
Loss: MSE reconstruction + KL divergence + depth consistency + normal consistency

Stage 2: Flow-Matching DiT (2 weeks, 32×A100)

Train rectified flow transformer for structure generation
Curriculum: 256³ → 512³ → 1024³
Conditioning: image + depth + layout

Stage 3: Material DiT (1 week, 16×A100)

Train material generation DiT conditioned on geometry + input image
PBR material prediction: albedo, metallic, roughness, normal

Stage 4: Fine-tuning (3 days, 8×A100)

LoRA fine-tuning on real interior photos (ScanNet + HM3D)
Domain adaptation from synthetic to real
Reinforcement learning for geometry consistency (GRPO-style)

Total Training: ~4 weeks on 32×A100

Inference Optimization

RTX 4090 (24GB VRAM)

Model quantization: INT8 via GPTQ
Gradient checkpointing disabled (inference only)
Gaussian splatting for real-time preview
Full mesh generation: ~15 seconds

A100 (80GB VRAM)

FP16 inference
Batch generation for multiple objects
Full pipeline: ~8 seconds

H100 (80GB VRAM)

BF16 inference
~5 seconds full generation

Edge / Mobile

Core depth + layout estimation only (~2 seconds)
Cloud-based 3D generation with streaming
Reduced mesh quality (decimated, lower texture resolution)

Export Formats

Format	Use Case	Features
GLB	Web, AR, Unity, Godot	PBR materials, animations, all data
FBX	Unreal Engine, Maya, 3ds Max	Full rigging support, PBR
OBJ	Legacy compatibility	Basic materials (MTL)
USDZ	iOS AR (ARKit)	Apple's native format
3DGS (.ply)	Real-time viewing	Gaussian splatting render
BLEND	Blender native	Full editability, nodes