InteriorFusion / ARCHITECTURE.md
stevee00's picture
Upload ARCHITECTURE.md
8af6a60 verified

InteriorFusion Architecture Design

Design Philosophy

InteriorFusion is built on a critical insight: interior scenes are fundamentally different from single objects. Current SOTA models (TRELLIS, Hunyuan3D-2, TripoSR, SF3D) are trained on object-centric datasets (Objaverse) and produce unit-cube-scaled assets. They have no concept of:

  • Room topology (walls, floors, ceilings)
  • Spatial relationships (table NEAR sofa, lamp ON nightstand)
  • Real-world scale (meters, not arbitrary units)
  • Multi-object coherence (furniture doesn't float)
  • Semantic room understanding (kitchen vs bedroom vs office)

InteriorFusion addresses all of these through a 5-phase hybrid pipeline.


Phase 1: Scene Understanding

1.1 Metric Depth Estimation

Model: depth-anything/Depth-Anything-V2-Metric-Indoor-Large-hf

Why metric indoor variant? It predicts depth in real-world meters (trained on Hypersim), essential for correct furniture scaling. Non-metric depth estimators produce relative depth that breaks room reconstruction.

1.2 Room Layout Estimation

Model: manycore-research/SpatialLM-Llama-1B (or Qwen-0.5B for Apache 2.0)

SpatialLM processes point clouds from depth + camera intrinsics to produce structured scene scripts:

@dataclass
class RoomLayout:
    walls: List[Plane]          # Wall planes with normals
    floor: Plane                # Floor plane
    ceiling: Plane              # Ceiling plane
    doors: List[Doorway]        # Doorway locations
    windows: List[Window]       # Window locations
    objects: List[ObjectBBox]   # Furniture bounding boxes

1.3 Semantic Segmentation

Model: Mask2Former / OneFormer with indoor-trained heads

Segments the input image into:

  • Wall regions (with material type: paint, wallpaper, brick)
  • Floor regions (wood, tile, carpet)
  • Ceiling region
  • Per-furniture instances (sofa, table, lamp, etc.)
  • Decorative elements (plants, paintings, curtains)

1.4 Multi-Object Detection & Isolation

Using SAM (Segment Anything Model) with indoor priors:

  • Segment each furniture piece
  • Extract per-object crops with alpha masks
  • Remove background context for clean object generation

Phase 2: Multi-View Generation

2.1 Per-Object Multi-View Diffusion

Model: stabilityai/stable-zero123 or Zero123++ community pipeline

For each segmented furniture object:

  • Generate 6 consistent orthographic views (0°, 60°, 120°, 180°, 240°, 300° azimuth)
  • Condition on the original crop + depth edge map
  • Use depth-conditioned ControlNet for geometric consistency

2.2 Room Shell Multi-View

For walls, floor, ceiling:

  • Generate panoramic-style extended views from the single image
  • Use depth-guided inpainting for occluded regions
  • Produce ceiling, floor, and wall texture atlases

2.3 Depth-Conditioned View Synthesis

Condition all multi-view generation on the metric depth map:

  • Depth acts as a geometric prior preventing shape hallucination
  • Cross-view depth consistency enforced via depth-normal consistency loss

Phase 3: 3D Reconstruction

3.1 Room Shell Reconstruction

Walls, floor, ceiling are reconstructed as planar meshes with UV atlases:

  • Walls: Extruded from detected wall planes + depth boundaries
  • Floor: Planar mesh with UV-mapped texture
  • Ceiling: Planar mesh with texture from inpainted ceiling view

3.2 Per-Object 3D Generation

Each furniture object is reconstructed using a hybrid approach:

Small objects (lamps, vases, decor): TRELLIS.2-4B → mesh with PBR Medium objects (chairs, tables): TRELLIS.2-4B or InteriorFusion-L native Large objects (sofas, beds, wardrobes): InteriorFusion-L with spatial constraints

The key innovation: Spatial Constraint Injection

  • Object position is constrained by the room layout from Phase 1
  • Object scale is constrained by metric depth
  • Object orientation is constrained by floor plane normal

3.3 Gaussian Splatting Layer

For the entire scene, we maintain a parallel 3D Gaussian Splatting representation:

  • Fast novel view synthesis for interactive preview
  • Per-object Gaussian subsets for editing
  • Global scene Gaussians for background/room shell

Phase 4: Scene Assembly

4.1 Layout Optimization

Using SpatialLM's scene graph + learned layout prior:

  • Place objects at detected positions from Phase 1
  • Resolve collisions using physics-based relaxation
  • Ensure objects rest on floor (gravity constraint)
  • Ensure objects don't intersect walls

4.2 Scale Normalization

All objects normalized to metric scale:

  • Use known furniture dimensions (e.g., standard chair height ~45cm)
  • Use depth consistency to resolve ambiguous scales
  • Human-scale reference from detected people/artifacts

4.3 Scene Graph Construction

@dataclass
class SceneGraph:
    nodes: Dict[str, SceneNode]     # Objects + room shell
    edges: List[SpatialRelation]    # "on", "next to", "in front of", etc.
    room_type: str                   # "modern_living_room", "scandinavian_kitchen"
    style: str                       # "modern", "scandinavian", "luxury", "indian"

Phase 5: Material & Texture

5.1 PBR Material Generation

For each surface:

  • Base color/albedo (diffuse)
  • Metallic map
  • Roughness map
  • Normal map (bump)
  • Ambient occlusion (optional)

Model: Custom material diffusion network fine-tuned on Hypersim + InteriorNet

5.2 Texture Baking

  • Project multi-view generated textures onto UV atlases
  • Visibility-aware blending (occlusion handling)
  • Seamless tiling for large surfaces (walls, floors)

5.3 Lighting Estimation

Estimate scene lighting from the input image:

  • HDR environment map extraction
  • Key light / fill light / ambient light decomposition
  • IBL (Image-Based Lighting) setup for game engines

Core Model: InteriorFusion-L (4B Parameters)

Encoder

  • Image encoder: DINOv3-L (frozen, feature extraction)
  • Depth encoder: Custom CNN processing metric depth map
  • Layout encoder: Transformer processing SpatialLM scene graph tokens
  • Semantic encoder: Mask2Former feature pyramid

Latent Representation: SLAT-Interior

Extension of TRELLIS SLAT optimized for indoor scenes:

  • Sparse 3D voxel grid, resolution 1024³
  • Active voxels only on surfaces (wall, furniture)
  • Per-voxel features: shape + material + semantic class
  • Room-shell voxels flagged separately from object voxels

Decoder

Three parallel decoders:

  1. Mesh decoder: Produces watertight or arbitrary-topology meshes (from O-Voxel)
  2. Gaussian decoder: Produces per-voxel Gaussian parameters
  3. Material decoder: Produces PBR material parameters per surface

Generation Pipeline

Two-stage rectified flow (following TRELLIS pattern):

  1. Structure generation: Dense occupancy grid → sparse structure
  2. Latent generation: Per-active-voxel features → shape + material

Conditioned on: DINOv3 image features + depth map + room layout tokens + semantic segmentation tokens


Training Strategy

Stage 1: VAE Pre-training (1 week, 8×A100)

  • Train SLAT-Interior VAE on 3D-FRONT + Structured3D rooms
  • Multi-resolution: 256³ → 512³ → 1024³ curriculum
  • Loss: MSE reconstruction + KL divergence + depth consistency + normal consistency

Stage 2: Flow-Matching DiT (2 weeks, 32×A100)

  • Train rectified flow transformer for structure generation
  • Curriculum: 256³ → 512³ → 1024³
  • Conditioning: image + depth + layout

Stage 3: Material DiT (1 week, 16×A100)

  • Train material generation DiT conditioned on geometry + input image
  • PBR material prediction: albedo, metallic, roughness, normal

Stage 4: Fine-tuning (3 days, 8×A100)

  • LoRA fine-tuning on real interior photos (ScanNet + HM3D)
  • Domain adaptation from synthetic to real
  • Reinforcement learning for geometry consistency (GRPO-style)

Total Training: ~4 weeks on 32×A100


Inference Optimization

RTX 4090 (24GB VRAM)

  • Model quantization: INT8 via GPTQ
  • Gradient checkpointing disabled (inference only)
  • Gaussian splatting for real-time preview
  • Full mesh generation: ~15 seconds

A100 (80GB VRAM)

  • FP16 inference
  • Batch generation for multiple objects
  • Full pipeline: ~8 seconds

H100 (80GB VRAM)

  • BF16 inference
  • ~5 seconds full generation

Edge / Mobile

  • Core depth + layout estimation only (~2 seconds)
  • Cloud-based 3D generation with streaming
  • Reduced mesh quality (decimated, lower texture resolution)

Export Formats

Format Use Case Features
GLB Web, AR, Unity, Godot PBR materials, animations, all data
FBX Unreal Engine, Maya, 3ds Max Full rigging support, PBR
OBJ Legacy compatibility Basic materials (MTL)
USDZ iOS AR (ARKit) Apple's native format
3DGS (.ply) Real-time viewing Gaussian splatting render
BLEND Blender native Full editability, nodes