InteriorFusion / docs /DATASET_STRATEGY.md
stevee00's picture
Upload docs/DATASET_STRATEGY.md
c88ec9c verified

InteriorFusion Dataset Strategy

Core Training Dataset: InteriorFusion-Train

We curate a composite dataset from multiple sources, processed into a unified format.

Dataset Composition

Source Split Rooms/Scenes Images Purpose Weight
3D-FRONT (HF MIDI-3D) train 14,000 ~500K Primary training 40%
Structured3D train 18,000 ~360K Layout structure 25%
InteriorNet train 50,000 ~1M Scale pre-training 20%
ScanNet++ train 1,200 ~50K Real-world adaptation 10%
HM3D train 800 ~30K Real-world adaptation 5%

Total: ~85K rooms, ~2M training images

Unified Data Format

@dataclass
class InteriorSample:
    # Input
    image: torch.Tensor           # [3, H, W] — single interior photo
    depth: torch.Tensor           # [1, H, W] — metric depth in meters
    normal: torch.Tensor          # [3, H, W] — surface normals
    
    # Scene understanding
    room_layout: RoomLayout       # Walls, floor, ceiling planes
    room_type: str                # "living_room", "bedroom", "kitchen"
    style: str                    # "modern", "scandinavian", "luxury"
    scene_graph: SceneGraph       # Object nodes + spatial relations
    
    # Per-object data
    objects: List[ObjectData]     # Individual furniture items
    
    # 3D ground truth
    room_mesh: trimesh.Trimesh     # Full room mesh (walls + floor + ceiling)
    object_meshes: List[trimesh.Trimesh]  # Per-object meshes
    gaussian_cloud: GaussianCloud  # 3D Gaussian representation
    
    # Materials
    materials: List[PBRMaterial]   # Per-object PBR materials
    wall_material: PBRMaterial
    floor_material: PBRMaterial
    
    # Camera
    camera_pose: CameraPose       # Intrinsics + extrinsics
    fov: float
    
    # Metadata
    source: str                   # "3dfront", "structured3d", "scannet"
    caption: str                  # Natural language description

Preprocessing Pipeline

Raw Dataset → Filter → Render Views → Compute Depth →
    Segment Objects → Extract Layout →
    Generate Multi-View → Create SLAT →
    Validate → Package → Upload to HF

Filtering Criteria

  1. Quality filter: Minimum resolution 512×512
  2. Content filter: Must contain at least 2 furniture objects
  3. Occlusion filter: Main objects must be >30% visible
  4. Room type filter: Exclude bathrooms, garages, outdoor
  5. Lighting filter: Exclude extremely dark or overexposed scenes
  6. Duplicate filter: Perceptual hash deduplication

Augmentation Pipeline

  1. Color jitter: brightness ±0.2, contrast ±0.2, saturation ±0.2, hue ±0.1
  2. Random crop: 0.8–1.0 scale, maintain aspect ratio
  3. Horizontal flip: 50% probability
  4. Perspective warp: Simulate different camera angles (±15° pitch, ±20° yaw)
  5. Synthetic occlusion: Add random rectangles simulating foreground objects
  6. Depth noise: Add Gaussian noise to depth map (σ=0.05m) for robustness
  7. Lighting variation: Re-render with different HDRI environments

Captioning Strategy

Automatic captions from Cap3D-style generation:

  • Room type: "a modern living room with a gray sofa and wooden coffee table"
  • Style: "scandinavian minimalist interior with natural light"
  • Objects: "contains: sofa, coffee table, floor lamp, bookshelf"
  • Materials: "wooden floor, white walls, leather sofa"
  • Spatial: "sofa against back wall, coffee table centered, lamp in corner"

Manual review: 10% random sample reviewed by interior designers for quality.

Synthetic Data Generation

Using ProcTHOR + AI2-THOR simulator:

  1. Generate 100K additional procedural rooms
  2. Randomize: furniture placement, materials, lighting, camera position
  3. Render 20 views per room
  4. Add to training mix with 15% weight

Data Splits

Split Rooms Images Purpose
Train 75,000 1,800,000 Model training
Val 5,000 120,000 Hyperparameter tuning
Test 5,000 120,000 Final evaluation
Benchmark 500 12,000 Leaderboard / comparison