InteriorFusion Dataset Strategy

Core Training Dataset: InteriorFusion-Train

We curate a composite dataset from multiple sources, processed into a unified format.

Dataset Composition

Source	Split	Rooms/Scenes	Images	Purpose	Weight
3D-FRONT (HF MIDI-3D)	train	14,000	~500K	Primary training	40%
Structured3D	train	18,000	~360K	Layout structure	25%
InteriorNet	train	50,000	~1M	Scale pre-training	20%
ScanNet++	train	1,200	~50K	Real-world adaptation	10%
HM3D	train	800	~30K	Real-world adaptation	5%

Total: ~85K rooms, ~2M training images

Unified Data Format

@dataclass
class InteriorSample:
    # Input
    image: torch.Tensor           # [3, H, W] — single interior photo
    depth: torch.Tensor           # [1, H, W] — metric depth in meters
    normal: torch.Tensor          # [3, H, W] — surface normals
    
    # Scene understanding
    room_layout: RoomLayout       # Walls, floor, ceiling planes
    room_type: str                # "living_room", "bedroom", "kitchen"
    style: str                    # "modern", "scandinavian", "luxury"
    scene_graph: SceneGraph       # Object nodes + spatial relations
    
    # Per-object data
    objects: List[ObjectData]     # Individual furniture items
    
    # 3D ground truth
    room_mesh: trimesh.Trimesh     # Full room mesh (walls + floor + ceiling)
    object_meshes: List[trimesh.Trimesh]  # Per-object meshes
    gaussian_cloud: GaussianCloud  # 3D Gaussian representation
    
    # Materials
    materials: List[PBRMaterial]   # Per-object PBR materials
    wall_material: PBRMaterial
    floor_material: PBRMaterial
    
    # Camera
    camera_pose: CameraPose       # Intrinsics + extrinsics
    fov: float
    
    # Metadata
    source: str                   # "3dfront", "structured3d", "scannet"
    caption: str                  # Natural language description

Preprocessing Pipeline

Raw Dataset → Filter → Render Views → Compute Depth →
    Segment Objects → Extract Layout →
    Generate Multi-View → Create SLAT →
    Validate → Package → Upload to HF

Filtering Criteria

Quality filter: Minimum resolution 512×512
Content filter: Must contain at least 2 furniture objects
Occlusion filter: Main objects must be >30% visible
Room type filter: Exclude bathrooms, garages, outdoor
Lighting filter: Exclude extremely dark or overexposed scenes
Duplicate filter: Perceptual hash deduplication

Augmentation Pipeline

Color jitter: brightness ±0.2, contrast ±0.2, saturation ±0.2, hue ±0.1
Random crop: 0.8–1.0 scale, maintain aspect ratio
Horizontal flip: 50% probability
Perspective warp: Simulate different camera angles (±15° pitch, ±20° yaw)
Synthetic occlusion: Add random rectangles simulating foreground objects
Depth noise: Add Gaussian noise to depth map (σ=0.05m) for robustness
Lighting variation: Re-render with different HDRI environments

Captioning Strategy

Automatic captions from Cap3D-style generation:

Room type: "a modern living room with a gray sofa and wooden coffee table"
Style: "scandinavian minimalist interior with natural light"
Objects: "contains: sofa, coffee table, floor lamp, bookshelf"
Materials: "wooden floor, white walls, leather sofa"
Spatial: "sofa against back wall, coffee table centered, lamp in corner"

Manual review: 10% random sample reviewed by interior designers for quality.

Synthetic Data Generation

Using ProcTHOR + AI2-THOR simulator:

Generate 100K additional procedural rooms
Randomize: furniture placement, materials, lighting, camera position
Render 20 views per room
Add to training mix with 15% weight

Data Splits

Split	Rooms	Images	Purpose
Train	75,000	1,800,000	Model training
Val	5,000	120,000	Hyperparameter tuning
Test	5,000	120,000	Final evaluation
Benchmark	500	12,000	Leaderboard / comparison