InteriorFusion / docs /RESEARCH_REPORT.md

Upload docs/RESEARCH_REPORT.md

708fe64 verified 15 days ago

preview code

raw

history blame contribute delete

13.4 kB

InteriorFusion: Research Report & Literature Review

Executive Summary

After analyzing 50+ papers, 20+ repositories, and 15+ datasets, we identified that no existing open-source system solves single-image-to-3D-interior at production quality. All current SOTA models are object-centric. InteriorFusion bridges this gap through a scene-aware hybrid architecture.

SOTA Comparison Table

System	Geometry Quality	Texture Quality	Inference Speed	VRAM Usage	Multi-View Consistency	Scene Generation	Mesh Quality	CAD Compatible	Controllable	Training Cost	Fine-Tuning Difficulty	Commercial Usable
TRELLIS	⭐⭐⭐⭐	⭐⭐⭐	15s	24GB	⭐⭐⭐⭐	❌ (object-only)	⭐⭐⭐⭐	⚠️ (needs export)	⭐⭐⭐	$50K (64×A100)	Medium	✅ MIT
TRELLIS.2	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	12s	32GB	⭐⭐⭐⭐⭐	❌ (object-only)	⭐⭐⭐⭐⭐	✅ Native PBR	⭐⭐⭐⭐	$100K (32×H100)	Hard	✅ MIT
Hunyuan3D-2	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	25s	24GB	⭐⭐⭐⭐	❌ (object-only)	⭐⭐⭐⭐	✅	⭐⭐⭐	Unknown	Hard	⚠️ (Tencent license)
Hunyuan3D-2.5	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	30s	48GB	⭐⭐⭐⭐⭐	❌ (object-only)	⭐⭐⭐⭐⭐	✅	⭐⭐⭐⭐	Unknown	Hard	⚠️
TripoSR	⭐⭐⭐	⭐⭐⭐	0.5s	8GB	⭐⭐⭐	❌	⭐⭐⭐	⚠️	⭐⭐	$5K (8×A100)	Easy	✅ MIT
SF3D	⭐⭐⭐⭐	⭐⭐⭐⭐	0.5s	10GB	⭐⭐⭐⭐	❌	⭐⭐⭐⭐	✅ PBR	⭐⭐⭐	$5K	Medium	✅ MIT
InstantMesh	⭐⭐⭐⭐	⭐⭐⭐⭐	10s	16GB	⭐⭐⭐⭐⭐	❌	⭐⭐⭐⭐⭐	⚠️	⭐⭐⭐	$20K	Medium	✅
CRM	⭐⭐⭐⭐⭐	⭐⭐⭐	4s	16GB	⭐⭐⭐⭐	❌	⭐⭐⭐⭐⭐	⚠️	⭐⭐⭐	$8K (8×A800)	Medium	✅
LGM	⭐⭐⭐	⭐⭐⭐⭐	5s	24GB	⭐⭐⭐⭐	❌	⭐⭐⭐ (Gaussian)	❌	⭐⭐	$30K (32×A100)	Medium	✅
Era3D	⭐⭐⭐⭐	⭐⭐⭐⭐	4min	24GB	⭐⭐⭐⭐⭐	❌	⭐⭐⭐⭐	⚠️	⭐⭐⭐	$15K (16×H800)	Hard	✅
Wonder3D	⭐⭐⭐⭐	⭐⭐⭐⭐	2min	16GB	⭐⭐⭐⭐⭐	❌	⭐⭐⭐⭐	⚠️	⭐⭐⭐	$10K	Medium	✅
SyncDreamer	⭐⭐⭐	⭐⭐⭐⭐	30s	16GB	⭐⭐⭐⭐⭐	❌	⭐⭐⭐	❌	⭐⭐	$8K	Easy	✅
MVDream	⭐⭐	⭐⭐⭐	20s	16GB	⭐⭐⭐⭐	❌	⭐⭐	❌	⭐⭐	$10K	Medium	✅
2DGS-Room	⭐⭐⭐⭐	⭐⭐⭐⭐	~30s	24GB	⭐⭐⭐	✅ (rooms!)	⭐⭐⭐	❌	⭐⭐	$5K	Hard	✅
Pano2Room	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	~2min	16GB	⭐⭐⭐⭐	✅ (panoramas)	⭐⭐⭐	❌	⭐⭐	$3K	Medium	✅
SpatialLM	N/A	N/A	1s	8GB	N/A	✅ (layouts!)	N/A	N/A	⭐⭐⭐⭐⭐	$20K	Easy	✅ Apache 2.0
InteriorFusion (target)	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	8s	16GB	⭐⭐⭐⭐⭐	✅✅✅	⭐⭐⭐⭐⭐	✅✅✅	⭐⭐⭐⭐⭐	$60K	Medium	✅ MIT

Why Current Models Fail for Interiors

1. Inconsistent Room Geometry

Root cause: No room topology prior. Object models generate in unit cube; rooms need planar walls with right angles. Fix in InteriorFusion: Explicit room layout estimation (SpatialLM) constrains wall/floor/ceiling to Manhattan-world planes.

2. Furniture Floating

Root cause: No gravity/physics prior. Objects generated independently with no floor contact constraint. Fix: Collision detection + physics relaxation in scene assembly phase. Floor plane from depth estimation anchors all objects.

3. Inaccurate Scaling

Root cause: Object-centric models normalize to unit cube. A chair and a sofa both fit in [−1,1]³. Fix: Metric depth estimation (Depth Anything V2 metric indoor) provides real-world scale in meters. Furniture dimensions matched against a prior database.

4. Wall/Floor Topology Issues

Root cause: No distinction between room shell and furniture. Models try to generate everything as one mesh. Fix: Separate room shell generation (planar meshes) from per-object generation. Room shell voxels flagged separately in SLAT-Interior.

5. Poor Spatial Relationships

Root cause: Independent object generation. No knowledge that "lamp goes on table" or "sofa faces TV". Fix: Scene graph generation + learned layout prior from 3D-FRONT. Spatial relations encoded as edge features in scene graph.

6. Weak Depth Consistency

Root cause: Single-view depth estimators produce inconsistent depth across object boundaries. Fix: Multi-view depth fusion + cross-view depth-normal consistency loss. Depth-conditioned generation at every stage.

7. Multi-Object Scene Collapse

Root cause: When multiple objects appear in one image, models merge them into a single blob. Fix: Semantic segmentation → per-object isolation → independent generation → scene assembly.

8. Texture Bleeding

Root cause: Multi-view texture projection without occlusion handling. Wall texture bleeds onto furniture. Fix: Visibility-aware texture baking with depth-buffer occlusion testing. Per-object UV atlases.

9. Incomplete Room Reconstruction

Root cause: Occluded regions (behind sofa, under table) are hallucinated incorrectly. Fix: Inpainting diffusion for occluded regions, conditioned on detected room layout. Ceiling/floor inpainting from detected planes.

10. Inability to Edit Generated Rooms

Root cause: Single output mesh. Can't move sofa without regenerating everything. Fix: Scene graph representation. Each object is a separate node. Objects generated independently, assembled via scene graph. Move sofa = update scene graph node position.

11. Lack of Semantic Room Understanding

Root cause: No training on room types. Model doesn't know "kitchen needs stove, bedroom needs bed". Fix: Room type classifier trained on 3D-FRONT room labels. Style-conditioned generation (modern, scandinavian, luxury, indian, commercial).

Bottleneck Analysis

Bottleneck	Impact	Solution in InteriorFusion
Latent representation	Object-only latents can't encode rooms	SLAT-Interior: sparse voxels with room-shell vs object flags
Scene encoding	No scene-level conditioning	Multi-encoder: image + depth + layout + semantic tokens
Geometry priors	No Manhattan world / planar constraints	Room shell generation enforces planar walls/floor/ceiling
Rendering pipeline	Object-only rendering (sphere cameras)	Indoor camera distribution (room-centered, limited elevation)
Training datasets	Only object datasets (Objaverse)	3D-FRONT + Structured3D + InteriorNet + ScanNet
Sparse-view reconstruction	150 views per object; rooms need more	Seed-guided 2D Gaussian splatting for room-scale
Scene graph modeling	No relationship modeling	SpatialLM scene scripts + learned layout prior

Key Papers & arXiv IDs

Paper	arXiv ID	Key Contribution
TRELLIS v1	2412.01506	Structured latent (SLAT) for 3D generation
TRELLIS.2	2512.14692	O-Voxel with PBR materials, 16× compression
TRELLISWorld	2510.23880	Tiled diffusion for scene generation
Hunyuan3D-2.0	2501.12202	Shape+texture two-stage pipeline
Hunyuan3D-2.1	2506.15442	Full training code release
Hunyuan3D-2.5	2506.16504	LATTICE 10B model
HunyuanWorld	2507.21809	Panoramic world proxies
SF3D	2408.00653	Sub-second mesh + PBR
InstantMesh	2404.07191	Best open-source mesh quality
CRM	2403.05034	Best geometry fidelity (CD 0.0094)
TripoSR	2403.02151	Fastest baseline (0.5s)
LGM	2402.05054	Gaussian splatting output
Era3D	2405.11616	High-res multi-view (512²)
Wonder3D	2310.15008	Cross-domain diffusion
SyncDreamer	2309.03453	Synchronized multi-view
MVDream	2308.16512	Multi-view diffusion
2DGS-Room	2412.03428	Indoor GS reconstruction
Pano2Room	2408.11413	Single panorama to 3DGS
SpatialLM	2506.07491	LLM for indoor scene understanding
RoomFormer	CVPR 2023	Floorplan from point clouds
EchoScene	2405.00915	Scene graph → 3D indoor
CHOrD	2503.11958	Collision-free house-scale scenes
Direct3D	2405.14832	Triplane VAE + DiT
Direct3D-S2	2505.17412	Sparse SDF VAE, 1024³ on 8 GPUs
CLAY	2406.13897	1.5B param multi-condition model
RL3DEdit	2603.03143	RL (GRPO) for 3D editing
AR3D-R1	(recent)	RL-enhanced text-to-3D
Grendel-GS	2406.18533	Distributed 3DGS training
TriplaneTurbo	2503.21694	Progressive rendering distillation
Depth Anything V2	2406.09414	SOTA monocular depth

Dataset Rankings for Interior 3D

Tier 1 (Essential)

Rank	Dataset	Size	Key Strength	HF Hub
1	3D-FRONT (MIDI-3D)	17K rooms	End-to-end room scenes with furniture	`huanngzh/3D-Front`
2	Structured3D	21K rooms	Best structured 3D annotations (planes, lines, junctions)	`Gen3DF/Structured3D`
3	ScanNet++	1.6K scenes	Real-world validation, dense annotations	`marvex/scannet-dataset`

Tier 2 (Pre-training & Scale)

Rank	Dataset	Size	Key Strength
4	InteriorNet	1.7M layouts	Massive scale, multi-sensor
5	HM3D	1K scenes	Largest real-world dataset
6	Hypersim	461 scenes	High photorealism, material decomposition
7	Replica	18 scenes	HDR textures, highest quality

Tier 3 (Assets & Objects)

Rank	Dataset	Size	Key Strength	HF Hub
8	Objaverse-XL	10M objects	Largest 3D object repo	`allenai/objaverse-xl`
9	OmniObject3D	6K objects	High-quality real scans	N/A
10	3D-FUTURE	10K furniture	Professional furniture models	N/A

Tier 4 (Auxiliary)

Dataset	Purpose
SceneVerse	Language grounding
ProcTHOR	Procedural augmentation
ARKitScenes	Mobile capture
3RScan	Change detection
MultiScan	Articulated furniture
Infinigen	Procedural generation
MVImgNet	Object multi-view
GSO	Evaluation benchmark

Training Recipe Summary

Stage 1: VAE (1 week, 8×A100)

Dataset: 3D-FRONT + Structured3D (synthetic rooms)
Multi-resolution: 256³ → 512³ → 1024³ curriculum
Optimizer: AdamW, lr 1e-4, weight decay 0.01
Loss: MSE reconstruction + KL (λ=1e-3) + depth L1 + normal cosine
Batch: 8 per GPU, effective 64

Stage 2: Structure DiT (1 week, 32×A100)

Rectified flow matching
Conditioning: DINOv3-L image features + depth + layout tokens
Resolution curriculum: 256³ → 512³ → 1024³
Batch: 8 per GPU, effective 256
Optimizer: AdamW, lr 1e-4 → 2e-5 (progressive)

Stage 3: Material DiT (1 week, 16×A100)

Conditioned on generated geometry + input image
PBR material prediction
Batch: 16 per GPU, effective 256
Loss: L1 on albedo + L1 on metallic/roughness + LPIPS on rendered appearance

Stage 4: Real-world Fine-tuning (3 days, 8×A100)

LoRA rank 32 on DiT attention layers
Dataset: ScanNet + HM3D real photos
RL fine-tuning: GRPO with VGGT geometric rewards
Domain adaptation from synthetic → real

Total Cost Estimate: ~$60K (4 weeks on 32×A100)

Novel Contributions of InteriorFusion

SLAT-Interior: First structured latent representation designed for indoor scenes with room-shell vs object separation
Scene-aware generation pipeline: First end-to-end pipeline from single image to editable 3D interior
Metric-scale consistency: Leverages metric depth for real-world furniture scaling
Hybrid output: Simultaneous mesh + Gaussian splatting + PBR materials
Editable scene graph: Objects are independent, movable, replaceable nodes
Style-conditioned: Supports modern, scandinavian, luxury, indian, commercial interiors
PBR material generation: Native metallic/roughness/normal output (not just baked textures)
Training-free scene assembly: Uses SpatialLM + learned layout prior without scene-level diffusion training

Business Moat Analysis

Moat	InteriorFusion	Competitors
Dataset moat	3D-FRONT + Structured3D rooms (interior-specific)	Generic object datasets
Architecture moat	Scene-aware SLAT + scene graph	Object-only representations
Integration moat	Blender/UE/Unity plugins + ComfyUI nodes	Mostly web/API only
Speed moat	8s on A100	0.5s (TripoSR) but no interiors; 15-30s for quality
Quality moat	PBR + editable + scene-aware	Single mesh blob
Open-source moat	MIT license, full code	Mixed licenses (some proprietary)