stevee00 commited on
Commit
d78cc54
Β·
verified Β·
1 Parent(s): c8a3327

Upload docs/INFERENCE_OPTIMIZATION.md

Browse files
Files changed (1) hide show
  1. docs/INFERENCE_OPTIMIZATION.md +178 -0
docs/INFERENCE_OPTIMIZATION.md ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # InteriorFusion Inference Optimization Guide
2
+
3
+ ## Target Platforms
4
+
5
+ ### RTX 4090 (24GB VRAM) β€” Consumer Desktop
6
+ ```bash
7
+ # Quantized inference with INT8
8
+ python -m interiorfusion.infer \
9
+ --image room.jpg \
10
+ --output ./output/ \
11
+ --model-size L \
12
+ --device cuda \
13
+ --dtype float16 \
14
+ --no-pbr # Disable PBR for faster generation
15
+
16
+ # Expected: ~12s for full scene with GLB+PLY output
17
+ ```
18
+
19
+ **Optimizations**:
20
+ - FP16 inference throughout pipeline
21
+ - Skip material generation for preview mode
22
+ - Use `torch.compile()` on DiT forward pass
23
+ - Flash Attention 2 for transformer attention
24
+ - Batch multi-view generation (6 views simultaneously)
25
+
26
+ ### A100 (80GB VRAM) β€” Cloud / Datacenter
27
+ ```bash
28
+ # Full quality generation
29
+ python -m interiorfusion.infer \
30
+ --image room.jpg \
31
+ --output ./output/ \
32
+ --model-size XL \
33
+ --device cuda \
34
+ --dtype bfloat16
35
+
36
+ # Expected: ~8s for full scene with all formats
37
+ ```
38
+
39
+ **Optimizations**:
40
+ - BF16 precision (better numerical stability than FP16)
41
+ - Batch size 4 for parallel room generation
42
+ - CUDA Graphs for repeated operations
43
+ - Persistent CUDA cache
44
+
45
+ ### H100 (80GB VRAM) β€” Latest Datacenter
46
+ ```bash
47
+ # Maximum quality with Transformer Engine
48
+ python -m interiorfusion.infer \
49
+ --image room.jpg \
50
+ --output ./output/ \
51
+ --model-size XL \
52
+ --device cuda \
53
+ --dtype bfloat16
54
+
55
+ # Expected: ~5s full pipeline
56
+ ```
57
+
58
+ **Optimizations**:
59
+ - FP8 via Transformer Engine
60
+ - Hardware-accelerated attention
61
+ - NVLink for multi-GPU distribution
62
+
63
+ ### Apple Silicon (MLX)
64
+ ```bash
65
+ # MLX-optimized inference
66
+ python -m interiorfusion.infer \
67
+ --image room.jpg \
68
+ --output ./output/ \
69
+ --model-size S \
70
+ --device mps \
71
+ --dtype float32
72
+
73
+ # Expected: ~30s on M3 Max (36GB unified memory)
74
+ ```
75
+
76
+ **Optimizations**:
77
+ - MLX graph compilation
78
+ - Unified memory avoids CPU-GPU copies
79
+ - Model quantization to 4-bit via GPTQ
80
+
81
+ ### Edge / Mobile
82
+ ```bash
83
+ # Core pipeline only (depth + layout)
84
+ python -m interiorfusion.infer \
85
+ --image room.jpg \
86
+ --output ./output/ \
87
+ --model-size S \
88
+ --device cpu \
89
+ --no-pbr --no-gaussian \
90
+ --formats glb
91
+
92
+ # Expected: ~5s depth+layout, scene sent to cloud for 3D generation
93
+ ```
94
+
95
+ **Optimizations**:
96
+ - Core inference on-device (depth + segmentation)
97
+ - Cloud offloading for 3D generation
98
+ - Streaming mesh chunks
99
+ - Aggressive quantization (INT4)
100
+
101
+ ## Quantization Strategies
102
+
103
+ | Method | Model Size | Speedup | Quality Impact | VRAM Reduction |
104
+ |--------|-----------|---------|---------------|---------------|
105
+ | FP32 (baseline) | 100% | 1Γ— | β€” | 100% |
106
+ | FP16 | 50% | 1.8Γ— | Minimal | 50% |
107
+ | BF16 | 50% | 1.8Γ— | Minimal | 50% |
108
+ | INT8 (SmoothQuant) | 25% | 2.5Γ— | Low | 25% |
109
+ | FP8 (TE) | 25% | 3Γ— | Low | 25% |
110
+ | GPTQ-4bit | 12.5% | 3.5Γ— | Medium | 12.5% |
111
+ | AWQ-4bit | 12.5% | 3.2Γ— | Low | 12.5% |
112
+
113
+ ## Export Formats
114
+
115
+ | Format | Size | Viewer | Game Engine | AR/VR | Notes |
116
+ |--------|------|--------|------------|-------|-------|
117
+ | **GLB** | ~5-50MB | βœ… (Web) | βœ… (UE/Unity) | βœ… (WebXR) | Recommended default |
118
+ | **FBX** | ~10-100MB | ⚠️ (Limited) | βœ… (UE/Unity/Maya) | ⚠️ | For animation/ rigging |
119
+ | **OBJ** | ~5-30MB | βœ… (Universal) | βœ… (All) | ⚠️ | Legacy, no PBR |
120
+ | **USDZ** | ~5-50MB | βœ… (iOS AR) | ⚠️ (UE via plugin) | βœ… (ARKit) | Apple's format |
121
+ | **PLY (3DGS)** | ~10-500MB | βœ… (Gaussian viewers) | ⚠️ (UE5 plugin) | ⚠️ | For splatting render |
122
+
123
+ ## ComfyUI Integration
124
+
125
+ Install the custom nodes:
126
+ ```bash
127
+ cd ComfyUI/custom_nodes
128
+ git clone https://github.com/stevee00/ComfyUI-InteriorFusion
129
+ ```
130
+
131
+ Available nodes:
132
+ - `InteriorFusion: Generate Scene` β€” Full pipeline
133
+ - `InteriorFusion: Generate Object` β€” Single furniture
134
+ - `InteriorFusion: Apply Material` β€” PBR material
135
+ - `InteriorFusion: Export Mesh` β€” Format conversion
136
+
137
+ ## Blender Integration
138
+
139
+ Install the addon:
140
+ ```bash
141
+ # In Blender: Edit > Preferences > Add-ons > Install
142
+ # Select blender_plugin/interiorfusion_blender.py
143
+ ```
144
+
145
+ Features:
146
+ - Generate 3D scene from reference image
147
+ - Import with PBR materials
148
+ - Interactive object editing
149
+ - Export to game engines
150
+
151
+ ## Unreal Engine Integration
152
+
153
+ 1. Export GLB from InteriorFusion
154
+ 2. Import via glTF importer (UE5 built-in)
155
+ 3. Materials auto-convert to Unreal PBR
156
+ 4. Use Gaussian Splatting plugin for real-time preview
157
+
158
+ Plugins needed:
159
+ - `glTFRuntime` for runtime GLB loading
160
+ - `MLSLabsGaussianSplattingRenderer` for 3DGS
161
+
162
+ ## Unity Integration
163
+
164
+ 1. Export GLB or FBX from InteriorFusion
165
+ 2. Import into Unity project
166
+ 3. Materials map to Unity Standard/URP/HDRP
167
+ 4. Use GaussianSplatting package for 3DGS
168
+
169
+ ## Performance Targets
170
+
171
+ | Platform | Target Time | Target VRAM | Output Quality |
172
+ |----------|------------|-------------|---------------|
173
+ | RTX 4090 | < 15s | < 20GB | Production |
174
+ | A100 | < 8s | < 72GB | Maximum |
175
+ | H100 | < 5s | < 72GB | Maximum |
176
+ | M3 Max | < 30s | < 36GB | Production |
177
+ | RTX 3060 | < 60s | < 10GB | Preview |
178
+ | Edge (CPU) | < 10s (depth only) | < 4GB | Core only |