WaveCut commited on
Commit
23f4b7d
·
verified ·
1 Parent(s): 08fc650

Add model CPU offload cold and warm benchmark

Browse files
Files changed (2) hide show
  1. README.md +23 -0
  2. model_cpu_offload_benchmark.json +283 -0
README.md CHANGED
@@ -111,6 +111,29 @@ Hardware: RunPod NVIDIA H100 80GB HBM3, PyTorch 2.8.0 CUDA 12.8 container, local
111
 
112
  Transformer-only footprint is computed from safetensors tensor storage for the denoising transformer parameter tensors only; it excludes allocator overhead and non-transformer components. The original transformer tensors are F32; the corrected SDNQ transformer stores quantized tensors as U8 plus the excluded modulation layers as BF16.
113
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  ## 10-Prompt Matrix
115
 
116
  | ID | Scenario | Seed | Original time, s | Quant time, s | Delta | Original peak allocated VRAM, GB | Quant peak allocated VRAM, GB |
 
111
 
112
  Transformer-only footprint is computed from safetensors tensor storage for the denoising transformer parameter tensors only; it excludes allocator overhead and non-transformer components. The original transformer tensors are F32; the corrected SDNQ transformer stores quantized tensors as U8 plus the excluded modulation layers as BF16.
113
 
114
+ ### Model CPU Offload Benchmark
115
+
116
+ Same hardware and 10 prompts, using `pipe.enable_model_cpu_offload()`. The reported load time uses a warm local Hugging Face cache on the container disk, so model download time is excluded. Each model was measured in a fresh Python process. `Cold generation` is P01, the first generation immediately after load/offload setup; `warm generation` aggregates P02-P10.
117
+
118
+ | Metric | Original Lens-Turbo | SDNQ uint4 static fixed |
119
+ | --- | ---: | ---: |
120
+ | Offload setup/load time, seconds | 15.411 | 12.371 |
121
+ | Offload setup peak allocated VRAM, GB | 12.582 | 12.582 |
122
+ | Offload setup peak reserved VRAM, GB | 13.881 | 13.881 |
123
+ | Cold generation time, seconds | 8.434 | 8.440 |
124
+ | Cold generation peak allocated VRAM, GB | 18.945 | 15.085 |
125
+ | Cold generation peak reserved VRAM, GB | 19.262 | 15.238 |
126
+ | Warm generation average time, seconds | 5.731 | 4.976 |
127
+ | Warm generation median time, seconds | 5.141 | 3.855 |
128
+ | Warm generation average peak allocated VRAM, GB | 18.945 | 15.084 |
129
+ | Warm generation average peak reserved VRAM, GB | 19.267 | 15.249 |
130
+ | Warm generation max peak allocated VRAM, GB | 18.968 | 15.104 |
131
+ | Warm generation max peak reserved VRAM, GB | 19.290 | 15.280 |
132
+
133
+ Raw offload benchmark data: [`model_cpu_offload_benchmark.json`](model_cpu_offload_benchmark.json).
134
+
135
+ In `model_cpu_offload` mode the setup/load VRAM peak is dominated by non-transformer components, so the load peak is effectively unchanged. During generation, where the denoising transformer is active, the SDNQ variant saves about 3.861 GB peak allocated VRAM on the warm prompts, a 20.4% reduction versus the original model.
136
+
137
  ## 10-Prompt Matrix
138
 
139
  | ID | Scenario | Seed | Original time, s | Quant time, s | Delta | Original peak allocated VRAM, GB | Quant peak allocated VRAM, GB |
model_cpu_offload_benchmark.json ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "benchmark": "model_cpu_offload_cold_warm",
3
+ "hardware": "RunPod NVIDIA H100 80GB HBM3 (H100 SXM)",
4
+ "mode": "Diffusers enable_model_cpu_offload()",
5
+ "cache_state": "warm local HF cache; no download time included",
6
+ "process_isolation": "single model per fresh Python process",
7
+ "base_resolution": 1024,
8
+ "aspect_ratio": "1:1",
9
+ "num_inference_steps": 4,
10
+ "guidance_scale": 1.0,
11
+ "dtype": "torch.bfloat16",
12
+ "definitions": {
13
+ "load_time_s": "Pipeline load plus enable_model_cpu_offload setup from warm local HF cache; download time excluded.",
14
+ "cold_generation": "P01, first generation immediately after fresh process load/offload setup.",
15
+ "warm_generation": "P02-P10 after the cold P01 generation."
16
+ },
17
+ "models": {
18
+ "base": {
19
+ "hardware": "RunPod NVIDIA H100 80GB HBM3 (H100 SXM)",
20
+ "mode": "Diffusers enable_model_cpu_offload()",
21
+ "cache_state": "warm local HF cache; no download time included",
22
+ "process_isolation": "single model per fresh Python process",
23
+ "base_resolution": 1024,
24
+ "aspect_ratio": "1:1",
25
+ "num_inference_steps": 4,
26
+ "guidance_scale": 1.0,
27
+ "dtype": "torch.bfloat16",
28
+ "kind": "base",
29
+ "load": {
30
+ "load_time_s": 15.411,
31
+ "peak_allocated_gb": 12.582,
32
+ "peak_reserved_gb": 13.881,
33
+ "end_allocated_gb": 10.185,
34
+ "end_reserved_gb": 10.679
35
+ },
36
+ "summary": {
37
+ "cold_time_s": 8.434,
38
+ "cold_peak_allocated_gb": 18.945,
39
+ "cold_peak_reserved_gb": 19.262,
40
+ "warm_avg_time_s": 5.731,
41
+ "warm_median_time_s": 5.141,
42
+ "warm_avg_peak_allocated_gb": 18.945,
43
+ "warm_avg_peak_reserved_gb": 19.267,
44
+ "warm_max_peak_allocated_gb": 18.968,
45
+ "warm_max_peak_reserved_gb": 19.29
46
+ },
47
+ "prompts": [
48
+ {
49
+ "id": "P01",
50
+ "title": "Orbital Night Market",
51
+ "seed": 101,
52
+ "time_s": 8.434,
53
+ "peak_allocated_gb": 18.945,
54
+ "peak_reserved_gb": 19.262,
55
+ "end_allocated_gb": 10.221,
56
+ "end_reserved_gb": 10.895
57
+ },
58
+ {
59
+ "id": "P02",
60
+ "title": "Arctic Research Desk",
61
+ "seed": 102,
62
+ "time_s": 5.613,
63
+ "peak_allocated_gb": 18.949,
64
+ "peak_reserved_gb": 19.258,
65
+ "end_allocated_gb": 10.221,
66
+ "end_reserved_gb": 10.893
67
+ },
68
+ {
69
+ "id": "P03",
70
+ "title": "Victorian Automaton Repair",
71
+ "seed": 103,
72
+ "time_s": 4.006,
73
+ "peak_allocated_gb": 18.944,
74
+ "peak_reserved_gb": 19.271,
75
+ "end_allocated_gb": 10.221,
76
+ "end_reserved_gb": 10.792
77
+ },
78
+ {
79
+ "id": "P04",
80
+ "title": "Mars Greenhouse Control Room",
81
+ "seed": 104,
82
+ "time_s": 7.509,
83
+ "peak_allocated_gb": 18.938,
84
+ "peak_reserved_gb": 19.258,
85
+ "end_allocated_gb": 10.221,
86
+ "end_reserved_gb": 10.872
87
+ },
88
+ {
89
+ "id": "P05",
90
+ "title": "Lost Railway Poster Wall",
91
+ "seed": 105,
92
+ "time_s": 4.658,
93
+ "peak_allocated_gb": 18.939,
94
+ "peak_reserved_gb": 19.258,
95
+ "end_allocated_gb": 10.221,
96
+ "end_reserved_gb": 10.872
97
+ },
98
+ {
99
+ "id": "P06",
100
+ "title": "Miniature Courtroom Diorama",
101
+ "seed": 106,
102
+ "time_s": 5.141,
103
+ "peak_allocated_gb": 18.942,
104
+ "peak_reserved_gb": 19.271,
105
+ "end_allocated_gb": 10.221,
106
+ "end_reserved_gb": 10.792
107
+ },
108
+ {
109
+ "id": "P07",
110
+ "title": "Rainy Seoul Book Cafe",
111
+ "seed": 107,
112
+ "time_s": 5.134,
113
+ "peak_allocated_gb": 18.942,
114
+ "peak_reserved_gb": 19.271,
115
+ "end_allocated_gb": 10.221,
116
+ "end_reserved_gb": 10.792
117
+ },
118
+ {
119
+ "id": "P08",
120
+ "title": "Oceanographic Expedition Map",
121
+ "seed": 108,
122
+ "time_s": 4.657,
123
+ "peak_allocated_gb": 18.942,
124
+ "peak_reserved_gb": 19.271,
125
+ "end_allocated_gb": 10.221,
126
+ "end_reserved_gb": 10.792
127
+ },
128
+ {
129
+ "id": "P09",
130
+ "title": "Renaissance Lab Notebook",
131
+ "seed": 109,
132
+ "time_s": 5.944,
133
+ "peak_allocated_gb": 18.939,
134
+ "peak_reserved_gb": 19.258,
135
+ "end_allocated_gb": 10.221,
136
+ "end_reserved_gb": 10.872
137
+ },
138
+ {
139
+ "id": "P10",
140
+ "title": "Russian Provincial Print Shop",
141
+ "seed": 110,
142
+ "time_s": 8.916,
143
+ "peak_allocated_gb": 18.968,
144
+ "peak_reserved_gb": 19.29,
145
+ "end_allocated_gb": 10.221,
146
+ "end_reserved_gb": 10.798
147
+ }
148
+ ]
149
+ },
150
+ "quant": {
151
+ "hardware": "RunPod NVIDIA H100 80GB HBM3 (H100 SXM)",
152
+ "mode": "Diffusers enable_model_cpu_offload()",
153
+ "cache_state": "warm local HF cache; no download time included",
154
+ "process_isolation": "single model per fresh Python process",
155
+ "base_resolution": 1024,
156
+ "aspect_ratio": "1:1",
157
+ "num_inference_steps": 4,
158
+ "guidance_scale": 1.0,
159
+ "dtype": "torch.bfloat16",
160
+ "kind": "quant",
161
+ "load": {
162
+ "load_time_s": 12.371,
163
+ "peak_allocated_gb": 12.582,
164
+ "peak_reserved_gb": 13.881,
165
+ "end_allocated_gb": 10.185,
166
+ "end_reserved_gb": 10.679
167
+ },
168
+ "summary": {
169
+ "cold_time_s": 8.44,
170
+ "cold_peak_allocated_gb": 15.085,
171
+ "cold_peak_reserved_gb": 15.238,
172
+ "warm_avg_time_s": 4.976,
173
+ "warm_median_time_s": 3.855,
174
+ "warm_avg_peak_allocated_gb": 15.084,
175
+ "warm_avg_peak_reserved_gb": 15.249,
176
+ "warm_max_peak_allocated_gb": 15.104,
177
+ "warm_max_peak_reserved_gb": 15.28
178
+ },
179
+ "prompts": [
180
+ {
181
+ "id": "P01",
182
+ "title": "Orbital Night Market",
183
+ "seed": 101,
184
+ "time_s": 8.44,
185
+ "peak_allocated_gb": 15.085,
186
+ "peak_reserved_gb": 15.238,
187
+ "end_allocated_gb": 10.221,
188
+ "end_reserved_gb": 10.872
189
+ },
190
+ {
191
+ "id": "P02",
192
+ "title": "Arctic Research Desk",
193
+ "seed": 102,
194
+ "time_s": 6.726,
195
+ "peak_allocated_gb": 15.089,
196
+ "peak_reserved_gb": 15.261,
197
+ "end_allocated_gb": 10.221,
198
+ "end_reserved_gb": 10.878
199
+ },
200
+ {
201
+ "id": "P03",
202
+ "title": "Victorian Automaton Repair",
203
+ "seed": 103,
204
+ "time_s": 8.244,
205
+ "peak_allocated_gb": 15.081,
206
+ "peak_reserved_gb": 15.246,
207
+ "end_allocated_gb": 10.221,
208
+ "end_reserved_gb": 10.775
209
+ },
210
+ {
211
+ "id": "P04",
212
+ "title": "Mars Greenhouse Control Room",
213
+ "seed": 104,
214
+ "time_s": 4.033,
215
+ "peak_allocated_gb": 15.079,
216
+ "peak_reserved_gb": 15.238,
217
+ "end_allocated_gb": 10.221,
218
+ "end_reserved_gb": 10.874
219
+ },
220
+ {
221
+ "id": "P05",
222
+ "title": "Lost Railway Poster Wall",
223
+ "seed": 105,
224
+ "time_s": 3.836,
225
+ "peak_allocated_gb": 15.08,
226
+ "peak_reserved_gb": 15.238,
227
+ "end_allocated_gb": 10.221,
228
+ "end_reserved_gb": 10.876
229
+ },
230
+ {
231
+ "id": "P06",
232
+ "title": "Miniature Courtroom Diorama",
233
+ "seed": 106,
234
+ "time_s": 3.845,
235
+ "peak_allocated_gb": 15.08,
236
+ "peak_reserved_gb": 15.246,
237
+ "end_allocated_gb": 10.221,
238
+ "end_reserved_gb": 10.775
239
+ },
240
+ {
241
+ "id": "P07",
242
+ "title": "Rainy Seoul Book Cafe",
243
+ "seed": 107,
244
+ "time_s": 3.855,
245
+ "peak_allocated_gb": 15.08,
246
+ "peak_reserved_gb": 15.246,
247
+ "end_allocated_gb": 10.221,
248
+ "end_reserved_gb": 10.773
249
+ },
250
+ {
251
+ "id": "P08",
252
+ "title": "Oceanographic Expedition Map",
253
+ "seed": 108,
254
+ "time_s": 3.841,
255
+ "peak_allocated_gb": 15.08,
256
+ "peak_reserved_gb": 15.246,
257
+ "end_allocated_gb": 10.221,
258
+ "end_reserved_gb": 10.773
259
+ },
260
+ {
261
+ "id": "P09",
262
+ "title": "Renaissance Lab Notebook",
263
+ "seed": 109,
264
+ "time_s": 3.852,
265
+ "peak_allocated_gb": 15.08,
266
+ "peak_reserved_gb": 15.238,
267
+ "end_allocated_gb": 10.221,
268
+ "end_reserved_gb": 10.876
269
+ },
270
+ {
271
+ "id": "P10",
272
+ "title": "Russian Provincial Print Shop",
273
+ "seed": 110,
274
+ "time_s": 6.556,
275
+ "peak_allocated_gb": 15.104,
276
+ "peak_reserved_gb": 15.28,
277
+ "end_allocated_gb": 10.221,
278
+ "end_reserved_gb": 10.781
279
+ }
280
+ ]
281
+ }
282
+ }
283
+ }