File size: 13,974 Bytes
db10084
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52c0a32
 
 
 
db10084
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52c0a32
db10084
52c0a32
 
db10084
52c0a32
 
 
 
db10084
52c0a32
 
 
 
 
 
 
db10084
52c0a32
 
 
db10084
52c0a32
db10084
52c0a32
 
db10084
52c0a32
 
db10084
52c0a32
 
 
 
db10084
52c0a32
 
 
db10084
52c0a32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db10084
52c0a32
 
 
db10084
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d0d4075
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9fd7a87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52c0a32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
db10084
 
52c0a32
 
 
 
 
 
 
 
 
 
db10084
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
# Cold Start Optimization Implementation Guide (HF Spaces GPU)

## Goal

Reduce end-to-end cold start time for the backend on Hugging Face Spaces GPU while preserving inference quality and endpoint behavior.

This guide is focused only on cold start optimization for the current FastAPI architecture.

## Baseline From Current Logs

Source log window:
- Build queued at 2026-04-20 04:23:34
- Application startup begins at 2026-04-20 04:24:02
- Models loaded successfully at 2026-04-20 04:25:36

### Baseline Timing Summary

| Segment | Start | End | Duration | Notes |
|---|---:|---:|---:|---|
| Queue/build to app startup | 04:23:34 | 04:24:02 | 28s | Includes scheduling, build finalization, image start |
| App startup to model-ready | 04:24:02 | 04:25:36 | 94s | Time from uvicorn start message to models loaded |
| API model load phase | 04:25:15 | 04:25:36 | 21s | From "Starting DeepFake Detector API..." to "Models loaded successfully!" |

### Build Stage Durations Visible In Logs

| Build Stage | Duration |
|---|---:|
| Restoring cache | 19.5s |
| COPY source to /app | 0.0s |
| mkdir/chown/chmod step | 0.1s |
| Pushing image | 0.7s |
| Exporting cache | 0.1s |
| Total visible timed stages | 20.4s |

Note:
- Several Docker steps were cache hits and reported as CACHED without explicit timing.
- "Application startup complete" appears immediately after model load logs; no explicit timestamp is printed, so 04:25:36 is used as the practical ready time.

### Model Load Breakdown (Current)

| Model | Start | End | Duration | Observation |
|---|---:|---:|---:|---|
| Fusion repo config | 04:25:15 | 04:25:16 | 1s | Fast |
| cnn-transfer-final | 04:25:16 | 04:25:17 | 1s | Fast |
| vit-base-final | 04:25:17 | 04:25:30 | 13s | Dominant bottleneck |
| deit-distilled-final | 04:25:30 | 04:25:35 | 5s | Moderate |
| gradfield-cnn-final | 04:25:35 | 04:25:35 | <1s | Fast |
| fusion model load | 04:25:35 | 04:25:36 | 1s | Fast |
| Total model load | 04:25:15 | 04:25:36 | 21s | Sequential loading |

## Current Bottlenecks

1. Dominant pre-app startup delay before Python module import begins.
2. Build-time prefetch cost when cache layers miss (extra build wall time).
3. Model loading is no longer dominant (~4s with current cache and bounded parallel load).
4. Cold-start variance likely includes platform scheduling/provisioning overhead.

## Implementation Plan

## Phase 1: Remove Runtime Model Downloads (Highest Impact)

### 1.1 Add model prefetch script

Create file: app/scripts/prefetch_models.py



Purpose:

- Download fusion repo and all submodel repos at build time into HF_CACHE_DIR.

- Ensure cold start does not wait on remote model downloads.



Implementation:



```python

import asyncio

from app.core.config import settings

from app.services.model_registry import get_model_registry


async def main() -> None:
    registry = get_model_registry()

    await registry.load_from_fusion_repo(settings.HF_FUSION_REPO_ID, force_reload=True)



if __name__ == "__main__":
    asyncio.run(main())

```


### 1.2 Update Dockerfile for build-time prefetch

Target file: Dockerfile

Key changes:
1. Keep dependency installation in a stable cache layer.
2. Copy only application code needed for prefetch before full source copy.
3. Run prefetch script during build with HF cache directory set.
4. Keep ownership and permissions for user uid 1000.

Implementation sketch:

```dockerfile

FROM python:3.11-slim



WORKDIR /app



ENV PYTHONDONTWRITEBYTECODE=1 \

    PYTHONUNBUFFERED=1 \

    PIP_NO_CACHE_DIR=1 \

    PIP_DISABLE_PIP_VERSION_CHECK=1 \

    PORT=7860 \

    HF_CACHE_DIR=/app/.hf_cache



RUN apt-get update && apt-get install -y --no-install-recommends \

    curl \

    git \

    && rm -rf /var/lib/apt/lists/*



RUN useradd -m -u 1000 user

ENV PATH="/home/user/.local/bin:$PATH"



COPY requirements.txt .

RUN pip install --no-cache-dir --upgrade -r requirements.txt



# Copy app code required for prefetch

COPY app /app/app

COPY start.sh /app/start.sh



RUN mkdir -p /app/.hf_cache



# Build-time model prefetch (requires public repos or HF token in build env)

RUN python -m app.scripts.prefetch_models



RUN chown -R user:user /app && chmod +x /app/start.sh

USER user



EXPOSE 7860

CMD ["./start.sh"]

```

Notes:
- If private model repos are used, build needs HF_TOKEN.

- This increases image size but reduces startup wait caused by downloads.



### 1.3 Verify HF cache is reused at runtime



Target file: app/services/hf_hub_service.py



Behavior to enforce:

- Keep deterministic local_dir path under /app/.hf_cache.

- Log cache hits clearly before download attempt.



Add logic before snapshot_download call:

```python

cached = self.get_cached_path(repo_id)

if cached and not force_download:

    logger.info(f"Using cached repo for {repo_id}: {cached}")

    return cached

```

## Phase 2: Parallelize Submodel Loading

Target file: app/services/model_registry.py



Current behavior:

- Submodels are loaded one by one.



New behavior:

- Load submodels concurrently with bounded parallelism.



Implementation steps:

1. Add a semaphore, for example max concurrency 2.

2. Replace sequential loop with asyncio.gather.

3. Keep deterministic final registration and clear error propagation.



Implementation sketch:



```python

sem = asyncio.Semaphore(2)



async def _load_with_limit(repo_id: str) -> None:

    async with sem:

        await self._load_submodel(repo_id)

tasks = [_load_with_limit(repo_id) for repo_id in submodel_repos]
results = await asyncio.gather(*tasks, return_exceptions=True)

errors = [r for r in results if isinstance(r, Exception)]

if errors:

    raise RuntimeError(f"Failed to load one or more submodels: {errors}")

```



Reason for bounded parallelism:

- Reduces startup time without overwhelming memory/network in GPU Space containers.



## Phase 3: Add Startup Instrumentation For Reliable Comparisons



Target file: app/main.py



Add timing markers:

- App startup begin timestamp.

- Model loading start and end.

- Total lifespan startup duration.



Implementation sketch:



```python

import time



startup_t0 = time.perf_counter()

...

model_t0 = time.perf_counter()

await registry.load_from_fusion_repo(settings.HF_FUSION_REPO_ID)

model_dt = time.perf_counter() - model_t0

logger.info(f"Model load duration_seconds={model_dt:.3f}")

...

startup_dt = time.perf_counter() - startup_t0

logger.info(f"Startup total duration_seconds={startup_dt:.3f}")

```



## Phase 4: Use Persistent Storage Cache (/data)



Goal:

- Make /data the primary cache location so model artifacts survive container rebuilds/restarts.



Target files:

- app/core/config.py

- start.sh

- app/services/hf_hub_service.py



Plan:

1. Prefer /data-backed cache paths when available:

    - HF_HOME=/data/.cache/huggingface

    - HF_CACHE_DIR=/data/.hf_cache

2. Keep fallback to /app/.hf_cache when /data is unavailable.

3. Ensure startup creates/chowns cache directories safely.

4. Keep cache-hit logging so verification remains explicit in logs.



Expected impact:

- Faster warm boots across deploys.

- Lower risk of repeated network fetch for large model files.



## Phase 5: Decouple Build From Prefetch



Goal:

- Reduce rebuild penalty from model prefetch while keeping runtime fast.



Target file:

- Dockerfile



Plan:

1. Make build-time prefetch optional via ARG/ENV flag.

2. Default to skipping build prefetch when persistent /data cache is enabled.

3. Keep one-time warm path at runtime (guarded by cache/sentinel file in /data).



Expected impact:

- Faster image rebuild/push cycles.

- Better developer iteration speed without sacrificing warmed production startup.



## Phase 6: Platform/GPU Startup Characterization



Goal:

- Quantify how much of remaining cold start is platform provisioning vs app code.



Plan:

1. Run repeated cold starts with identical image on current T4.

2. If available, test one higher-tier GPU Space and compare only phase3 markers.

3. Record variance in:

    - Application Startup -> module_import_start

    - module_import_complete -> startup complete



Notes:

- GPU type can affect model initialization time.

- The measured dominant delay currently occurs before app module import, so platform scheduling/provisioning is likely the bigger lever than model code tuning.



## Phase 7: Runtime Hygiene (Completed)



Completed changes:

1. Set valid OMP default in start.sh.

2. Pin scikit-learn to 1.6.1 for pickle compatibility.



## Validation and Benchmark Protocol



Use the same procedure before and after changes.



1. Force a cold deployment in HF Space.

2. Record these timestamps from logs:

   - Build queued time

   - Application startup time

   - Starting DeepFake Detector API

   - Models loaded successfully

   - Application startup complete

3. Compute:

   - Queue/build to app startup

   - App startup to model-ready

   - API model load phase

4. Capture per-model load durations from logs.

5. Save a comparison table in this file.



## Phase 1 Results From Latest Logs



Source log window:

- Build queued at 2026-04-20 05:04:31

- Application startup begins at 2026-04-20 05:05:07

- Models loaded successfully at 2026-04-20 05:06:46



### Phase 1 Timing Summary



| Segment | Start | End | Duration | Notes |

|---|---:|---:|---:|---|

| Queue/build to app startup | 05:04:31 | 05:05:07 | 36s | Includes scheduling, build finalization, image start |

| App startup to model-ready | 05:05:07 | 05:06:46 | 99s | Time from uvicorn start message to models loaded |

| API model load phase | 05:06:41 | 05:06:46 | 5s | From "Starting DeepFake Detector API..." to "Models loaded successfully!" |



### Phase 1 Observations



- All Hugging Face repos were served from cache at runtime, confirming the build-time prefetch is working.

- The previous runtime download cost was eliminated from startup.

- The remaining startup time is now dominated by model wrapper initialization and import/init overhead rather than repo downloads.



## Phase 2 Results From Latest Logs



Source log window:

- Build queued at 2026-04-20 05:46:19

- Application startup begins at 2026-04-20 05:48:18

- Models loaded successfully at 2026-04-20 05:49:56



### Phase 2 Timing Summary



| Segment | Start | End | Duration | Notes |

|---|---:|---:|---:|---|

| Queue/build to app startup | 05:46:19 | 05:48:18 | 119s | Includes scheduling, build finalization, image start |

| App startup to model-ready | 05:48:18 | 05:49:56 | 98s | Time from uvicorn start message to models loaded |

| API model load phase | 05:49:52 | 05:49:56 | 4s | From "Starting DeepFake Detector API..." to "Models loaded successfully!" |



### Phase 2 Observations



- Submodel loading now overlaps in runtime logs (bounded parallel local initialization is active).

- Runtime API model load phase improved slightly (5s -> 4s).

- End-to-end startup remained dominated by pre-lifespan/init time (98s still much larger than model load slice).

- Runtime hygiene warnings no longer appeared in this run (no OMP warning and no sklearn pickle version warning).



## Phase 3 Results and Bottleneck Attribution



Source log window:

- Build queued at 2026-04-20 06:01:57

- Application startup begins at 2026-04-20 06:02:56

- Models loaded successfully at 2026-04-20 06:04:37



### Phase 3 Timing Summary



| Segment | Start | End | Duration | Notes |

|---|---:|---:|---:|---|

| Queue/build to app startup | 06:01:57 | 06:02:56 | 59s | Includes scheduling, build finalization, image start |

| App startup to model-ready | 06:02:56 | 06:04:37 | 101s | End-to-end startup from Space startup marker |

| API model load phase | 06:04:33 | 06:04:37 | 4s | From app startup handler to models loaded |



### Phase 3 Instrumentation Breakdown (Container Runtime)



| Marker | Duration |

|---|---:|

| module_import_complete | 4.050s |

| startup_model_load_duration_seconds | 3.967s |

| startup_lifespan_total_duration_seconds | 3.967s |

| load_from_fusion_repo_total_duration_seconds | 3.967s |



### Bottleneck Attribution



- Dominant gap is before module import:

    - 06:02:56 (Application Startup) -> 06:04:29 (module_import_start) = 93s.

- App code after import is no longer the main problem:

    - import + lifespan + model load is about 8s total.

- Conclusion:

    - Remaining cold start is primarily platform/container readiness overhead, not model download/load logic.



## Comparison Template (Fill After Implementation)



| Metric | Baseline (2026-04-20) | After Phase 1 | After Phase 2 | After Phase 3 | Final |

|---|---:|---:|---:|---:|---:|

| Queue/build to app startup | 28s | 36s | 119s | 59s |  |

| App startup to model-ready | 94s | 99s | 98s | 101s |  |

| API model load phase | 21s | 5s | 4s | 4s |  |

| vit-base load | 13s | 1s | 2s | 2s |  |

| deit-distilled load | 5s | 2s | 2s | 2s |  |

| Total visible build timed stages | 20.4s | 28.0s | 112.7s | 33.6s |  |

| Phase3 module import duration | n/a | n/a | n/a | 4.050s |  |

| Phase3 model registry total duration | n/a | n/a | n/a | 3.967s |  |



## Expected Outcome



Primary expected wins:

1. Reduced startup latency by avoiding runtime model downloads.

2. Reduced model load wall-clock via parallel submodel loads.

3. Stable and comparable timing data for iterative tuning.



Secondary expected wins:

1. Cleaner startup logs (no OMP warning).

2. Lower risk from sklearn deserialization mismatch.



## Rollback Plan



If anything regresses:

1. Revert parallel loading only and keep build-time prefetch.

2. Revert build-time prefetch and restore runtime download flow.

3. Keep instrumentation to retain comparability.



## Notes



- This plan intentionally keeps current FastAPI inference architecture unchanged.

- Triton feasibility can be revisited after cold start metrics improve and stabilize.