File size: 13,974 Bytes
db10084 52c0a32 db10084 52c0a32 db10084 52c0a32 db10084 52c0a32 db10084 52c0a32 db10084 52c0a32 db10084 52c0a32 db10084 52c0a32 db10084 52c0a32 db10084 52c0a32 db10084 52c0a32 db10084 52c0a32 db10084 52c0a32 db10084 d0d4075 9fd7a87 52c0a32 db10084 52c0a32 db10084 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 | # Cold Start Optimization Implementation Guide (HF Spaces GPU)
## Goal
Reduce end-to-end cold start time for the backend on Hugging Face Spaces GPU while preserving inference quality and endpoint behavior.
This guide is focused only on cold start optimization for the current FastAPI architecture.
## Baseline From Current Logs
Source log window:
- Build queued at 2026-04-20 04:23:34
- Application startup begins at 2026-04-20 04:24:02
- Models loaded successfully at 2026-04-20 04:25:36
### Baseline Timing Summary
| Segment | Start | End | Duration | Notes |
|---|---:|---:|---:|---|
| Queue/build to app startup | 04:23:34 | 04:24:02 | 28s | Includes scheduling, build finalization, image start |
| App startup to model-ready | 04:24:02 | 04:25:36 | 94s | Time from uvicorn start message to models loaded |
| API model load phase | 04:25:15 | 04:25:36 | 21s | From "Starting DeepFake Detector API..." to "Models loaded successfully!" |
### Build Stage Durations Visible In Logs
| Build Stage | Duration |
|---|---:|
| Restoring cache | 19.5s |
| COPY source to /app | 0.0s |
| mkdir/chown/chmod step | 0.1s |
| Pushing image | 0.7s |
| Exporting cache | 0.1s |
| Total visible timed stages | 20.4s |
Note:
- Several Docker steps were cache hits and reported as CACHED without explicit timing.
- "Application startup complete" appears immediately after model load logs; no explicit timestamp is printed, so 04:25:36 is used as the practical ready time.
### Model Load Breakdown (Current)
| Model | Start | End | Duration | Observation |
|---|---:|---:|---:|---|
| Fusion repo config | 04:25:15 | 04:25:16 | 1s | Fast |
| cnn-transfer-final | 04:25:16 | 04:25:17 | 1s | Fast |
| vit-base-final | 04:25:17 | 04:25:30 | 13s | Dominant bottleneck |
| deit-distilled-final | 04:25:30 | 04:25:35 | 5s | Moderate |
| gradfield-cnn-final | 04:25:35 | 04:25:35 | <1s | Fast |
| fusion model load | 04:25:35 | 04:25:36 | 1s | Fast |
| Total model load | 04:25:15 | 04:25:36 | 21s | Sequential loading |
## Current Bottlenecks
1. Dominant pre-app startup delay before Python module import begins.
2. Build-time prefetch cost when cache layers miss (extra build wall time).
3. Model loading is no longer dominant (~4s with current cache and bounded parallel load).
4. Cold-start variance likely includes platform scheduling/provisioning overhead.
## Implementation Plan
## Phase 1: Remove Runtime Model Downloads (Highest Impact)
### 1.1 Add model prefetch script
Create file: app/scripts/prefetch_models.py
Purpose:
- Download fusion repo and all submodel repos at build time into HF_CACHE_DIR.
- Ensure cold start does not wait on remote model downloads.
Implementation:
```python
import asyncio
from app.core.config import settings
from app.services.model_registry import get_model_registry
async def main() -> None:
registry = get_model_registry()
await registry.load_from_fusion_repo(settings.HF_FUSION_REPO_ID, force_reload=True)
if __name__ == "__main__":
asyncio.run(main())
```
### 1.2 Update Dockerfile for build-time prefetch
Target file: Dockerfile
Key changes:
1. Keep dependency installation in a stable cache layer.
2. Copy only application code needed for prefetch before full source copy.
3. Run prefetch script during build with HF cache directory set.
4. Keep ownership and permissions for user uid 1000.
Implementation sketch:
```dockerfile
FROM python:3.11-slim
WORKDIR /app
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
PORT=7860 \
HF_CACHE_DIR=/app/.hf_cache
RUN apt-get update && apt-get install -y --no-install-recommends \
curl \
git \
&& rm -rf /var/lib/apt/lists/*
RUN useradd -m -u 1000 user
ENV PATH="/home/user/.local/bin:$PATH"
COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade -r requirements.txt
# Copy app code required for prefetch
COPY app /app/app
COPY start.sh /app/start.sh
RUN mkdir -p /app/.hf_cache
# Build-time model prefetch (requires public repos or HF token in build env)
RUN python -m app.scripts.prefetch_models
RUN chown -R user:user /app && chmod +x /app/start.sh
USER user
EXPOSE 7860
CMD ["./start.sh"]
```
Notes:
- If private model repos are used, build needs HF_TOKEN.
- This increases image size but reduces startup wait caused by downloads.
### 1.3 Verify HF cache is reused at runtime
Target file: app/services/hf_hub_service.py
Behavior to enforce:
- Keep deterministic local_dir path under /app/.hf_cache.
- Log cache hits clearly before download attempt.
Add logic before snapshot_download call:
```python
cached = self.get_cached_path(repo_id)
if cached and not force_download:
logger.info(f"Using cached repo for {repo_id}: {cached}")
return cached
```
## Phase 2: Parallelize Submodel Loading
Target file: app/services/model_registry.py
Current behavior:
- Submodels are loaded one by one.
New behavior:
- Load submodels concurrently with bounded parallelism.
Implementation steps:
1. Add a semaphore, for example max concurrency 2.
2. Replace sequential loop with asyncio.gather.
3. Keep deterministic final registration and clear error propagation.
Implementation sketch:
```python
sem = asyncio.Semaphore(2)
async def _load_with_limit(repo_id: str) -> None:
async with sem:
await self._load_submodel(repo_id)
tasks = [_load_with_limit(repo_id) for repo_id in submodel_repos]
results = await asyncio.gather(*tasks, return_exceptions=True)
errors = [r for r in results if isinstance(r, Exception)]
if errors:
raise RuntimeError(f"Failed to load one or more submodels: {errors}")
```
Reason for bounded parallelism:
- Reduces startup time without overwhelming memory/network in GPU Space containers.
## Phase 3: Add Startup Instrumentation For Reliable Comparisons
Target file: app/main.py
Add timing markers:
- App startup begin timestamp.
- Model loading start and end.
- Total lifespan startup duration.
Implementation sketch:
```python
import time
startup_t0 = time.perf_counter()
...
model_t0 = time.perf_counter()
await registry.load_from_fusion_repo(settings.HF_FUSION_REPO_ID)
model_dt = time.perf_counter() - model_t0
logger.info(f"Model load duration_seconds={model_dt:.3f}")
...
startup_dt = time.perf_counter() - startup_t0
logger.info(f"Startup total duration_seconds={startup_dt:.3f}")
```
## Phase 4: Use Persistent Storage Cache (/data)
Goal:
- Make /data the primary cache location so model artifacts survive container rebuilds/restarts.
Target files:
- app/core/config.py
- start.sh
- app/services/hf_hub_service.py
Plan:
1. Prefer /data-backed cache paths when available:
- HF_HOME=/data/.cache/huggingface
- HF_CACHE_DIR=/data/.hf_cache
2. Keep fallback to /app/.hf_cache when /data is unavailable.
3. Ensure startup creates/chowns cache directories safely.
4. Keep cache-hit logging so verification remains explicit in logs.
Expected impact:
- Faster warm boots across deploys.
- Lower risk of repeated network fetch for large model files.
## Phase 5: Decouple Build From Prefetch
Goal:
- Reduce rebuild penalty from model prefetch while keeping runtime fast.
Target file:
- Dockerfile
Plan:
1. Make build-time prefetch optional via ARG/ENV flag.
2. Default to skipping build prefetch when persistent /data cache is enabled.
3. Keep one-time warm path at runtime (guarded by cache/sentinel file in /data).
Expected impact:
- Faster image rebuild/push cycles.
- Better developer iteration speed without sacrificing warmed production startup.
## Phase 6: Platform/GPU Startup Characterization
Goal:
- Quantify how much of remaining cold start is platform provisioning vs app code.
Plan:
1. Run repeated cold starts with identical image on current T4.
2. If available, test one higher-tier GPU Space and compare only phase3 markers.
3. Record variance in:
- Application Startup -> module_import_start
- module_import_complete -> startup complete
Notes:
- GPU type can affect model initialization time.
- The measured dominant delay currently occurs before app module import, so platform scheduling/provisioning is likely the bigger lever than model code tuning.
## Phase 7: Runtime Hygiene (Completed)
Completed changes:
1. Set valid OMP default in start.sh.
2. Pin scikit-learn to 1.6.1 for pickle compatibility.
## Validation and Benchmark Protocol
Use the same procedure before and after changes.
1. Force a cold deployment in HF Space.
2. Record these timestamps from logs:
- Build queued time
- Application startup time
- Starting DeepFake Detector API
- Models loaded successfully
- Application startup complete
3. Compute:
- Queue/build to app startup
- App startup to model-ready
- API model load phase
4. Capture per-model load durations from logs.
5. Save a comparison table in this file.
## Phase 1 Results From Latest Logs
Source log window:
- Build queued at 2026-04-20 05:04:31
- Application startup begins at 2026-04-20 05:05:07
- Models loaded successfully at 2026-04-20 05:06:46
### Phase 1 Timing Summary
| Segment | Start | End | Duration | Notes |
|---|---:|---:|---:|---|
| Queue/build to app startup | 05:04:31 | 05:05:07 | 36s | Includes scheduling, build finalization, image start |
| App startup to model-ready | 05:05:07 | 05:06:46 | 99s | Time from uvicorn start message to models loaded |
| API model load phase | 05:06:41 | 05:06:46 | 5s | From "Starting DeepFake Detector API..." to "Models loaded successfully!" |
### Phase 1 Observations
- All Hugging Face repos were served from cache at runtime, confirming the build-time prefetch is working.
- The previous runtime download cost was eliminated from startup.
- The remaining startup time is now dominated by model wrapper initialization and import/init overhead rather than repo downloads.
## Phase 2 Results From Latest Logs
Source log window:
- Build queued at 2026-04-20 05:46:19
- Application startup begins at 2026-04-20 05:48:18
- Models loaded successfully at 2026-04-20 05:49:56
### Phase 2 Timing Summary
| Segment | Start | End | Duration | Notes |
|---|---:|---:|---:|---|
| Queue/build to app startup | 05:46:19 | 05:48:18 | 119s | Includes scheduling, build finalization, image start |
| App startup to model-ready | 05:48:18 | 05:49:56 | 98s | Time from uvicorn start message to models loaded |
| API model load phase | 05:49:52 | 05:49:56 | 4s | From "Starting DeepFake Detector API..." to "Models loaded successfully!" |
### Phase 2 Observations
- Submodel loading now overlaps in runtime logs (bounded parallel local initialization is active).
- Runtime API model load phase improved slightly (5s -> 4s).
- End-to-end startup remained dominated by pre-lifespan/init time (98s still much larger than model load slice).
- Runtime hygiene warnings no longer appeared in this run (no OMP warning and no sklearn pickle version warning).
## Phase 3 Results and Bottleneck Attribution
Source log window:
- Build queued at 2026-04-20 06:01:57
- Application startup begins at 2026-04-20 06:02:56
- Models loaded successfully at 2026-04-20 06:04:37
### Phase 3 Timing Summary
| Segment | Start | End | Duration | Notes |
|---|---:|---:|---:|---|
| Queue/build to app startup | 06:01:57 | 06:02:56 | 59s | Includes scheduling, build finalization, image start |
| App startup to model-ready | 06:02:56 | 06:04:37 | 101s | End-to-end startup from Space startup marker |
| API model load phase | 06:04:33 | 06:04:37 | 4s | From app startup handler to models loaded |
### Phase 3 Instrumentation Breakdown (Container Runtime)
| Marker | Duration |
|---|---:|
| module_import_complete | 4.050s |
| startup_model_load_duration_seconds | 3.967s |
| startup_lifespan_total_duration_seconds | 3.967s |
| load_from_fusion_repo_total_duration_seconds | 3.967s |
### Bottleneck Attribution
- Dominant gap is before module import:
- 06:02:56 (Application Startup) -> 06:04:29 (module_import_start) = 93s.
- App code after import is no longer the main problem:
- import + lifespan + model load is about 8s total.
- Conclusion:
- Remaining cold start is primarily platform/container readiness overhead, not model download/load logic.
## Comparison Template (Fill After Implementation)
| Metric | Baseline (2026-04-20) | After Phase 1 | After Phase 2 | After Phase 3 | Final |
|---|---:|---:|---:|---:|---:|
| Queue/build to app startup | 28s | 36s | 119s | 59s | |
| App startup to model-ready | 94s | 99s | 98s | 101s | |
| API model load phase | 21s | 5s | 4s | 4s | |
| vit-base load | 13s | 1s | 2s | 2s | |
| deit-distilled load | 5s | 2s | 2s | 2s | |
| Total visible build timed stages | 20.4s | 28.0s | 112.7s | 33.6s | |
| Phase3 module import duration | n/a | n/a | n/a | 4.050s | |
| Phase3 model registry total duration | n/a | n/a | n/a | 3.967s | |
## Expected Outcome
Primary expected wins:
1. Reduced startup latency by avoiding runtime model downloads.
2. Reduced model load wall-clock via parallel submodel loads.
3. Stable and comparable timing data for iterative tuning.
Secondary expected wins:
1. Cleaner startup logs (no OMP warning).
2. Lower risk from sklearn deserialization mismatch.
## Rollback Plan
If anything regresses:
1. Revert parallel loading only and keep build-time prefetch.
2. Revert build-time prefetch and restore runtime download flow.
3. Keep instrumentation to retain comparability.
## Notes
- This plan intentionally keeps current FastAPI inference architecture unchanged.
- Triton feasibility can be revisited after cold start metrics improve and stabilize.
|