DeepFakeDetectorBackend / COLD_START_OPTIMIZATION.md
lukhsaankumar's picture
Deploy DeepFake Detector API - 2026-04-20 23:04:30
52c0a32

Cold Start Optimization Implementation Guide (HF Spaces GPU)

Goal

Reduce end-to-end cold start time for the backend on Hugging Face Spaces GPU while preserving inference quality and endpoint behavior.

This guide is focused only on cold start optimization for the current FastAPI architecture.

Baseline From Current Logs

Source log window:

  • Build queued at 2026-04-20 04:23:34
  • Application startup begins at 2026-04-20 04:24:02
  • Models loaded successfully at 2026-04-20 04:25:36

Baseline Timing Summary

Segment Start End Duration Notes
Queue/build to app startup 04:23:34 04:24:02 28s Includes scheduling, build finalization, image start
App startup to model-ready 04:24:02 04:25:36 94s Time from uvicorn start message to models loaded
API model load phase 04:25:15 04:25:36 21s From "Starting DeepFake Detector API..." to "Models loaded successfully!"

Build Stage Durations Visible In Logs

Build Stage Duration
Restoring cache 19.5s
COPY source to /app 0.0s
mkdir/chown/chmod step 0.1s
Pushing image 0.7s
Exporting cache 0.1s
Total visible timed stages 20.4s

Note:

  • Several Docker steps were cache hits and reported as CACHED without explicit timing.
  • "Application startup complete" appears immediately after model load logs; no explicit timestamp is printed, so 04:25:36 is used as the practical ready time.

Model Load Breakdown (Current)

Model Start End Duration Observation
Fusion repo config 04:25:15 04:25:16 1s Fast
cnn-transfer-final 04:25:16 04:25:17 1s Fast
vit-base-final 04:25:17 04:25:30 13s Dominant bottleneck
deit-distilled-final 04:25:30 04:25:35 5s Moderate
gradfield-cnn-final 04:25:35 04:25:35 <1s Fast
fusion model load 04:25:35 04:25:36 1s Fast
Total model load 04:25:15 04:25:36 21s Sequential loading

Current Bottlenecks

  1. Dominant pre-app startup delay before Python module import begins.
  2. Build-time prefetch cost when cache layers miss (extra build wall time).
  3. Model loading is no longer dominant (~4s with current cache and bounded parallel load).
  4. Cold-start variance likely includes platform scheduling/provisioning overhead.

Implementation Plan

Phase 1: Remove Runtime Model Downloads (Highest Impact)

1.1 Add model prefetch script

Create file: app/scripts/prefetch_models.py

Purpose:

  • Download fusion repo and all submodel repos at build time into HF_CACHE_DIR.
  • Ensure cold start does not wait on remote model downloads.

Implementation:

import asyncio
from app.core.config import settings
from app.services.model_registry import get_model_registry


async def main() -> None:
    registry = get_model_registry()
    await registry.load_from_fusion_repo(settings.HF_FUSION_REPO_ID, force_reload=True)


if __name__ == "__main__":
    asyncio.run(main())

1.2 Update Dockerfile for build-time prefetch

Target file: Dockerfile

Key changes:

  1. Keep dependency installation in a stable cache layer.
  2. Copy only application code needed for prefetch before full source copy.
  3. Run prefetch script during build with HF cache directory set.
  4. Keep ownership and permissions for user uid 1000.

Implementation sketch:

FROM python:3.11-slim

WORKDIR /app

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1 \
    PORT=7860 \
    HF_CACHE_DIR=/app/.hf_cache

RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    git \
    && rm -rf /var/lib/apt/lists/*

RUN useradd -m -u 1000 user
ENV PATH="/home/user/.local/bin:$PATH"

COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade -r requirements.txt

# Copy app code required for prefetch
COPY app /app/app
COPY start.sh /app/start.sh

RUN mkdir -p /app/.hf_cache

# Build-time model prefetch (requires public repos or HF token in build env)
RUN python -m app.scripts.prefetch_models

RUN chown -R user:user /app && chmod +x /app/start.sh
USER user

EXPOSE 7860
CMD ["./start.sh"]

Notes:

  • If private model repos are used, build needs HF_TOKEN.
  • This increases image size but reduces startup wait caused by downloads.

1.3 Verify HF cache is reused at runtime

Target file: app/services/hf_hub_service.py

Behavior to enforce:

  • Keep deterministic local_dir path under /app/.hf_cache.
  • Log cache hits clearly before download attempt.

Add logic before snapshot_download call:

cached = self.get_cached_path(repo_id)
if cached and not force_download:
    logger.info(f"Using cached repo for {repo_id}: {cached}")
    return cached

Phase 2: Parallelize Submodel Loading

Target file: app/services/model_registry.py

Current behavior:

  • Submodels are loaded one by one.

New behavior:

  • Load submodels concurrently with bounded parallelism.

Implementation steps:

  1. Add a semaphore, for example max concurrency 2.
  2. Replace sequential loop with asyncio.gather.
  3. Keep deterministic final registration and clear error propagation.

Implementation sketch:

sem = asyncio.Semaphore(2)

async def _load_with_limit(repo_id: str) -> None:
    async with sem:
        await self._load_submodel(repo_id)

tasks = [_load_with_limit(repo_id) for repo_id in submodel_repos]
results = await asyncio.gather(*tasks, return_exceptions=True)
errors = [r for r in results if isinstance(r, Exception)]
if errors:
    raise RuntimeError(f"Failed to load one or more submodels: {errors}")

Reason for bounded parallelism:

  • Reduces startup time without overwhelming memory/network in GPU Space containers.

Phase 3: Add Startup Instrumentation For Reliable Comparisons

Target file: app/main.py

Add timing markers:

  • App startup begin timestamp.
  • Model loading start and end.
  • Total lifespan startup duration.

Implementation sketch:

import time

startup_t0 = time.perf_counter()
...
model_t0 = time.perf_counter()
await registry.load_from_fusion_repo(settings.HF_FUSION_REPO_ID)
model_dt = time.perf_counter() - model_t0
logger.info(f"Model load duration_seconds={model_dt:.3f}")
...
startup_dt = time.perf_counter() - startup_t0
logger.info(f"Startup total duration_seconds={startup_dt:.3f}")

Phase 4: Use Persistent Storage Cache (/data)

Goal:

  • Make /data the primary cache location so model artifacts survive container rebuilds/restarts.

Target files:

  • app/core/config.py
  • start.sh
  • app/services/hf_hub_service.py

Plan:

  1. Prefer /data-backed cache paths when available:
    • HF_HOME=/data/.cache/huggingface
    • HF_CACHE_DIR=/data/.hf_cache
  2. Keep fallback to /app/.hf_cache when /data is unavailable.
  3. Ensure startup creates/chowns cache directories safely.
  4. Keep cache-hit logging so verification remains explicit in logs.

Expected impact:

  • Faster warm boots across deploys.
  • Lower risk of repeated network fetch for large model files.

Phase 5: Decouple Build From Prefetch

Goal:

  • Reduce rebuild penalty from model prefetch while keeping runtime fast.

Target file:

  • Dockerfile

Plan:

  1. Make build-time prefetch optional via ARG/ENV flag.
  2. Default to skipping build prefetch when persistent /data cache is enabled.
  3. Keep one-time warm path at runtime (guarded by cache/sentinel file in /data).

Expected impact:

  • Faster image rebuild/push cycles.
  • Better developer iteration speed without sacrificing warmed production startup.

Phase 6: Platform/GPU Startup Characterization

Goal:

  • Quantify how much of remaining cold start is platform provisioning vs app code.

Plan:

  1. Run repeated cold starts with identical image on current T4.
  2. If available, test one higher-tier GPU Space and compare only phase3 markers.
  3. Record variance in:
    • Application Startup -> module_import_start
    • module_import_complete -> startup complete

Notes:

  • GPU type can affect model initialization time.
  • The measured dominant delay currently occurs before app module import, so platform scheduling/provisioning is likely the bigger lever than model code tuning.

Phase 7: Runtime Hygiene (Completed)

Completed changes:

  1. Set valid OMP default in start.sh.
  2. Pin scikit-learn to 1.6.1 for pickle compatibility.

Validation and Benchmark Protocol

Use the same procedure before and after changes.

  1. Force a cold deployment in HF Space.
  2. Record these timestamps from logs:
    • Build queued time
    • Application startup time
    • Starting DeepFake Detector API
    • Models loaded successfully
    • Application startup complete
  3. Compute:
    • Queue/build to app startup
    • App startup to model-ready
    • API model load phase
  4. Capture per-model load durations from logs.
  5. Save a comparison table in this file.

Phase 1 Results From Latest Logs

Source log window:

  • Build queued at 2026-04-20 05:04:31
  • Application startup begins at 2026-04-20 05:05:07
  • Models loaded successfully at 2026-04-20 05:06:46

Phase 1 Timing Summary

Segment Start End Duration Notes
Queue/build to app startup 05:04:31 05:05:07 36s Includes scheduling, build finalization, image start
App startup to model-ready 05:05:07 05:06:46 99s Time from uvicorn start message to models loaded
API model load phase 05:06:41 05:06:46 5s From "Starting DeepFake Detector API..." to "Models loaded successfully!"

Phase 1 Observations

  • All Hugging Face repos were served from cache at runtime, confirming the build-time prefetch is working.
  • The previous runtime download cost was eliminated from startup.
  • The remaining startup time is now dominated by model wrapper initialization and import/init overhead rather than repo downloads.

Phase 2 Results From Latest Logs

Source log window:

  • Build queued at 2026-04-20 05:46:19
  • Application startup begins at 2026-04-20 05:48:18
  • Models loaded successfully at 2026-04-20 05:49:56

Phase 2 Timing Summary

Segment Start End Duration Notes
Queue/build to app startup 05:46:19 05:48:18 119s Includes scheduling, build finalization, image start
App startup to model-ready 05:48:18 05:49:56 98s Time from uvicorn start message to models loaded
API model load phase 05:49:52 05:49:56 4s From "Starting DeepFake Detector API..." to "Models loaded successfully!"

Phase 2 Observations

  • Submodel loading now overlaps in runtime logs (bounded parallel local initialization is active).
  • Runtime API model load phase improved slightly (5s -> 4s).
  • End-to-end startup remained dominated by pre-lifespan/init time (98s still much larger than model load slice).
  • Runtime hygiene warnings no longer appeared in this run (no OMP warning and no sklearn pickle version warning).

Phase 3 Results and Bottleneck Attribution

Source log window:

  • Build queued at 2026-04-20 06:01:57
  • Application startup begins at 2026-04-20 06:02:56
  • Models loaded successfully at 2026-04-20 06:04:37

Phase 3 Timing Summary

Segment Start End Duration Notes
Queue/build to app startup 06:01:57 06:02:56 59s Includes scheduling, build finalization, image start
App startup to model-ready 06:02:56 06:04:37 101s End-to-end startup from Space startup marker
API model load phase 06:04:33 06:04:37 4s From app startup handler to models loaded

Phase 3 Instrumentation Breakdown (Container Runtime)

Marker Duration
module_import_complete 4.050s
startup_model_load_duration_seconds 3.967s
startup_lifespan_total_duration_seconds 3.967s
load_from_fusion_repo_total_duration_seconds 3.967s

Bottleneck Attribution

  • Dominant gap is before module import:
    • 06:02:56 (Application Startup) -> 06:04:29 (module_import_start) = 93s.
  • App code after import is no longer the main problem:
    • import + lifespan + model load is about 8s total.
  • Conclusion:
    • Remaining cold start is primarily platform/container readiness overhead, not model download/load logic.

Comparison Template (Fill After Implementation)

Metric Baseline (2026-04-20) After Phase 1 After Phase 2 After Phase 3 Final
Queue/build to app startup 28s 36s 119s 59s
App startup to model-ready 94s 99s 98s 101s
API model load phase 21s 5s 4s 4s
vit-base load 13s 1s 2s 2s
deit-distilled load 5s 2s 2s 2s
Total visible build timed stages 20.4s 28.0s 112.7s 33.6s
Phase3 module import duration n/a n/a n/a 4.050s
Phase3 model registry total duration n/a n/a n/a 3.967s

Expected Outcome

Primary expected wins:

  1. Reduced startup latency by avoiding runtime model downloads.
  2. Reduced model load wall-clock via parallel submodel loads.
  3. Stable and comparable timing data for iterative tuning.

Secondary expected wins:

  1. Cleaner startup logs (no OMP warning).
  2. Lower risk from sklearn deserialization mismatch.

Rollback Plan

If anything regresses:

  1. Revert parallel loading only and keep build-time prefetch.
  2. Revert build-time prefetch and restore runtime download flow.
  3. Keep instrumentation to retain comparability.

Notes

  • This plan intentionally keeps current FastAPI inference architecture unchanged.
  • Triton feasibility can be revisited after cold start metrics improve and stabilize.