Spaces:

lukhsaankumar
/

DeepFakeDetectorBackend

Sleeping

Segment	Start	End	Duration	Notes
Queue/build to app startup	04:23:34	04:24:02	28s	Includes scheduling, build finalization, image start
App startup to model-ready	04:24:02	04:25:36	94s	Time from uvicorn start message to models loaded
API model load phase	04:25:15	04:25:36	21s	From "Starting DeepFake Detector API..." to "Models loaded successfully!"

Build Stage Durations Visible In Logs

Build Stage	Duration
Restoring cache	19.5s
COPY source to /app	0.0s
mkdir/chown/chmod step	0.1s
Pushing image	0.7s
Exporting cache	0.1s
Total visible timed stages	20.4s

Note:

Several Docker steps were cache hits and reported as CACHED without explicit timing.
"Application startup complete" appears immediately after model load logs; no explicit timestamp is printed, so 04:25:36 is used as the practical ready time.

Model Load Breakdown (Current)

Model	Start	End	Duration	Observation
Fusion repo config	04:25:15	04:25:16	1s	Fast
cnn-transfer-final	04:25:16	04:25:17	1s	Fast
vit-base-final	04:25:17	04:25:30	13s	Dominant bottleneck
deit-distilled-final	04:25:30	04:25:35	5s	Moderate
gradfield-cnn-final	04:25:35	04:25:35	<1s	Fast
fusion model load	04:25:35	04:25:36	1s	Fast
Total model load	04:25:15	04:25:36	21s	Sequential loading

Current Bottlenecks

Dominant pre-app startup delay before Python module import begins.
Build-time prefetch cost when cache layers miss (extra build wall time).
Model loading is no longer dominant (~4s with current cache and bounded parallel load).
Cold-start variance likely includes platform scheduling/provisioning overhead.

Implementation Plan

Phase 1: Remove Runtime Model Downloads (Highest Impact)

1.1 Add model prefetch script

Create file: app/scripts/prefetch_models.py

Purpose:

Download fusion repo and all submodel repos at build time into HF_CACHE_DIR.
Ensure cold start does not wait on remote model downloads.

Implementation:

import asyncio
from app.core.config import settings
from app.services.model_registry import get_model_registry


async def main() -> None:
    registry = get_model_registry()
    await registry.load_from_fusion_repo(settings.HF_FUSION_REPO_ID, force_reload=True)


if __name__ == "__main__":
    asyncio.run(main())

1.2 Update Dockerfile for build-time prefetch

Target file: Dockerfile

Key changes:

Keep dependency installation in a stable cache layer.
Copy only application code needed for prefetch before full source copy.
Run prefetch script during build with HF cache directory set.
Keep ownership and permissions for user uid 1000.

Implementation sketch:

FROM python:3.11-slim

WORKDIR /app

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1 \
    PIP_DISABLE_PIP_VERSION_CHECK=1 \
    PORT=7860 \
    HF_CACHE_DIR=/app/.hf_cache

RUN apt-get update && apt-get install -y --no-install-recommends \
    curl \
    git \
    && rm -rf /var/lib/apt/lists/*

RUN useradd -m -u 1000 user
ENV PATH="/home/user/.local/bin:$PATH"

COPY requirements.txt .
RUN pip install --no-cache-dir --upgrade -r requirements.txt

# Copy app code required for prefetch
COPY app /app/app
COPY start.sh /app/start.sh

RUN mkdir -p /app/.hf_cache

# Build-time model prefetch (requires public repos or HF token in build env)
RUN python -m app.scripts.prefetch_models

RUN chown -R user:user /app && chmod +x /app/start.sh
USER user

EXPOSE 7860
CMD ["./start.sh"]

Notes:

If private model repos are used, build needs HF_TOKEN.
This increases image size but reduces startup wait caused by downloads.

1.3 Verify HF cache is reused at runtime

Target file: app/services/hf_hub_service.py

Behavior to enforce:

Keep deterministic local_dir path under /app/.hf_cache.
Log cache hits clearly before download attempt.

Add logic before snapshot_download call:

cached = self.get_cached_path(repo_id)
if cached and not force_download:
    logger.info(f"Using cached repo for {repo_id}: {cached}")
    return cached

Phase 2: Parallelize Submodel Loading

Target file: app/services/model_registry.py

Current behavior:

Submodels are loaded one by one.

New behavior:

Load submodels concurrently with bounded parallelism.

Implementation steps:

Add a semaphore, for example max concurrency 2.
Replace sequential loop with asyncio.gather.
Keep deterministic final registration and clear error propagation.

Implementation sketch:

sem = asyncio.Semaphore(2)

async def _load_with_limit(repo_id: str) -> None:
    async with sem:
        await self._load_submodel(repo_id)

tasks = [_load_with_limit(repo_id) for repo_id in submodel_repos]
results = await asyncio.gather(*tasks, return_exceptions=True)
errors = [r for r in results if isinstance(r, Exception)]
if errors:
    raise RuntimeError(f"Failed to load one or more submodels: {errors}")

Reason for bounded parallelism:

Reduces startup time without overwhelming memory/network in GPU Space containers.

Phase 3: Add Startup Instrumentation For Reliable Comparisons

Target file: app/main.py

Add timing markers:

App startup begin timestamp.
Model loading start and end.
Total lifespan startup duration.

Implementation sketch:

import time

startup_t0 = time.perf_counter()
...
model_t0 = time.perf_counter()
await registry.load_from_fusion_repo(settings.HF_FUSION_REPO_ID)
model_dt = time.perf_counter() - model_t0
logger.info(f"Model load duration_seconds={model_dt:.3f}")
...
startup_dt = time.perf_counter() - startup_t0
logger.info(f"Startup total duration_seconds={startup_dt:.3f}")

Phase 4: Use Persistent Storage Cache (/data)

Goal:

Make /data the primary cache location so model artifacts survive container rebuilds/restarts.

Target files:

app/core/config.py
start.sh
app/services/hf_hub_service.py

Plan:

Prefer /data-backed cache paths when available:
- HF_HOME=/data/.cache/huggingface
- HF_CACHE_DIR=/data/.hf_cache
Keep fallback to /app/.hf_cache when /data is unavailable.
Ensure startup creates/chowns cache directories safely.
Keep cache-hit logging so verification remains explicit in logs.

Expected impact:

Faster warm boots across deploys.
Lower risk of repeated network fetch for large model files.

Phase 5: Decouple Build From Prefetch

Goal:

Reduce rebuild penalty from model prefetch while keeping runtime fast.

Target file:

Dockerfile

Plan:

Make build-time prefetch optional via ARG/ENV flag.
Default to skipping build prefetch when persistent /data cache is enabled.
Keep one-time warm path at runtime (guarded by cache/sentinel file in /data).

Expected impact:

Faster image rebuild/push cycles.
Better developer iteration speed without sacrificing warmed production startup.

Phase 6: Platform/GPU Startup Characterization

Goal:

Quantify how much of remaining cold start is platform provisioning vs app code.

Plan:

Run repeated cold starts with identical image on current T4.
If available, test one higher-tier GPU Space and compare only phase3 markers.
Record variance in:
- Application Startup -> module_import_start
- module_import_complete -> startup complete

Notes:

GPU type can affect model initialization time.
The measured dominant delay currently occurs before app module import, so platform scheduling/provisioning is likely the bigger lever than model code tuning.

Phase 7: Runtime Hygiene (Completed)

Completed changes:

Set valid OMP default in start.sh.
Pin scikit-learn to 1.6.1 for pickle compatibility.

Validation and Benchmark Protocol

Use the same procedure before and after changes.

Force a cold deployment in HF Space.
Record these timestamps from logs:
- Build queued time
- Application startup time
- Starting DeepFake Detector API
- Models loaded successfully
- Application startup complete
Compute:
- Queue/build to app startup
- App startup to model-ready
- API model load phase
Capture per-model load durations from logs.
Save a comparison table in this file.

Phase 1 Results From Latest Logs

Source log window:

Build queued at 2026-04-20 05:04:31
Application startup begins at 2026-04-20 05:05:07
Models loaded successfully at 2026-04-20 05:06:46

Phase 1 Timing Summary

Segment	Start	End	Duration	Notes
Queue/build to app startup	05:04:31	05:05:07	36s	Includes scheduling, build finalization, image start
App startup to model-ready	05:05:07	05:06:46	99s	Time from uvicorn start message to models loaded
API model load phase	05:06:41	05:06:46	5s	From "Starting DeepFake Detector API..." to "Models loaded successfully!"

Phase 1 Observations

All Hugging Face repos were served from cache at runtime, confirming the build-time prefetch is working.
The previous runtime download cost was eliminated from startup.
The remaining startup time is now dominated by model wrapper initialization and import/init overhead rather than repo downloads.

Phase 2 Results From Latest Logs

Source log window:

Build queued at 2026-04-20 05:46:19
Application startup begins at 2026-04-20 05:48:18
Models loaded successfully at 2026-04-20 05:49:56

Phase 2 Timing Summary

Segment	Start	End	Duration	Notes
Queue/build to app startup	05:46:19	05:48:18	119s	Includes scheduling, build finalization, image start
App startup to model-ready	05:48:18	05:49:56	98s	Time from uvicorn start message to models loaded
API model load phase	05:49:52	05:49:56	4s	From "Starting DeepFake Detector API..." to "Models loaded successfully!"

Phase 2 Observations

Submodel loading now overlaps in runtime logs (bounded parallel local initialization is active).
Runtime API model load phase improved slightly (5s -> 4s).
End-to-end startup remained dominated by pre-lifespan/init time (98s still much larger than model load slice).
Runtime hygiene warnings no longer appeared in this run (no OMP warning and no sklearn pickle version warning).

Phase 3 Results and Bottleneck Attribution

Source log window:

Build queued at 2026-04-20 06:01:57
Application startup begins at 2026-04-20 06:02:56
Models loaded successfully at 2026-04-20 06:04:37

Phase 3 Timing Summary

Segment	Start	End	Duration	Notes
Queue/build to app startup	06:01:57	06:02:56	59s	Includes scheduling, build finalization, image start
App startup to model-ready	06:02:56	06:04:37	101s	End-to-end startup from Space startup marker
API model load phase	06:04:33	06:04:37	4s	From app startup handler to models loaded

Phase 3 Instrumentation Breakdown (Container Runtime)

Marker	Duration
module_import_complete	4.050s
startup_model_load_duration_seconds	3.967s
startup_lifespan_total_duration_seconds	3.967s
load_from_fusion_repo_total_duration_seconds	3.967s

Bottleneck Attribution

Dominant gap is before module import:
- 06:02:56 (Application Startup) -> 06:04:29 (module_import_start) = 93s.
App code after import is no longer the main problem:
- import + lifespan + model load is about 8s total.
Conclusion:
- Remaining cold start is primarily platform/container readiness overhead, not model download/load logic.

Comparison Template (Fill After Implementation)

Metric	Baseline (2026-04-20)	After Phase 1	After Phase 2	After Phase 3
Queue/build to app startup	28s	36s	119s	59s
App startup to model-ready	94s	99s	98s	101s
API model load phase	21s	5s	4s	4s
vit-base load	13s	1s	2s	2s
deit-distilled load	5s	2s	2s	2s
Total visible build timed stages	20.4s	28.0s	112.7s	33.6s
Phase3 module import duration	n/a	n/a	n/a	4.050s
Phase3 model registry total duration	n/a	n/a	n/a	3.967s

Expected Outcome

Primary expected wins:

Reduced startup latency by avoiding runtime model downloads.
Reduced model load wall-clock via parallel submodel loads.
Stable and comparable timing data for iterative tuning.

Secondary expected wins:

Cleaner startup logs (no OMP warning).
Lower risk from sklearn deserialization mismatch.

Rollback Plan

If anything regresses:

Revert parallel loading only and keep build-time prefetch.
Revert build-time prefetch and restore runtime download flow.
Keep instrumentation to retain comparability.

Notes

This plan intentionally keeps current FastAPI inference architecture unchanged.
Triton feasibility can be revisited after cold start metrics improve and stabilize.