Update Dockerfile and README.md to set default for DOCKER_BUILD_SKIP_LLM_WARM
Browse files- Changed the default value of DOCKER_BUILD_SKIP_LLM_WARM to 1 in the Dockerfile to prevent out-of-memory issues during build.
- Updated README.md to clarify the behavior of the DOCKER_BUILD_SKIP_LLM_WARM variable and its impact on model warming during the build process.
- Dockerfile +1 -1
- README.md +1 -1
- scripts/docker_build_assets.py +10 -5
Dockerfile
CHANGED
|
@@ -35,7 +35,7 @@ ENV OMP_NUM_THREADS=2 \
|
|
| 35 |
|
| 36 |
ARG HF_TOKEN=
|
| 37 |
ARG HUGGING_FACE_HUB_TOKEN=
|
| 38 |
-
ARG DOCKER_BUILD_SKIP_LLM_WARM=
|
| 39 |
ENV HF_TOKEN=${HF_TOKEN}
|
| 40 |
ENV HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
|
| 41 |
ENV DOCKER_BUILD_SKIP_LLM_WARM=${DOCKER_BUILD_SKIP_LLM_WARM}
|
|
|
|
| 35 |
|
| 36 |
ARG HF_TOKEN=
|
| 37 |
ARG HUGGING_FACE_HUB_TOKEN=
|
| 38 |
+
ARG DOCKER_BUILD_SKIP_LLM_WARM=1
|
| 39 |
ENV HF_TOKEN=${HF_TOKEN}
|
| 40 |
ENV HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
|
| 41 |
ENV DOCKER_BUILD_SKIP_LLM_WARM=${DOCKER_BUILD_SKIP_LLM_WARM}
|
README.md
CHANGED
|
@@ -103,7 +103,7 @@ docker compose up --build -d
|
|
| 103 |
|
| 104 |
Default compose maps **`7860:7860`**. The image bakes **`/code/data/business_catalog_embedded.jsonl`** and **`/code/data/task_a_reviews_embedded.jsonl`** at build time (or stubs if Yelp JSON is missing). Override with a bind mount, e.g. `./data:/code/data`, if you rebuild those files locally.
|
| 105 |
|
| 106 |
-
The Docker image sets **`HF_HUB_OFFLINE=1`** and **`TRANSFORMERS_OFFLINE=1`** so the running container does not call the Hugging Face Hub (models must be fully
|
| 107 |
|
| 108 |
On startup, **`STARTUP_PREWARM`** (default **`user_modeling`**) loads that task’s embedder + optional RAG index + LLM before serving traffic (`all` = Task A and Task B, uses ~2× LLM RAM). Disable with **`SKIP_STARTUP_PREWARM=1`**.
|
| 109 |
|
|
|
|
| 103 |
|
| 104 |
Default compose maps **`7860:7860`**. The image bakes **`/code/data/business_catalog_embedded.jsonl`** and **`/code/data/task_a_reviews_embedded.jsonl`** at build time (or stubs if Yelp JSON is missing). Override with a bind mount, e.g. `./data:/code/data`, if you rebuild those files locally.
|
| 105 |
|
| 106 |
+
The Docker image sets **`HF_HUB_OFFLINE=1`** and **`TRANSFORMERS_OFFLINE=1`** so the running container does not call the Hugging Face Hub (models must be fully present in `/models/huggingface` from **`snapshot_download` during build**). By default **`DOCKER_BUILD_SKIP_LLM_WARM=1`**: the image build does **not** load the causal LM into RAM (avoids **OOM / exit 137** on small HF builders). Weights are still downloaded to disk; **startup prewarm** loads them when the Space starts. Set build-arg **`DOCKER_BUILD_SKIP_LLM_WARM=0`** only on a machine with several GB of spare RAM if you want a full CPU warm during build.
|
| 107 |
|
| 108 |
On startup, **`STARTUP_PREWARM`** (default **`user_modeling`**) loads that task’s embedder + optional RAG index + LLM before serving traffic (`all` = Task A and Task B, uses ~2× LLM RAM). Disable with **`SKIP_STARTUP_PREWARM=1`**.
|
| 109 |
|
scripts/docker_build_assets.py
CHANGED
|
@@ -77,7 +77,16 @@ def prefetch_hub_files_only() -> None:
|
|
| 77 |
|
| 78 |
|
| 79 |
def warm_runtime_models() -> None:
|
| 80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
import gc
|
| 82 |
|
| 83 |
emb_key = os.environ.get("TASK_B_LOCAL_EMBEDDING_MODEL", "all-MiniLM-L6-v2")
|
|
@@ -88,10 +97,6 @@ def warm_runtime_models() -> None:
|
|
| 88 |
del st
|
| 89 |
gc.collect()
|
| 90 |
|
| 91 |
-
if os.environ.get("DOCKER_BUILD_SKIP_LLM_WARM", "").strip().lower() in ("1", "true", "yes"):
|
| 92 |
-
print("docker_build_assets: DOCKER_BUILD_SKIP_LLM_WARM set — skipping causal LM warm.")
|
| 93 |
-
return
|
| 94 |
-
|
| 95 |
import torch # type: ignore[import-untyped]
|
| 96 |
from transformers import AutoModelForCausalLM, AutoTokenizer # type: ignore[import-untyped]
|
| 97 |
|
|
|
|
| 77 |
|
| 78 |
|
| 79 |
def warm_runtime_models() -> None:
|
| 80 |
+
raw = os.environ.get("DOCKER_BUILD_SKIP_LLM_WARM", "1").strip().lower()
|
| 81 |
+
skip = raw not in ("0", "false", "no")
|
| 82 |
+
if skip:
|
| 83 |
+
print(
|
| 84 |
+
"docker_build_assets: skipping in-RAM LLM warm (DOCKER_BUILD_SKIP_LLM_WARM default 1). "
|
| 85 |
+
"Weights are on disk from snapshot_download + stub encodes; uvicorn prewarm loads them at runtime."
|
| 86 |
+
)
|
| 87 |
+
return
|
| 88 |
+
|
| 89 |
+
print("docker_build_assets: full model warm (CPU) — DOCKER_BUILD_SKIP_LLM_WARM=0; needs several GB RAM.")
|
| 90 |
import gc
|
| 91 |
|
| 92 |
emb_key = os.environ.get("TASK_B_LOCAL_EMBEDDING_MODEL", "all-MiniLM-L6-v2")
|
|
|
|
| 97 |
del st
|
| 98 |
gc.collect()
|
| 99 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
import torch # type: ignore[import-untyped]
|
| 101 |
from transformers import AutoModelForCausalLM, AutoTokenizer # type: ignore[import-untyped]
|
| 102 |
|