nexusbert commited on
Commit
3cdb77e
·
1 Parent(s): 10bc91f

Update Dockerfile and README.md to set default for DOCKER_BUILD_SKIP_LLM_WARM

Browse files

- Changed the default value of DOCKER_BUILD_SKIP_LLM_WARM to 1 in the Dockerfile to prevent out-of-memory issues during build.
- Updated README.md to clarify the behavior of the DOCKER_BUILD_SKIP_LLM_WARM variable and its impact on model warming during the build process.

Files changed (3) hide show
  1. Dockerfile +1 -1
  2. README.md +1 -1
  3. scripts/docker_build_assets.py +10 -5
Dockerfile CHANGED
@@ -35,7 +35,7 @@ ENV OMP_NUM_THREADS=2 \
35
 
36
  ARG HF_TOKEN=
37
  ARG HUGGING_FACE_HUB_TOKEN=
38
- ARG DOCKER_BUILD_SKIP_LLM_WARM=
39
  ENV HF_TOKEN=${HF_TOKEN}
40
  ENV HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
41
  ENV DOCKER_BUILD_SKIP_LLM_WARM=${DOCKER_BUILD_SKIP_LLM_WARM}
 
35
 
36
  ARG HF_TOKEN=
37
  ARG HUGGING_FACE_HUB_TOKEN=
38
+ ARG DOCKER_BUILD_SKIP_LLM_WARM=1
39
  ENV HF_TOKEN=${HF_TOKEN}
40
  ENV HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
41
  ENV DOCKER_BUILD_SKIP_LLM_WARM=${DOCKER_BUILD_SKIP_LLM_WARM}
README.md CHANGED
@@ -103,7 +103,7 @@ docker compose up --build -d
103
 
104
  Default compose maps **`7860:7860`**. The image bakes **`/code/data/business_catalog_embedded.jsonl`** and **`/code/data/task_a_reviews_embedded.jsonl`** at build time (or stubs if Yelp JSON is missing). Override with a bind mount, e.g. `./data:/code/data`, if you rebuild those files locally.
105
 
106
- The Docker image sets **`HF_HUB_OFFLINE=1`** and **`TRANSFORMERS_OFFLINE=1`** so the running container does not call the Hugging Face Hub (models must be fully cached during `docker build`). `scripts/docker_build_assets.py` runs **`warm_runtime_models()`** after data JSONL: one SentenceTransformer forward and one causal LM forward on CPU (set build-arg **`DOCKER_BUILD_SKIP_LLM_WARM=1`** if the builder OOMs).
107
 
108
  On startup, **`STARTUP_PREWARM`** (default **`user_modeling`**) loads that task’s embedder + optional RAG index + LLM before serving traffic (`all` = Task A and Task B, uses ~2× LLM RAM). Disable with **`SKIP_STARTUP_PREWARM=1`**.
109
 
 
103
 
104
  Default compose maps **`7860:7860`**. The image bakes **`/code/data/business_catalog_embedded.jsonl`** and **`/code/data/task_a_reviews_embedded.jsonl`** at build time (or stubs if Yelp JSON is missing). Override with a bind mount, e.g. `./data:/code/data`, if you rebuild those files locally.
105
 
106
+ The Docker image sets **`HF_HUB_OFFLINE=1`** and **`TRANSFORMERS_OFFLINE=1`** so the running container does not call the Hugging Face Hub (models must be fully present in `/models/huggingface` from **`snapshot_download` during build**). By default **`DOCKER_BUILD_SKIP_LLM_WARM=1`**: the image build does **not** load the causal LM into RAM (avoids **OOM / exit 137** on small HF builders). Weights are still downloaded to disk; **startup prewarm** loads them when the Space starts. Set build-arg **`DOCKER_BUILD_SKIP_LLM_WARM=0`** only on a machine with several GB of spare RAM if you want a full CPU warm during build.
107
 
108
  On startup, **`STARTUP_PREWARM`** (default **`user_modeling`**) loads that task’s embedder + optional RAG index + LLM before serving traffic (`all` = Task A and Task B, uses ~2× LLM RAM). Disable with **`SKIP_STARTUP_PREWARM=1`**.
109
 
scripts/docker_build_assets.py CHANGED
@@ -77,7 +77,16 @@ def prefetch_hub_files_only() -> None:
77
 
78
 
79
  def warm_runtime_models() -> None:
80
- print("docker_build_assets: warming models for runtime (CPU, one forward each)...")
 
 
 
 
 
 
 
 
 
81
  import gc
82
 
83
  emb_key = os.environ.get("TASK_B_LOCAL_EMBEDDING_MODEL", "all-MiniLM-L6-v2")
@@ -88,10 +97,6 @@ def warm_runtime_models() -> None:
88
  del st
89
  gc.collect()
90
 
91
- if os.environ.get("DOCKER_BUILD_SKIP_LLM_WARM", "").strip().lower() in ("1", "true", "yes"):
92
- print("docker_build_assets: DOCKER_BUILD_SKIP_LLM_WARM set — skipping causal LM warm.")
93
- return
94
-
95
  import torch # type: ignore[import-untyped]
96
  from transformers import AutoModelForCausalLM, AutoTokenizer # type: ignore[import-untyped]
97
 
 
77
 
78
 
79
  def warm_runtime_models() -> None:
80
+ raw = os.environ.get("DOCKER_BUILD_SKIP_LLM_WARM", "1").strip().lower()
81
+ skip = raw not in ("0", "false", "no")
82
+ if skip:
83
+ print(
84
+ "docker_build_assets: skipping in-RAM LLM warm (DOCKER_BUILD_SKIP_LLM_WARM default 1). "
85
+ "Weights are on disk from snapshot_download + stub encodes; uvicorn prewarm loads them at runtime."
86
+ )
87
+ return
88
+
89
+ print("docker_build_assets: full model warm (CPU) — DOCKER_BUILD_SKIP_LLM_WARM=0; needs several GB RAM.")
90
  import gc
91
 
92
  emb_key = os.environ.get("TASK_B_LOCAL_EMBEDDING_MODEL", "all-MiniLM-L6-v2")
 
97
  del st
98
  gc.collect()
99
 
 
 
 
 
100
  import torch # type: ignore[import-untyped]
101
  from transformers import AutoModelForCausalLM, AutoTokenizer # type: ignore[import-untyped]
102