satya007 commited on
Commit
6b080ac
·
verified ·
1 Parent(s): e295d89

Add no-weights Docker image build path

Browse files
.dockerignore ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ .cache/
2
+ patches/
3
+ results/
4
+ HANDOFF.md
5
+ *.log
6
+ __pycache__/
7
+ **/__pycache__/
8
+ *.pyc
.hfignore CHANGED
@@ -1,3 +1,5 @@
1
  HANDOFF.md
2
  patches/**
3
  .cache/**
 
 
 
1
  HANDOFF.md
2
  patches/**
3
  .cache/**
4
+ __pycache__/**
5
+ *.pyc
README.md CHANGED
@@ -25,6 +25,8 @@ This is an experimental reproducibility release, not a production-ready model. I
25
  - `scripts/setup_repro_from_hf.sh`: one-command setup for a new machine.
26
  - `scripts/serve_phase2_eagle.sh`: OpenAI-compatible vLLM server launcher.
27
  - `scripts/bench_tokens_sec_phase2_eagle.sh`: smoke/benchmark runner.
 
 
28
  - `scripts/test_triton_codebook_match.py`: isolated kernel equivalence harness.
29
  - `scripts/measure_kv_cache_compression.py`: live KV-cache measurement helper.
30
  - `results/`: selected validation outputs.
@@ -55,6 +57,76 @@ export HF_TOKEN=...
55
 
56
  Do not bake tokens into Docker images or committed files.
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ## One-Command Setup
59
 
60
  Pick a host directory. The setup script creates this layout:
 
25
  - `scripts/setup_repro_from_hf.sh`: one-command setup for a new machine.
26
  - `scripts/serve_phase2_eagle.sh`: OpenAI-compatible vLLM server launcher.
27
  - `scripts/bench_tokens_sec_phase2_eagle.sh`: smoke/benchmark runner.
28
+ - `scripts/build_docker_image.sh`: builds a no-weights runtime image.
29
+ - `docker/`: Dockerfile and entrypoint for the no-weights runtime image.
30
  - `scripts/test_triton_codebook_match.py`: isolated kernel equivalence harness.
31
  - `scripts/measure_kv_cache_compression.py`: live KV-cache measurement helper.
32
  - `results/`: selected validation outputs.
 
57
 
58
  Do not bake tokens into Docker images or committed files.
59
 
60
+ ## No-Weights Docker Image
61
+
62
+ This is the simplest hosting path if you are willing to build an image. The image bakes in:
63
+
64
+ ```text
65
+ vLLM Spectral fork at 008dd7f87fb9de185e536ad30b4d524024ed9b9f
66
+ GemmaCut launcher entrypoint
67
+ Spectral sidecar artifacts/spectral_sidecar_chat_v2.pt
68
+ git/cmake/ninja build tools for inspection and follow-up work
69
+ ```
70
+
71
+ It does **not** bake in model weights. `Intel/gemma-4-31B-it-int4-AutoRound` and `RedHatAI/gemma-4-31B-it-speculator.eagle3` are downloaded at runtime into the mounted Hugging Face cache.
72
+
73
+ Build:
74
+
75
+ ```bash
76
+ hf download satya007/gemmacut-spectral \
77
+ .dockerignore \
78
+ docker/Dockerfile \
79
+ docker/entrypoint.sh \
80
+ docker/download_sidecar.py \
81
+ scripts/build_docker_image.sh \
82
+ --local-dir ./gemmacut-spectral-image
83
+
84
+ cd ./gemmacut-spectral-image
85
+ chmod +x ./scripts/build_docker_image.sh
86
+ IMAGE=gemmacut-spectral:008dd7f87 ./scripts/build_docker_image.sh
87
+ ```
88
+
89
+ Smoke:
90
+
91
+ ```bash
92
+ mkdir -p "$PWD/hf-cache" "$PWD/results"
93
+
94
+ docker run --rm --gpus all --ipc=host \
95
+ -e HF_TOKEN \
96
+ -v "$PWD/hf-cache:/root/.cache/huggingface" \
97
+ -v "$PWD/results:/workspace/results_bench" \
98
+ gemmacut-spectral:008dd7f87 smoke
99
+ ```
100
+
101
+ Serve:
102
+
103
+ ```bash
104
+ docker run --rm --gpus all --ipc=host \
105
+ -p 8000:8000 \
106
+ -e HF_TOKEN \
107
+ -e MAX_MODEL_LEN=512 \
108
+ -e MAX_NUM_BATCHED_TOKENS=512 \
109
+ -e MAX_NUM_SEQS=2 \
110
+ -e GPU_MEMORY_UTILIZATION=0.8 \
111
+ -v "$PWD/hf-cache:/root/.cache/huggingface" \
112
+ gemmacut-spectral:008dd7f87 serve
113
+ ```
114
+
115
+ Optional: build without the sidecar and mount it yourself.
116
+
117
+ ```bash
118
+ IMAGE=gemmacut-spectral:008dd7f87-nosidecar \
119
+ ./scripts/build_docker_image.sh --build-arg INCLUDE_SIDECAR=0
120
+
121
+ docker run --rm --gpus all --ipc=host \
122
+ -p 8000:8000 \
123
+ -e HF_TOKEN \
124
+ -e SPECTRAL_SIDECAR=/workspace/spectral_sidecar_chat_v2.pt \
125
+ -v "$PWD/hf-cache:/root/.cache/huggingface" \
126
+ -v "$PWD/spectral_sidecar_chat_v2.pt:/workspace/spectral_sidecar_chat_v2.pt:ro" \
127
+ gemmacut-spectral:008dd7f87-nosidecar serve
128
+ ```
129
+
130
  ## One-Command Setup
131
 
132
  Pick a host directory. The setup script creates this layout:
docker/Dockerfile ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ARG BASE_IMAGE=vllm/vllm-openai:gemma4-cu130
2
+ FROM ${BASE_IMAGE}
3
+
4
+ ARG VLLM_REPO=https://github.com/bluecopa/vllm-spectral.git
5
+ ARG VLLM_BRANCH=spectral-codebook-docker
6
+ ARG VLLM_COMMIT=008dd7f87fb9de185e536ad30b4d524024ed9b9f
7
+ ARG HF_REPO_ID=satya007/gemmacut-spectral
8
+ ARG SIDECAR_SHA256=e47a36c13467cbedf720e7f782b976df3dcda2d989c727113a8315008661a3e4
9
+ ARG INCLUDE_SIDECAR=1
10
+
11
+ LABEL org.opencontainers.image.title="gemmacut-spectral"
12
+ LABEL org.opencontainers.image.description="GemmaCut SpectralQuant Phase 2 + Eagle3 vLLM runtime; model weights are not baked into the image."
13
+ LABEL org.opencontainers.image.source="https://github.com/bluecopa/vllm-spectral"
14
+ LABEL org.opencontainers.image.revision="${VLLM_COMMIT}"
15
+
16
+ ENV VLLM_SOURCE=/opt/vllm-spectral \
17
+ GEMMACUT_HOME=/opt/gemmacut \
18
+ SPECTRAL_SIDECAR=/opt/gemmacut/artifacts/spectral_sidecar_chat_v2.pt \
19
+ HF_HUB_DISABLE_XET=1 \
20
+ SPECTRAL_TRITON_COMPRESS=1 \
21
+ SPECTRAL_TRITON_DEQUANT=1 \
22
+ SPECTRAL_CUDA_GRAPH=1 \
23
+ SPECTRAL_VERIFY=0 \
24
+ DISABLE_HYBRID_KV_CACHE_MANAGER=0
25
+
26
+ SHELL ["/bin/bash", "-o", "pipefail", "-c"]
27
+
28
+ RUN apt-get update && \
29
+ apt-get install -y --no-install-recommends \
30
+ ca-certificates \
31
+ cmake \
32
+ git \
33
+ ninja-build && \
34
+ rm -rf /var/lib/apt/lists/*
35
+
36
+ RUN git clone --branch "${VLLM_BRANCH}" "${VLLM_REPO}" "${VLLM_SOURCE}" && \
37
+ git -C "${VLLM_SOURCE}" checkout "${VLLM_COMMIT}" && \
38
+ git -C "${VLLM_SOURCE}" log --oneline -1
39
+
40
+ COPY docker/download_sidecar.py /tmp/download_sidecar.py
41
+ RUN mkdir -p "${GEMMACUT_HOME}/artifacts" && \
42
+ if [[ "${INCLUDE_SIDECAR}" == "1" ]]; then \
43
+ HF_REPO_ID="${HF_REPO_ID}" \
44
+ SIDECAR_SHA256="${SIDECAR_SHA256}" \
45
+ python3 /tmp/download_sidecar.py; \
46
+ else \
47
+ echo "INCLUDE_SIDECAR=0; mount or set SPECTRAL_SIDECAR at runtime"; \
48
+ fi && \
49
+ rm -f /tmp/download_sidecar.py
50
+
51
+ COPY docker/entrypoint.sh /usr/local/bin/gemmacut-spectral
52
+ RUN chmod +x /usr/local/bin/gemmacut-spectral
53
+
54
+ EXPOSE 8000
55
+ ENTRYPOINT ["gemmacut-spectral"]
56
+ CMD ["serve"]
docker/download_sidecar.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import hashlib
2
+ import os
3
+ import shutil
4
+
5
+ from huggingface_hub import hf_hub_download
6
+
7
+
8
+ repo_id = os.environ["HF_REPO_ID"]
9
+ expected = os.environ["SIDECAR_SHA256"]
10
+ target = "/opt/gemmacut/artifacts/spectral_sidecar_chat_v2.pt"
11
+ path = hf_hub_download(
12
+ repo_id=repo_id,
13
+ filename="artifacts/spectral_sidecar_chat_v2.pt",
14
+ repo_type="model",
15
+ )
16
+ shutil.copyfile(path, target)
17
+ actual = hashlib.sha256(open(target, "rb").read()).hexdigest()
18
+ if actual != expected:
19
+ raise SystemExit(f"sidecar sha256 mismatch: expected {expected}, got {actual}")
20
+ print(f"sidecar ready: {target} sha256={actual}")
docker/entrypoint.sh ADDED
@@ -0,0 +1,255 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ COMMAND="${1:-serve}"
5
+ if [ "$#" -gt 0 ]; then
6
+ shift
7
+ fi
8
+
9
+ MODEL="${MODEL:-Intel/gemma-4-31B-it-int4-AutoRound}"
10
+ DRAFT="${DRAFT:-RedHatAI/gemma-4-31B-it-speculator.eagle3}"
11
+ SERVED_MODEL_NAME="${SERVED_MODEL_NAME:-gemmacut-spectral}"
12
+ SPECTRAL_SIDECAR="${SPECTRAL_SIDECAR:-/opt/gemmacut/artifacts/spectral_sidecar_chat_v2.pt}"
13
+ VLLM_SOURCE="${VLLM_SOURCE:-/opt/vllm-spectral}"
14
+ PORT="${PORT:-8000}"
15
+ MAX_MODEL_LEN="${MAX_MODEL_LEN:-512}"
16
+ MAX_NUM_BATCHED_TOKENS="${MAX_NUM_BATCHED_TOKENS:-512}"
17
+ MAX_NUM_SEQS="${MAX_NUM_SEQS:-2}"
18
+ GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.8}"
19
+ NUM_SPEC_TOKENS="${NUM_SPEC_TOKENS:-3}"
20
+ SPECTRAL_CUDA_GRAPH="${SPECTRAL_CUDA_GRAPH:-1}"
21
+ VLLM_LOGGING_LEVEL="${VLLM_LOGGING_LEVEL:-INFO}"
22
+ DISABLE_HYBRID_KV_CACHE_MANAGER="${DISABLE_HYBRID_KV_CACHE_MANAGER:-0}"
23
+ RESULTS_ROOT="${RESULTS_ROOT:-/workspace/results_bench}"
24
+
25
+ export VLLM_LOGGING_LEVEL
26
+ export SPECTRAL_CUDA_GRAPH
27
+ export SPECTRAL_TRITON_COMPRESS="${SPECTRAL_TRITON_COMPRESS:-1}"
28
+ export SPECTRAL_TRITON_DEQUANT="${SPECTRAL_TRITON_DEQUANT:-1}"
29
+ export SPECTRAL_VERIFY="${SPECTRAL_VERIFY:-0}"
30
+ export HF_HUB_DISABLE_XET="${HF_HUB_DISABLE_XET:-1}"
31
+ unset SPECTRAL_SHARED_ALLOC
32
+
33
+ if [ "${HF_HUB_OFFLINE:-0}" = "1" ]; then
34
+ export HF_HUB_OFFLINE=1
35
+ else
36
+ unset HF_HUB_OFFLINE
37
+ fi
38
+
39
+ prepare_overlay() {
40
+ local run_src="${SPECTRAL_RUN_SRC:-/tmp/vllm-spectral-run}"
41
+ local site
42
+
43
+ if [ ! -d "$VLLM_SOURCE" ]; then
44
+ echo "Missing VLLM_SOURCE: $VLLM_SOURCE" >&2
45
+ exit 1
46
+ fi
47
+ if [ ! -f "$SPECTRAL_SIDECAR" ]; then
48
+ echo "Missing SPECTRAL_SIDECAR: $SPECTRAL_SIDECAR" >&2
49
+ exit 1
50
+ fi
51
+
52
+ site="$(python3 - <<'PY'
53
+ import pathlib
54
+ import vllm
55
+ print(pathlib.Path(vllm.__file__).resolve().parent)
56
+ PY
57
+ )"
58
+
59
+ rm -rf "$run_src"
60
+ cp -a "$VLLM_SOURCE" "$run_src"
61
+
62
+ shopt -s nullglob
63
+ for f in "$site"/_C*.so "$site"/_moe_C*.so "$site"/_flashmla*.so "$site"/cumem_allocator*.so; do
64
+ ln -sf "$f" "$run_src/vllm/"
65
+ done
66
+ mkdir -p "$run_src/vllm/vllm_flash_attn"
67
+ for f in "$site"/vllm_flash_attn/_vllm_fa2_C*.so "$site"/vllm_flash_attn/_vllm_fa3_C*.so; do
68
+ ln -sf "$f" "$run_src/vllm/vllm_flash_attn/"
69
+ done
70
+ ln -sfn "$site/vllm_flash_attn/cute" "$run_src/vllm/vllm_flash_attn/cute"
71
+ ln -sfn "$site/vllm_flash_attn/layers" "$run_src/vllm/vllm_flash_attn/layers"
72
+ mkdir -p "$run_src/vllm/third_party" "$run_src/vllm/third_party/flashmla"
73
+ ln -sfn "$site/third_party/triton_kernels" "$run_src/vllm/third_party/triton_kernels"
74
+ ln -sf "$site/third_party/flashmla/flash_mla_interface.py" "$run_src/vllm/third_party/flashmla/"
75
+ shopt -u nullglob
76
+
77
+ export PYTHONPATH="$run_src:$run_src/vllm/third_party${PYTHONPATH:+:$PYTHONPATH}"
78
+ }
79
+
80
+ server_args() {
81
+ local args=(
82
+ --host "${HOST:-0.0.0.0}"
83
+ --port "$PORT"
84
+ --model "$MODEL"
85
+ --served-model-name "$SERVED_MODEL_NAME"
86
+ --spectral-calibration "$SPECTRAL_SIDECAR"
87
+ --spectral-quantize
88
+ --kv-cache-dtype fp8_e4m3
89
+ --max-model-len "$MAX_MODEL_LEN"
90
+ --max-num-batched-tokens "$MAX_NUM_BATCHED_TOKENS"
91
+ --max-num-seqs "$MAX_NUM_SEQS"
92
+ --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION"
93
+ --compilation-config "{\"compile_sizes\": []}"
94
+ --speculative-config "{\"model\":\"$DRAFT\",\"num_speculative_tokens\":$NUM_SPEC_TOKENS,\"method\":\"eagle3\"}"
95
+ )
96
+ if [ "$DISABLE_HYBRID_KV_CACHE_MANAGER" = "1" ]; then
97
+ args+=(--disable-hybrid-kv-cache-manager)
98
+ fi
99
+ printf '%s\0' "${args[@]}"
100
+ }
101
+
102
+ run_server() {
103
+ prepare_overlay
104
+ local args=()
105
+ while IFS= read -r -d '' item; do
106
+ args+=("$item")
107
+ done < <(server_args)
108
+ exec python3 -m vllm.entrypoints.openai.api_server "${args[@]}" "$@"
109
+ }
110
+
111
+ wait_for_server() {
112
+ python3 - <<PY
113
+ import os
114
+ import sys
115
+ import time
116
+ import urllib.request
117
+
118
+ pid = int(os.environ["SERVER_PID"])
119
+ port = int(os.environ["PORT"])
120
+ deadline = time.time() + int(os.environ.get("SERVER_TIMEOUT", "300"))
121
+ url = f"http://127.0.0.1:{port}/v1/models"
122
+ while time.time() < deadline:
123
+ try:
124
+ os.kill(pid, 0)
125
+ except OSError:
126
+ raise SystemExit("server exited early")
127
+ try:
128
+ with urllib.request.urlopen(url, timeout=2) as response:
129
+ if response.status == 200:
130
+ print("SERVER_READY", flush=True)
131
+ raise SystemExit(0)
132
+ except Exception:
133
+ time.sleep(1)
134
+ raise SystemExit("server did not become ready")
135
+ PY
136
+ }
137
+
138
+ start_background_server() {
139
+ prepare_overlay
140
+ local args=()
141
+ HOST=127.0.0.1
142
+ export HOST
143
+ while IFS= read -r -d '' item; do
144
+ args+=("$item")
145
+ done < <(server_args)
146
+ python3 -m vllm.entrypoints.openai.api_server "${args[@]}" > "$SERVER_LOG" 2>&1 &
147
+ SERVER_PID=$!
148
+ export SERVER_PID PORT
149
+ trap 'kill "$SERVER_PID" >/dev/null 2>&1 || true; wait "$SERVER_PID" >/dev/null 2>&1 || true' EXIT
150
+ wait_for_server
151
+ }
152
+
153
+ run_smoke_client() {
154
+ python3 - <<PY
155
+ import json
156
+ import urllib.request
157
+
158
+ model = "${SERVED_MODEL_NAME}"
159
+ url = "http://127.0.0.1:${PORT}/v1/chat/completions"
160
+ checks = [
161
+ ("What is 2+2? Answer with just the number.", "4"),
162
+ ("Paris is the capital of which country? Answer with one word.", "France"),
163
+ ]
164
+
165
+ for prompt, expected in checks:
166
+ payload = {
167
+ "model": model,
168
+ "messages": [{"role": "user", "content": prompt}],
169
+ "max_tokens": 16,
170
+ "temperature": 0,
171
+ }
172
+ request = urllib.request.Request(
173
+ url,
174
+ data=json.dumps(payload).encode("utf-8"),
175
+ headers={"Content-Type": "application/json"},
176
+ method="POST",
177
+ )
178
+ with urllib.request.urlopen(request, timeout=120) as response:
179
+ data = json.load(response)
180
+ text = data["choices"][0]["message"]["content"].strip()
181
+ print(f"{prompt} => {text}", flush=True)
182
+ if expected.lower() not in text.lower():
183
+ raise SystemExit(
184
+ f"semantic smoke failed: expected {expected!r} in response {text!r}")
185
+
186
+ print("SMOKE_PROMPTS_OK", flush=True)
187
+ PY
188
+ }
189
+
190
+ run_smoke() {
191
+ RUN_ID="${RUN_ID:-smoke_$(date +%Y%m%d_%H%M%S)}"
192
+ OUT="${RESULTS_DIR:-$RESULTS_ROOT/$RUN_ID}"
193
+ mkdir -p "$OUT"
194
+ SERVER_LOG="$OUT/server.log"
195
+ start_background_server
196
+ run_smoke_client | tee "$OUT/smoke_outputs.txt"
197
+ echo "SMOKE_OUT=$OUT"
198
+ }
199
+
200
+ run_bench() {
201
+ RUN_ID="${RUN_ID:-tokens_sec_phase2_eagle_$(date +%Y%m%d_%H%M%S)}"
202
+ OUT="${RESULTS_DIR:-$RESULTS_ROOT/$RUN_ID}"
203
+ mkdir -p "$OUT"
204
+ SERVER_LOG="$OUT/server.log"
205
+ start_background_server
206
+
207
+ if [ "${RUN_SMOKE:-0}" = "1" ]; then
208
+ run_smoke_client | tee "$OUT/smoke_outputs.txt"
209
+ fi
210
+ if [ "${SMOKE_ONLY:-0}" = "1" ]; then
211
+ echo "SMOKE_ONLY=1; skipping benchmark"
212
+ echo "BENCH_OUT=$OUT"
213
+ exit 0
214
+ fi
215
+
216
+ python3 -m vllm.entrypoints.cli.main bench serve \
217
+ --backend openai-chat \
218
+ --base-url "http://127.0.0.1:$PORT" \
219
+ --endpoint /v1/chat/completions \
220
+ --model "$SERVED_MODEL_NAME" \
221
+ --tokenizer "$MODEL" \
222
+ --dataset-name random \
223
+ --random-input-len "${INPUT_LEN:-128}" \
224
+ --random-output-len "${OUTPUT_LEN:-32}" \
225
+ --num-prompts "${NUM_PROMPTS:-8}" \
226
+ --num-warmups "${NUM_WARMUPS:-1}" \
227
+ --request-rate "${REQUEST_RATE:-inf}" \
228
+ --temperature 0 \
229
+ --ignore-eos \
230
+ --disable-tqdm \
231
+ --save-result \
232
+ --result-dir "$OUT" \
233
+ --result-filename bench.json \
234
+ 2>&1 | tee "$OUT/bench.log"
235
+
236
+ echo "BENCH_OUT=$OUT"
237
+ }
238
+
239
+ case "$COMMAND" in
240
+ serve)
241
+ run_server "$@"
242
+ ;;
243
+ smoke)
244
+ run_smoke
245
+ ;;
246
+ bench)
247
+ run_bench
248
+ ;;
249
+ bash|sh)
250
+ exec "$COMMAND" "$@"
251
+ ;;
252
+ *)
253
+ exec "$COMMAND" "$@"
254
+ ;;
255
+ esac
manifest.json CHANGED
@@ -23,9 +23,18 @@
23
  "scripts/setup_repro_from_hf.sh",
24
  "scripts/serve_phase2_eagle.sh",
25
  "scripts/bench_tokens_sec_phase2_eagle.sh",
 
26
  "scripts/test_triton_codebook_match.py",
27
  "scripts/measure_kv_cache_compression.py"
28
  ],
 
 
 
 
 
 
 
 
29
  "recommended_runtime_env": {
30
  "SPECTRAL_CUDA_GRAPH": "1",
31
  "SPECTRAL_TRITON_COMPRESS": "1",
 
23
  "scripts/setup_repro_from_hf.sh",
24
  "scripts/serve_phase2_eagle.sh",
25
  "scripts/bench_tokens_sec_phase2_eagle.sh",
26
+ "scripts/build_docker_image.sh",
27
  "scripts/test_triton_codebook_match.py",
28
  "scripts/measure_kv_cache_compression.py"
29
  ],
30
+ "docker_image_build": {
31
+ "dockerfile": "docker/Dockerfile",
32
+ "entrypoint": "docker/entrypoint.sh",
33
+ "downloads_model_weights_at_runtime": true,
34
+ "includes_sidecar_by_default": true,
35
+ "optional_no_sidecar_build_arg": "INCLUDE_SIDECAR=0",
36
+ "default_image_tag": "gemmacut-spectral:008dd7f87"
37
+ },
38
  "recommended_runtime_env": {
39
  "SPECTRAL_CUDA_GRAPH": "1",
40
  "SPECTRAL_TRITON_COMPRESS": "1",
scripts/build_docker_image.sh ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Build the no-weights GemmaCut SpectralQuant runtime image.
3
+
4
+ set -euo pipefail
5
+
6
+ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
7
+ BUNDLE_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
8
+
9
+ IMAGE="${IMAGE:-gemmacut-spectral:008dd7f87}"
10
+
11
+ docker build \
12
+ -f "$BUNDLE_DIR/docker/Dockerfile" \
13
+ -t "$IMAGE" \
14
+ "$@" \
15
+ "$BUNDLE_DIR"
16
+
17
+ echo "Built $IMAGE"