Add no-weights Docker image build path
Browse files- .dockerignore +8 -0
- .hfignore +2 -0
- README.md +72 -0
- docker/Dockerfile +56 -0
- docker/download_sidecar.py +20 -0
- docker/entrypoint.sh +255 -0
- manifest.json +9 -0
- scripts/build_docker_image.sh +17 -0
.dockerignore
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
.cache/
|
| 2 |
+
patches/
|
| 3 |
+
results/
|
| 4 |
+
HANDOFF.md
|
| 5 |
+
*.log
|
| 6 |
+
__pycache__/
|
| 7 |
+
**/__pycache__/
|
| 8 |
+
*.pyc
|
.hfignore
CHANGED
|
@@ -1,3 +1,5 @@
|
|
| 1 |
HANDOFF.md
|
| 2 |
patches/**
|
| 3 |
.cache/**
|
|
|
|
|
|
|
|
|
| 1 |
HANDOFF.md
|
| 2 |
patches/**
|
| 3 |
.cache/**
|
| 4 |
+
__pycache__/**
|
| 5 |
+
*.pyc
|
README.md
CHANGED
|
@@ -25,6 +25,8 @@ This is an experimental reproducibility release, not a production-ready model. I
|
|
| 25 |
- `scripts/setup_repro_from_hf.sh`: one-command setup for a new machine.
|
| 26 |
- `scripts/serve_phase2_eagle.sh`: OpenAI-compatible vLLM server launcher.
|
| 27 |
- `scripts/bench_tokens_sec_phase2_eagle.sh`: smoke/benchmark runner.
|
|
|
|
|
|
|
| 28 |
- `scripts/test_triton_codebook_match.py`: isolated kernel equivalence harness.
|
| 29 |
- `scripts/measure_kv_cache_compression.py`: live KV-cache measurement helper.
|
| 30 |
- `results/`: selected validation outputs.
|
|
@@ -55,6 +57,76 @@ export HF_TOKEN=...
|
|
| 55 |
|
| 56 |
Do not bake tokens into Docker images or committed files.
|
| 57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
## One-Command Setup
|
| 59 |
|
| 60 |
Pick a host directory. The setup script creates this layout:
|
|
|
|
| 25 |
- `scripts/setup_repro_from_hf.sh`: one-command setup for a new machine.
|
| 26 |
- `scripts/serve_phase2_eagle.sh`: OpenAI-compatible vLLM server launcher.
|
| 27 |
- `scripts/bench_tokens_sec_phase2_eagle.sh`: smoke/benchmark runner.
|
| 28 |
+
- `scripts/build_docker_image.sh`: builds a no-weights runtime image.
|
| 29 |
+
- `docker/`: Dockerfile and entrypoint for the no-weights runtime image.
|
| 30 |
- `scripts/test_triton_codebook_match.py`: isolated kernel equivalence harness.
|
| 31 |
- `scripts/measure_kv_cache_compression.py`: live KV-cache measurement helper.
|
| 32 |
- `results/`: selected validation outputs.
|
|
|
|
| 57 |
|
| 58 |
Do not bake tokens into Docker images or committed files.
|
| 59 |
|
| 60 |
+
## No-Weights Docker Image
|
| 61 |
+
|
| 62 |
+
This is the simplest hosting path if you are willing to build an image. The image bakes in:
|
| 63 |
+
|
| 64 |
+
```text
|
| 65 |
+
vLLM Spectral fork at 008dd7f87fb9de185e536ad30b4d524024ed9b9f
|
| 66 |
+
GemmaCut launcher entrypoint
|
| 67 |
+
Spectral sidecar artifacts/spectral_sidecar_chat_v2.pt
|
| 68 |
+
git/cmake/ninja build tools for inspection and follow-up work
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
It does **not** bake in model weights. `Intel/gemma-4-31B-it-int4-AutoRound` and `RedHatAI/gemma-4-31B-it-speculator.eagle3` are downloaded at runtime into the mounted Hugging Face cache.
|
| 72 |
+
|
| 73 |
+
Build:
|
| 74 |
+
|
| 75 |
+
```bash
|
| 76 |
+
hf download satya007/gemmacut-spectral \
|
| 77 |
+
.dockerignore \
|
| 78 |
+
docker/Dockerfile \
|
| 79 |
+
docker/entrypoint.sh \
|
| 80 |
+
docker/download_sidecar.py \
|
| 81 |
+
scripts/build_docker_image.sh \
|
| 82 |
+
--local-dir ./gemmacut-spectral-image
|
| 83 |
+
|
| 84 |
+
cd ./gemmacut-spectral-image
|
| 85 |
+
chmod +x ./scripts/build_docker_image.sh
|
| 86 |
+
IMAGE=gemmacut-spectral:008dd7f87 ./scripts/build_docker_image.sh
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
Smoke:
|
| 90 |
+
|
| 91 |
+
```bash
|
| 92 |
+
mkdir -p "$PWD/hf-cache" "$PWD/results"
|
| 93 |
+
|
| 94 |
+
docker run --rm --gpus all --ipc=host \
|
| 95 |
+
-e HF_TOKEN \
|
| 96 |
+
-v "$PWD/hf-cache:/root/.cache/huggingface" \
|
| 97 |
+
-v "$PWD/results:/workspace/results_bench" \
|
| 98 |
+
gemmacut-spectral:008dd7f87 smoke
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
+
Serve:
|
| 102 |
+
|
| 103 |
+
```bash
|
| 104 |
+
docker run --rm --gpus all --ipc=host \
|
| 105 |
+
-p 8000:8000 \
|
| 106 |
+
-e HF_TOKEN \
|
| 107 |
+
-e MAX_MODEL_LEN=512 \
|
| 108 |
+
-e MAX_NUM_BATCHED_TOKENS=512 \
|
| 109 |
+
-e MAX_NUM_SEQS=2 \
|
| 110 |
+
-e GPU_MEMORY_UTILIZATION=0.8 \
|
| 111 |
+
-v "$PWD/hf-cache:/root/.cache/huggingface" \
|
| 112 |
+
gemmacut-spectral:008dd7f87 serve
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
Optional: build without the sidecar and mount it yourself.
|
| 116 |
+
|
| 117 |
+
```bash
|
| 118 |
+
IMAGE=gemmacut-spectral:008dd7f87-nosidecar \
|
| 119 |
+
./scripts/build_docker_image.sh --build-arg INCLUDE_SIDECAR=0
|
| 120 |
+
|
| 121 |
+
docker run --rm --gpus all --ipc=host \
|
| 122 |
+
-p 8000:8000 \
|
| 123 |
+
-e HF_TOKEN \
|
| 124 |
+
-e SPECTRAL_SIDECAR=/workspace/spectral_sidecar_chat_v2.pt \
|
| 125 |
+
-v "$PWD/hf-cache:/root/.cache/huggingface" \
|
| 126 |
+
-v "$PWD/spectral_sidecar_chat_v2.pt:/workspace/spectral_sidecar_chat_v2.pt:ro" \
|
| 127 |
+
gemmacut-spectral:008dd7f87-nosidecar serve
|
| 128 |
+
```
|
| 129 |
+
|
| 130 |
## One-Command Setup
|
| 131 |
|
| 132 |
Pick a host directory. The setup script creates this layout:
|
docker/Dockerfile
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ARG BASE_IMAGE=vllm/vllm-openai:gemma4-cu130
|
| 2 |
+
FROM ${BASE_IMAGE}
|
| 3 |
+
|
| 4 |
+
ARG VLLM_REPO=https://github.com/bluecopa/vllm-spectral.git
|
| 5 |
+
ARG VLLM_BRANCH=spectral-codebook-docker
|
| 6 |
+
ARG VLLM_COMMIT=008dd7f87fb9de185e536ad30b4d524024ed9b9f
|
| 7 |
+
ARG HF_REPO_ID=satya007/gemmacut-spectral
|
| 8 |
+
ARG SIDECAR_SHA256=e47a36c13467cbedf720e7f782b976df3dcda2d989c727113a8315008661a3e4
|
| 9 |
+
ARG INCLUDE_SIDECAR=1
|
| 10 |
+
|
| 11 |
+
LABEL org.opencontainers.image.title="gemmacut-spectral"
|
| 12 |
+
LABEL org.opencontainers.image.description="GemmaCut SpectralQuant Phase 2 + Eagle3 vLLM runtime; model weights are not baked into the image."
|
| 13 |
+
LABEL org.opencontainers.image.source="https://github.com/bluecopa/vllm-spectral"
|
| 14 |
+
LABEL org.opencontainers.image.revision="${VLLM_COMMIT}"
|
| 15 |
+
|
| 16 |
+
ENV VLLM_SOURCE=/opt/vllm-spectral \
|
| 17 |
+
GEMMACUT_HOME=/opt/gemmacut \
|
| 18 |
+
SPECTRAL_SIDECAR=/opt/gemmacut/artifacts/spectral_sidecar_chat_v2.pt \
|
| 19 |
+
HF_HUB_DISABLE_XET=1 \
|
| 20 |
+
SPECTRAL_TRITON_COMPRESS=1 \
|
| 21 |
+
SPECTRAL_TRITON_DEQUANT=1 \
|
| 22 |
+
SPECTRAL_CUDA_GRAPH=1 \
|
| 23 |
+
SPECTRAL_VERIFY=0 \
|
| 24 |
+
DISABLE_HYBRID_KV_CACHE_MANAGER=0
|
| 25 |
+
|
| 26 |
+
SHELL ["/bin/bash", "-o", "pipefail", "-c"]
|
| 27 |
+
|
| 28 |
+
RUN apt-get update && \
|
| 29 |
+
apt-get install -y --no-install-recommends \
|
| 30 |
+
ca-certificates \
|
| 31 |
+
cmake \
|
| 32 |
+
git \
|
| 33 |
+
ninja-build && \
|
| 34 |
+
rm -rf /var/lib/apt/lists/*
|
| 35 |
+
|
| 36 |
+
RUN git clone --branch "${VLLM_BRANCH}" "${VLLM_REPO}" "${VLLM_SOURCE}" && \
|
| 37 |
+
git -C "${VLLM_SOURCE}" checkout "${VLLM_COMMIT}" && \
|
| 38 |
+
git -C "${VLLM_SOURCE}" log --oneline -1
|
| 39 |
+
|
| 40 |
+
COPY docker/download_sidecar.py /tmp/download_sidecar.py
|
| 41 |
+
RUN mkdir -p "${GEMMACUT_HOME}/artifacts" && \
|
| 42 |
+
if [[ "${INCLUDE_SIDECAR}" == "1" ]]; then \
|
| 43 |
+
HF_REPO_ID="${HF_REPO_ID}" \
|
| 44 |
+
SIDECAR_SHA256="${SIDECAR_SHA256}" \
|
| 45 |
+
python3 /tmp/download_sidecar.py; \
|
| 46 |
+
else \
|
| 47 |
+
echo "INCLUDE_SIDECAR=0; mount or set SPECTRAL_SIDECAR at runtime"; \
|
| 48 |
+
fi && \
|
| 49 |
+
rm -f /tmp/download_sidecar.py
|
| 50 |
+
|
| 51 |
+
COPY docker/entrypoint.sh /usr/local/bin/gemmacut-spectral
|
| 52 |
+
RUN chmod +x /usr/local/bin/gemmacut-spectral
|
| 53 |
+
|
| 54 |
+
EXPOSE 8000
|
| 55 |
+
ENTRYPOINT ["gemmacut-spectral"]
|
| 56 |
+
CMD ["serve"]
|
docker/download_sidecar.py
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import hashlib
|
| 2 |
+
import os
|
| 3 |
+
import shutil
|
| 4 |
+
|
| 5 |
+
from huggingface_hub import hf_hub_download
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
repo_id = os.environ["HF_REPO_ID"]
|
| 9 |
+
expected = os.environ["SIDECAR_SHA256"]
|
| 10 |
+
target = "/opt/gemmacut/artifacts/spectral_sidecar_chat_v2.pt"
|
| 11 |
+
path = hf_hub_download(
|
| 12 |
+
repo_id=repo_id,
|
| 13 |
+
filename="artifacts/spectral_sidecar_chat_v2.pt",
|
| 14 |
+
repo_type="model",
|
| 15 |
+
)
|
| 16 |
+
shutil.copyfile(path, target)
|
| 17 |
+
actual = hashlib.sha256(open(target, "rb").read()).hexdigest()
|
| 18 |
+
if actual != expected:
|
| 19 |
+
raise SystemExit(f"sidecar sha256 mismatch: expected {expected}, got {actual}")
|
| 20 |
+
print(f"sidecar ready: {target} sha256={actual}")
|
docker/entrypoint.sh
ADDED
|
@@ -0,0 +1,255 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
set -euo pipefail
|
| 3 |
+
|
| 4 |
+
COMMAND="${1:-serve}"
|
| 5 |
+
if [ "$#" -gt 0 ]; then
|
| 6 |
+
shift
|
| 7 |
+
fi
|
| 8 |
+
|
| 9 |
+
MODEL="${MODEL:-Intel/gemma-4-31B-it-int4-AutoRound}"
|
| 10 |
+
DRAFT="${DRAFT:-RedHatAI/gemma-4-31B-it-speculator.eagle3}"
|
| 11 |
+
SERVED_MODEL_NAME="${SERVED_MODEL_NAME:-gemmacut-spectral}"
|
| 12 |
+
SPECTRAL_SIDECAR="${SPECTRAL_SIDECAR:-/opt/gemmacut/artifacts/spectral_sidecar_chat_v2.pt}"
|
| 13 |
+
VLLM_SOURCE="${VLLM_SOURCE:-/opt/vllm-spectral}"
|
| 14 |
+
PORT="${PORT:-8000}"
|
| 15 |
+
MAX_MODEL_LEN="${MAX_MODEL_LEN:-512}"
|
| 16 |
+
MAX_NUM_BATCHED_TOKENS="${MAX_NUM_BATCHED_TOKENS:-512}"
|
| 17 |
+
MAX_NUM_SEQS="${MAX_NUM_SEQS:-2}"
|
| 18 |
+
GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.8}"
|
| 19 |
+
NUM_SPEC_TOKENS="${NUM_SPEC_TOKENS:-3}"
|
| 20 |
+
SPECTRAL_CUDA_GRAPH="${SPECTRAL_CUDA_GRAPH:-1}"
|
| 21 |
+
VLLM_LOGGING_LEVEL="${VLLM_LOGGING_LEVEL:-INFO}"
|
| 22 |
+
DISABLE_HYBRID_KV_CACHE_MANAGER="${DISABLE_HYBRID_KV_CACHE_MANAGER:-0}"
|
| 23 |
+
RESULTS_ROOT="${RESULTS_ROOT:-/workspace/results_bench}"
|
| 24 |
+
|
| 25 |
+
export VLLM_LOGGING_LEVEL
|
| 26 |
+
export SPECTRAL_CUDA_GRAPH
|
| 27 |
+
export SPECTRAL_TRITON_COMPRESS="${SPECTRAL_TRITON_COMPRESS:-1}"
|
| 28 |
+
export SPECTRAL_TRITON_DEQUANT="${SPECTRAL_TRITON_DEQUANT:-1}"
|
| 29 |
+
export SPECTRAL_VERIFY="${SPECTRAL_VERIFY:-0}"
|
| 30 |
+
export HF_HUB_DISABLE_XET="${HF_HUB_DISABLE_XET:-1}"
|
| 31 |
+
unset SPECTRAL_SHARED_ALLOC
|
| 32 |
+
|
| 33 |
+
if [ "${HF_HUB_OFFLINE:-0}" = "1" ]; then
|
| 34 |
+
export HF_HUB_OFFLINE=1
|
| 35 |
+
else
|
| 36 |
+
unset HF_HUB_OFFLINE
|
| 37 |
+
fi
|
| 38 |
+
|
| 39 |
+
prepare_overlay() {
|
| 40 |
+
local run_src="${SPECTRAL_RUN_SRC:-/tmp/vllm-spectral-run}"
|
| 41 |
+
local site
|
| 42 |
+
|
| 43 |
+
if [ ! -d "$VLLM_SOURCE" ]; then
|
| 44 |
+
echo "Missing VLLM_SOURCE: $VLLM_SOURCE" >&2
|
| 45 |
+
exit 1
|
| 46 |
+
fi
|
| 47 |
+
if [ ! -f "$SPECTRAL_SIDECAR" ]; then
|
| 48 |
+
echo "Missing SPECTRAL_SIDECAR: $SPECTRAL_SIDECAR" >&2
|
| 49 |
+
exit 1
|
| 50 |
+
fi
|
| 51 |
+
|
| 52 |
+
site="$(python3 - <<'PY'
|
| 53 |
+
import pathlib
|
| 54 |
+
import vllm
|
| 55 |
+
print(pathlib.Path(vllm.__file__).resolve().parent)
|
| 56 |
+
PY
|
| 57 |
+
)"
|
| 58 |
+
|
| 59 |
+
rm -rf "$run_src"
|
| 60 |
+
cp -a "$VLLM_SOURCE" "$run_src"
|
| 61 |
+
|
| 62 |
+
shopt -s nullglob
|
| 63 |
+
for f in "$site"/_C*.so "$site"/_moe_C*.so "$site"/_flashmla*.so "$site"/cumem_allocator*.so; do
|
| 64 |
+
ln -sf "$f" "$run_src/vllm/"
|
| 65 |
+
done
|
| 66 |
+
mkdir -p "$run_src/vllm/vllm_flash_attn"
|
| 67 |
+
for f in "$site"/vllm_flash_attn/_vllm_fa2_C*.so "$site"/vllm_flash_attn/_vllm_fa3_C*.so; do
|
| 68 |
+
ln -sf "$f" "$run_src/vllm/vllm_flash_attn/"
|
| 69 |
+
done
|
| 70 |
+
ln -sfn "$site/vllm_flash_attn/cute" "$run_src/vllm/vllm_flash_attn/cute"
|
| 71 |
+
ln -sfn "$site/vllm_flash_attn/layers" "$run_src/vllm/vllm_flash_attn/layers"
|
| 72 |
+
mkdir -p "$run_src/vllm/third_party" "$run_src/vllm/third_party/flashmla"
|
| 73 |
+
ln -sfn "$site/third_party/triton_kernels" "$run_src/vllm/third_party/triton_kernels"
|
| 74 |
+
ln -sf "$site/third_party/flashmla/flash_mla_interface.py" "$run_src/vllm/third_party/flashmla/"
|
| 75 |
+
shopt -u nullglob
|
| 76 |
+
|
| 77 |
+
export PYTHONPATH="$run_src:$run_src/vllm/third_party${PYTHONPATH:+:$PYTHONPATH}"
|
| 78 |
+
}
|
| 79 |
+
|
| 80 |
+
server_args() {
|
| 81 |
+
local args=(
|
| 82 |
+
--host "${HOST:-0.0.0.0}"
|
| 83 |
+
--port "$PORT"
|
| 84 |
+
--model "$MODEL"
|
| 85 |
+
--served-model-name "$SERVED_MODEL_NAME"
|
| 86 |
+
--spectral-calibration "$SPECTRAL_SIDECAR"
|
| 87 |
+
--spectral-quantize
|
| 88 |
+
--kv-cache-dtype fp8_e4m3
|
| 89 |
+
--max-model-len "$MAX_MODEL_LEN"
|
| 90 |
+
--max-num-batched-tokens "$MAX_NUM_BATCHED_TOKENS"
|
| 91 |
+
--max-num-seqs "$MAX_NUM_SEQS"
|
| 92 |
+
--gpu-memory-utilization "$GPU_MEMORY_UTILIZATION"
|
| 93 |
+
--compilation-config "{\"compile_sizes\": []}"
|
| 94 |
+
--speculative-config "{\"model\":\"$DRAFT\",\"num_speculative_tokens\":$NUM_SPEC_TOKENS,\"method\":\"eagle3\"}"
|
| 95 |
+
)
|
| 96 |
+
if [ "$DISABLE_HYBRID_KV_CACHE_MANAGER" = "1" ]; then
|
| 97 |
+
args+=(--disable-hybrid-kv-cache-manager)
|
| 98 |
+
fi
|
| 99 |
+
printf '%s\0' "${args[@]}"
|
| 100 |
+
}
|
| 101 |
+
|
| 102 |
+
run_server() {
|
| 103 |
+
prepare_overlay
|
| 104 |
+
local args=()
|
| 105 |
+
while IFS= read -r -d '' item; do
|
| 106 |
+
args+=("$item")
|
| 107 |
+
done < <(server_args)
|
| 108 |
+
exec python3 -m vllm.entrypoints.openai.api_server "${args[@]}" "$@"
|
| 109 |
+
}
|
| 110 |
+
|
| 111 |
+
wait_for_server() {
|
| 112 |
+
python3 - <<PY
|
| 113 |
+
import os
|
| 114 |
+
import sys
|
| 115 |
+
import time
|
| 116 |
+
import urllib.request
|
| 117 |
+
|
| 118 |
+
pid = int(os.environ["SERVER_PID"])
|
| 119 |
+
port = int(os.environ["PORT"])
|
| 120 |
+
deadline = time.time() + int(os.environ.get("SERVER_TIMEOUT", "300"))
|
| 121 |
+
url = f"http://127.0.0.1:{port}/v1/models"
|
| 122 |
+
while time.time() < deadline:
|
| 123 |
+
try:
|
| 124 |
+
os.kill(pid, 0)
|
| 125 |
+
except OSError:
|
| 126 |
+
raise SystemExit("server exited early")
|
| 127 |
+
try:
|
| 128 |
+
with urllib.request.urlopen(url, timeout=2) as response:
|
| 129 |
+
if response.status == 200:
|
| 130 |
+
print("SERVER_READY", flush=True)
|
| 131 |
+
raise SystemExit(0)
|
| 132 |
+
except Exception:
|
| 133 |
+
time.sleep(1)
|
| 134 |
+
raise SystemExit("server did not become ready")
|
| 135 |
+
PY
|
| 136 |
+
}
|
| 137 |
+
|
| 138 |
+
start_background_server() {
|
| 139 |
+
prepare_overlay
|
| 140 |
+
local args=()
|
| 141 |
+
HOST=127.0.0.1
|
| 142 |
+
export HOST
|
| 143 |
+
while IFS= read -r -d '' item; do
|
| 144 |
+
args+=("$item")
|
| 145 |
+
done < <(server_args)
|
| 146 |
+
python3 -m vllm.entrypoints.openai.api_server "${args[@]}" > "$SERVER_LOG" 2>&1 &
|
| 147 |
+
SERVER_PID=$!
|
| 148 |
+
export SERVER_PID PORT
|
| 149 |
+
trap 'kill "$SERVER_PID" >/dev/null 2>&1 || true; wait "$SERVER_PID" >/dev/null 2>&1 || true' EXIT
|
| 150 |
+
wait_for_server
|
| 151 |
+
}
|
| 152 |
+
|
| 153 |
+
run_smoke_client() {
|
| 154 |
+
python3 - <<PY
|
| 155 |
+
import json
|
| 156 |
+
import urllib.request
|
| 157 |
+
|
| 158 |
+
model = "${SERVED_MODEL_NAME}"
|
| 159 |
+
url = "http://127.0.0.1:${PORT}/v1/chat/completions"
|
| 160 |
+
checks = [
|
| 161 |
+
("What is 2+2? Answer with just the number.", "4"),
|
| 162 |
+
("Paris is the capital of which country? Answer with one word.", "France"),
|
| 163 |
+
]
|
| 164 |
+
|
| 165 |
+
for prompt, expected in checks:
|
| 166 |
+
payload = {
|
| 167 |
+
"model": model,
|
| 168 |
+
"messages": [{"role": "user", "content": prompt}],
|
| 169 |
+
"max_tokens": 16,
|
| 170 |
+
"temperature": 0,
|
| 171 |
+
}
|
| 172 |
+
request = urllib.request.Request(
|
| 173 |
+
url,
|
| 174 |
+
data=json.dumps(payload).encode("utf-8"),
|
| 175 |
+
headers={"Content-Type": "application/json"},
|
| 176 |
+
method="POST",
|
| 177 |
+
)
|
| 178 |
+
with urllib.request.urlopen(request, timeout=120) as response:
|
| 179 |
+
data = json.load(response)
|
| 180 |
+
text = data["choices"][0]["message"]["content"].strip()
|
| 181 |
+
print(f"{prompt} => {text}", flush=True)
|
| 182 |
+
if expected.lower() not in text.lower():
|
| 183 |
+
raise SystemExit(
|
| 184 |
+
f"semantic smoke failed: expected {expected!r} in response {text!r}")
|
| 185 |
+
|
| 186 |
+
print("SMOKE_PROMPTS_OK", flush=True)
|
| 187 |
+
PY
|
| 188 |
+
}
|
| 189 |
+
|
| 190 |
+
run_smoke() {
|
| 191 |
+
RUN_ID="${RUN_ID:-smoke_$(date +%Y%m%d_%H%M%S)}"
|
| 192 |
+
OUT="${RESULTS_DIR:-$RESULTS_ROOT/$RUN_ID}"
|
| 193 |
+
mkdir -p "$OUT"
|
| 194 |
+
SERVER_LOG="$OUT/server.log"
|
| 195 |
+
start_background_server
|
| 196 |
+
run_smoke_client | tee "$OUT/smoke_outputs.txt"
|
| 197 |
+
echo "SMOKE_OUT=$OUT"
|
| 198 |
+
}
|
| 199 |
+
|
| 200 |
+
run_bench() {
|
| 201 |
+
RUN_ID="${RUN_ID:-tokens_sec_phase2_eagle_$(date +%Y%m%d_%H%M%S)}"
|
| 202 |
+
OUT="${RESULTS_DIR:-$RESULTS_ROOT/$RUN_ID}"
|
| 203 |
+
mkdir -p "$OUT"
|
| 204 |
+
SERVER_LOG="$OUT/server.log"
|
| 205 |
+
start_background_server
|
| 206 |
+
|
| 207 |
+
if [ "${RUN_SMOKE:-0}" = "1" ]; then
|
| 208 |
+
run_smoke_client | tee "$OUT/smoke_outputs.txt"
|
| 209 |
+
fi
|
| 210 |
+
if [ "${SMOKE_ONLY:-0}" = "1" ]; then
|
| 211 |
+
echo "SMOKE_ONLY=1; skipping benchmark"
|
| 212 |
+
echo "BENCH_OUT=$OUT"
|
| 213 |
+
exit 0
|
| 214 |
+
fi
|
| 215 |
+
|
| 216 |
+
python3 -m vllm.entrypoints.cli.main bench serve \
|
| 217 |
+
--backend openai-chat \
|
| 218 |
+
--base-url "http://127.0.0.1:$PORT" \
|
| 219 |
+
--endpoint /v1/chat/completions \
|
| 220 |
+
--model "$SERVED_MODEL_NAME" \
|
| 221 |
+
--tokenizer "$MODEL" \
|
| 222 |
+
--dataset-name random \
|
| 223 |
+
--random-input-len "${INPUT_LEN:-128}" \
|
| 224 |
+
--random-output-len "${OUTPUT_LEN:-32}" \
|
| 225 |
+
--num-prompts "${NUM_PROMPTS:-8}" \
|
| 226 |
+
--num-warmups "${NUM_WARMUPS:-1}" \
|
| 227 |
+
--request-rate "${REQUEST_RATE:-inf}" \
|
| 228 |
+
--temperature 0 \
|
| 229 |
+
--ignore-eos \
|
| 230 |
+
--disable-tqdm \
|
| 231 |
+
--save-result \
|
| 232 |
+
--result-dir "$OUT" \
|
| 233 |
+
--result-filename bench.json \
|
| 234 |
+
2>&1 | tee "$OUT/bench.log"
|
| 235 |
+
|
| 236 |
+
echo "BENCH_OUT=$OUT"
|
| 237 |
+
}
|
| 238 |
+
|
| 239 |
+
case "$COMMAND" in
|
| 240 |
+
serve)
|
| 241 |
+
run_server "$@"
|
| 242 |
+
;;
|
| 243 |
+
smoke)
|
| 244 |
+
run_smoke
|
| 245 |
+
;;
|
| 246 |
+
bench)
|
| 247 |
+
run_bench
|
| 248 |
+
;;
|
| 249 |
+
bash|sh)
|
| 250 |
+
exec "$COMMAND" "$@"
|
| 251 |
+
;;
|
| 252 |
+
*)
|
| 253 |
+
exec "$COMMAND" "$@"
|
| 254 |
+
;;
|
| 255 |
+
esac
|
manifest.json
CHANGED
|
@@ -23,9 +23,18 @@
|
|
| 23 |
"scripts/setup_repro_from_hf.sh",
|
| 24 |
"scripts/serve_phase2_eagle.sh",
|
| 25 |
"scripts/bench_tokens_sec_phase2_eagle.sh",
|
|
|
|
| 26 |
"scripts/test_triton_codebook_match.py",
|
| 27 |
"scripts/measure_kv_cache_compression.py"
|
| 28 |
],
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
"recommended_runtime_env": {
|
| 30 |
"SPECTRAL_CUDA_GRAPH": "1",
|
| 31 |
"SPECTRAL_TRITON_COMPRESS": "1",
|
|
|
|
| 23 |
"scripts/setup_repro_from_hf.sh",
|
| 24 |
"scripts/serve_phase2_eagle.sh",
|
| 25 |
"scripts/bench_tokens_sec_phase2_eagle.sh",
|
| 26 |
+
"scripts/build_docker_image.sh",
|
| 27 |
"scripts/test_triton_codebook_match.py",
|
| 28 |
"scripts/measure_kv_cache_compression.py"
|
| 29 |
],
|
| 30 |
+
"docker_image_build": {
|
| 31 |
+
"dockerfile": "docker/Dockerfile",
|
| 32 |
+
"entrypoint": "docker/entrypoint.sh",
|
| 33 |
+
"downloads_model_weights_at_runtime": true,
|
| 34 |
+
"includes_sidecar_by_default": true,
|
| 35 |
+
"optional_no_sidecar_build_arg": "INCLUDE_SIDECAR=0",
|
| 36 |
+
"default_image_tag": "gemmacut-spectral:008dd7f87"
|
| 37 |
+
},
|
| 38 |
"recommended_runtime_env": {
|
| 39 |
"SPECTRAL_CUDA_GRAPH": "1",
|
| 40 |
"SPECTRAL_TRITON_COMPRESS": "1",
|
scripts/build_docker_image.sh
ADDED
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Build the no-weights GemmaCut SpectralQuant runtime image.
|
| 3 |
+
|
| 4 |
+
set -euo pipefail
|
| 5 |
+
|
| 6 |
+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
| 7 |
+
BUNDLE_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
|
| 8 |
+
|
| 9 |
+
IMAGE="${IMAGE:-gemmacut-spectral:008dd7f87}"
|
| 10 |
+
|
| 11 |
+
docker build \
|
| 12 |
+
-f "$BUNDLE_DIR/docker/Dockerfile" \
|
| 13 |
+
-t "$IMAGE" \
|
| 14 |
+
"$@" \
|
| 15 |
+
"$BUNDLE_DIR"
|
| 16 |
+
|
| 17 |
+
echo "Built $IMAGE"
|