Upload Kaiju Coder 7 runtime quantization recipe

Browse files

Files changed (6) hide show

PUBLIC_TESTING_QUICKSTART.md +149 -0
README.md +83 -0
scripts/run-gojira-b-vllm-serving-benchmark.sh +44 -0
scripts/start-qwen36-merged-vllm.sh +76 -0
scripts/stop-qwen36-merged-sglang.sh +18 -0
scripts/stop-qwen36-merged-vllm.sh +18 -0

PUBLIC_TESTING_QUICKSTART.md ADDED Viewed

	@@ -0,0 +1,149 @@

+# Kaiju Coder 7 Public Testing Quickstart
+Kaiju Coder 7 is the public model name. The OpenAI-compatible model id is:
+```text
+kaiju-coder-7
+```
+Use this guide for serious public testing. It avoids internal checkpoint names
+and keeps the current limitations clear.
+## Pick A Test Path
+### Path 1: OpenCode Against An Existing Endpoint
+Use this if you already have Kaiju Coder 7 served at an OpenAI-compatible
+`/v1` endpoint.
+```bash
+git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
+cd kaiju-coder-7-opencode
+python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18083/v1
+```
+Then run OpenCode inside the project you want to edit:
+```bash
+opencode -m kaiju/kaiju-coder-7 --agent kaiju-coder-7
+```
+For a bounded smoke test:
+```bash
+mkdir -p /tmp/kaiju-public-smoke
+opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 \
+  --dir /tmp/kaiju-public-smoke \
+  "Create hello.txt with exactly: Kaiju Coder 7 is ready"
+```
+Or run the packaged verifier, which checks the installer, live model endpoint,
+OpenCode binary, actual file creation, and wrong-directory behavior:
+```bash
+python3 scripts/run_kaiju_public_opencode_smoke.py
+```
+The helper installer adds:
+- the `kaiju` OpenAI-compatible provider
+- the lean `kaiju-coder-7` OpenCode agent
+- a scoped no-autocontinue plugin that prevents false completion loops after
+  compaction or output limits
+### Path 2: Full Local Weights
+Use this if the full `RMDWLLC/kaiju-coder-7` Hugging Face repo has been
+uploaded and you have suitable local GPU hardware.
+```bash
+hf download RMDWLLC/kaiju-coder-7 --local-dir ./kaiju-coder-7
+```
+Serve the downloaded folder with an OpenAI-compatible local server. Configure
+the server to expose:
+```text
+model id: kaiju-coder-7
+base URL: http://127.0.0.1:18083/v1
+context: 16384
+```
+Then install the OpenCode helper with:
+```bash
+git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
+cd kaiju-coder-7-opencode
+python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18083/v1
+```
+### Path 3: Runtime-Quantized Local Candidate
+Use this only if you are comfortable with advanced serving setups. The current
+working quantized option is a runtime bitsandbytes recipe, not a separate
+persisted quantized weights repo.
+```bash
+git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-quantized-runtime
+cd kaiju-coder-7-quantized-runtime
+```
+Read `README.md` in that repo before serving. This path can reduce model memory
+at runtime, but it still depends on access to the full Kaiju Coder 7 weights.
+## Recommended Test Prompt
+Run this from an empty project folder:
+```text
+Build a launch-ready local service business website and operating pack. Include
+index.html, a Stripe checkout safety plan, a CSV parser with tests, a simple CRM
+schema, a weekly money report, and a safety/provenance note. Write the files,
+not just advice.
+```
+Expected result:
+- files are written in the requested project folder
+- `index.html` is complete HTML
+- business docs start with Markdown H1 headings
+- code includes a test or smoke-check command where practical
+- no fake API keys, OAuth tokens, payment secrets, or private customer data
+## Current Recommended Defaults
+- Public model id: `kaiju-coder-7`
+- OpenCode context: `16384`
+- Output cap for public testing: `2500`
+- Current reliable product path: model plus deterministic business-owner
+  harness plus verifier
+- Raw multi-file OpenCode generation: still too slow for broad paid API claims
+- Paid API: not public until launch preflight passes
+## What Not To Claim Yet
+Do not claim:
+- that raw model weights alone reliably build every business-owner artifact
+- that a paid hosted API is generally available
+- that persisted quantized weights exist
+- that 32k context is the current live default
+Do claim:
+- Kaiju Coder 7 has a working local/OpenCode release candidate
+- the current tested OpenCode default is 16k context
+- the helper package includes a lean agent and compaction loop guard
+- the paid API scaffold has tests and a launch preflight, but is not yet public
+- the packaged public smoke verifies a fresh OpenCode one-file write before
+  public claims are refreshed
+## Current Blockers Before Public Release
+- Hugging Face repo creation still requires a write-capable token or namespace.
+- Full merged model upload has not completed; the merged folder must first have
+  the metadata packet synced by `prepare_hf_merged_model_metadata.sh`.
+- Public paid API launch needs real Cloudflare D1/KV/R2 bindings, Wrangler
+  secret verification, Stripe webhook staging evidence, staging traffic, latency
+  evidence, and rollback proof.
+- Human review is still required before public upload.

README.md ADDED Viewed

	@@ -0,0 +1,83 @@

+# Kaiju Coder 7 Runtime-Quantized Local Candidate
+This is the current working local quantized variant for Kaiju Coder 7. It is a
+runtime bitsandbytes vLLM serving path, not a separate persisted quantized
+weight artifact yet.
+## Status
+- Model id: `kaiju-coder-7`
+- Runtime: `gojira/vllm-openai-ray:nightly`
+- Quantization mode: vLLM `--quantization bitsandbytes`
+- Load format: vLLM `--load-format bitsandbytes`
+- Required launch mode: `--language-model-only`
+- Required OpenCode launch flag: `--enable-auto-tool-choice`
+- Required preinstall in this image: `pandas`
+- Tested contexts: `8192`, `16384`
+- OpenCode smoke: passed
+- Persisted quantized Hugging Face weights: pending
+## Run
+Use the guarded benchmark script from the repo root:
+```bash
+KAIJU_VLLM_CONTEXT=16384 \
+KAIJU_VLLM_READY_TIMEOUT=1200 \
+KAIJU_VLLM_QUANTIZATION=bitsandbytes \
+KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \
+  ./scripts/run-gojira-b-vllm-serving-benchmark.sh
+```
+The script stops the merged SGLang service, starts vLLM on port `18084`, runs
+the benchmark, then restores the recommended SGLang service on port `18083`.
+## Evidence
+Runs:
+- `runs/benchmarks/20260603T153257Z-kaiju-coder-7-serving/summary.md`
+- `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
+- `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
+- `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
+| Runtime | Context | Prompt | OK | Seconds | Chars | Chars/s |
+| --- | ---: | --- | --- | ---: | ---: | ---: |
+| vLLM bitsandbytes | 8192 | identity | True | 21.19 | 26 | 1.227 |
+| vLLM bitsandbytes | 8192 | code_patch | True | 11.31 | 424 | 37.489 |
+| vLLM bitsandbytes | 16384 | identity | True | 19.51 | 26 | 1.333 |
+| vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 |
+| vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 |
+| vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 |
+Gojira-B log evidence recorded model load at about `17.8 GiB` memory for both
+8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement
+over the full bfloat16 vLLM model load, which reported about `50.22 GiB`.
+The 16k business-document task passed after the wrapper restored the default
+SGLang service.
+OpenCode one-file smoke also passed through the runtime-quantized endpoint:
+```bash
+bash scripts/run_kaiju_quantized_opencode_smoke.sh
+```
+Result:
+- Workdir: `/tmp/kaiju-opencode-quantized-smoke`
+- File: `hello.txt`
+- Exact content: `Kaiju Coder 7 quantized runtime ok`
+- OpenCode config: isolated temporary `HOME`, no global config edit
+- Permission mode: `--dangerously-skip-permissions` inside the temporary smoke
+  harness only
+## Release Interpretation
+This is a working quantized local runtime candidate. It is useful for internal
+testing, serious GPU users, and the next paid API speed experiments. It is not
+yet a standalone public quantized weights repo because the artifact is still the
+full merged model loaded through bitsandbytes at runtime.
+The next release step is to produce a persisted quantized artifact, or package
+this runtime path as an advanced serving recipe while clearly saying it still
+requires access to the full Kaiju Coder 7 merged weights.

scripts/run-gojira-b-vllm-serving-benchmark.sh ADDED Viewed

	@@ -0,0 +1,44 @@

+#!/usr/bin/env bash
+set -euo pipefail
+ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+PORT="${KAIJU_VLLM_PORT:-18084}"
+MODEL="${KAIJU_VLLM_MODEL_NAME:-kaiju-coder-7}"
+CONTEXT="${KAIJU_VLLM_CONTEXT:-16384}"
+READY_TIMEOUT="${KAIJU_VLLM_READY_TIMEOUT:-900}"
+KEEP_VLLM="${KAIJU_VLLM_KEEP_RUNNING:-0}"
+PROMPTS="${KAIJU_VLLM_PROMPTS:-identity code_patch}"
+MAX_TOKENS="${KAIJU_VLLM_MAX_TOKENS:-128}"
+TIMEOUT="${KAIJU_VLLM_PROMPT_TIMEOUT:-300}"
+BASE_URL="http://100.109.109.14:${PORT}/v1"
+restore_sglang() {
+  if [[ "${KEEP_VLLM}" == "1" ]]; then
+    return
+  fi
+  "${ROOT}/scripts/stop-qwen36-merged-vllm.sh" >/dev/null 2>&1 || true
+  KAIJU_QWEN36_MERGED_CONTEXT="${KAIJU_QWEN36_MERGED_CONTEXT:-32768}" \
+    "${ROOT}/scripts/start-qwen36-merged-sglang.sh" >/dev/null 2>&1 || true
+}
+trap restore_sglang EXIT
+"${ROOT}/scripts/stop-qwen36-merged-sglang.sh"
+"${ROOT}/scripts/stop-qwen36-merged-vllm.sh"
+KAIJU_VLLM_CONTEXT="${CONTEXT}" "${ROOT}/scripts/start-qwen36-merged-vllm.sh"
+deadline=$((SECONDS + READY_TIMEOUT))
+until curl -fsSL "${BASE_URL}/models" | grep -q "\"${MODEL}\""; do
+  if (( SECONDS >= deadline )); then
+    echo "vLLM endpoint did not become ready at ${BASE_URL}" >&2
+    exit 1
+  fi
+  sleep 10
+done
+python3 "${ROOT}/scripts/benchmark_kaiju_serving.py" \
+  --base-url "${BASE_URL}" \
+  --model "${MODEL}" \
+  --contexts "${CONTEXT}" \
+  --prompts ${PROMPTS} \
+  --max-tokens "${MAX_TOKENS}" \
+  --timeout "${TIMEOUT}"

scripts/start-qwen36-merged-vllm.sh ADDED Viewed

	@@ -0,0 +1,76 @@

+#!/usr/bin/env bash
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# shellcheck source=scripts/gojira-b-ssh-lib.sh
+source "${SCRIPT_DIR}/gojira-b-ssh-lib.sh"
+kaiju_gojira_b_init
+PORT="${KAIJU_VLLM_PORT:-18084}"
+SESSION="${KAIJU_VLLM_SESSION:-kaiju_qwen36_v18_merged_vllm}"
+CONTEXT_LENGTH="${KAIJU_VLLM_CONTEXT:-32768}"
+GPU_UTIL="${KAIJU_VLLM_GPU_UTIL:-0.90}"
+MODEL_REMOTE="${KAIJU_VLLM_MODEL_REMOTE:-/home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged}"
+SERVED_MODEL="${KAIJU_VLLM_MODEL_NAME:-kaiju-coder-7}"
+IMAGE="${KAIJU_VLLM_IMAGE:-gojira/vllm-openai-ray:nightly}"
+PREINSTALL="${KAIJU_VLLM_PREINSTALL:-pandas}"
+QUANTIZATION="${KAIJU_VLLM_QUANTIZATION:-}"
+LOAD_FORMAT="${KAIJU_VLLM_LOAD_FORMAT:-}"
+KV_CACHE_DTYPE="${KAIJU_VLLM_KV_CACHE_DTYPE:-}"
+ENABLE_AUTO_TOOL_CHOICE="${KAIJU_VLLM_ENABLE_AUTO_TOOL_CHOICE:-1}"
+CONTAINER="qwen36-merged-vllm-${PORT}"
+EXTRA_ARGS=()
+if [[ -n "${QUANTIZATION}" ]]; then
+  EXTRA_ARGS+=(--quantization "${QUANTIZATION}")
+fi
+if [[ -n "${LOAD_FORMAT}" ]]; then
+  EXTRA_ARGS+=(--load-format "${LOAD_FORMAT}")
+fi
+if [[ -n "${KV_CACHE_DTYPE}" ]]; then
+  EXTRA_ARGS+=(--kv-cache-dtype "${KV_CACHE_DTYPE}")
+fi
+if [[ "${ENABLE_AUTO_TOOL_CHOICE}" == "1" ]]; then
+  EXTRA_ARGS+=(--enable-auto-tool-choice)
+fi
+EXTRA_ARGS_Q=""
+if ((${#EXTRA_ARGS[@]})); then
+  printf -v EXTRA_ARGS_Q "%q " "${EXTRA_ARGS[@]}"
+fi
+kaiju_gojira_b_ssh "
+  set -euo pipefail
+  test -d '${MODEL_REMOTE}' || { echo 'Missing merged model: ${MODEL_REMOTE}' >&2; exit 2; }
+  mkdir -p ~/kaiju-coder/logs ~/hf-cache
+  LOG=~/kaiju-coder/logs/qwen36-merged-vllm-${PORT}.log
+  if tmux has-session -t '${SESSION}' 2>/dev/null; then
+    echo 'session already running: ${SESSION}'
+  else
+    sudo docker rm -f '${CONTAINER}' >/dev/null 2>&1 || true
+    rm -f \"\${LOG}\"
+    tmux new-session -d -s '${SESSION}' \"set -euo pipefail; sudo docker run --rm --gpus all --network host --ipc=host \
+      -v '${MODEL_REMOTE}':/models/kaiju-merged:ro \
+      -v ~/hf-cache:/root/.cache/huggingface \
+      --name '${CONTAINER}' \
+      --entrypoint bash \
+      '${IMAGE}' \
+      -lc 'if [[ -n \"${PREINSTALL}\" ]]; then python3 -m pip install -q ${PREINSTALL}; fi; python3 -m vllm.entrypoints.openai.api_server \
+        --model /models/kaiju-merged \
+        --served-model-name '${SERVED_MODEL}' \
+        --host 0.0.0.0 \
+        --port '${PORT}' \
+        --max-model-len '${CONTEXT_LENGTH}' \
+        --gpu-memory-utilization '${GPU_UTIL}' \
+        --trust-remote-code \
+        --language-model-only \
+        --dtype bfloat16 \
+        --tool-call-parser qwen3_coder \
+        --reasoning-parser qwen3 \
+        ${EXTRA_ARGS_Q} \
+        --uvicorn-log-level info' 2>&1 | tee \${LOG}\"
+    echo 'started session: ${SESSION}'
+  fi
+  echo 'log:' \"\${LOG}\"
+  echo 'model: ${SERVED_MODEL}'
+"

scripts/stop-qwen36-merged-sglang.sh ADDED Viewed

	@@ -0,0 +1,18 @@

+#!/usr/bin/env bash
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# shellcheck source=scripts/gojira-b-ssh-lib.sh
+source "${SCRIPT_DIR}/gojira-b-ssh-lib.sh"
+kaiju_gojira_b_init
+PORT="${KAIJU_QWEN36_MERGED_PORT:-18083}"
+SESSION="${KAIJU_QWEN36_MERGED_SESSION:-kaiju_qwen36_v18_merged_sglang}"
+CONTAINER="qwen36-merged-sglang-${PORT}"
+kaiju_gojira_b_ssh "
+  set -euo pipefail
+  tmux kill-session -t '${SESSION}' >/dev/null 2>&1 || true
+  sudo docker rm -f '${CONTAINER}' >/dev/null 2>&1 || true
+  echo 'stopped ${SESSION} / ${CONTAINER}'
+"

scripts/stop-qwen36-merged-vllm.sh ADDED Viewed

	@@ -0,0 +1,18 @@

+#!/usr/bin/env bash
+set -euo pipefail
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# shellcheck source=scripts/gojira-b-ssh-lib.sh
+source "${SCRIPT_DIR}/gojira-b-ssh-lib.sh"
+kaiju_gojira_b_init
+PORT="${KAIJU_VLLM_PORT:-18084}"
+SESSION="${KAIJU_VLLM_SESSION:-kaiju_qwen36_v18_merged_vllm}"
+CONTAINER="qwen36-merged-vllm-${PORT}"
+kaiju_gojira_b_ssh "
+  set -euo pipefail
+  tmux kill-session -t '${SESSION}' >/dev/null 2>&1 || true
+  sudo docker rm -f '${CONTAINER}' >/dev/null 2>&1 || true
+  echo 'stopped ${SESSION} / ${CONTAINER}'
+"