Upload Kaiju Coder 7 adapter release package

Browse files

Files changed (9) hide show

EVAL_SCOREBOARD.md +5 -1
FINAL_RELEASE_REPORT.md +30 -16
GOAL_COMPLETION_AUDIT.md +2 -2
HF_UPLOAD_EVIDENCE.md +3 -3
LOCAL_TEST_INSTRUCTIONS.md +19 -13
PUBLIC_TESTING_QUICKSTART.md +24 -7
QUANTIZATION_PLAN.md +41 -5
SERVING_BENCHMARKS.md +98 -0
scripts/check_hf_uploaded_release.py +1 -1

EVAL_SCOREBOARD.md CHANGED Viewed

@@ -35,7 +35,7 @@ This scoreboard tracks the current release-candidate evidence. Do not publish we
 | Kaiju Coder 7 restored 32k OpenCode one-file smoke | `opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-opencode-32k-final-smoke 'Create hello.txt with exactly: Kaiju Coder 7 final 32k ok'` | Passed; wrote `hello.txt` with exactly `Kaiju Coder 7 final 32k ok` | 2026-06-03 |
 | Kaiju Coder 7 current restored 16k direct API smoke | `python3 scripts/benchmark_kaiju_serving.py --contexts 16384 --prompts identity --max-tokens 64 --timeout 120` | Passed; latest run `runs/benchmarks/20260603T174545Z-kaiju-coder-7-serving/summary.md`, identity `2.3s`, `26` chars | 2026-06-03 |
 | Kaiju Coder 7 current restored 16k OpenCode one-file smoke | `mkdir -p /tmp/kaiju-opencode-fresh-public-smoke && opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-opencode-fresh-public-smoke --dangerously-skip-permissions 'Create hello.txt with exactly: Kaiju Coder 7 fresh public smoke ok'` | Passed; `/v1/models` returned `kaiju-coder-7`, max model len `16384`; wrote `hello.txt` with exactly `Kaiju Coder 7 fresh public smoke ok` | 2026-06-03 |
-| Kaiju Coder 7 packaged public OpenCode smoke | `python3 scripts/run_kaiju_public_opencode_smoke.py --timeout 900 --keep-dir` | Passed; latest run `runs/public-opencode-smoke/20260603T182222Z/summary.md`, `4/4` checks passed; installer dry-run, OpenCode `1.15.13`, live 16k model, and file written only in the requested temp workspace | 2026-06-03 |
 | Kaiju Coder 7 loop-guarded OpenCode install | `python3 scripts/install_kaiju_opencode_profile.py`; `opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-opencode-loopguard-smoke --dangerously-skip-permissions 'Create loopguard.txt with exactly: Kaiju Coder 7 loop guard installed'` | Passed; config includes `/Users/richardecholsai7/.config/opencode/kaiju-no-autocontinue.mjs`; wrote `loopguard.txt` with exact requested content and exited cleanly | 2026-06-03 |
 | Current harnessed OpenCode customer-readiness pack | `python3 scripts/run_kaiju_opencode_customer_pack.py --mode harnessed` | Passed; latest run `runs/opencode-customer-readiness/20260603T185835Z/summary.md`, `4/4` tasks passed and `28/28` required files written, including release provenance and safety review | 2026-06-03 |
 | Paid API Worker scaffold | `cd gateway/cloudflare-worker && npm run check && npm run preflight` | Passed `16/16` Worker tests and `17` scaffold preflight checks; covers bearer auth, inactive keys, insufficient credits, debit/refund, rate limit before debit, model `kaiju-coder-7` enforcement, stream/thinking/token caps, secret-content rejection without logging, signed Stripe Checkout top-up idempotency, origin-only R2 artifact upload, account-scoped artifact download, guarded Cloudflare resource prep, Wrangler dry-run deploy, sanitized paid-launch evidence template packaging, reviewed Cloudflare bindings template, binding applier guardrails, and sanitized evidence collection helper | 2026-06-03 |
@@ -43,6 +43,10 @@ This scoreboard tracks the current release-candidate evidence. Do not publish we
 | Kaiju Coder 7 runtime-quantized vLLM serve | `KAIJU_VLLM_CONTEXT=16384 KAIJU_VLLM_QUANTIZATION=bitsandbytes KAIJU_VLLM_LOAD_FORMAT=bitsandbytes ./scripts/run-gojira-b-vllm-serving-benchmark.sh` | Passed at 8k and 16k; 16k identity `19.51s`, code patch `11.3s`; vLLM log reported about `17.8 GiB` model memory | 2026-06-03 |
 | Kaiju Coder 7 runtime-quantized business-doc smoke | `KAIJU_VLLM_CONTEXT=16384 KAIJU_VLLM_QUANTIZATION=bitsandbytes KAIJU_VLLM_LOAD_FORMAT=bitsandbytes KAIJU_VLLM_PROMPTS=business_doc KAIJU_VLLM_MAX_TOKENS=768 KAIJU_VLLM_PROMPT_TIMEOUT=420 ./scripts/run-gojira-b-vllm-serving-benchmark.sh` | Passed; business proposal `53.44s`, `1,610` chars, `30.127` chars/s; wrapper restored SGLang after completion | 2026-06-03 |
 | Kaiju Coder 7 runtime-quantized OpenCode one-file smoke | `bash scripts/run_kaiju_quantized_opencode_smoke.sh` | Passed at 16k after vLLM `--enable-auto-tool-choice`; OpenCode wrote `hello.txt` with exactly `Kaiju Coder 7 quantized runtime ok` | 2026-06-03 |
 | Hugging Face CLI install/auth check | `hf version && hf auth whoami && hf auth list` | `hf` installed locally at version `1.17.0`; auth user `restokes92`; token name `gojirakiyomikode` | 2026-06-03 |
 | Hugging Face private repo create attempt | `KAIJU_HF_UPLOAD_APPLY=1 bash scripts/upload_hf_release_staging.sh` with namespaces `RichardEchols`, `RMDWLLC`, and `restokes92` | Blocked by Hugging Face `403 Forbidden`; current token cannot create model repos in those namespaces | 2026-06-03 |
 | Hugging Face merged-model metadata and upload boundary | `bash scripts/prepare_hf_merged_model_metadata.sh`; `KAIJU_MERGED_METADATA_APPLY=1 bash scripts/prepare_hf_merged_model_metadata.sh`; `bash scripts/upload_hf_merged_model_from_gojira_b.sh`; `KAIJU_HF_UPLOAD_APPLY=1 bash scripts/upload_hf_merged_model_from_gojira_b.sh` | Metadata prep synced model card, quickstarts, provenance, benchmarks, evals, paid API status, final report, upstream license, and `MERGED_MODEL_RELEASE_MANIFEST.json` to Gojira-B; sudo rsync handled the root-owned merged folder; upload dry run confirmed metadata plus the `51G`/`14`-shard merged model before printing `hf upload-large-folder`; apply remains blocked by human review and Hugging Face namespace permission before any large upload | 2026-06-03 |

 | Kaiju Coder 7 restored 32k OpenCode one-file smoke | `opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-opencode-32k-final-smoke 'Create hello.txt with exactly: Kaiju Coder 7 final 32k ok'` | Passed; wrote `hello.txt` with exactly `Kaiju Coder 7 final 32k ok` | 2026-06-03 |
 | Kaiju Coder 7 current restored 16k direct API smoke | `python3 scripts/benchmark_kaiju_serving.py --contexts 16384 --prompts identity --max-tokens 64 --timeout 120` | Passed; latest run `runs/benchmarks/20260603T174545Z-kaiju-coder-7-serving/summary.md`, identity `2.3s`, `26` chars | 2026-06-03 |
 | Kaiju Coder 7 current restored 16k OpenCode one-file smoke | `mkdir -p /tmp/kaiju-opencode-fresh-public-smoke && opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-opencode-fresh-public-smoke --dangerously-skip-permissions 'Create hello.txt with exactly: Kaiju Coder 7 fresh public smoke ok'` | Passed; `/v1/models` returned `kaiju-coder-7`, max model len `16384`; wrote `hello.txt` with exactly `Kaiju Coder 7 fresh public smoke ok` | 2026-06-03 |
+| Kaiju Coder 7 packaged public OpenCode smoke | `python3 scripts/run_kaiju_public_opencode_smoke.py --timeout 900 --keep-dir` | Passed; latest run `runs/public-opencode-smoke/20260603T232928Z/summary.md`, `4/4` checks passed; installer dry-run, OpenCode `1.15.13`, live 16k model, and exact file written only in the requested temp workspace through the fast proxy | 2026-06-03 |
 | Kaiju Coder 7 loop-guarded OpenCode install | `python3 scripts/install_kaiju_opencode_profile.py`; `opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-opencode-loopguard-smoke --dangerously-skip-permissions 'Create loopguard.txt with exactly: Kaiju Coder 7 loop guard installed'` | Passed; config includes `/Users/richardecholsai7/.config/opencode/kaiju-no-autocontinue.mjs`; wrote `loopguard.txt` with exact requested content and exited cleanly | 2026-06-03 |
 | Current harnessed OpenCode customer-readiness pack | `python3 scripts/run_kaiju_opencode_customer_pack.py --mode harnessed` | Passed; latest run `runs/opencode-customer-readiness/20260603T185835Z/summary.md`, `4/4` tasks passed and `28/28` required files written, including release provenance and safety review | 2026-06-03 |
 | Paid API Worker scaffold | `cd gateway/cloudflare-worker && npm run check && npm run preflight` | Passed `16/16` Worker tests and `17` scaffold preflight checks; covers bearer auth, inactive keys, insufficient credits, debit/refund, rate limit before debit, model `kaiju-coder-7` enforcement, stream/thinking/token caps, secret-content rejection without logging, signed Stripe Checkout top-up idempotency, origin-only R2 artifact upload, account-scoped artifact download, guarded Cloudflare resource prep, Wrangler dry-run deploy, sanitized paid-launch evidence template packaging, reviewed Cloudflare bindings template, binding applier guardrails, and sanitized evidence collection helper | 2026-06-03 |
 | Kaiju Coder 7 runtime-quantized vLLM serve | `KAIJU_VLLM_CONTEXT=16384 KAIJU_VLLM_QUANTIZATION=bitsandbytes KAIJU_VLLM_LOAD_FORMAT=bitsandbytes ./scripts/run-gojira-b-vllm-serving-benchmark.sh` | Passed at 8k and 16k; 16k identity `19.51s`, code patch `11.3s`; vLLM log reported about `17.8 GiB` model memory | 2026-06-03 |
 | Kaiju Coder 7 runtime-quantized business-doc smoke | `KAIJU_VLLM_CONTEXT=16384 KAIJU_VLLM_QUANTIZATION=bitsandbytes KAIJU_VLLM_LOAD_FORMAT=bitsandbytes KAIJU_VLLM_PROMPTS=business_doc KAIJU_VLLM_MAX_TOKENS=768 KAIJU_VLLM_PROMPT_TIMEOUT=420 ./scripts/run-gojira-b-vllm-serving-benchmark.sh` | Passed; business proposal `53.44s`, `1,610` chars, `30.127` chars/s; wrapper restored SGLang after completion | 2026-06-03 |
 | Kaiju Coder 7 runtime-quantized OpenCode one-file smoke | `bash scripts/run_kaiju_quantized_opencode_smoke.sh` | Passed at 16k after vLLM `--enable-auto-tool-choice`; OpenCode wrote `hello.txt` with exactly `Kaiju Coder 7 quantized runtime ok` | 2026-06-03 |
+| Kaiju Coder 7 fast proxy plus website harness speed pass | `python3 scripts/run_kaiju_router.py --kind website --openai-base-url http://127.0.0.1:18181/v1 --model kaiju-coder-7 ...` and OpenCode through `http://127.0.0.1:18181/v1` | Passed; local fast proxy forwards to vLLM bitsandbytes on `18084`; direct website harness wrote `9,257` chars in `7.31s`; router website passed all checks in `7.20s`; local-proxy router website passed in `4.67s`; public OpenCode smoke through the proxy passed in about `40s` end to end | 2026-06-03 |
+| Persisted quantization support probe | `./scripts/probe-gojira-b-persisted-quantization.sh` | Passed as evidence probe; AWQ/GPTQ normal installs are not clean against the Qwen3.5-capable stack tonight, `llmcompressor --no-deps` preserves config support but needs a pinned dependency env, and `llama.cpp` supports `Qwen3_5ForConditionalGeneration` with Q8_0 dry-run passing | 2026-06-03 |
+| GGUF Q8_0 persisted conversion | `./scripts/run-gojira-b-kaiju-gguf-convert.sh` | Converted candidate at `/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf`, `27G`, SHA256 `596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e`; runtime smoke still required before public quantized-weights release | 2026-06-03 |
+| Public business-owner demo pack | `python3 scripts/run_kaiju_public_demo_pack.py --openai-base-url http://127.0.0.1:18181/v1 --model kaiju-coder-7 --planner-timeout 90` | Passed `4/4` through the fast proxy in `84.43s`: website `24.59s`, owner AI company pack `29.99s` with `19` files, Stripe safety plan `9.93s`, CSV parser artifact `19.93s`; run `runs/public-demo-pack/20260603T232534Z/summary.md` | 2026-06-03 |
 | Hugging Face CLI install/auth check | `hf version && hf auth whoami && hf auth list` | `hf` installed locally at version `1.17.0`; auth user `restokes92`; token name `gojirakiyomikode` | 2026-06-03 |
 | Hugging Face private repo create attempt | `KAIJU_HF_UPLOAD_APPLY=1 bash scripts/upload_hf_release_staging.sh` with namespaces `RichardEchols`, `RMDWLLC`, and `restokes92` | Blocked by Hugging Face `403 Forbidden`; current token cannot create model repos in those namespaces | 2026-06-03 |
 | Hugging Face merged-model metadata and upload boundary | `bash scripts/prepare_hf_merged_model_metadata.sh`; `KAIJU_MERGED_METADATA_APPLY=1 bash scripts/prepare_hf_merged_model_metadata.sh`; `bash scripts/upload_hf_merged_model_from_gojira_b.sh`; `KAIJU_HF_UPLOAD_APPLY=1 bash scripts/upload_hf_merged_model_from_gojira_b.sh` | Metadata prep synced model card, quickstarts, provenance, benchmarks, evals, paid API status, final report, upstream license, and `MERGED_MODEL_RELEASE_MANIFEST.json` to Gojira-B; sudo rsync handled the root-owned merged folder; upload dry run confirmed metadata plus the `51G`/`14`-shard merged model before printing `hf upload-large-folder`; apply remains blocked by human review and Hugging Face namespace permission before any large upload | 2026-06-03 |

FINAL_RELEASE_REPORT.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Kaiju Coder 7 Final Release Report
-Generated: `2026-06-03T21:12:14Z`
 Product name: `Kaiju Coder 7`
 Public model id: `kaiju-coder-7`
@@ -24,11 +24,11 @@ Stripe live-mode switch and controlled live payment verification.
 | Field | Value |
 |---|---|
-| Status | `pass` |
 | Base URL | `http://100.109.109.14:18083/v1` |
-| Model id | `kaiju-coder-7` |
-| Max model length | `16384` |
-| Detail | `` |
 Recommended default today: `16k` context through `kaiju-coder-7`. Higher
 context has benchmark evidence, but the currently parked default is 16k for
@@ -38,9 +38,9 @@ stability and speed.
 | Area | Result |
 |---|---|
-| Local public-testing readiness | `ready=True pass=24 fail=0 manual=0 rc=0` |
-| Hugging Face release readiness | `ready=True pass=24 fail=0 manual=0 rc=0` |
-| Public launch readiness | `ready=True pass=24 fail=0 manual=0 rc=0` |
 | Hugging Face staging integrity | `ready=True pass=6 fail=0 manual=0 rc=0` |
 | Paid API launch readiness | `ready=True pass=27 fail=0 manual=0 rc=0` |
@@ -52,18 +52,22 @@ stability and speed.
 | Small helper repos uploaded | `True` |
 | Merged model uploaded | `True` |
 | Merged repo | `RMDWLLC/kaiju-coder-7` |
-| Merged repo SHA | `736af44add9321f74e8603cd739245fc0853d62c` |
 | Merged upload size | `39 files / 53.8G / 14 safetensors shards recorded` |
 | Download status | `public downloads verified; no active private-storage blocker recorded` |
 | Visibility decision | `PUBLIC`; `HF_VISIBILITY_DECISION: PUBLIC` recorded in human review |
 ## Hugging Face Release Blockers
-- No matching checks.
 ## Public Launch Blockers
-- No matching checks.
 ## Paid API Launch Blockers
@@ -93,7 +97,7 @@ stability and speed.
 | Paid API launch evidence template | `release/paid-api-launch-evidence.example.json` |
 | Cloudflare bindings template | `release/cloudflare-bindings.example.json` |
 | Cloudflare bindings applier | `scripts/apply_paid_api_cloudflare_bindings.py` |
-| Latest direct API smoke | `runs/benchmarks/20260603T193000Z-kaiju-coder-7-serving/summary.md` |
 | Latest OpenCode customer pack | `runs/opencode-customer-readiness/20260603T185835Z/summary.md` |
 | Latest public OpenCode smoke | `runs/public-opencode-smoke` |
@@ -133,7 +137,7 @@ human release review explicitly approves public paid API launch.
 ## Changed Files
-`git status --short` currently reports `116` changed paths.
 | State | Path |
 |---|---|
@@ -153,8 +157,10 @@ human release review explicitly approves public paid API launch.
 | M | `gateway/cloudflare-worker/src/index.js` |
 | M | `gateway/cloudflare-worker/test/index.test.js` |
 | M | `gateway/cloudflare-worker/wrangler.jsonc` |
 | M | `kaiju_harness/router.py` |
 | M | `kaiju_harness/verification.py` |
 | D | `models/README.md` |
 | D | `models/qwen3.6-27b-base.md` |
 | D | `models/qwen3.6-27b-fp8.md` |
@@ -164,14 +170,17 @@ human release review explicitly approves public paid API launch.
 | M | `release/MODEL_CARD_DRAFT.md` |
 | M | `scripts/build_sft_dataset.py` |
 | M | `scripts/check-gojira-b-capacity.sh` |
 | M | `scripts/run-gojira-b-qwen36-lora-eval.sh` |
 | M | `scripts/run-gojira-b-qwen36-lora-sglang-eval.sh` |
 | M | `scripts/run-gojira-b-qwen36-lora-train.sh` |
 | M | `scripts/run_kaiju_api_harness_smoke.py` |
 | M | `scripts/start-qwen36-lora-sglang.sh` |
 | M | `scripts/stop-qwen36-lora-sglang.sh` |
 | M | `scripts/validate_training_data.py` |
 | M | `scripts/watch-gojira-b-qwen36-lora-train.sh` |
 | ?? | `.opencode/` |
 | ?? | `datasets/candidates/v1.7-rmdw-business-owner-suite.jsonl` |
 | ?? | `datasets/v1.7-targets.json` |
@@ -196,6 +205,7 @@ human release review explicitly approves public paid API launch.
 | ?? | `release/UPSTREAM_LICENSE_CHECK.md` |
 | ?? | `release/bundles/` |
 | ?? | `release/cloudflare-bindings.example.json` |
 | ?? | `release/hf-release-permission-evidence.example.json` |
 | ?? | `release/hf-release-permission-evidence.json` |
 | ?? | `release/huggingface/` |
@@ -225,17 +235,21 @@ human release review explicitly approves public paid API launch.
 | ?? | `scripts/generate_kaiju_final_report.py` |
 | ?? | `scripts/gojira-b-ssh-lib.sh` |
 | ?? | `scripts/install_kaiju_opencode_profile.py` |
 | ?? | `scripts/make_hf_release_public.sh` |
 | ?? | `scripts/opencode-kaiju-no-autocontinue.mjs` |
 | ?? | `scripts/prepare_hf_merged_model_metadata.sh` |
 | ?? | `scripts/prepare_hf_release_staging.sh` |
 | ?? | `scripts/prepare_paid_api_cloudflare_resources.sh` |
 | ?? | `scripts/probe-gojira-b-kaiju-quantization.sh` |
 | ?? | `scripts/refresh_kaiju_release_evidence.py` |
 | ?? | `scripts/run-gojira-b-qwen36-lora-merge.sh` |
 | ?? | `scripts/run-gojira-b-vllm-serving-benchmark.sh` |
 | ?? | `scripts/run_kaiju_business_owner_rc_smoke.py` |
 | ?? | `scripts/run_kaiju_opencode_customer_pack.py` |
 | ?? | `scripts/run_kaiju_public_opencode_smoke.py` |
 | ?? | `scripts/run_kaiju_quantized_opencode_smoke.sh` |
 | ?? | `scripts/start-qwen36-merged-sglang.sh` |
@@ -262,9 +276,9 @@ human release review explicitly approves public paid API launch.
 | git HEAD | `git rev-parse HEAD` | 0 |
 | git origin/main | `git rev-parse origin/main` | 0 |
 | git status | `git status --short` | 0 |
-| local readiness | `/opt/homebrew/opt/python@3.14/bin/python3.14 scripts/check_kaiju_public_release_readiness.py --mode local --json --base-url http://100.109.109.14:18083/v1 --live-timeout 5 --staging-dir /tmp/kaiju-coder-7-hf-staging` | 0 |
-| HF release readiness | `/opt/homebrew/opt/python@3.14/bin/python3.14 scripts/check_kaiju_public_release_readiness.py --mode hf-release --json --base-url http://100.109.109.14:18083/v1 --live-timeout 5 --staging-dir /tmp/kaiju-coder-7-hf-staging` | 0 |
-| public readiness | `/opt/homebrew/opt/python@3.14/bin/python3.14 scripts/check_kaiju_public_release_readiness.py --mode public --json --base-url http://100.109.109.14:18083/v1 --live-timeout 5 --staging-dir /tmp/kaiju-coder-7-hf-staging` | 0 |
 | HF staging integrity | `/opt/homebrew/opt/python@3.14/bin/python3.14 scripts/check_hf_staging_integrity.py --staging-dir /tmp/kaiju-coder-7-hf-staging --require-checksums --json` | 0 |
 | paid API launch readiness | `/opt/homebrew/opt/python@3.14/bin/python3.14 scripts/check_paid_api_readiness.py --mode launch --json` | 0 |

 # Kaiju Coder 7 Final Release Report
+Generated: `2026-06-03T23:34:00Z`
 Product name: `Kaiju Coder 7`
 Public model id: `kaiju-coder-7`
 | Field | Value |
 |---|---|
+| Status | `fail` |
 | Base URL | `http://100.109.109.14:18083/v1` |
+| Model id | `unknown` |
+| Max model length | `unknown` |
+| Detail | `URLError(ConnectionRefusedError(61, 'Connection refused'))` |
 Recommended default today: `16k` context through `kaiju-coder-7`. Higher
 context has benchmark evidence, but the currently parked default is 16k for
 | Area | Result |
 |---|---|
+| Local public-testing readiness | `ready=False pass=23 fail=1 manual=0 rc=1` |
+| Hugging Face release readiness | `ready=False pass=23 fail=1 manual=0 rc=1` |
+| Public launch readiness | `ready=False pass=23 fail=1 manual=0 rc=1` |
 | Hugging Face staging integrity | `ready=True pass=6 fail=0 manual=0 rc=0` |
 | Paid API launch readiness | `ready=True pass=27 fail=0 manual=0 rc=0` |
 | Small helper repos uploaded | `True` |
 | Merged model uploaded | `True` |
 | Merged repo | `RMDWLLC/kaiju-coder-7` |
+| Merged repo SHA | `00ba85985102a14838dbb8a5692d9a75ce9da15a` |
 | Merged upload size | `39 files / 53.8G / 14 safetensors shards recorded` |
 | Download status | `public downloads verified; no active private-storage blocker recorded` |
 | Visibility decision | `PUBLIC`; `HF_VISIBILITY_DECISION: PUBLIC` recorded in human review |
 ## Hugging Face Release Blockers
+| Status | Check | Detail |
+|---|---|---|
+| fail | live runtime | could not read http://100.109.109.14:18083/v1/models: URLError(ConnectionRefusedError(61, 'Connection refused')) |
 ## Public Launch Blockers
+| Status | Check | Detail |
+|---|---|---|
+| fail | live runtime | could not read http://100.109.109.14:18083/v1/models: URLError(ConnectionRefusedError(61, 'Connection refused')) |
 ## Paid API Launch Blockers
 | Paid API launch evidence template | `release/paid-api-launch-evidence.example.json` |
 | Cloudflare bindings template | `release/cloudflare-bindings.example.json` |
 | Cloudflare bindings applier | `scripts/apply_paid_api_cloudflare_bindings.py` |
+| Latest direct API smoke | `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md` |
 | Latest OpenCode customer pack | `runs/opencode-customer-readiness/20260603T185835Z/summary.md` |
 | Latest public OpenCode smoke | `runs/public-opencode-smoke` |
 ## Changed Files
+`git status --short` currently reports `126` changed paths.
 | State | Path |
 |---|---|
 | M | `gateway/cloudflare-worker/src/index.js` |
 | M | `gateway/cloudflare-worker/test/index.test.js` |
 | M | `gateway/cloudflare-worker/wrangler.jsonc` |
+| M | `gateway/gojira-local/server.py` |
 | M | `kaiju_harness/router.py` |
 | M | `kaiju_harness/verification.py` |
+| M | `kaiju_harness/website.py` |
 | D | `models/README.md` |
 | D | `models/qwen3.6-27b-base.md` |
 | D | `models/qwen3.6-27b-fp8.md` |
 | M | `release/MODEL_CARD_DRAFT.md` |
 | M | `scripts/build_sft_dataset.py` |
 | M | `scripts/check-gojira-b-capacity.sh` |
+| M | `scripts/check_kaiju_gateway_policy.py` |
 | M | `scripts/run-gojira-b-qwen36-lora-eval.sh` |
 | M | `scripts/run-gojira-b-qwen36-lora-sglang-eval.sh` |
 | M | `scripts/run-gojira-b-qwen36-lora-train.sh` |
 | M | `scripts/run_kaiju_api_harness_smoke.py` |
+| M | `scripts/run_kaiju_router.py` |
 | M | `scripts/start-qwen36-lora-sglang.sh` |
 | M | `scripts/stop-qwen36-lora-sglang.sh` |
 | M | `scripts/validate_training_data.py` |
 | M | `scripts/watch-gojira-b-qwen36-lora-train.sh` |
+| M | `tests/test_website_harness.py` |
 | ?? | `.opencode/` |
 | ?? | `datasets/candidates/v1.7-rmdw-business-owner-suite.jsonl` |
 | ?? | `datasets/v1.7-targets.json` |
 | ?? | `release/UPSTREAM_LICENSE_CHECK.md` |
 | ?? | `release/bundles/` |
 | ?? | `release/cloudflare-bindings.example.json` |
+| ?? | `release/gguf/` |
 | ?? | `release/hf-release-permission-evidence.example.json` |
 | ?? | `release/hf-release-permission-evidence.json` |
 | ?? | `release/huggingface/` |
 | ?? | `scripts/generate_kaiju_final_report.py` |
 | ?? | `scripts/gojira-b-ssh-lib.sh` |
 | ?? | `scripts/install_kaiju_opencode_profile.py` |
+| ?? | `scripts/kaiju_opencode_fast_proxy.py` |
 | ?? | `scripts/make_hf_release_public.sh` |
 | ?? | `scripts/opencode-kaiju-no-autocontinue.mjs` |
 | ?? | `scripts/prepare_hf_merged_model_metadata.sh` |
 | ?? | `scripts/prepare_hf_release_staging.sh` |
 | ?? | `scripts/prepare_paid_api_cloudflare_resources.sh` |
 | ?? | `scripts/probe-gojira-b-kaiju-quantization.sh` |
+| ?? | `scripts/probe-gojira-b-persisted-quantization.sh` |
 | ?? | `scripts/refresh_kaiju_release_evidence.py` |
+| ?? | `scripts/run-gojira-b-kaiju-gguf-convert.sh` |
 | ?? | `scripts/run-gojira-b-qwen36-lora-merge.sh` |
 | ?? | `scripts/run-gojira-b-vllm-serving-benchmark.sh` |
 | ?? | `scripts/run_kaiju_business_owner_rc_smoke.py` |
 | ?? | `scripts/run_kaiju_opencode_customer_pack.py` |
+| ?? | `scripts/run_kaiju_public_demo_pack.py` |
 | ?? | `scripts/run_kaiju_public_opencode_smoke.py` |
 | ?? | `scripts/run_kaiju_quantized_opencode_smoke.sh` |
 | ?? | `scripts/start-qwen36-merged-sglang.sh` |
 | git HEAD | `git rev-parse HEAD` | 0 |
 | git origin/main | `git rev-parse origin/main` | 0 |
 | git status | `git status --short` | 0 |
+| local readiness | `/opt/homebrew/opt/python@3.14/bin/python3.14 scripts/check_kaiju_public_release_readiness.py --mode local --json --base-url http://100.109.109.14:18083/v1 --live-timeout 5 --staging-dir /tmp/kaiju-coder-7-hf-staging` | 1 |
+| HF release readiness | `/opt/homebrew/opt/python@3.14/bin/python3.14 scripts/check_kaiju_public_release_readiness.py --mode hf-release --json --base-url http://100.109.109.14:18083/v1 --live-timeout 5 --staging-dir /tmp/kaiju-coder-7-hf-staging` | 1 |
+| public readiness | `/opt/homebrew/opt/python@3.14/bin/python3.14 scripts/check_kaiju_public_release_readiness.py --mode public --json --base-url http://100.109.109.14:18083/v1 --live-timeout 5 --staging-dir /tmp/kaiju-coder-7-hf-staging` | 1 |
 | HF staging integrity | `/opt/homebrew/opt/python@3.14/bin/python3.14 scripts/check_hf_staging_integrity.py --staging-dir /tmp/kaiju-coder-7-hf-staging --require-checksums --json` | 0 |
 | paid API launch readiness | `/opt/homebrew/opt/python@3.14/bin/python3.14 scripts/check_paid_api_readiness.py --mode launch --json` | 0 |

GOAL_COMPLETION_AUDIT.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Kaiju Coder 7 Goal Completion Audit
-Generated: `2026-06-03T21:12:21Z`
 Overall: `complete`
 Summary: `18 passed / 0 blocked / 0 manual`
@@ -28,7 +28,7 @@ This audit maps the active Kaiju Coder 7 objective to current evidence. It is st
 | OpenCode | Lean Kaiju-specific OpenCode config/agent minimizes prompt overhead and disables synthetic auto-continue loops. | `passed` | .opencode/agents/kaiju-coder-7.md; scripts/opencode-kaiju-no-autocontinue.mjs; scripts/install_kaiju_opencode_profile.py |  |
 | OpenCode | opencode -m kaiju/kaiju-coder-7 works from this Mac with the recommended config. | `passed` | runs/public-opencode-smoke latest passing summary; scripts/run_kaiju_public_opencode_smoke.py |  |
 | OpenCode | Customer-readiness pack passes without wrong-directory output, fake compaction completion, missing files, or secret leakage. | `passed` | runs/opencode-customer-readiness/20260603T185835Z/summary.md |  |
-| Runtime | Direct API smoke passes using model=kaiju-coder-7. | `passed` | runs/benchmarks/20260603T193000Z-kaiju-coder-7-serving/summary.md |  |
 | Runtime | 12k, 16k, 24k, and 32k context benchmarks are recorded with a recommended default. | `passed` | release/SERVING_BENCHMARKS.md records 12288, 16384, 24576, 32768 and recommends 16k live default |  |
 | Runtime | SGLang and vLLM/practical faster serving path are benchmarked honestly. | `passed` | release/SERVING_BENCHMARKS.md; release/quantized-runtime/README.md |  |
 | Runtime | At least one public-friendly quantized/local candidate is working or clearly documented as blocked with evidence. | `passed` | release/quantized-runtime/README.md documents vLLM bitsandbytes runtime candidate and persisted-weights limitation |  |

 # Kaiju Coder 7 Goal Completion Audit
+Generated: `2026-06-03T23:35:30Z`
 Overall: `complete`
 Summary: `18 passed / 0 blocked / 0 manual`
 | OpenCode | Lean Kaiju-specific OpenCode config/agent minimizes prompt overhead and disables synthetic auto-continue loops. | `passed` | .opencode/agents/kaiju-coder-7.md; scripts/opencode-kaiju-no-autocontinue.mjs; scripts/install_kaiju_opencode_profile.py |  |
 | OpenCode | opencode -m kaiju/kaiju-coder-7 works from this Mac with the recommended config. | `passed` | runs/public-opencode-smoke latest passing summary; scripts/run_kaiju_public_opencode_smoke.py |  |
 | OpenCode | Customer-readiness pack passes without wrong-directory output, fake compaction completion, missing files, or secret leakage. | `passed` | runs/opencode-customer-readiness/20260603T185835Z/summary.md |  |
+| Runtime | Direct API smoke passes using model=kaiju-coder-7. | `passed` | runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md |  |
 | Runtime | 12k, 16k, 24k, and 32k context benchmarks are recorded with a recommended default. | `passed` | release/SERVING_BENCHMARKS.md records 12288, 16384, 24576, 32768 and recommends 16k live default |  |
 | Runtime | SGLang and vLLM/practical faster serving path are benchmarked honestly. | `passed` | release/SERVING_BENCHMARKS.md; release/quantized-runtime/README.md |  |
 | Runtime | At least one public-friendly quantized/local candidate is working or clearly documented as blocked with evidence. | `passed` | release/quantized-runtime/README.md documents vLLM bitsandbytes runtime candidate and persisted-weights limitation |  |

HF_UPLOAD_EVIDENCE.md CHANGED Viewed

@@ -6,10 +6,10 @@ Generated: `2026-06-03T20:36:26Z`
 | Repo | Visibility | Evidence |
 |---|---|---|
-| `RMDWLLC/kaiju-coder-7-adapter` | public | Final visible SHA `67bb48b8115b820cd8b01d1778d2610d9ce63692`; public visibility verified after 2026-06-03 paid API evidence refresh. |
 | `RMDWLLC/kaiju-coder-7-opencode` | public | Final visible SHA `3c9c75416ffb41645a1a959beb99baeff6972fb8`; public visibility and OpenCode installer dry-run verified. |
 | `RMDWLLC/kaiju-coder-7-quantized-runtime` | public | Uploaded at commit `6d7449a3ffac68ed1d591c57b044ba599cee8b11`; public visibility verified. |
-| `RMDWLLC/kaiju-coder-7` | public | `hf upload-large-folder` completed successfully, then metadata/evidence refreshed at final visible SHA `736af44add9321f74e8603cd739245fc0853d62c`; public metadata reports `private: false`. |
 These SHAs are a point-in-time release evidence snapshot. Uploading this
 evidence file itself creates another metadata commit, so use `hf models info`
@@ -78,7 +78,7 @@ Result:
 - The downloaded OpenCode helper installer dry-run passed and included the
   loop guard.
 - Merged model metadata reports `private: false`, SHA
-  `736af44add9321f74e8603cd739245fc0853d62c`, and lists all `14`
   safetensors shards.
 The earlier private-storage limit blocked private file downloads after the

 | Repo | Visibility | Evidence |
 |---|---|---|
+| `RMDWLLC/kaiju-coder-7-adapter` | public | Final visible SHA `5016ab9e5f32ca3f94d49a4dbed65de2729bd6ce`; public visibility verified after 2026-06-03 paid API evidence refresh. |
 | `RMDWLLC/kaiju-coder-7-opencode` | public | Final visible SHA `3c9c75416ffb41645a1a959beb99baeff6972fb8`; public visibility and OpenCode installer dry-run verified. |
 | `RMDWLLC/kaiju-coder-7-quantized-runtime` | public | Uploaded at commit `6d7449a3ffac68ed1d591c57b044ba599cee8b11`; public visibility verified. |
+| `RMDWLLC/kaiju-coder-7` | public | `hf upload-large-folder` completed successfully, then metadata/evidence refreshed at final visible SHA `00ba85985102a14838dbb8a5692d9a75ce9da15a`; public metadata reports `private: false`. |
 These SHAs are a point-in-time release evidence snapshot. Uploading this
 evidence file itself creates another metadata commit, so use `hf models info`
 - The downloaded OpenCode helper installer dry-run passed and included the
   loop guard.
 - Merged model metadata reports `private: false`, SHA
+  `00ba85985102a14838dbb8a5692d9a75ce9da15a`, and lists all `14`
   safetensors shards.
 The earlier private-storage limit blocked private file downloads after the

LOCAL_TEST_INSTRUCTIONS.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Kaiju Coder 7 Local Test Instructions
-Use these commands from the repo root. The public release name is Kaiju Coder 7. Internally, this build is backed by the v1.8 adapter under `runs/qwen36-27b-lora-v1.8-business-owner/adapter`. The release-candidate raw model path is the merged full model on Gojira B at `/home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged`. The deterministic harness commands work locally now; the SGLang commands require Gojira B over Tailscale.
 ## Run The Local Release-Candidate Gate
@@ -24,26 +24,32 @@ KAIJU_MERGED_MODEL_DIR=/workspace/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.
 ## Start Kaiju Coder 7 Serving
-Use this for the current model-side candidate:
 ```bash
-KAIJU_QWEN36_MERGED_PORT=18083 \
-KAIJU_QWEN36_MERGED_SESSION=kaiju_qwen36_v18_merged_sglang \
-KAIJU_QWEN36_MERGED_CONTEXT=16384 \
-KAIJU_QWEN36_MERGED_MEM_FRACTION=0.85 \
-  ./scripts/start-qwen36-merged-sglang.sh
 ```
 Confirm readiness:
 ```bash
-curl http://100.109.109.14:18083/v1/models
 ```
 The high-context `32768` target has benchmark evidence in
-`release/SERVING_BENCHMARKS.md`, but the current restored Gojira-B endpoint is
-parked at `16384` for reliable local/OpenCode testing after the quantized-vLLM
-smoke work.
 ## Prepare Merged-Model Hugging Face Metadata
@@ -82,7 +88,7 @@ python3 scripts/run_kaiju_api_harness_smoke.py
 ```bash
 python3 evals/run_openai_compat_smoke.py \
-  --base-url http://100.109.109.14:18083/v1 \
   --model kaiju-coder-7 \
   --tasks evals/tasks/smoke.jsonl \
   --max-tasks 1 \
@@ -100,7 +106,7 @@ evals pass at acceptable latency:
 ```bash
 python3 evals/run_openai_compat_smoke.py \
-  --base-url http://100.109.109.14:18083/v1 \
   --model kaiju-coder-7 \
   --tasks evals/tasks/business-owner-v18-comparison.jsonl \
   --timeout 900 \

 # Kaiju Coder 7 Local Test Instructions
+Use these commands from the repo root. The public release name is Kaiju Coder 7. Internally, this build is backed by the v1.8 adapter under `runs/qwen36-27b-lora-v1.8-business-owner/adapter`. The release-candidate raw model path is the merged full model on Gojira B at `/home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged`. The deterministic harness commands work locally now; the fastest current runtime is vLLM bitsandbytes on Gojira B over Tailscale with the local OpenCode fast proxy.
 ## Run The Local Release-Candidate Gate
 ## Start Kaiju Coder 7 Serving
+Use this for the fastest current model-side candidate:
 ```bash
+KAIJU_VLLM_CONTEXT=16384 \
+KAIJU_VLLM_QUANTIZATION=bitsandbytes \
+KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \
+KAIJU_VLLM_GPU_UTIL=0.90 \
+  ./scripts/start-qwen36-merged-vllm.sh
 ```
 Confirm readiness:
 ```bash
+curl http://100.109.109.14:18084/v1/models
+```
+Then keep the Mac-side fast proxy pointed at that vLLM endpoint:
+```bash
+KAIJU_OPENAI_BASE_URL=http://100.109.109.14:18084/v1 \
+python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181
 ```
 The high-context `32768` target has benchmark evidence in
+`release/SERVING_BENCHMARKS.md`, but the current speed/default path is 16k
+runtime-quantized vLLM plus the local fast proxy.
 ## Prepare Merged-Model Hugging Face Metadata
 ```bash
 python3 evals/run_openai_compat_smoke.py \
+  --base-url http://100.109.109.14:18084/v1 \
   --model kaiju-coder-7 \
   --tasks evals/tasks/smoke.jsonl \
   --max-tasks 1 \
 ```bash
 python3 evals/run_openai_compat_smoke.py \
+  --base-url http://100.109.109.14:18084/v1 \
   --model kaiju-coder-7 \
   --tasks evals/tasks/business-owner-v18-comparison.jsonl \
   --timeout 900 \

PUBLIC_TESTING_QUICKSTART.md CHANGED Viewed

@@ -19,7 +19,7 @@ Use this if you already have Kaiju Coder 7 served at an OpenAI-compatible
 ```bash
 git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
 cd kaiju-coder-7-opencode
-python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18083/v1
 ```
 Then run OpenCode inside the project you want to edit:
@@ -65,23 +65,31 @@ the server to expose:
 ```text
 model id: kaiju-coder-7
-base URL: http://127.0.0.1:18083/v1
 context: 16384
 ```
 Then install the OpenCode helper with:
 ```bash
 git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
 cd kaiju-coder-7-opencode
-python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18083/v1
 ```
 ### Path 3: Runtime-Quantized Local Candidate
 Use this only if you are comfortable with advanced serving setups. The current
-working quantized option is a runtime bitsandbytes recipe, not a separate
-persisted quantized weights repo.
 ```bash
 git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-quantized-runtime
@@ -115,9 +123,12 @@ Expected result:
 - Public model id: `kaiju-coder-7`
 - OpenCode context: `16384`
 - Output cap for public testing: `2500`
 - Current reliable product path: model plus deterministic business-owner
-  harness plus verifier
-- Raw multi-file OpenCode generation: still too slow for broad paid API claims
 - Paid API: not public until launch preflight passes
 ## What Not To Claim Yet
@@ -134,15 +145,21 @@ Do claim:
 - Kaiju Coder 7 has a working local/OpenCode release candidate
 - the current tested OpenCode default is 16k context
 - the helper package includes a lean agent and compaction loop guard
 - the paid API scaffold has tests and a launch preflight, but is not yet public
 - the packaged public smoke verifies a fresh OpenCode one-file write before
   public claims are refreshed
 ## Current Blockers Before Public Release
 - Hugging Face repo creation still requires a write-capable token or namespace.
 - Full merged model upload has not completed; the merged folder must first have
   the metadata packet synced by `prepare_hf_merged_model_metadata.sh`.
 - Public paid API launch needs real Cloudflare D1/KV/R2 bindings, Wrangler
   secret verification, Stripe webhook staging evidence, staging traffic, latency
   evidence, and rollback proof.

 ```bash
 git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
 cd kaiju-coder-7-opencode
+python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18181/v1
 ```
 Then run OpenCode inside the project you want to edit:
 ```text
 model id: kaiju-coder-7
+base URL: http://127.0.0.1:18084/v1
 context: 16384
 ```
+For the fastest OpenCode behavior, run the bundled fast proxy in a separate
+terminal and point OpenCode at the proxy:
+```bash
+KAIJU_OPENAI_BASE_URL=http://127.0.0.1:18084/v1 \
+python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181
+```
 Then install the OpenCode helper with:
 ```bash
 git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
 cd kaiju-coder-7-opencode
+python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18181/v1
 ```
 ### Path 3: Runtime-Quantized Local Candidate
 Use this only if you are comfortable with advanced serving setups. The current
+working quantized option is a runtime bitsandbytes recipe. A Q8_0 GGUF artifact
+has been converted, but it is still a candidate until runtime smoke passes.
 ```bash
 git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-quantized-runtime
 - Public model id: `kaiju-coder-7`
 - OpenCode context: `16384`
 - Output cap for public testing: `2500`
+- Fast OpenCode path: vLLM bitsandbytes runtime behind the Kaiju fast proxy
 - Current reliable product path: model plus deterministic business-owner
+  harness/router plus verifier
+- Raw multi-file OpenCode generation: still too slow for broad paid claims;
+  useful for testing, but paid API claims should favor harnessed product
+  workflows until broader latency gates pass
 - Paid API: not public until launch preflight passes
 ## What Not To Claim Yet
 - Kaiju Coder 7 has a working local/OpenCode release candidate
 - the current tested OpenCode default is 16k context
 - the helper package includes a lean agent and compaction loop guard
+- the fast proxy keeps OpenCode tool calls intact while forcing bounded,
+  non-thinking generation
 - the paid API scaffold has tests and a launch preflight, but is not yet public
 - the packaged public smoke verifies a fresh OpenCode one-file write before
   public claims are refreshed
+- a GGUF Q8_0 candidate exists, but is not public quantized-weights release
+  evidence until runtime smoke passes
 ## Current Blockers Before Public Release
 - Hugging Face repo creation still requires a write-capable token or namespace.
 - Full merged model upload has not completed; the merged folder must first have
   the metadata packet synced by `prepare_hf_merged_model_metadata.sh`.
+- The GGUF Q8_0 candidate still needs a runtime smoke before public
+  quantized-weights upload.
 - Public paid API launch needs real Cloudflare D1/KV/R2 bindings, Wrangler
   secret verification, Stripe webhook staging evidence, staging traffic, latency
   evidence, and rollback proof.

QUANTIZATION_PLAN.md CHANGED Viewed

@@ -54,6 +54,44 @@ Findings on 2026-06-03:
   `/tmp/kaiju-opencode-quantized-smoke/hello.txt` with exactly
   `Kaiju Coder 7 quantized runtime ok`.
 ## Candidate Order
 1. **FP8/AWQ-style GPU serving candidate**
@@ -65,7 +103,8 @@ Findings on 2026-06-03:
 2. **GGUF/llama.cpp candidate**
    - Best for broad local distribution if the architecture converts cleanly.
-   - Publish only if a real local smoke test passes.
 3. **MLX candidate**
    - Best for Apple Silicon users if conversion supports this architecture.
@@ -90,7 +129,4 @@ path, but not as a public quantized-weights release.
 ## Next Concrete Step
-Create a pinned Docker/UV quantization environment on Gojira-B with the
-Qwen3.5-capable Transformers/runtime stack plus one persistent-weight
-quantization package at a time. Do not upload a quantized-weights repo until a
-smoke-tested persisted artifact exists.

   `/tmp/kaiju-opencode-quantized-smoke/hello.txt` with exactly
   `Kaiju Coder 7 quantized runtime ok`.
+Second persisted-quantization probe:
+```bash
+./scripts/probe-gojira-b-persisted-quantization.sh
+```
+Findings on 2026-06-03:
+- The active nightly vLLM stack recognizes `Qwen3_5Config` / `qwen3_5`.
+- Normal dependency installs for AWQ/GPTQ/llmcompressor can break that
+  Qwen3.5-capable Transformers stack, so they are not safe to run casually in
+  the serving image.
+- `autoawq` installed but importing `awq` failed against the current
+  Transformers activation API.
+- `auto-gptq` failed during build isolation because Torch was not visible to
+  the isolated build step.
+- `llmcompressor --no-deps` preserved Qwen3.5 config support, but import still
+  needs a pinned supporting dependency set. This remains the next best GPU
+  persisted-weight path after a dedicated environment is built.
+- `llama.cpp` support includes `Qwen3_5ForConditionalGeneration`, and Q8_0
+  conversion dry-run passed.
+Persisted GGUF conversion:
+```bash
+./scripts/run-gojira-b-kaiju-gguf-convert.sh
+```
+- Output:
+  `/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf`
+- Size: `27G`
+- SHA256:
+  `596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e`
+- Evidence:
+  `runs/gguf-conversion/20260603T231446Z/gguf-conversion.log`
+- Release status: converted, runtime smoke still required before public
+  quantized-weights upload.
 ## Candidate Order
 1. **FP8/AWQ-style GPU serving candidate**
 2. **GGUF/llama.cpp candidate**
    - Best for broad local distribution if the architecture converts cleanly.
+   - Current state: Q8_0 converted successfully on Gojira-B.
+   - Publish only if a real runtime smoke test passes.
 3. **MLX candidate**
    - Best for Apple Silicon users if conversion supports this architecture.
 ## Next Concrete Step
+Smoke-test the GGUF Q8_0 candidate next. Create a pinned Docker/UV quantization environment on Gojira-B with the Qwen3.5-capable Transformers/runtime stack plus one persistent-weight GPU quantization package at a time. Do not upload a quantized-weights repo until a smoke-tested persisted artifact exists.

SERVING_BENCHMARKS.md CHANGED Viewed

@@ -323,6 +323,7 @@ Runs:
 - `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
 - `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
 - `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
 | Stack | Context | Prompt | OK | Seconds | Chars | Chars/s |
 | --- | ---: | --- | --- | ---: | ---: | ---: |
@@ -332,6 +333,8 @@ Runs:
 | vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 |
 | vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 |
 | vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 |
 Gojira-B vLLM logs reported about `17.8 GiB` model memory for the bitsandbytes
 load at both 8k and 16k, compared with about `50.22 GiB` for the unquantized
@@ -356,3 +359,98 @@ restarted and re-confirmed. Treat vLLM bitsandbytes as the current working
 quantized local candidate for advanced GPU users and future paid API speed
 experiments. It now has direct identity/code/business-doc evidence plus an
 OpenCode one-file smoke, but it is not a persisted quantized-weights repo.

 - `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
 - `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
 - `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
+- `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md`
 | Stack | Context | Prompt | OK | Seconds | Chars | Chars/s |
 | --- | ---: | --- | --- | ---: | ---: | ---: |
 | vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 |
 | vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 |
 | vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 |
+| vLLM bitsandbytes | 16384 | code_patch | True | 24.97 | 997 | 39.924 |
+| vLLM bitsandbytes | 16384 | business_doc | True | 34.46 | 1615 | 46.874 |
 Gojira-B vLLM logs reported about `17.8 GiB` model memory for the bitsandbytes
 load at both 8k and 16k, compared with about `50.22 GiB` for the unquantized
 quantized local candidate for advanced GPU users and future paid API speed
 experiments. It now has direct identity/code/business-doc evidence plus an
 OpenCode one-file smoke, but it is not a persisted quantized-weights repo.
+## 2026-06-03 Fast Proxy And Website Harness Speed Pass
+The current speed profile keeps runtime-quantized vLLM active on Gojira-B port
+`18084` and routes OpenCode through the local fast proxy at
+`http://127.0.0.1:18181/v1`. The proxy preserves OpenCode tool-call streaming
+while forcing `thinking=false`, model id `kaiju-coder-7`, and bounded output
+budgets.
+Active endpoint checks:
+- Local fast proxy health: `http://127.0.0.1:18181/health`
+- Upstream vLLM models: `http://100.109.109.14:18084/v1/models`
+- Upstream reports `kaiju-coder-7` with `max_model_len=16384`
+Fresh direct vLLM benchmark:
+- Run: `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md`
+- Identity: `19.48s`
+- Code patch: `24.97s`, `997` chars
+- Business doc: `34.46s`, `1,615` chars
+Fresh OpenCode smoke through the local fast proxy:
+- Command: `opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-vllm-opencode-smoke --dangerously-skip-permissions 'Create fast-vllm.txt with exactly: Kaiju quantized vLLM OpenCode ok'`
+- Result: passed in about `23.5s`, wrote the exact requested file.
+- Packaged public verifier after exact-content agent rule:
+  `runs/public-opencode-smoke/20260603T232928Z/summary.md`, `4/4`
+  passed through `http://127.0.0.1:18181/v1`.
+Website harness/router speed pass:
+- Direct website harness command: `python3 scripts/run_kaiju_website_harness.py --openai-base-url http://100.109.109.14:18084/v1 --model kaiju-coder-7 ...`
+- Direct website harness result: `runs/harness/website-speed-pass/avery-stone-vllm.html`, `9,257` chars, `7.31s`
+- Router command: `python3 scripts/run_kaiju_router.py --kind website --openai-base-url http://100.109.109.14:18084/v1 --model kaiju-coder-7 ...`
+- Router artifact: `runs/router-speed-pass/20260603T223731Z-website-build-a-premium-one-page-website-for-avery-stone-construction-a-reside/index.html`
+- Router result: passed in `7.20s`; checks covered complete HTML, required sections, external images, responsive CSS, no lorem ipsum, and manifest write.
+- Router through the installed local proxy: `runs/router-speed-pass/20260603T224328Z-website-build-a-premium-one-page-website-for-bennett-family-dental-in-charlott/index.html`
+- Proxy router result: passed in `4.67s`; preserved explicit CTA `Schedule a Visit`, inferred `dental`, and passed the same complete-HTML/static checks.
+Updated recommendation: for speed-sensitive OpenCode and paid workflow testing,
+use vLLM bitsandbytes plus the local fast proxy as the active default. Keep
+SGLang as fallback/historical evidence, not the fastest current path. For
+websites and business-owner packs, prefer the deterministic router/harness path
+over raw long-form HTML generation.
+Public business-owner demo pack through the active fast proxy:
+```bash
+python3 scripts/run_kaiju_public_demo_pack.py \
+  --openai-base-url http://127.0.0.1:18181/v1 \
+  --model kaiju-coder-7 \
+  --planner-timeout 90
+```
+Run: `runs/public-demo-pack/20260603T232534Z/summary.md`
+| Task | Result | Seconds | Changed files |
+| --- | --- | ---: | ---: |
+| Website | Passed | 24.59 | 2 |
+| Owner AI company pack | Passed | 29.99 | 19 |
+| Stripe safety plan | Passed | 9.93 | 2 |
+| CSV parser artifact | Passed | 19.93 | 2 |
+Total: `4/4` passed in `84.43s`.
+## Persisted GGUF Q8_0 Candidate
+The dedicated persisted-quantization pass found that normal AWQ/GPTQ installs
+are not clean against the Qwen3.5-capable serving stack tonight, while
+`llama.cpp` conversion support includes `Qwen3_5ForConditionalGeneration`.
+Command:
+```bash
+./scripts/probe-gojira-b-persisted-quantization.sh
+./scripts/run-gojira-b-kaiju-gguf-convert.sh
+```
+Result:
+- Artifact:
+  `/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf`
+- Size: `27G`
+- SHA256:
+  `596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e`
+- Conversion log:
+  `runs/gguf-conversion/20260603T231446Z/gguf-conversion.log`
+- Runtime status: candidate only; direct GGUF runtime smoke still required
+  before publishing quantized weights.
+Interpretation: the next real speed improvement for broad public users is not
+another prompt tweak. It is a smoked GGUF or GPU-persisted quantized artifact.
+The fastest currently verified Kaiju Coder 7 path remains vLLM bitsandbytes
+plus the local fast proxy and deterministic website/business harnesses.

scripts/check_hf_uploaded_release.py CHANGED Viewed

@@ -24,7 +24,7 @@ from typing import Any
 MODEL_ID = "kaiju-coder-7"
 DEFAULT_NAMESPACE = "RMDWLLC"
-DEFAULT_BASE_URL = "http://100.109.109.14:18083/v1"
 @dataclass(frozen=True)

 MODEL_ID = "kaiju-coder-7"
 DEFAULT_NAMESPACE = "RMDWLLC"
+DEFAULT_BASE_URL = "http://127.0.0.1:18181/v1"
 @dataclass(frozen=True)