| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - kaiju-coder-7 |
| - quantization |
| - vllm |
| - bitsandbytes |
| - local-ai |
| - opencode |
| --- |
| |
| # Kaiju Coder 7 Runtime-Quantized Local Candidate |
|
|
|  |
|
|
| This is the current working local quantized variant for Kaiju Coder 7. It is a |
| runtime bitsandbytes vLLM serving path, not a separate persisted quantized |
| weight artifact yet. |
|
|
| ## Status |
|
|
| - Model id: `kaiju-coder-7` |
| - Runtime: `gojira/vllm-openai-ray:nightly` |
| - Quantization mode: vLLM `--quantization bitsandbytes` |
| - Load format: vLLM `--load-format bitsandbytes` |
| - Required launch mode: `--language-model-only` |
| - Required OpenCode launch flag: `--enable-auto-tool-choice` |
| - Required preinstall in this image: `pandas` |
| - Tested contexts: `8192`, `16384` |
| - OpenCode smoke: passed through the local fast proxy |
| - Persisted quantized Hugging Face weights: GGUF Q8_0 converted, runtime smoke |
| pending before public upload |
| |
| ## Run |
| |
| Use the guarded benchmark script from the repo root: |
| |
| ```bash |
| KAIJU_VLLM_CONTEXT=16384 \ |
| KAIJU_VLLM_READY_TIMEOUT=1200 \ |
| KAIJU_VLLM_QUANTIZATION=bitsandbytes \ |
| KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \ |
| ./scripts/run-gojira-b-vllm-serving-benchmark.sh |
| ``` |
| |
| The script stops the merged SGLang service, starts vLLM on port `18084`, runs |
| the benchmark, then restores SGLang unless `KAIJU_VLLM_KEEP_RUNNING=1` is set. |
| For the current fast OpenCode setup, keep vLLM running and point the fast proxy |
| at port `18084`. |
|
|
| ```bash |
| KAIJU_OPENAI_BASE_URL=http://127.0.0.1:18084/v1 \ |
| python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181 |
| ``` |
|
|
| ## Evidence |
|
|
| Runs: |
|
|
| - `runs/benchmarks/20260603T153257Z-kaiju-coder-7-serving/summary.md` |
| - `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md` |
| - `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md` |
| - `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md` |
| - `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md` |
|
|
| | Runtime | Context | Prompt | OK | Seconds | Chars | Chars/s | |
| | --- | ---: | --- | --- | ---: | ---: | ---: | |
| | vLLM bitsandbytes | 8192 | identity | True | 21.19 | 26 | 1.227 | |
| | vLLM bitsandbytes | 8192 | code_patch | True | 11.31 | 424 | 37.489 | |
| | vLLM bitsandbytes | 16384 | identity | True | 19.51 | 26 | 1.333 | |
| | vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 | |
| | vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 | |
| | vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 | |
| | vLLM bitsandbytes | 16384 | code_patch | True | 24.97 | 997 | 39.924 | |
| | vLLM bitsandbytes | 16384 | business_doc | True | 34.46 | 1615 | 46.874 | |
| |
| Gojira-B log evidence recorded model load at about `17.8 GiB` memory for both |
| 8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement |
| over the full bfloat16 vLLM model load, which reported about `50.22 GiB`. |
| The 16k business-document task passed, and the current speed pass keeps the |
| runtime-quantized vLLM service active for OpenCode through the local proxy. |
| |
| The dedicated website harness/router speed pass produced a complete checked |
| website in about `7.2s` through vLLM bitsandbytes: |
| |
| - Direct website harness: `runs/harness/website-speed-pass/avery-stone-vllm.html` |
| - Router artifact: `runs/router-speed-pass/20260603T223731Z-website-build-a-premium-one-page-website-for-avery-stone-construction-a-reside/index.html` |
| - Local-proxy router artifact: `runs/router-speed-pass/20260603T224328Z-website-build-a-premium-one-page-website-for-bennett-family-dental-in-charlott/index.html` |
| - Router checks: complete HTML, required sections, external images, |
| responsive CSS, no lorem ipsum, manifest write |
| |
| OpenCode one-file smoke also passed through the runtime-quantized endpoint: |
| |
| ```bash |
| bash scripts/run_kaiju_quantized_opencode_smoke.sh |
| ``` |
| |
| Result: |
| |
| - Workdir: `/tmp/kaiju-opencode-quantized-smoke` |
| - File: `hello.txt` |
| - Exact content: `Kaiju Coder 7 quantized runtime ok` |
| - OpenCode config: isolated temporary `HOME`, no global config edit |
| - Permission mode: `--dangerously-skip-permissions` inside the temporary smoke |
| harness only |
| |
| ## Persisted GGUF Candidate |
| |
| A Q8_0 GGUF candidate now exists on Gojira-B: |
|
|
| ```text |
| /home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf |
| ``` |
|
|
| - Size: `27G` |
| - SHA256: |
| `596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e` |
| - Conversion evidence: |
| `runs/gguf-conversion/20260603T231446Z/gguf-conversion.log` |
| - Local docs: `release/gguf/README.md` |
|
|
| This is not public quantized-weights release evidence yet. It still needs a |
| runtime smoke that proves identity, business-owner output, and the intended |
| OpenCode/router path under an actual GGUF runtime. |
|
|
| ## Release Interpretation |
|
|
| This is a working quantized local runtime candidate. It is useful for internal |
| testing, serious GPU users, and the next paid API speed experiments. It is not |
| yet a standalone public quantized weights repo because the only fully smoked |
| path is still the full merged model loaded through bitsandbytes at runtime. |
|
|
| The next release step is to smoke-test the GGUF candidate or package this |
| runtime path as an advanced serving recipe while clearly saying it still |
| requires access to the full Kaiju Coder 7 merged weights. |
|
|