File size: 5,258 Bytes
d914316 6d7449a 53943f9 6d7449a 785f3d7 6d7449a 785f3d7 6d7449a 785f3d7 6d7449a 785f3d7 6d7449a 785f3d7 6d7449a 785f3d7 6d7449a 785f3d7 6d7449a 785f3d7 6d7449a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | ---
license: apache-2.0
language:
- en
tags:
- kaiju-coder-7
- quantization
- vllm
- bitsandbytes
- local-ai
- opencode
---
# Kaiju Coder 7 Runtime-Quantized Local Candidate

This is the current working local quantized variant for Kaiju Coder 7. It is a
runtime bitsandbytes vLLM serving path, not a separate persisted quantized
weight artifact yet.
## Status
- Model id: `kaiju-coder-7`
- Runtime: `gojira/vllm-openai-ray:nightly`
- Quantization mode: vLLM `--quantization bitsandbytes`
- Load format: vLLM `--load-format bitsandbytes`
- Required launch mode: `--language-model-only`
- Required OpenCode launch flag: `--enable-auto-tool-choice`
- Required preinstall in this image: `pandas`
- Tested contexts: `8192`, `16384`
- OpenCode smoke: passed through the local fast proxy
- Persisted quantized Hugging Face weights: GGUF Q8_0 converted, runtime smoke
pending before public upload
## Run
Use the guarded benchmark script from the repo root:
```bash
KAIJU_VLLM_CONTEXT=16384 \
KAIJU_VLLM_READY_TIMEOUT=1200 \
KAIJU_VLLM_QUANTIZATION=bitsandbytes \
KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \
./scripts/run-gojira-b-vllm-serving-benchmark.sh
```
The script stops the merged SGLang service, starts vLLM on port `18084`, runs
the benchmark, then restores SGLang unless `KAIJU_VLLM_KEEP_RUNNING=1` is set.
For the current fast OpenCode setup, keep vLLM running and point the fast proxy
at port `18084`.
```bash
KAIJU_OPENAI_BASE_URL=http://127.0.0.1:18084/v1 \
python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181
```
## Evidence
Runs:
- `runs/benchmarks/20260603T153257Z-kaiju-coder-7-serving/summary.md`
- `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
- `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
- `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
- `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md`
| Runtime | Context | Prompt | OK | Seconds | Chars | Chars/s |
| --- | ---: | --- | --- | ---: | ---: | ---: |
| vLLM bitsandbytes | 8192 | identity | True | 21.19 | 26 | 1.227 |
| vLLM bitsandbytes | 8192 | code_patch | True | 11.31 | 424 | 37.489 |
| vLLM bitsandbytes | 16384 | identity | True | 19.51 | 26 | 1.333 |
| vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 |
| vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 |
| vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 |
| vLLM bitsandbytes | 16384 | code_patch | True | 24.97 | 997 | 39.924 |
| vLLM bitsandbytes | 16384 | business_doc | True | 34.46 | 1615 | 46.874 |
Gojira-B log evidence recorded model load at about `17.8 GiB` memory for both
8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement
over the full bfloat16 vLLM model load, which reported about `50.22 GiB`.
The 16k business-document task passed, and the current speed pass keeps the
runtime-quantized vLLM service active for OpenCode through the local proxy.
The dedicated website harness/router speed pass produced a complete checked
website in about `7.2s` through vLLM bitsandbytes:
- Direct website harness: `runs/harness/website-speed-pass/avery-stone-vllm.html`
- Router artifact: `runs/router-speed-pass/20260603T223731Z-website-build-a-premium-one-page-website-for-avery-stone-construction-a-reside/index.html`
- Local-proxy router artifact: `runs/router-speed-pass/20260603T224328Z-website-build-a-premium-one-page-website-for-bennett-family-dental-in-charlott/index.html`
- Router checks: complete HTML, required sections, external images,
responsive CSS, no lorem ipsum, manifest write
OpenCode one-file smoke also passed through the runtime-quantized endpoint:
```bash
bash scripts/run_kaiju_quantized_opencode_smoke.sh
```
Result:
- Workdir: `/tmp/kaiju-opencode-quantized-smoke`
- File: `hello.txt`
- Exact content: `Kaiju Coder 7 quantized runtime ok`
- OpenCode config: isolated temporary `HOME`, no global config edit
- Permission mode: `--dangerously-skip-permissions` inside the temporary smoke
harness only
## Persisted GGUF Candidate
A Q8_0 GGUF candidate now exists on Gojira-B:
```text
/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf
```
- Size: `27G`
- SHA256:
`596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e`
- Conversion evidence:
`runs/gguf-conversion/20260603T231446Z/gguf-conversion.log`
- Local docs: `release/gguf/README.md`
This is not public quantized-weights release evidence yet. It still needs a
runtime smoke that proves identity, business-owner output, and the intended
OpenCode/router path under an actual GGUF runtime.
## Release Interpretation
This is a working quantized local runtime candidate. It is useful for internal
testing, serious GPU users, and the next paid API speed experiments. It is not
yet a standalone public quantized weights repo because the only fully smoked
path is still the full merged model loaded through bitsandbytes at runtime.
The next release step is to smoke-test the GGUF candidate or package this
runtime path as an advanced serving recipe while clearly saying it still
requires access to the full Kaiju Coder 7 merged weights.
|