Upload Kaiju Coder 7 runtime quantization recipe

d914316 verified 5 days ago

5.26 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- kaiju-coder-7
	- quantization
	- vllm
	- bitsandbytes
	- local-ai
	- opencode
	---

	# Kaiju Coder 7 Runtime-Quantized Local Candidate

	![RMDW logo](assets/RMDWlogo.png)

	This is the current working local quantized variant for Kaiju Coder 7. It is a
	runtime bitsandbytes vLLM serving path, not a separate persisted quantized
	weight artifact yet.

	## Status

	- Model id: `kaiju-coder-7`
	- Runtime: `gojira/vllm-openai-ray:nightly`
	- Quantization mode: vLLM `--quantization bitsandbytes`
	- Load format: vLLM `--load-format bitsandbytes`
	- Required launch mode: `--language-model-only`
	- Required OpenCode launch flag: `--enable-auto-tool-choice`
	- Required preinstall in this image: `pandas`
	- Tested contexts: `8192`, `16384`
	- OpenCode smoke: passed through the local fast proxy
	- Persisted quantized Hugging Face weights: GGUF Q8_0 converted, runtime smoke
	pending before public upload

	## Run

	Use the guarded benchmark script from the repo root:

	```bash
	KAIJU_VLLM_CONTEXT=16384 \
	KAIJU_VLLM_READY_TIMEOUT=1200 \
	KAIJU_VLLM_QUANTIZATION=bitsandbytes \
	KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \
	./scripts/run-gojira-b-vllm-serving-benchmark.sh
	```

	The script stops the merged SGLang service, starts vLLM on port `18084`, runs
	the benchmark, then restores SGLang unless `KAIJU_VLLM_KEEP_RUNNING=1` is set.
	For the current fast OpenCode setup, keep vLLM running and point the fast proxy
	at port `18084`.

	```bash
	KAIJU_OPENAI_BASE_URL=http://127.0.0.1:18084/v1 \
	python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181
	```

	## Evidence

	Runs:

	- `runs/benchmarks/20260603T153257Z-kaiju-coder-7-serving/summary.md`
	- `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
	- `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
	- `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
	- `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md`

	\| Runtime \| Context \| Prompt \| OK \| Seconds \| Chars \| Chars/s \|
	\| --- \| ---: \| --- \| --- \| ---: \| ---: \| ---: \|
	\| vLLM bitsandbytes \| 8192 \| identity \| True \| 21.19 \| 26 \| 1.227 \|
	\| vLLM bitsandbytes \| 8192 \| code_patch \| True \| 11.31 \| 424 \| 37.489 \|
	\| vLLM bitsandbytes \| 16384 \| identity \| True \| 19.51 \| 26 \| 1.333 \|
	\| vLLM bitsandbytes \| 16384 \| code_patch \| True \| 11.3 \| 416 \| 36.814 \|
	\| vLLM bitsandbytes \| 16384 \| business_doc \| True \| 53.44 \| 1610 \| 30.127 \|
	\| vLLM bitsandbytes \| 16384 \| identity \| True \| 19.65 \| 26 \| 1.323 \|
	\| vLLM bitsandbytes \| 16384 \| code_patch \| True \| 24.97 \| 997 \| 39.924 \|
	\| vLLM bitsandbytes \| 16384 \| business_doc \| True \| 34.46 \| 1615 \| 46.874 \|

	Gojira-B log evidence recorded model load at about `17.8 GiB` memory for both
	8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement
	over the full bfloat16 vLLM model load, which reported about `50.22 GiB`.
	The 16k business-document task passed, and the current speed pass keeps the
	runtime-quantized vLLM service active for OpenCode through the local proxy.

	The dedicated website harness/router speed pass produced a complete checked
	website in about `7.2s` through vLLM bitsandbytes:

	- Direct website harness: `runs/harness/website-speed-pass/avery-stone-vllm.html`
	- Router artifact: `runs/router-speed-pass/20260603T223731Z-website-build-a-premium-one-page-website-for-avery-stone-construction-a-reside/index.html`
	- Local-proxy router artifact: `runs/router-speed-pass/20260603T224328Z-website-build-a-premium-one-page-website-for-bennett-family-dental-in-charlott/index.html`
	- Router checks: complete HTML, required sections, external images,
	responsive CSS, no lorem ipsum, manifest write

	OpenCode one-file smoke also passed through the runtime-quantized endpoint:

	```bash
	bash scripts/run_kaiju_quantized_opencode_smoke.sh
	```

	Result:

	- Workdir: `/tmp/kaiju-opencode-quantized-smoke`
	- File: `hello.txt`
	- Exact content: `Kaiju Coder 7 quantized runtime ok`
	- OpenCode config: isolated temporary `HOME`, no global config edit
	- Permission mode: `--dangerously-skip-permissions` inside the temporary smoke
	harness only

	## Persisted GGUF Candidate

	A Q8_0 GGUF candidate now exists on Gojira-B:

	```text
	/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf
	```

	- Size: `27G`
	- SHA256:
	`596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e`
	- Conversion evidence:
	`runs/gguf-conversion/20260603T231446Z/gguf-conversion.log`
	- Local docs: `release/gguf/README.md`

	This is not public quantized-weights release evidence yet. It still needs a
	runtime smoke that proves identity, business-owner output, and the intended
	OpenCode/router path under an actual GGUF runtime.

	## Release Interpretation

	This is a working quantized local runtime candidate. It is useful for internal
	testing, serious GPU users, and the next paid API speed experiments. It is not
	yet a standalone public quantized weights repo because the only fully smoked
	path is still the full merged model loaded through bitsandbytes at runtime.

	The next release step is to smoke-test the GGUF candidate or package this
	runtime path as an advanced serving recipe while clearly saying it still
	requires access to the full Kaiju Coder 7 merged weights.