Instructions to use RMDWLLC/kaiju-coder-7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RMDWLLC/kaiju-coder-7 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RMDWLLC/kaiju-coder-7")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("RMDWLLC/kaiju-coder-7")
model = AutoModelForImageTextToText.from_pretrained("RMDWLLC/kaiju-coder-7")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use RMDWLLC/kaiju-coder-7 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RMDWLLC/kaiju-coder-7"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RMDWLLC/kaiju-coder-7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/RMDWLLC/kaiju-coder-7

SGLang

How to use RMDWLLC/kaiju-coder-7 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RMDWLLC/kaiju-coder-7" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RMDWLLC/kaiju-coder-7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RMDWLLC/kaiju-coder-7" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RMDWLLC/kaiju-coder-7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use RMDWLLC/kaiju-coder-7 with Docker Model Runner:
```
docker model run hf.co/RMDWLLC/kaiju-coder-7
```

kaiju-coder-7 / SERVING_BENCHMARKS.md

restokes92

Add files using upload-large-folder tool

4ca1eb4 verified 5 days ago

preview code

raw

history blame contribute delete

17.7 kB

	# Kaiju Coder 7 Serving Benchmarks

	This file records serving evidence for public download and paid API decisions.
	The model id must remain `kaiju-coder-7`.

	## Current Live Runtime

	- Host: Gojira-B over Tailscale
	- Local OpenCode base URL: `http://127.0.0.1:18181/v1`
	- Upstream base URL: `http://100.109.109.14:18084/v1`
	- Serving stack: vLLM bitsandbytes runtime quantization behind the Kaiju fast
	proxy
	- Current verified context: `16384`
	- Tested high-context target: `32768`
	- Current container: `qwen36-merged-vllm-18084`
	- Current caveat: direct raw generation is still slow for multi-file OpenCode
	work; use the deterministic router/harness for public business-owner demos.

	## Benchmark Command

	For current-context latency without restart:

	```bash
	python3 scripts/benchmark_kaiju_serving.py \
	--contexts 12288 \
	--prompts identity business_doc code_patch \
	--max-tokens 768 \
	--timeout 420
	```

	For context restart benchmarking:

	```bash
	python3 scripts/benchmark_kaiju_serving.py \
	--restart \
	--contexts 12288 16384 24576 32768 \
	--prompts identity business_doc \
	--max-tokens 768 \
	--timeout 420 \
	--ready-timeout 1200
	```

	Use `--contexts 16384` for the current restored Gojira-B endpoint. Use
	`32768` when explicitly testing the high-context target; it has passed earlier
	benchmarks but should be re-confirmed after a fresh restart before calling it
	the live default.

	## Current 12k Direct API Benchmark

	Command:

	```bash
	python3 scripts/benchmark_kaiju_serving.py \
	--contexts 12288 \
	--prompts identity code_patch \
	--max-tokens 256 \
	--timeout 300
	```

	Run: `runs/benchmarks/20260603T135017Z-kaiju-coder-7-serving/summary.md`

	\| Context \| Prompt \| OK \| Seconds \| Chars \| Chars/s \|
	\| --- \| --- \| --- \| ---: \| ---: \| ---: \|
	\| 12288 \| identity \| True \| 2.41 \| 26 \| 10.788 \|
	\| 12288 \| code_patch \| True \| 57.61 \| 860 \| 14.928 \|

	Interpretation: direct API calls are usable for short tasks, but latency is too
	high for a paid raw-code API unless outputs are streamed and route-specific
	limits are enforced.

	## 16k Context Benchmark

	16k was tested to reduce OpenCode compaction pressure.

	Commands:

	```bash
	python3 scripts/benchmark_kaiju_serving.py \
	--restart \
	--contexts 16384 \
	--prompts identity \
	--max-tokens 128 \
	--timeout 300 \
	--ready-timeout 1200

	python3 scripts/benchmark_kaiju_serving.py \
	--contexts 16384 \
	--prompts code_patch \
	--max-tokens 128 \
	--timeout 300
	```

	Runs:

	- `runs/benchmarks/20260603T135651Z-kaiju-coder-7-serving/summary.md`
	- `runs/benchmarks/20260603T140318Z-kaiju-coder-7-serving/summary.md`

	\| Context \| Prompt \| OK \| Load Wait \| Seconds \| Chars \| Chars/s \|
	\| --- \| --- \| --- \| ---: \| ---: \| ---: \| ---: \|
	\| 16384 \| identity \| True \| 354.16 \| 14.9 \| 26 \| 1.745 \|
	\| 16384 \| code_patch \| True \| n/a \| 28.99 \| 416 \| 14.35 \|

	Interpretation: `16384` is a stable lower-load fallback and still leaves more
	room above OpenCode's prompt/tool overhead than the original 12k setting.

	## 24k And 32k Context Benchmarks

	24k and 32k were tested after 16k proved stable. Both loaded and returned the
	same code-patch latency profile as 16k on the short patch benchmark.

	Commands:

	```bash
	python3 scripts/benchmark_kaiju_serving.py \
	--restart \
	--contexts 24576 \
	--prompts identity \
	--max-tokens 128 \
	--timeout 300 \
	--ready-timeout 1200

	python3 scripts/benchmark_kaiju_serving.py \
	--contexts 24576 \
	--prompts code_patch \
	--max-tokens 128 \
	--timeout 300

	python3 scripts/benchmark_kaiju_serving.py \
	--restart \
	--contexts 32768 \
	--prompts identity \
	--max-tokens 64 \
	--timeout 300 \
	--ready-timeout 1200

	python3 scripts/benchmark_kaiju_serving.py \
	--contexts 32768 \
	--prompts code_patch \
	--max-tokens 128 \
	--timeout 300
	```

	Runs:

	- `runs/benchmarks/20260603T141559Z-kaiju-coder-7-serving/summary.md`
	- `runs/benchmarks/20260603T142354Z-kaiju-coder-7-serving/summary.md`
	- `runs/benchmarks/20260603T142439Z-kaiju-coder-7-serving/summary.md`
	- `runs/benchmarks/20260603T143256Z-kaiju-coder-7-serving/summary.md`

	\| Context \| Prompt \| OK \| Load Wait \| Seconds \| Chars \| Chars/s \|
	\| --- \| --- \| --- \| ---: \| ---: \| ---: \| ---: \|
	\| 24576 \| identity \| True \| 439.54 \| 16.84 \| 26 \| 1.544 \|
	\| 24576 \| code_patch \| True \| n/a \| 29.03 \| 416 \| 14.33 \|
	\| 32768 \| identity \| True \| 386.53 \| 16.27 \| 26 \| 1.598 \|
	\| 32768 \| code_patch \| True \| n/a \| 28.99 \| 416 \| 14.35 \|

	Interpretation: `32768` is a proven high-context target from this benchmark set,
	but it is not the currently parked live endpoint after the later
	quantized-runtime testing. The current Gojira-B/OpenCode profile should stay at
	`16384` until `32768` is freshly restarted and re-confirmed. Keep `12288` for
	direct API smoke tests and constrained hardware.

	Restored-service 32k direct API smoke after vLLM testing:

	- Run: `runs/benchmarks/20260603T155233Z-kaiju-coder-7-serving/summary.md`
	- `/v1/models`: `kaiju-coder-7`, max model len `32768`

	\| Context \| Prompt \| OK \| Seconds \| Chars \| Chars/s \|
	\| --- \| --- \| --- \| ---: \| ---: \| ---: \|
	\| 32768 \| identity \| True \| 2.92 \| 26 \| 8.904 \|
	\| 32768 \| business_doc \| True \| 94.28 \| 1737 \| 18.424 \|

	Interpretation: the restored default endpoint is usable for business-owner
	document work, but a long proposal response still takes about 94 seconds. Paid
	routes must stream, cap output, queue carefully, and prefer verified
	artifact routes over raw open-ended generation.

	## OpenCode Customer-Readiness Evidence

	Final restored-service small OpenCode smoke:

	```bash
	opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 \
	--dir /tmp/kaiju-opencode-32k-final-smoke \
	'Create hello.txt with exactly: Kaiju Coder 7 final 32k ok'
	```

	Result: passed. OpenCode wrote `hello.txt` with exactly
	`Kaiju Coder 7 final 32k ok`.

	Current restored 16k OpenCode smoke after quantized-vLLM testing:

	```bash
	mkdir -p /tmp/kaiju-opencode-fresh-public-smoke
	opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 \
	--dir /tmp/kaiju-opencode-fresh-public-smoke \
	--dangerously-skip-permissions \
	'Create hello.txt with exactly: Kaiju Coder 7 fresh public smoke ok'
	```

	Result: passed. OpenCode wrote `hello.txt` with exactly
	`Kaiju Coder 7 fresh public smoke ok` in
	`/tmp/kaiju-opencode-fresh-public-smoke`, and `/v1/models` returned
	`kaiju-coder-7` with max model len `16384`.

	Current restored 16k direct API identity smoke:

	- Run: `runs/benchmarks/20260603T174545Z-kaiju-coder-7-serving/summary.md`
	- `/v1/models`: `kaiju-coder-7`, max model len `16384`

	\| Context \| Prompt \| OK \| Seconds \| Chars \| Chars/s \|
	\| --- \| --- \| --- \| ---: \| ---: \| ---: \|
	\| 16384 \| identity \| True \| 2.3 \| 26 \| 11.304 \|

	Command:

	```bash
	python3 scripts/run_kaiju_opencode_customer_pack.py
	```

	Latest harnessed product-path result on 2026-06-03:

	- Run: `runs/opencode-customer-readiness/20260603T185835Z/summary.md`
	- Mode: `harnessed`
	- Status: `4/4` passed
	- Tasks:
	- `fade-flow-service-site`
	- `kiyomi-owner-operating-pack`
	- `paid-api-safety-scaffold`
	- `release-provenance-safety-review`
	- Required files written: `28/28`
	- Forbidden secret-looking tokens: none found by verifier

	Loop-guarded OpenCode install smoke:

	- Command: `python3 scripts/install_kaiju_opencode_profile.py`, then
	`opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-opencode-loopguard-smoke --dangerously-skip-permissions 'Create loopguard.txt with exactly: Kaiju Coder 7 loop guard installed'`
	- Result: passed. OpenCode wrote `loopguard.txt` in the requested directory with
	exactly `Kaiju Coder 7 loop guard installed` and exited cleanly.
	- Installed guard: `/Users/richardecholsai7/.config/opencode/kaiju-no-autocontinue.mjs`

	Raw OpenCode-agent result on 2026-06-03:

	- Task: `fade-flow-service-site`
	- Status: timed out after `900s`
	- Required files written: `0`
	- Observed Gojira-B decode throughput while running: about `4.4` tokens/sec
	- Follow-up runner fix: workspaces now run outside the repo and pass `opencode
	run --dir <workspace>` explicitly.
	- Structured follow-up run:
	`runs/opencode-customer-readiness/20260603T135520Z/results.jsonl`
	timed out after `60s`, wrote `0` files, and recorded `pwd` as the intended
	temp workspace.
	- 16k/stricter-agent follow-up runs:
	- `runs/opencode-customer-readiness/20260603T140650Z/results.jsonl`
	timed out after `120s`, wrote `0` files, and recorded the intended temp
	workspace.
	- `runs/opencode-customer-readiness/20260603T140908Z/results.jsonl`
	timed out after `120s`, wrote `0` files after adding stricter "write first
	file immediately" prompt guidance.
	- Interpretation: the lean OpenCode agent fits and can write small files.
	Harnessed file-plan delivery passes the customer pack. Current raw multi-file
	OpenCode generation is still not public/API ready, so public and paid claims
	must describe the reliable product path as model plus deterministic harness
	and verifier.

	## Recommendation Until Faster Serving Is Proven

	- Public local release can proceed only with clear speed/hardware caveats.
	- Paid API should route business-owner deliverables through deterministic
	harnesses and verifiers, not raw OpenCode multi-file generation.
	- Quantized candidates and/or a smaller distilled variant are required for
	broad public OpenCode usability.

	## vLLM Serving Probe

	vLLM was tested as the practical alternative serving path after SGLang. The
	standard `vllm/vllm-openai:latest` image cannot read the merged checkpoint's
	`qwen3_5` config. The Gojira nightly image can read it, but needed two launch
	fixes for this checkpoint:

	- preinstall `pandas`, because the Qwen3.5 model path imports it in this image
	- pass `--language-model-only`, because the merged text-serving checkpoint does
	not include the visual encoder weights expected by the multimodal config

	Guarded benchmark command:

	```bash
	KAIJU_VLLM_CONTEXT=16384 KAIJU_VLLM_READY_TIMEOUT=900 \
	./scripts/run-gojira-b-vllm-serving-benchmark.sh
	```

	Run: `runs/benchmarks/20260603T151244Z-kaiju-coder-7-serving/summary.md`

	\| Stack \| Context \| Prompt \| OK \| Seconds \| Chars \| Chars/s \|
	\| --- \| ---: \| --- \| --- \| ---: \| ---: \| ---: \|
	\| vLLM nightly \| 16384 \| identity \| True \| 19.99 \| 26 \| 1.301 \|
	\| vLLM nightly \| 16384 \| code_patch \| True \| 28.8 \| 416 \| 14.444 \|

	Interpretation: unquantized vLLM now runs Kaiju Coder 7 at 16k, but it was not
	clearly faster than SGLang on these smoke prompts. This is historical fallback
	evidence. The later bitsandbytes vLLM path plus fast proxy is the active speed
	path. Keep the live/default OpenCode profile at 16k until 32k is freshly
	re-confirmed.

	## vLLM bitsandbytes Runtime-Quantized Candidate

	The first working quantized local variant is a runtime bitsandbytes vLLM path.
	It does not create separate quantized weights yet; it loads the full merged
	model through vLLM's bitsandbytes loader.

	Command:

	```bash
	KAIJU_VLLM_CONTEXT=16384 \
	KAIJU_VLLM_READY_TIMEOUT=1200 \
	KAIJU_VLLM_QUANTIZATION=bitsandbytes \
	KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \
	./scripts/run-gojira-b-vllm-serving-benchmark.sh
	```

	Runs:

	- `runs/benchmarks/20260603T153257Z-kaiju-coder-7-serving/summary.md`
	- `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
	- `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
	- `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
	- `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md`

	\| Stack \| Context \| Prompt \| OK \| Seconds \| Chars \| Chars/s \|
	\| --- \| ---: \| --- \| --- \| ---: \| ---: \| ---: \|
	\| vLLM bitsandbytes \| 8192 \| identity \| True \| 21.19 \| 26 \| 1.227 \|
	\| vLLM bitsandbytes \| 8192 \| code_patch \| True \| 11.31 \| 424 \| 37.489 \|
	\| vLLM bitsandbytes \| 16384 \| identity \| True \| 19.51 \| 26 \| 1.333 \|
	\| vLLM bitsandbytes \| 16384 \| code_patch \| True \| 11.3 \| 416 \| 36.814 \|
	\| vLLM bitsandbytes \| 16384 \| business_doc \| True \| 53.44 \| 1610 \| 30.127 \|
	\| vLLM bitsandbytes \| 16384 \| identity \| True \| 19.65 \| 26 \| 1.323 \|
	\| vLLM bitsandbytes \| 16384 \| code_patch \| True \| 24.97 \| 997 \| 39.924 \|
	\| vLLM bitsandbytes \| 16384 \| business_doc \| True \| 34.46 \| 1615 \| 46.874 \|

	Gojira-B vLLM logs reported about `17.8 GiB` model memory for the bitsandbytes
	load at both 8k and 16k, compared with about `50.22 GiB` for the unquantized
	vLLM load. Code-patch latency improved materially on this smoke prompt.
	Business-document latency improved versus the restored 32k SGLang business-doc
	smoke (`53.44s` at 16k vLLM bitsandbytes versus `94.28s` at 32k SGLang).
	Identity latency remains slower than SGLang.

	Quantized OpenCode one-file smoke passed after launching vLLM with
	`--enable-auto-tool-choice` plus `--tool-call-parser qwen3_coder` and running:

	```bash
	bash scripts/run_kaiju_quantized_opencode_smoke.sh
	```

	Result: OpenCode wrote `/tmp/kaiju-opencode-quantized-smoke/hello.txt` with
	exactly `Kaiju Coder 7 quantized runtime ok`.

	Recommendation: use vLLM bitsandbytes behind the local fast proxy as the
	current public/OpenCode speed path and keep the installed OpenCode profile at
	16k unless the 32k target has just been restarted and re-confirmed. Treat
	SGLang as fallback and historical high-context evidence. vLLM bitsandbytes has
	direct identity/code/business-doc evidence plus an OpenCode one-file smoke, but
	it is not a persisted quantized-weights repo.

	## 2026-06-03 Fast Proxy And Website Harness Speed Pass

	The current speed profile keeps runtime-quantized vLLM active on Gojira-B port
	`18084` and routes OpenCode through the local fast proxy at
	`http://127.0.0.1:18181/v1`. The proxy preserves OpenCode tool-call streaming
	while forcing `thinking=false`, model id `kaiju-coder-7`, and bounded output
	budgets.

	Active endpoint checks:

	- Local fast proxy health: `http://127.0.0.1:18181/health`
	- Upstream vLLM models: `http://100.109.109.14:18084/v1/models`
	- Upstream reports `kaiju-coder-7` with `max_model_len=16384`

	Fresh direct vLLM benchmark:

	- Run: `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md`
	- Identity: `19.48s`
	- Code patch: `24.97s`, `997` chars
	- Business doc: `34.46s`, `1,615` chars

	Fresh OpenCode smoke through the local fast proxy:

	- Command: `opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-vllm-opencode-smoke --dangerously-skip-permissions 'Create fast-vllm.txt with exactly: Kaiju quantized vLLM OpenCode ok'`
	- Result: passed in about `23.5s`, wrote the exact requested file.
	- Packaged public verifier after exact-content agent rule:
	`runs/public-opencode-smoke/20260603T235002Z/summary.md`, `4/4`
	passed through `http://127.0.0.1:18181/v1`.

	Website harness/router speed pass:

	- Direct website harness command: `python3 scripts/run_kaiju_website_harness.py --openai-base-url http://100.109.109.14:18084/v1 --model kaiju-coder-7 ...`
	- Direct website harness result: `runs/harness/website-speed-pass/avery-stone-vllm.html`, `9,257` chars, `7.31s`
	- Router command: `python3 scripts/run_kaiju_router.py --kind website --openai-base-url http://100.109.109.14:18084/v1 --model kaiju-coder-7 ...`
	- Router artifact: `runs/router-speed-pass/20260603T223731Z-website-build-a-premium-one-page-website-for-avery-stone-construction-a-reside/index.html`
	- Router result: passed in `7.20s`; checks covered complete HTML, required sections, external images, responsive CSS, no lorem ipsum, and manifest write.
	- Router through the installed local proxy: `runs/router-speed-pass/20260603T224328Z-website-build-a-premium-one-page-website-for-bennett-family-dental-in-charlott/index.html`
	- Proxy router result: passed in `4.67s`; preserved explicit CTA `Schedule a Visit`, inferred `dental`, and passed the same complete-HTML/static checks.

	Updated recommendation: for speed-sensitive OpenCode and paid workflow testing,
	use vLLM bitsandbytes plus the local fast proxy as the active default. Keep
	SGLang as fallback/historical evidence, not the fastest current path. For
	websites and business-owner packs, prefer the deterministic router/harness path
	over raw long-form HTML generation.

	Public business-owner demo pack through the active fast proxy:

	```bash
	python3 scripts/run_kaiju_public_demo_pack.py \
	--openai-base-url http://127.0.0.1:18181/v1 \
	--model kaiju-coder-7 \
	--planner-timeout 90
	```

	Run: `runs/public-demo-pack/20260603T235009Z/summary.md`

	\| Task \| Result \| Seconds \| Changed files \|
	\| --- \| --- \| ---: \| ---: \|
	\| Website \| Passed \| 4.73 \| 2 \|
	\| Owner AI company pack \| Passed \| 29.85 \| 19 \|
	\| Stripe safety plan \| Passed \| 9.99 \| 2 \|
	\| CSV parser artifact \| Passed \| 19.97 \| 2 \|

	Total: `4/4` passed in `64.529s`.

	## Persisted GGUF Q8_0 Candidate

	The dedicated persisted-quantization pass found that normal AWQ/GPTQ installs
	are not clean against the Qwen3.5-capable serving stack tonight, while
	`llama.cpp` conversion support includes `Qwen3_5ForConditionalGeneration`.

	Command:

	```bash
	./scripts/probe-gojira-b-persisted-quantization.sh
	./scripts/run-gojira-b-kaiju-gguf-convert.sh
	```

	Result:

	- Artifact:
	`/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf`
	- Size: `27G`
	- SHA256:
	`596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e`
	- Conversion log:
	`runs/gguf-conversion/20260603T231446Z/gguf-conversion.log`
	- Runtime status: candidate only; direct GGUF runtime smoke still required
	before publishing quantized weights.

	Interpretation: the next real speed improvement for broad public users is not
	another prompt tweak. It is a smoked GGUF or GPU-persisted quantized artifact.
	The fastest currently verified Kaiju Coder 7 path remains vLLM bitsandbytes
	plus the local fast proxy and deterministic website/business harnesses.