Instructions to use RMDWLLC/kaiju-coder-7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RMDWLLC/kaiju-coder-7 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RMDWLLC/kaiju-coder-7")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("RMDWLLC/kaiju-coder-7")
model = AutoModelForImageTextToText.from_pretrained("RMDWLLC/kaiju-coder-7")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use RMDWLLC/kaiju-coder-7 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RMDWLLC/kaiju-coder-7"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RMDWLLC/kaiju-coder-7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/RMDWLLC/kaiju-coder-7

SGLang

How to use RMDWLLC/kaiju-coder-7 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RMDWLLC/kaiju-coder-7" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RMDWLLC/kaiju-coder-7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RMDWLLC/kaiju-coder-7" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RMDWLLC/kaiju-coder-7",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use RMDWLLC/kaiju-coder-7 with Docker Model Runner:
```
docker model run hf.co/RMDWLLC/kaiju-coder-7
```

kaiju-coder-7 / SERVING_BENCHMARKS.md

restokes92

Add files using upload-large-folder tool

4ca1eb4 verified 5 days ago

preview code

raw

history blame contribute delete

17.7 kB

Kaiju Coder 7 Serving Benchmarks

This file records serving evidence for public download and paid API decisions. The model id must remain kaiju-coder-7.

Current Live Runtime

Host: Gojira-B over Tailscale
Local OpenCode base URL: http://127.0.0.1:18181/v1
Upstream base URL: http://100.109.109.14:18084/v1
Serving stack: vLLM bitsandbytes runtime quantization behind the Kaiju fast proxy
Current verified context: 16384
Tested high-context target: 32768
Current container: qwen36-merged-vllm-18084
Current caveat: direct raw generation is still slow for multi-file OpenCode work; use the deterministic router/harness for public business-owner demos.

Benchmark Command

For current-context latency without restart:

python3 scripts/benchmark_kaiju_serving.py \
  --contexts 12288 \
  --prompts identity business_doc code_patch \
  --max-tokens 768 \
  --timeout 420

For context restart benchmarking:

python3 scripts/benchmark_kaiju_serving.py \
  --restart \
  --contexts 12288 16384 24576 32768 \
  --prompts identity business_doc \
  --max-tokens 768 \
  --timeout 420 \
  --ready-timeout 1200

Use --contexts 16384 for the current restored Gojira-B endpoint. Use 32768 when explicitly testing the high-context target; it has passed earlier benchmarks but should be re-confirmed after a fresh restart before calling it the live default.

Current 12k Direct API Benchmark

Command:

python3 scripts/benchmark_kaiju_serving.py \
  --contexts 12288 \
  --prompts identity code_patch \
  --max-tokens 256 \
  --timeout 300

Run: runs/benchmarks/20260603T135017Z-kaiju-coder-7-serving/summary.md

Context	Prompt	OK	Seconds	Chars	Chars/s
12288	identity	True	2.41	26	10.788
12288	code_patch	True	57.61	860	14.928

Interpretation: direct API calls are usable for short tasks, but latency is too high for a paid raw-code API unless outputs are streamed and route-specific limits are enforced.

16k Context Benchmark

16k was tested to reduce OpenCode compaction pressure.

Commands:

python3 scripts/benchmark_kaiju_serving.py \
  --restart \
  --contexts 16384 \
  --prompts identity \
  --max-tokens 128 \
  --timeout 300 \
  --ready-timeout 1200

python3 scripts/benchmark_kaiju_serving.py \
  --contexts 16384 \
  --prompts code_patch \
  --max-tokens 128 \
  --timeout 300

Runs:

runs/benchmarks/20260603T135651Z-kaiju-coder-7-serving/summary.md
runs/benchmarks/20260603T140318Z-kaiju-coder-7-serving/summary.md

Context	Prompt	OK	Load Wait	Seconds	Chars	Chars/s
16384	identity	True	354.16	14.9	26	1.745
16384	code_patch	True	n/a	28.99	416	14.35

Interpretation: 16384 is a stable lower-load fallback and still leaves more room above OpenCode's prompt/tool overhead than the original 12k setting.

24k And 32k Context Benchmarks

24k and 32k were tested after 16k proved stable. Both loaded and returned the same code-patch latency profile as 16k on the short patch benchmark.

Commands:

python3 scripts/benchmark_kaiju_serving.py \
  --restart \
  --contexts 24576 \
  --prompts identity \
  --max-tokens 128 \
  --timeout 300 \
  --ready-timeout 1200

python3 scripts/benchmark_kaiju_serving.py \
  --contexts 24576 \
  --prompts code_patch \
  --max-tokens 128 \
  --timeout 300

python3 scripts/benchmark_kaiju_serving.py \
  --restart \
  --contexts 32768 \
  --prompts identity \
  --max-tokens 64 \
  --timeout 300 \
  --ready-timeout 1200

python3 scripts/benchmark_kaiju_serving.py \
  --contexts 32768 \
  --prompts code_patch \
  --max-tokens 128 \
  --timeout 300

Runs:

runs/benchmarks/20260603T141559Z-kaiju-coder-7-serving/summary.md
runs/benchmarks/20260603T142354Z-kaiju-coder-7-serving/summary.md
runs/benchmarks/20260603T142439Z-kaiju-coder-7-serving/summary.md
runs/benchmarks/20260603T143256Z-kaiju-coder-7-serving/summary.md

Context	Prompt	OK	Load Wait	Seconds	Chars	Chars/s
24576	identity	True	439.54	16.84	26	1.544
24576	code_patch	True	n/a	29.03	416	14.33
32768	identity	True	386.53	16.27	26	1.598
32768	code_patch	True	n/a	28.99	416	14.35

Interpretation: 32768 is a proven high-context target from this benchmark set, but it is not the currently parked live endpoint after the later quantized-runtime testing. The current Gojira-B/OpenCode profile should stay at 16384 until 32768 is freshly restarted and re-confirmed. Keep 12288 for direct API smoke tests and constrained hardware.

Restored-service 32k direct API smoke after vLLM testing:

Run: runs/benchmarks/20260603T155233Z-kaiju-coder-7-serving/summary.md
/v1/models: kaiju-coder-7, max model len 32768

Context	Prompt	OK	Seconds	Chars	Chars/s
32768	identity	True	2.92	26	8.904
32768	business_doc	True	94.28	1737	18.424

Interpretation: the restored default endpoint is usable for business-owner document work, but a long proposal response still takes about 94 seconds. Paid routes must stream, cap output, queue carefully, and prefer verified artifact routes over raw open-ended generation.

OpenCode Customer-Readiness Evidence

Final restored-service small OpenCode smoke:

opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 \
  --dir /tmp/kaiju-opencode-32k-final-smoke \
  'Create hello.txt with exactly: Kaiju Coder 7 final 32k ok'

Result: passed. OpenCode wrote hello.txt with exactly Kaiju Coder 7 final 32k ok.

Current restored 16k OpenCode smoke after quantized-vLLM testing:

mkdir -p /tmp/kaiju-opencode-fresh-public-smoke
opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 \
  --dir /tmp/kaiju-opencode-fresh-public-smoke \
  --dangerously-skip-permissions \
  'Create hello.txt with exactly: Kaiju Coder 7 fresh public smoke ok'

Result: passed. OpenCode wrote hello.txt with exactly Kaiju Coder 7 fresh public smoke ok in /tmp/kaiju-opencode-fresh-public-smoke, and /v1/models returned kaiju-coder-7 with max model len 16384.

Current restored 16k direct API identity smoke:

Run: runs/benchmarks/20260603T174545Z-kaiju-coder-7-serving/summary.md
/v1/models: kaiju-coder-7, max model len 16384

Context	Prompt	OK	Seconds	Chars	Chars/s
16384	identity	True	2.3	26	11.304

Command:

python3 scripts/run_kaiju_opencode_customer_pack.py

Latest harnessed product-path result on 2026-06-03:

Run: runs/opencode-customer-readiness/20260603T185835Z/summary.md
Mode: harnessed
Status: 4/4 passed
Tasks:
- fade-flow-service-site
- kiyomi-owner-operating-pack
- paid-api-safety-scaffold
- release-provenance-safety-review
Required files written: 28/28
Forbidden secret-looking tokens: none found by verifier

Loop-guarded OpenCode install smoke:

Command: python3 scripts/install_kaiju_opencode_profile.py, then opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-opencode-loopguard-smoke --dangerously-skip-permissions 'Create loopguard.txt with exactly: Kaiju Coder 7 loop guard installed'
Result: passed. OpenCode wrote loopguard.txt in the requested directory with exactly Kaiju Coder 7 loop guard installed and exited cleanly.
Installed guard: /Users/richardecholsai7/.config/opencode/kaiju-no-autocontinue.mjs

Raw OpenCode-agent result on 2026-06-03:

Task: fade-flow-service-site
Status: timed out after 900s
Required files written: 0
Observed Gojira-B decode throughput while running: about 4.4 tokens/sec
Follow-up runner fix: workspaces now run outside the repo and pass opencode run --dir <workspace> explicitly.
Structured follow-up run: runs/opencode-customer-readiness/20260603T135520Z/results.jsonl timed out after 60s, wrote 0 files, and recorded pwd as the intended temp workspace.
16k/stricter-agent follow-up runs:
- runs/opencode-customer-readiness/20260603T140650Z/results.jsonl timed out after 120s, wrote 0 files, and recorded the intended temp workspace.
- runs/opencode-customer-readiness/20260603T140908Z/results.jsonl timed out after 120s, wrote 0 files after adding stricter "write first file immediately" prompt guidance.
Interpretation: the lean OpenCode agent fits and can write small files. Harnessed file-plan delivery passes the customer pack. Current raw multi-file OpenCode generation is still not public/API ready, so public and paid claims must describe the reliable product path as model plus deterministic harness and verifier.

Recommendation Until Faster Serving Is Proven

Public local release can proceed only with clear speed/hardware caveats.
Paid API should route business-owner deliverables through deterministic harnesses and verifiers, not raw OpenCode multi-file generation.
Quantized candidates and/or a smaller distilled variant are required for broad public OpenCode usability.

vLLM Serving Probe

vLLM was tested as the practical alternative serving path after SGLang. The standard vllm/vllm-openai:latest image cannot read the merged checkpoint's qwen3_5 config. The Gojira nightly image can read it, but needed two launch fixes for this checkpoint:

preinstall pandas, because the Qwen3.5 model path imports it in this image
pass --language-model-only, because the merged text-serving checkpoint does not include the visual encoder weights expected by the multimodal config

Guarded benchmark command:

KAIJU_VLLM_CONTEXT=16384 KAIJU_VLLM_READY_TIMEOUT=900 \
  ./scripts/run-gojira-b-vllm-serving-benchmark.sh

Run: runs/benchmarks/20260603T151244Z-kaiju-coder-7-serving/summary.md

Stack	Context	Prompt	OK	Seconds	Chars	Chars/s
vLLM nightly	16384	identity	True	19.99	26	1.301
vLLM nightly	16384	code_patch	True	28.8	416	14.444

Interpretation: unquantized vLLM now runs Kaiju Coder 7 at 16k, but it was not clearly faster than SGLang on these smoke prompts. This is historical fallback evidence. The later bitsandbytes vLLM path plus fast proxy is the active speed path. Keep the live/default OpenCode profile at 16k until 32k is freshly re-confirmed.

vLLM bitsandbytes Runtime-Quantized Candidate

The first working quantized local variant is a runtime bitsandbytes vLLM path. It does not create separate quantized weights yet; it loads the full merged model through vLLM's bitsandbytes loader.

Command:

KAIJU_VLLM_CONTEXT=16384 \
KAIJU_VLLM_READY_TIMEOUT=1200 \
KAIJU_VLLM_QUANTIZATION=bitsandbytes \
KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \
  ./scripts/run-gojira-b-vllm-serving-benchmark.sh

Runs:

runs/benchmarks/20260603T153257Z-kaiju-coder-7-serving/summary.md
runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md
runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md
runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md
runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md

Stack	Context	Prompt	OK	Seconds	Chars	Chars/s
vLLM bitsandbytes	8192	identity	True	21.19	26	1.227
vLLM bitsandbytes	8192	code_patch	True	11.31	424	37.489
vLLM bitsandbytes	16384	identity	True	19.51	26	1.333
vLLM bitsandbytes	16384	code_patch	True	11.3	416	36.814
vLLM bitsandbytes	16384	business_doc	True	53.44	1610	30.127
vLLM bitsandbytes	16384	identity	True	19.65	26	1.323
vLLM bitsandbytes	16384	code_patch	True	24.97	997	39.924
vLLM bitsandbytes	16384	business_doc	True	34.46	1615	46.874

Gojira-B vLLM logs reported about 17.8 GiB model memory for the bitsandbytes load at both 8k and 16k, compared with about 50.22 GiB for the unquantized vLLM load. Code-patch latency improved materially on this smoke prompt. Business-document latency improved versus the restored 32k SGLang business-doc smoke (53.44s at 16k vLLM bitsandbytes versus 94.28s at 32k SGLang). Identity latency remains slower than SGLang.

Quantized OpenCode one-file smoke passed after launching vLLM with --enable-auto-tool-choice plus --tool-call-parser qwen3_coder and running:

bash scripts/run_kaiju_quantized_opencode_smoke.sh

Result: OpenCode wrote /tmp/kaiju-opencode-quantized-smoke/hello.txt with exactly Kaiju Coder 7 quantized runtime ok.

Recommendation: use vLLM bitsandbytes behind the local fast proxy as the current public/OpenCode speed path and keep the installed OpenCode profile at 16k unless the 32k target has just been restarted and re-confirmed. Treat SGLang as fallback and historical high-context evidence. vLLM bitsandbytes has direct identity/code/business-doc evidence plus an OpenCode one-file smoke, but it is not a persisted quantized-weights repo.

2026-06-03 Fast Proxy And Website Harness Speed Pass

The current speed profile keeps runtime-quantized vLLM active on Gojira-B port 18084 and routes OpenCode through the local fast proxy at http://127.0.0.1:18181/v1. The proxy preserves OpenCode tool-call streaming while forcing thinking=false, model id kaiju-coder-7, and bounded output budgets.

Active endpoint checks:

Local fast proxy health: http://127.0.0.1:18181/health
Upstream vLLM models: http://100.109.109.14:18084/v1/models
Upstream reports kaiju-coder-7 with max_model_len=16384

Fresh direct vLLM benchmark:

Run: runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md
Identity: 19.48s
Code patch: 24.97s, 997 chars
Business doc: 34.46s, 1,615 chars

Fresh OpenCode smoke through the local fast proxy:

Command: opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-vllm-opencode-smoke --dangerously-skip-permissions 'Create fast-vllm.txt with exactly: Kaiju quantized vLLM OpenCode ok'
Result: passed in about 23.5s, wrote the exact requested file.
Packaged public verifier after exact-content agent rule: runs/public-opencode-smoke/20260603T235002Z/summary.md, 4/4 passed through http://127.0.0.1:18181/v1.

Website harness/router speed pass:

Direct website harness command: python3 scripts/run_kaiju_website_harness.py --openai-base-url http://100.109.109.14:18084/v1 --model kaiju-coder-7 ...
Direct website harness result: runs/harness/website-speed-pass/avery-stone-vllm.html, 9,257 chars, 7.31s
Router command: python3 scripts/run_kaiju_router.py --kind website --openai-base-url http://100.109.109.14:18084/v1 --model kaiju-coder-7 ...
Router artifact: runs/router-speed-pass/20260603T223731Z-website-build-a-premium-one-page-website-for-avery-stone-construction-a-reside/index.html
Router result: passed in 7.20s; checks covered complete HTML, required sections, external images, responsive CSS, no lorem ipsum, and manifest write.
Router through the installed local proxy: runs/router-speed-pass/20260603T224328Z-website-build-a-premium-one-page-website-for-bennett-family-dental-in-charlott/index.html
Proxy router result: passed in 4.67s; preserved explicit CTA Schedule a Visit, inferred dental, and passed the same complete-HTML/static checks.

Updated recommendation: for speed-sensitive OpenCode and paid workflow testing, use vLLM bitsandbytes plus the local fast proxy as the active default. Keep SGLang as fallback/historical evidence, not the fastest current path. For websites and business-owner packs, prefer the deterministic router/harness path over raw long-form HTML generation.

Public business-owner demo pack through the active fast proxy:

python3 scripts/run_kaiju_public_demo_pack.py \
  --openai-base-url http://127.0.0.1:18181/v1 \
  --model kaiju-coder-7 \
  --planner-timeout 90

Run: runs/public-demo-pack/20260603T235009Z/summary.md

Task	Result	Seconds	Changed files
Website	Passed	4.73	2
Owner AI company pack	Passed	29.85	19
Stripe safety plan	Passed	9.99	2
CSV parser artifact	Passed	19.97	2

Total: 4/4 passed in 64.529s.

Persisted GGUF Q8_0 Candidate

The dedicated persisted-quantization pass found that normal AWQ/GPTQ installs are not clean against the Qwen3.5-capable serving stack tonight, while llama.cpp conversion support includes Qwen3_5ForConditionalGeneration.

Command:

./scripts/probe-gojira-b-persisted-quantization.sh
./scripts/run-gojira-b-kaiju-gguf-convert.sh

Result:

Artifact: /home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf
Size: 27G
SHA256: 596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e
Conversion log: runs/gguf-conversion/20260603T231446Z/gguf-conversion.log
Runtime status: candidate only; direct GGUF runtime smoke still required before publishing quantized weights.

Interpretation: the next real speed improvement for broad public users is not another prompt tweak. It is a smoked GGUF or GPU-persisted quantized artifact. The fastest currently verified Kaiju Coder 7 path remains vLLM bitsandbytes plus the local fast proxy and deterministic website/business harnesses.