Text Generation
Transformers
Safetensors
English
qwen3_5
image-text-to-text
kaiju-coder-7
coding
local-ai
business
opencode
tool-use
conversational
Instructions to use RMDWLLC/kaiju-coder-7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RMDWLLC/kaiju-coder-7 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="RMDWLLC/kaiju-coder-7") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("RMDWLLC/kaiju-coder-7") model = AutoModelForImageTextToText.from_pretrained("RMDWLLC/kaiju-coder-7") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use RMDWLLC/kaiju-coder-7 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RMDWLLC/kaiju-coder-7" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RMDWLLC/kaiju-coder-7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/RMDWLLC/kaiju-coder-7
- SGLang
How to use RMDWLLC/kaiju-coder-7 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RMDWLLC/kaiju-coder-7" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RMDWLLC/kaiju-coder-7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RMDWLLC/kaiju-coder-7" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RMDWLLC/kaiju-coder-7", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use RMDWLLC/kaiju-coder-7 with Docker Model Runner:
docker model run hf.co/RMDWLLC/kaiju-coder-7
| # Kaiju Coder 7 Serving Benchmarks | |
| This file records serving evidence for public download and paid API decisions. | |
| The model id must remain `kaiju-coder-7`. | |
| ## Current Live Runtime | |
| - Host: Gojira-B over Tailscale | |
| - Local OpenCode base URL: `http://127.0.0.1:18181/v1` | |
| - Upstream base URL: `http://100.109.109.14:18084/v1` | |
| - Serving stack: vLLM bitsandbytes runtime quantization behind the Kaiju fast | |
| proxy | |
| - Current verified context: `16384` | |
| - Tested high-context target: `32768` | |
| - Current container: `qwen36-merged-vllm-18084` | |
| - Current caveat: direct raw generation is still slow for multi-file OpenCode | |
| work; use the deterministic router/harness for public business-owner demos. | |
| ## Benchmark Command | |
| For current-context latency without restart: | |
| ```bash | |
| python3 scripts/benchmark_kaiju_serving.py \ | |
| --contexts 12288 \ | |
| --prompts identity business_doc code_patch \ | |
| --max-tokens 768 \ | |
| --timeout 420 | |
| ``` | |
| For context restart benchmarking: | |
| ```bash | |
| python3 scripts/benchmark_kaiju_serving.py \ | |
| --restart \ | |
| --contexts 12288 16384 24576 32768 \ | |
| --prompts identity business_doc \ | |
| --max-tokens 768 \ | |
| --timeout 420 \ | |
| --ready-timeout 1200 | |
| ``` | |
| Use `--contexts 16384` for the current restored Gojira-B endpoint. Use | |
| `32768` when explicitly testing the high-context target; it has passed earlier | |
| benchmarks but should be re-confirmed after a fresh restart before calling it | |
| the live default. | |
| ## Current 12k Direct API Benchmark | |
| Command: | |
| ```bash | |
| python3 scripts/benchmark_kaiju_serving.py \ | |
| --contexts 12288 \ | |
| --prompts identity code_patch \ | |
| --max-tokens 256 \ | |
| --timeout 300 | |
| ``` | |
| Run: `runs/benchmarks/20260603T135017Z-kaiju-coder-7-serving/summary.md` | |
| | Context | Prompt | OK | Seconds | Chars | Chars/s | | |
| | --- | --- | --- | ---: | ---: | ---: | | |
| | 12288 | identity | True | 2.41 | 26 | 10.788 | | |
| | 12288 | code_patch | True | 57.61 | 860 | 14.928 | | |
| Interpretation: direct API calls are usable for short tasks, but latency is too | |
| high for a paid raw-code API unless outputs are streamed and route-specific | |
| limits are enforced. | |
| ## 16k Context Benchmark | |
| 16k was tested to reduce OpenCode compaction pressure. | |
| Commands: | |
| ```bash | |
| python3 scripts/benchmark_kaiju_serving.py \ | |
| --restart \ | |
| --contexts 16384 \ | |
| --prompts identity \ | |
| --max-tokens 128 \ | |
| --timeout 300 \ | |
| --ready-timeout 1200 | |
| python3 scripts/benchmark_kaiju_serving.py \ | |
| --contexts 16384 \ | |
| --prompts code_patch \ | |
| --max-tokens 128 \ | |
| --timeout 300 | |
| ``` | |
| Runs: | |
| - `runs/benchmarks/20260603T135651Z-kaiju-coder-7-serving/summary.md` | |
| - `runs/benchmarks/20260603T140318Z-kaiju-coder-7-serving/summary.md` | |
| | Context | Prompt | OK | Load Wait | Seconds | Chars | Chars/s | | |
| | --- | --- | --- | ---: | ---: | ---: | ---: | | |
| | 16384 | identity | True | 354.16 | 14.9 | 26 | 1.745 | | |
| | 16384 | code_patch | True | n/a | 28.99 | 416 | 14.35 | | |
| Interpretation: `16384` is a stable lower-load fallback and still leaves more | |
| room above OpenCode's prompt/tool overhead than the original 12k setting. | |
| ## 24k And 32k Context Benchmarks | |
| 24k and 32k were tested after 16k proved stable. Both loaded and returned the | |
| same code-patch latency profile as 16k on the short patch benchmark. | |
| Commands: | |
| ```bash | |
| python3 scripts/benchmark_kaiju_serving.py \ | |
| --restart \ | |
| --contexts 24576 \ | |
| --prompts identity \ | |
| --max-tokens 128 \ | |
| --timeout 300 \ | |
| --ready-timeout 1200 | |
| python3 scripts/benchmark_kaiju_serving.py \ | |
| --contexts 24576 \ | |
| --prompts code_patch \ | |
| --max-tokens 128 \ | |
| --timeout 300 | |
| python3 scripts/benchmark_kaiju_serving.py \ | |
| --restart \ | |
| --contexts 32768 \ | |
| --prompts identity \ | |
| --max-tokens 64 \ | |
| --timeout 300 \ | |
| --ready-timeout 1200 | |
| python3 scripts/benchmark_kaiju_serving.py \ | |
| --contexts 32768 \ | |
| --prompts code_patch \ | |
| --max-tokens 128 \ | |
| --timeout 300 | |
| ``` | |
| Runs: | |
| - `runs/benchmarks/20260603T141559Z-kaiju-coder-7-serving/summary.md` | |
| - `runs/benchmarks/20260603T142354Z-kaiju-coder-7-serving/summary.md` | |
| - `runs/benchmarks/20260603T142439Z-kaiju-coder-7-serving/summary.md` | |
| - `runs/benchmarks/20260603T143256Z-kaiju-coder-7-serving/summary.md` | |
| | Context | Prompt | OK | Load Wait | Seconds | Chars | Chars/s | | |
| | --- | --- | --- | ---: | ---: | ---: | ---: | | |
| | 24576 | identity | True | 439.54 | 16.84 | 26 | 1.544 | | |
| | 24576 | code_patch | True | n/a | 29.03 | 416 | 14.33 | | |
| | 32768 | identity | True | 386.53 | 16.27 | 26 | 1.598 | | |
| | 32768 | code_patch | True | n/a | 28.99 | 416 | 14.35 | | |
| Interpretation: `32768` is a proven high-context target from this benchmark set, | |
| but it is not the currently parked live endpoint after the later | |
| quantized-runtime testing. The current Gojira-B/OpenCode profile should stay at | |
| `16384` until `32768` is freshly restarted and re-confirmed. Keep `12288` for | |
| direct API smoke tests and constrained hardware. | |
| Restored-service 32k direct API smoke after vLLM testing: | |
| - Run: `runs/benchmarks/20260603T155233Z-kaiju-coder-7-serving/summary.md` | |
| - `/v1/models`: `kaiju-coder-7`, max model len `32768` | |
| | Context | Prompt | OK | Seconds | Chars | Chars/s | | |
| | --- | --- | --- | ---: | ---: | ---: | | |
| | 32768 | identity | True | 2.92 | 26 | 8.904 | | |
| | 32768 | business_doc | True | 94.28 | 1737 | 18.424 | | |
| Interpretation: the restored default endpoint is usable for business-owner | |
| document work, but a long proposal response still takes about 94 seconds. Paid | |
| routes must stream, cap output, queue carefully, and prefer verified | |
| artifact routes over raw open-ended generation. | |
| ## OpenCode Customer-Readiness Evidence | |
| Final restored-service small OpenCode smoke: | |
| ```bash | |
| opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 \ | |
| --dir /tmp/kaiju-opencode-32k-final-smoke \ | |
| 'Create hello.txt with exactly: Kaiju Coder 7 final 32k ok' | |
| ``` | |
| Result: passed. OpenCode wrote `hello.txt` with exactly | |
| `Kaiju Coder 7 final 32k ok`. | |
| Current restored 16k OpenCode smoke after quantized-vLLM testing: | |
| ```bash | |
| mkdir -p /tmp/kaiju-opencode-fresh-public-smoke | |
| opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 \ | |
| --dir /tmp/kaiju-opencode-fresh-public-smoke \ | |
| --dangerously-skip-permissions \ | |
| 'Create hello.txt with exactly: Kaiju Coder 7 fresh public smoke ok' | |
| ``` | |
| Result: passed. OpenCode wrote `hello.txt` with exactly | |
| `Kaiju Coder 7 fresh public smoke ok` in | |
| `/tmp/kaiju-opencode-fresh-public-smoke`, and `/v1/models` returned | |
| `kaiju-coder-7` with max model len `16384`. | |
| Current restored 16k direct API identity smoke: | |
| - Run: `runs/benchmarks/20260603T174545Z-kaiju-coder-7-serving/summary.md` | |
| - `/v1/models`: `kaiju-coder-7`, max model len `16384` | |
| | Context | Prompt | OK | Seconds | Chars | Chars/s | | |
| | --- | --- | --- | ---: | ---: | ---: | | |
| | 16384 | identity | True | 2.3 | 26 | 11.304 | | |
| Command: | |
| ```bash | |
| python3 scripts/run_kaiju_opencode_customer_pack.py | |
| ``` | |
| Latest harnessed product-path result on 2026-06-03: | |
| - Run: `runs/opencode-customer-readiness/20260603T185835Z/summary.md` | |
| - Mode: `harnessed` | |
| - Status: `4/4` passed | |
| - Tasks: | |
| - `fade-flow-service-site` | |
| - `kiyomi-owner-operating-pack` | |
| - `paid-api-safety-scaffold` | |
| - `release-provenance-safety-review` | |
| - Required files written: `28/28` | |
| - Forbidden secret-looking tokens: none found by verifier | |
| Loop-guarded OpenCode install smoke: | |
| - Command: `python3 scripts/install_kaiju_opencode_profile.py`, then | |
| `opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-opencode-loopguard-smoke --dangerously-skip-permissions 'Create loopguard.txt with exactly: Kaiju Coder 7 loop guard installed'` | |
| - Result: passed. OpenCode wrote `loopguard.txt` in the requested directory with | |
| exactly `Kaiju Coder 7 loop guard installed` and exited cleanly. | |
| - Installed guard: `/Users/richardecholsai7/.config/opencode/kaiju-no-autocontinue.mjs` | |
| Raw OpenCode-agent result on 2026-06-03: | |
| - Task: `fade-flow-service-site` | |
| - Status: timed out after `900s` | |
| - Required files written: `0` | |
| - Observed Gojira-B decode throughput while running: about `4.4` tokens/sec | |
| - Follow-up runner fix: workspaces now run outside the repo and pass `opencode | |
| run --dir <workspace>` explicitly. | |
| - Structured follow-up run: | |
| `runs/opencode-customer-readiness/20260603T135520Z/results.jsonl` | |
| timed out after `60s`, wrote `0` files, and recorded `pwd` as the intended | |
| temp workspace. | |
| - 16k/stricter-agent follow-up runs: | |
| - `runs/opencode-customer-readiness/20260603T140650Z/results.jsonl` | |
| timed out after `120s`, wrote `0` files, and recorded the intended temp | |
| workspace. | |
| - `runs/opencode-customer-readiness/20260603T140908Z/results.jsonl` | |
| timed out after `120s`, wrote `0` files after adding stricter "write first | |
| file immediately" prompt guidance. | |
| - Interpretation: the lean OpenCode agent fits and can write small files. | |
| Harnessed file-plan delivery passes the customer pack. Current raw multi-file | |
| OpenCode generation is still not public/API ready, so public and paid claims | |
| must describe the reliable product path as model plus deterministic harness | |
| and verifier. | |
| ## Recommendation Until Faster Serving Is Proven | |
| - Public local release can proceed only with clear speed/hardware caveats. | |
| - Paid API should route business-owner deliverables through deterministic | |
| harnesses and verifiers, not raw OpenCode multi-file generation. | |
| - Quantized candidates and/or a smaller distilled variant are required for | |
| broad public OpenCode usability. | |
| ## vLLM Serving Probe | |
| vLLM was tested as the practical alternative serving path after SGLang. The | |
| standard `vllm/vllm-openai:latest` image cannot read the merged checkpoint's | |
| `qwen3_5` config. The Gojira nightly image can read it, but needed two launch | |
| fixes for this checkpoint: | |
| - preinstall `pandas`, because the Qwen3.5 model path imports it in this image | |
| - pass `--language-model-only`, because the merged text-serving checkpoint does | |
| not include the visual encoder weights expected by the multimodal config | |
| Guarded benchmark command: | |
| ```bash | |
| KAIJU_VLLM_CONTEXT=16384 KAIJU_VLLM_READY_TIMEOUT=900 \ | |
| ./scripts/run-gojira-b-vllm-serving-benchmark.sh | |
| ``` | |
| Run: `runs/benchmarks/20260603T151244Z-kaiju-coder-7-serving/summary.md` | |
| | Stack | Context | Prompt | OK | Seconds | Chars | Chars/s | | |
| | --- | ---: | --- | --- | ---: | ---: | ---: | | |
| | vLLM nightly | 16384 | identity | True | 19.99 | 26 | 1.301 | | |
| | vLLM nightly | 16384 | code_patch | True | 28.8 | 416 | 14.444 | | |
| Interpretation: unquantized vLLM now runs Kaiju Coder 7 at 16k, but it was not | |
| clearly faster than SGLang on these smoke prompts. This is historical fallback | |
| evidence. The later bitsandbytes vLLM path plus fast proxy is the active speed | |
| path. Keep the live/default OpenCode profile at 16k until 32k is freshly | |
| re-confirmed. | |
| ## vLLM bitsandbytes Runtime-Quantized Candidate | |
| The first working quantized local variant is a runtime bitsandbytes vLLM path. | |
| It does not create separate quantized weights yet; it loads the full merged | |
| model through vLLM's bitsandbytes loader. | |
| Command: | |
| ```bash | |
| KAIJU_VLLM_CONTEXT=16384 \ | |
| KAIJU_VLLM_READY_TIMEOUT=1200 \ | |
| KAIJU_VLLM_QUANTIZATION=bitsandbytes \ | |
| KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \ | |
| ./scripts/run-gojira-b-vllm-serving-benchmark.sh | |
| ``` | |
| Runs: | |
| - `runs/benchmarks/20260603T153257Z-kaiju-coder-7-serving/summary.md` | |
| - `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md` | |
| - `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md` | |
| - `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md` | |
| - `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md` | |
| | Stack | Context | Prompt | OK | Seconds | Chars | Chars/s | | |
| | --- | ---: | --- | --- | ---: | ---: | ---: | | |
| | vLLM bitsandbytes | 8192 | identity | True | 21.19 | 26 | 1.227 | | |
| | vLLM bitsandbytes | 8192 | code_patch | True | 11.31 | 424 | 37.489 | | |
| | vLLM bitsandbytes | 16384 | identity | True | 19.51 | 26 | 1.333 | | |
| | vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 | | |
| | vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 | | |
| | vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 | | |
| | vLLM bitsandbytes | 16384 | code_patch | True | 24.97 | 997 | 39.924 | | |
| | vLLM bitsandbytes | 16384 | business_doc | True | 34.46 | 1615 | 46.874 | | |
| Gojira-B vLLM logs reported about `17.8 GiB` model memory for the bitsandbytes | |
| load at both 8k and 16k, compared with about `50.22 GiB` for the unquantized | |
| vLLM load. Code-patch latency improved materially on this smoke prompt. | |
| Business-document latency improved versus the restored 32k SGLang business-doc | |
| smoke (`53.44s` at 16k vLLM bitsandbytes versus `94.28s` at 32k SGLang). | |
| Identity latency remains slower than SGLang. | |
| Quantized OpenCode one-file smoke passed after launching vLLM with | |
| `--enable-auto-tool-choice` plus `--tool-call-parser qwen3_coder` and running: | |
| ```bash | |
| bash scripts/run_kaiju_quantized_opencode_smoke.sh | |
| ``` | |
| Result: OpenCode wrote `/tmp/kaiju-opencode-quantized-smoke/hello.txt` with | |
| exactly `Kaiju Coder 7 quantized runtime ok`. | |
| Recommendation: use vLLM bitsandbytes behind the local fast proxy as the | |
| current public/OpenCode speed path and keep the installed OpenCode profile at | |
| 16k unless the 32k target has just been restarted and re-confirmed. Treat | |
| SGLang as fallback and historical high-context evidence. vLLM bitsandbytes has | |
| direct identity/code/business-doc evidence plus an OpenCode one-file smoke, but | |
| it is not a persisted quantized-weights repo. | |
| ## 2026-06-03 Fast Proxy And Website Harness Speed Pass | |
| The current speed profile keeps runtime-quantized vLLM active on Gojira-B port | |
| `18084` and routes OpenCode through the local fast proxy at | |
| `http://127.0.0.1:18181/v1`. The proxy preserves OpenCode tool-call streaming | |
| while forcing `thinking=false`, model id `kaiju-coder-7`, and bounded output | |
| budgets. | |
| Active endpoint checks: | |
| - Local fast proxy health: `http://127.0.0.1:18181/health` | |
| - Upstream vLLM models: `http://100.109.109.14:18084/v1/models` | |
| - Upstream reports `kaiju-coder-7` with `max_model_len=16384` | |
| Fresh direct vLLM benchmark: | |
| - Run: `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md` | |
| - Identity: `19.48s` | |
| - Code patch: `24.97s`, `997` chars | |
| - Business doc: `34.46s`, `1,615` chars | |
| Fresh OpenCode smoke through the local fast proxy: | |
| - Command: `opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-vllm-opencode-smoke --dangerously-skip-permissions 'Create fast-vllm.txt with exactly: Kaiju quantized vLLM OpenCode ok'` | |
| - Result: passed in about `23.5s`, wrote the exact requested file. | |
| - Packaged public verifier after exact-content agent rule: | |
| `runs/public-opencode-smoke/20260603T235002Z/summary.md`, `4/4` | |
| passed through `http://127.0.0.1:18181/v1`. | |
| Website harness/router speed pass: | |
| - Direct website harness command: `python3 scripts/run_kaiju_website_harness.py --openai-base-url http://100.109.109.14:18084/v1 --model kaiju-coder-7 ...` | |
| - Direct website harness result: `runs/harness/website-speed-pass/avery-stone-vllm.html`, `9,257` chars, `7.31s` | |
| - Router command: `python3 scripts/run_kaiju_router.py --kind website --openai-base-url http://100.109.109.14:18084/v1 --model kaiju-coder-7 ...` | |
| - Router artifact: `runs/router-speed-pass/20260603T223731Z-website-build-a-premium-one-page-website-for-avery-stone-construction-a-reside/index.html` | |
| - Router result: passed in `7.20s`; checks covered complete HTML, required sections, external images, responsive CSS, no lorem ipsum, and manifest write. | |
| - Router through the installed local proxy: `runs/router-speed-pass/20260603T224328Z-website-build-a-premium-one-page-website-for-bennett-family-dental-in-charlott/index.html` | |
| - Proxy router result: passed in `4.67s`; preserved explicit CTA `Schedule a Visit`, inferred `dental`, and passed the same complete-HTML/static checks. | |
| Updated recommendation: for speed-sensitive OpenCode and paid workflow testing, | |
| use vLLM bitsandbytes plus the local fast proxy as the active default. Keep | |
| SGLang as fallback/historical evidence, not the fastest current path. For | |
| websites and business-owner packs, prefer the deterministic router/harness path | |
| over raw long-form HTML generation. | |
| Public business-owner demo pack through the active fast proxy: | |
| ```bash | |
| python3 scripts/run_kaiju_public_demo_pack.py \ | |
| --openai-base-url http://127.0.0.1:18181/v1 \ | |
| --model kaiju-coder-7 \ | |
| --planner-timeout 90 | |
| ``` | |
| Run: `runs/public-demo-pack/20260603T235009Z/summary.md` | |
| | Task | Result | Seconds | Changed files | | |
| | --- | --- | ---: | ---: | | |
| | Website | Passed | 4.73 | 2 | | |
| | Owner AI company pack | Passed | 29.85 | 19 | | |
| | Stripe safety plan | Passed | 9.99 | 2 | | |
| | CSV parser artifact | Passed | 19.97 | 2 | | |
| Total: `4/4` passed in `64.529s`. | |
| ## Persisted GGUF Q8_0 Candidate | |
| The dedicated persisted-quantization pass found that normal AWQ/GPTQ installs | |
| are not clean against the Qwen3.5-capable serving stack tonight, while | |
| `llama.cpp` conversion support includes `Qwen3_5ForConditionalGeneration`. | |
| Command: | |
| ```bash | |
| ./scripts/probe-gojira-b-persisted-quantization.sh | |
| ./scripts/run-gojira-b-kaiju-gguf-convert.sh | |
| ``` | |
| Result: | |
| - Artifact: | |
| `/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf` | |
| - Size: `27G` | |
| - SHA256: | |
| `596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e` | |
| - Conversion log: | |
| `runs/gguf-conversion/20260603T231446Z/gguf-conversion.log` | |
| - Runtime status: candidate only; direct GGUF runtime smoke still required | |
| before publishing quantized weights. | |
| Interpretation: the next real speed improvement for broad public users is not | |
| another prompt tweak. It is a smoked GGUF or GPU-persisted quantized artifact. | |
| The fastest currently verified Kaiju Coder 7 path remains vLLM bitsandbytes | |
| plus the local fast proxy and deterministic website/business harnesses. | |