File size: 5,258 Bytes
d914316
 
 
 
 
 
 
 
 
 
 
 
 
6d7449a
 
53943f9
 
6d7449a
 
 
 
 
 
 
 
 
 
 
 
 
 
785f3d7
 
 
6d7449a
 
 
 
 
 
 
 
 
 
 
 
 
 
785f3d7
 
 
 
 
 
 
 
6d7449a
 
 
 
 
 
 
 
 
785f3d7
6d7449a
 
 
 
 
 
 
 
 
785f3d7
 
6d7449a
 
 
 
785f3d7
 
 
 
 
 
 
 
 
 
 
6d7449a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
785f3d7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6d7449a
 
 
 
785f3d7
 
6d7449a
785f3d7
 
6d7449a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
license: apache-2.0
language:
  - en
tags:
  - kaiju-coder-7
  - quantization
  - vllm
  - bitsandbytes
  - local-ai
  - opencode
---

# Kaiju Coder 7 Runtime-Quantized Local Candidate

![RMDW logo](assets/RMDWlogo.png)

This is the current working local quantized variant for Kaiju Coder 7. It is a
runtime bitsandbytes vLLM serving path, not a separate persisted quantized
weight artifact yet.

## Status

- Model id: `kaiju-coder-7`
- Runtime: `gojira/vllm-openai-ray:nightly`
- Quantization mode: vLLM `--quantization bitsandbytes`
- Load format: vLLM `--load-format bitsandbytes`
- Required launch mode: `--language-model-only`
- Required OpenCode launch flag: `--enable-auto-tool-choice`
- Required preinstall in this image: `pandas`
- Tested contexts: `8192`, `16384`
- OpenCode smoke: passed through the local fast proxy
- Persisted quantized Hugging Face weights: GGUF Q8_0 converted, runtime smoke
  pending before public upload

## Run

Use the guarded benchmark script from the repo root:

```bash
KAIJU_VLLM_CONTEXT=16384 \
KAIJU_VLLM_READY_TIMEOUT=1200 \
KAIJU_VLLM_QUANTIZATION=bitsandbytes \
KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \
  ./scripts/run-gojira-b-vllm-serving-benchmark.sh
```

The script stops the merged SGLang service, starts vLLM on port `18084`, runs
the benchmark, then restores SGLang unless `KAIJU_VLLM_KEEP_RUNNING=1` is set.
For the current fast OpenCode setup, keep vLLM running and point the fast proxy
at port `18084`.

```bash
KAIJU_OPENAI_BASE_URL=http://127.0.0.1:18084/v1 \
python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181
```

## Evidence

Runs:

- `runs/benchmarks/20260603T153257Z-kaiju-coder-7-serving/summary.md`
- `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
- `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
- `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
- `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md`

| Runtime | Context | Prompt | OK | Seconds | Chars | Chars/s |
| --- | ---: | --- | --- | ---: | ---: | ---: |
| vLLM bitsandbytes | 8192 | identity | True | 21.19 | 26 | 1.227 |
| vLLM bitsandbytes | 8192 | code_patch | True | 11.31 | 424 | 37.489 |
| vLLM bitsandbytes | 16384 | identity | True | 19.51 | 26 | 1.333 |
| vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 |
| vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 |
| vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 |
| vLLM bitsandbytes | 16384 | code_patch | True | 24.97 | 997 | 39.924 |
| vLLM bitsandbytes | 16384 | business_doc | True | 34.46 | 1615 | 46.874 |

Gojira-B log evidence recorded model load at about `17.8 GiB` memory for both
8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement
over the full bfloat16 vLLM model load, which reported about `50.22 GiB`.
The 16k business-document task passed, and the current speed pass keeps the
runtime-quantized vLLM service active for OpenCode through the local proxy.

The dedicated website harness/router speed pass produced a complete checked
website in about `7.2s` through vLLM bitsandbytes:

- Direct website harness: `runs/harness/website-speed-pass/avery-stone-vllm.html`
- Router artifact: `runs/router-speed-pass/20260603T223731Z-website-build-a-premium-one-page-website-for-avery-stone-construction-a-reside/index.html`
- Local-proxy router artifact: `runs/router-speed-pass/20260603T224328Z-website-build-a-premium-one-page-website-for-bennett-family-dental-in-charlott/index.html`
- Router checks: complete HTML, required sections, external images,
  responsive CSS, no lorem ipsum, manifest write

OpenCode one-file smoke also passed through the runtime-quantized endpoint:

```bash
bash scripts/run_kaiju_quantized_opencode_smoke.sh
```

Result:

- Workdir: `/tmp/kaiju-opencode-quantized-smoke`
- File: `hello.txt`
- Exact content: `Kaiju Coder 7 quantized runtime ok`
- OpenCode config: isolated temporary `HOME`, no global config edit
- Permission mode: `--dangerously-skip-permissions` inside the temporary smoke
  harness only

## Persisted GGUF Candidate

A Q8_0 GGUF candidate now exists on Gojira-B:

```text
/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf
```

- Size: `27G`
- SHA256:
  `596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e`
- Conversion evidence:
  `runs/gguf-conversion/20260603T231446Z/gguf-conversion.log`
- Local docs: `release/gguf/README.md`

This is not public quantized-weights release evidence yet. It still needs a
runtime smoke that proves identity, business-owner output, and the intended
OpenCode/router path under an actual GGUF runtime.

## Release Interpretation

This is a working quantized local runtime candidate. It is useful for internal
testing, serious GPU users, and the next paid API speed experiments. It is not
yet a standalone public quantized weights repo because the only fully smoked
path is still the full merged model loaded through bitsandbytes at runtime.

The next release step is to smoke-test the GGUF candidate or package this
runtime path as an advanced serving recipe while clearly saying it still
requires access to the full Kaiju Coder 7 merged weights.