restokes92 commited on
Commit
6d7449a
·
verified ·
1 Parent(s): 1703b88

Upload Kaiju Coder 7 runtime quantization recipe

Browse files
PUBLIC_TESTING_QUICKSTART.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Kaiju Coder 7 Public Testing Quickstart
2
+
3
+ Kaiju Coder 7 is the public model name. The OpenAI-compatible model id is:
4
+
5
+ ```text
6
+ kaiju-coder-7
7
+ ```
8
+
9
+ Use this guide for serious public testing. It avoids internal checkpoint names
10
+ and keeps the current limitations clear.
11
+
12
+ ## Pick A Test Path
13
+
14
+ ### Path 1: OpenCode Against An Existing Endpoint
15
+
16
+ Use this if you already have Kaiju Coder 7 served at an OpenAI-compatible
17
+ `/v1` endpoint.
18
+
19
+ ```bash
20
+ git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
21
+ cd kaiju-coder-7-opencode
22
+ python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18083/v1
23
+ ```
24
+
25
+ Then run OpenCode inside the project you want to edit:
26
+
27
+ ```bash
28
+ opencode -m kaiju/kaiju-coder-7 --agent kaiju-coder-7
29
+ ```
30
+
31
+ For a bounded smoke test:
32
+
33
+ ```bash
34
+ mkdir -p /tmp/kaiju-public-smoke
35
+ opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 \
36
+ --dir /tmp/kaiju-public-smoke \
37
+ "Create hello.txt with exactly: Kaiju Coder 7 is ready"
38
+ ```
39
+
40
+ Or run the packaged verifier, which checks the installer, live model endpoint,
41
+ OpenCode binary, actual file creation, and wrong-directory behavior:
42
+
43
+ ```bash
44
+ python3 scripts/run_kaiju_public_opencode_smoke.py
45
+ ```
46
+
47
+ The helper installer adds:
48
+
49
+ - the `kaiju` OpenAI-compatible provider
50
+ - the lean `kaiju-coder-7` OpenCode agent
51
+ - a scoped no-autocontinue plugin that prevents false completion loops after
52
+ compaction or output limits
53
+
54
+ ### Path 2: Full Local Weights
55
+
56
+ Use this if the full `RMDWLLC/kaiju-coder-7` Hugging Face repo has been
57
+ uploaded and you have suitable local GPU hardware.
58
+
59
+ ```bash
60
+ hf download RMDWLLC/kaiju-coder-7 --local-dir ./kaiju-coder-7
61
+ ```
62
+
63
+ Serve the downloaded folder with an OpenAI-compatible local server. Configure
64
+ the server to expose:
65
+
66
+ ```text
67
+ model id: kaiju-coder-7
68
+ base URL: http://127.0.0.1:18083/v1
69
+ context: 16384
70
+ ```
71
+
72
+ Then install the OpenCode helper with:
73
+
74
+ ```bash
75
+ git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
76
+ cd kaiju-coder-7-opencode
77
+ python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18083/v1
78
+ ```
79
+
80
+ ### Path 3: Runtime-Quantized Local Candidate
81
+
82
+ Use this only if you are comfortable with advanced serving setups. The current
83
+ working quantized option is a runtime bitsandbytes recipe, not a separate
84
+ persisted quantized weights repo.
85
+
86
+ ```bash
87
+ git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-quantized-runtime
88
+ cd kaiju-coder-7-quantized-runtime
89
+ ```
90
+
91
+ Read `README.md` in that repo before serving. This path can reduce model memory
92
+ at runtime, but it still depends on access to the full Kaiju Coder 7 weights.
93
+
94
+ ## Recommended Test Prompt
95
+
96
+ Run this from an empty project folder:
97
+
98
+ ```text
99
+ Build a launch-ready local service business website and operating pack. Include
100
+ index.html, a Stripe checkout safety plan, a CSV parser with tests, a simple CRM
101
+ schema, a weekly money report, and a safety/provenance note. Write the files,
102
+ not just advice.
103
+ ```
104
+
105
+ Expected result:
106
+
107
+ - files are written in the requested project folder
108
+ - `index.html` is complete HTML
109
+ - business docs start with Markdown H1 headings
110
+ - code includes a test or smoke-check command where practical
111
+ - no fake API keys, OAuth tokens, payment secrets, or private customer data
112
+
113
+ ## Current Recommended Defaults
114
+
115
+ - Public model id: `kaiju-coder-7`
116
+ - OpenCode context: `16384`
117
+ - Output cap for public testing: `2500`
118
+ - Current reliable product path: model plus deterministic business-owner
119
+ harness plus verifier
120
+ - Raw multi-file OpenCode generation: still too slow for broad paid API claims
121
+ - Paid API: not public until launch preflight passes
122
+
123
+ ## What Not To Claim Yet
124
+
125
+ Do not claim:
126
+
127
+ - that raw model weights alone reliably build every business-owner artifact
128
+ - that a paid hosted API is generally available
129
+ - that persisted quantized weights exist
130
+ - that 32k context is the current live default
131
+
132
+ Do claim:
133
+
134
+ - Kaiju Coder 7 has a working local/OpenCode release candidate
135
+ - the current tested OpenCode default is 16k context
136
+ - the helper package includes a lean agent and compaction loop guard
137
+ - the paid API scaffold has tests and a launch preflight, but is not yet public
138
+ - the packaged public smoke verifies a fresh OpenCode one-file write before
139
+ public claims are refreshed
140
+
141
+ ## Current Blockers Before Public Release
142
+
143
+ - Hugging Face repo creation still requires a write-capable token or namespace.
144
+ - Full merged model upload has not completed; the merged folder must first have
145
+ the metadata packet synced by `prepare_hf_merged_model_metadata.sh`.
146
+ - Public paid API launch needs real Cloudflare D1/KV/R2 bindings, Wrangler
147
+ secret verification, Stripe webhook staging evidence, staging traffic, latency
148
+ evidence, and rollback proof.
149
+ - Human review is still required before public upload.
README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Kaiju Coder 7 Runtime-Quantized Local Candidate
2
+
3
+ This is the current working local quantized variant for Kaiju Coder 7. It is a
4
+ runtime bitsandbytes vLLM serving path, not a separate persisted quantized
5
+ weight artifact yet.
6
+
7
+ ## Status
8
+
9
+ - Model id: `kaiju-coder-7`
10
+ - Runtime: `gojira/vllm-openai-ray:nightly`
11
+ - Quantization mode: vLLM `--quantization bitsandbytes`
12
+ - Load format: vLLM `--load-format bitsandbytes`
13
+ - Required launch mode: `--language-model-only`
14
+ - Required OpenCode launch flag: `--enable-auto-tool-choice`
15
+ - Required preinstall in this image: `pandas`
16
+ - Tested contexts: `8192`, `16384`
17
+ - OpenCode smoke: passed
18
+ - Persisted quantized Hugging Face weights: pending
19
+
20
+ ## Run
21
+
22
+ Use the guarded benchmark script from the repo root:
23
+
24
+ ```bash
25
+ KAIJU_VLLM_CONTEXT=16384 \
26
+ KAIJU_VLLM_READY_TIMEOUT=1200 \
27
+ KAIJU_VLLM_QUANTIZATION=bitsandbytes \
28
+ KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \
29
+ ./scripts/run-gojira-b-vllm-serving-benchmark.sh
30
+ ```
31
+
32
+ The script stops the merged SGLang service, starts vLLM on port `18084`, runs
33
+ the benchmark, then restores the recommended SGLang service on port `18083`.
34
+
35
+ ## Evidence
36
+
37
+ Runs:
38
+
39
+ - `runs/benchmarks/20260603T153257Z-kaiju-coder-7-serving/summary.md`
40
+ - `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
41
+ - `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
42
+ - `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
43
+
44
+ | Runtime | Context | Prompt | OK | Seconds | Chars | Chars/s |
45
+ | --- | ---: | --- | --- | ---: | ---: | ---: |
46
+ | vLLM bitsandbytes | 8192 | identity | True | 21.19 | 26 | 1.227 |
47
+ | vLLM bitsandbytes | 8192 | code_patch | True | 11.31 | 424 | 37.489 |
48
+ | vLLM bitsandbytes | 16384 | identity | True | 19.51 | 26 | 1.333 |
49
+ | vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 |
50
+ | vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 |
51
+ | vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 |
52
+
53
+ Gojira-B log evidence recorded model load at about `17.8 GiB` memory for both
54
+ 8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement
55
+ over the full bfloat16 vLLM model load, which reported about `50.22 GiB`.
56
+ The 16k business-document task passed after the wrapper restored the default
57
+ SGLang service.
58
+
59
+ OpenCode one-file smoke also passed through the runtime-quantized endpoint:
60
+
61
+ ```bash
62
+ bash scripts/run_kaiju_quantized_opencode_smoke.sh
63
+ ```
64
+
65
+ Result:
66
+
67
+ - Workdir: `/tmp/kaiju-opencode-quantized-smoke`
68
+ - File: `hello.txt`
69
+ - Exact content: `Kaiju Coder 7 quantized runtime ok`
70
+ - OpenCode config: isolated temporary `HOME`, no global config edit
71
+ - Permission mode: `--dangerously-skip-permissions` inside the temporary smoke
72
+ harness only
73
+
74
+ ## Release Interpretation
75
+
76
+ This is a working quantized local runtime candidate. It is useful for internal
77
+ testing, serious GPU users, and the next paid API speed experiments. It is not
78
+ yet a standalone public quantized weights repo because the artifact is still the
79
+ full merged model loaded through bitsandbytes at runtime.
80
+
81
+ The next release step is to produce a persisted quantized artifact, or package
82
+ this runtime path as an advanced serving recipe while clearly saying it still
83
+ requires access to the full Kaiju Coder 7 merged weights.
scripts/run-gojira-b-vllm-serving-benchmark.sh ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
5
+ PORT="${KAIJU_VLLM_PORT:-18084}"
6
+ MODEL="${KAIJU_VLLM_MODEL_NAME:-kaiju-coder-7}"
7
+ CONTEXT="${KAIJU_VLLM_CONTEXT:-16384}"
8
+ READY_TIMEOUT="${KAIJU_VLLM_READY_TIMEOUT:-900}"
9
+ KEEP_VLLM="${KAIJU_VLLM_KEEP_RUNNING:-0}"
10
+ PROMPTS="${KAIJU_VLLM_PROMPTS:-identity code_patch}"
11
+ MAX_TOKENS="${KAIJU_VLLM_MAX_TOKENS:-128}"
12
+ TIMEOUT="${KAIJU_VLLM_PROMPT_TIMEOUT:-300}"
13
+ BASE_URL="http://100.109.109.14:${PORT}/v1"
14
+
15
+ restore_sglang() {
16
+ if [[ "${KEEP_VLLM}" == "1" ]]; then
17
+ return
18
+ fi
19
+ "${ROOT}/scripts/stop-qwen36-merged-vllm.sh" >/dev/null 2>&1 || true
20
+ KAIJU_QWEN36_MERGED_CONTEXT="${KAIJU_QWEN36_MERGED_CONTEXT:-32768}" \
21
+ "${ROOT}/scripts/start-qwen36-merged-sglang.sh" >/dev/null 2>&1 || true
22
+ }
23
+ trap restore_sglang EXIT
24
+
25
+ "${ROOT}/scripts/stop-qwen36-merged-sglang.sh"
26
+ "${ROOT}/scripts/stop-qwen36-merged-vllm.sh"
27
+ KAIJU_VLLM_CONTEXT="${CONTEXT}" "${ROOT}/scripts/start-qwen36-merged-vllm.sh"
28
+
29
+ deadline=$((SECONDS + READY_TIMEOUT))
30
+ until curl -fsSL "${BASE_URL}/models" | grep -q "\"${MODEL}\""; do
31
+ if (( SECONDS >= deadline )); then
32
+ echo "vLLM endpoint did not become ready at ${BASE_URL}" >&2
33
+ exit 1
34
+ fi
35
+ sleep 10
36
+ done
37
+
38
+ python3 "${ROOT}/scripts/benchmark_kaiju_serving.py" \
39
+ --base-url "${BASE_URL}" \
40
+ --model "${MODEL}" \
41
+ --contexts "${CONTEXT}" \
42
+ --prompts ${PROMPTS} \
43
+ --max-tokens "${MAX_TOKENS}" \
44
+ --timeout "${TIMEOUT}"
scripts/start-qwen36-merged-vllm.sh ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
5
+ # shellcheck source=scripts/gojira-b-ssh-lib.sh
6
+ source "${SCRIPT_DIR}/gojira-b-ssh-lib.sh"
7
+ kaiju_gojira_b_init
8
+
9
+ PORT="${KAIJU_VLLM_PORT:-18084}"
10
+ SESSION="${KAIJU_VLLM_SESSION:-kaiju_qwen36_v18_merged_vllm}"
11
+ CONTEXT_LENGTH="${KAIJU_VLLM_CONTEXT:-32768}"
12
+ GPU_UTIL="${KAIJU_VLLM_GPU_UTIL:-0.90}"
13
+ MODEL_REMOTE="${KAIJU_VLLM_MODEL_REMOTE:-/home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged}"
14
+ SERVED_MODEL="${KAIJU_VLLM_MODEL_NAME:-kaiju-coder-7}"
15
+ IMAGE="${KAIJU_VLLM_IMAGE:-gojira/vllm-openai-ray:nightly}"
16
+ PREINSTALL="${KAIJU_VLLM_PREINSTALL:-pandas}"
17
+ QUANTIZATION="${KAIJU_VLLM_QUANTIZATION:-}"
18
+ LOAD_FORMAT="${KAIJU_VLLM_LOAD_FORMAT:-}"
19
+ KV_CACHE_DTYPE="${KAIJU_VLLM_KV_CACHE_DTYPE:-}"
20
+ ENABLE_AUTO_TOOL_CHOICE="${KAIJU_VLLM_ENABLE_AUTO_TOOL_CHOICE:-1}"
21
+ CONTAINER="qwen36-merged-vllm-${PORT}"
22
+
23
+ EXTRA_ARGS=()
24
+ if [[ -n "${QUANTIZATION}" ]]; then
25
+ EXTRA_ARGS+=(--quantization "${QUANTIZATION}")
26
+ fi
27
+ if [[ -n "${LOAD_FORMAT}" ]]; then
28
+ EXTRA_ARGS+=(--load-format "${LOAD_FORMAT}")
29
+ fi
30
+ if [[ -n "${KV_CACHE_DTYPE}" ]]; then
31
+ EXTRA_ARGS+=(--kv-cache-dtype "${KV_CACHE_DTYPE}")
32
+ fi
33
+ if [[ "${ENABLE_AUTO_TOOL_CHOICE}" == "1" ]]; then
34
+ EXTRA_ARGS+=(--enable-auto-tool-choice)
35
+ fi
36
+
37
+ EXTRA_ARGS_Q=""
38
+ if ((${#EXTRA_ARGS[@]})); then
39
+ printf -v EXTRA_ARGS_Q "%q " "${EXTRA_ARGS[@]}"
40
+ fi
41
+
42
+ kaiju_gojira_b_ssh "
43
+ set -euo pipefail
44
+ test -d '${MODEL_REMOTE}' || { echo 'Missing merged model: ${MODEL_REMOTE}' >&2; exit 2; }
45
+ mkdir -p ~/kaiju-coder/logs ~/hf-cache
46
+ LOG=~/kaiju-coder/logs/qwen36-merged-vllm-${PORT}.log
47
+ if tmux has-session -t '${SESSION}' 2>/dev/null; then
48
+ echo 'session already running: ${SESSION}'
49
+ else
50
+ sudo docker rm -f '${CONTAINER}' >/dev/null 2>&1 || true
51
+ rm -f \"\${LOG}\"
52
+ tmux new-session -d -s '${SESSION}' \"set -euo pipefail; sudo docker run --rm --gpus all --network host --ipc=host \
53
+ -v '${MODEL_REMOTE}':/models/kaiju-merged:ro \
54
+ -v ~/hf-cache:/root/.cache/huggingface \
55
+ --name '${CONTAINER}' \
56
+ --entrypoint bash \
57
+ '${IMAGE}' \
58
+ -lc 'if [[ -n \"${PREINSTALL}\" ]]; then python3 -m pip install -q ${PREINSTALL}; fi; python3 -m vllm.entrypoints.openai.api_server \
59
+ --model /models/kaiju-merged \
60
+ --served-model-name '${SERVED_MODEL}' \
61
+ --host 0.0.0.0 \
62
+ --port '${PORT}' \
63
+ --max-model-len '${CONTEXT_LENGTH}' \
64
+ --gpu-memory-utilization '${GPU_UTIL}' \
65
+ --trust-remote-code \
66
+ --language-model-only \
67
+ --dtype bfloat16 \
68
+ --tool-call-parser qwen3_coder \
69
+ --reasoning-parser qwen3 \
70
+ ${EXTRA_ARGS_Q} \
71
+ --uvicorn-log-level info' 2>&1 | tee \${LOG}\"
72
+ echo 'started session: ${SESSION}'
73
+ fi
74
+ echo 'log:' \"\${LOG}\"
75
+ echo 'model: ${SERVED_MODEL}'
76
+ "
scripts/stop-qwen36-merged-sglang.sh ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
5
+ # shellcheck source=scripts/gojira-b-ssh-lib.sh
6
+ source "${SCRIPT_DIR}/gojira-b-ssh-lib.sh"
7
+ kaiju_gojira_b_init
8
+
9
+ PORT="${KAIJU_QWEN36_MERGED_PORT:-18083}"
10
+ SESSION="${KAIJU_QWEN36_MERGED_SESSION:-kaiju_qwen36_v18_merged_sglang}"
11
+ CONTAINER="qwen36-merged-sglang-${PORT}"
12
+
13
+ kaiju_gojira_b_ssh "
14
+ set -euo pipefail
15
+ tmux kill-session -t '${SESSION}' >/dev/null 2>&1 || true
16
+ sudo docker rm -f '${CONTAINER}' >/dev/null 2>&1 || true
17
+ echo 'stopped ${SESSION} / ${CONTAINER}'
18
+ "
scripts/stop-qwen36-merged-vllm.sh ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
5
+ # shellcheck source=scripts/gojira-b-ssh-lib.sh
6
+ source "${SCRIPT_DIR}/gojira-b-ssh-lib.sh"
7
+ kaiju_gojira_b_init
8
+
9
+ PORT="${KAIJU_VLLM_PORT:-18084}"
10
+ SESSION="${KAIJU_VLLM_SESSION:-kaiju_qwen36_v18_merged_vllm}"
11
+ CONTAINER="qwen36-merged-vllm-${PORT}"
12
+
13
+ kaiju_gojira_b_ssh "
14
+ set -euo pipefail
15
+ tmux kill-session -t '${SESSION}' >/dev/null 2>&1 || true
16
+ sudo docker rm -f '${CONTAINER}' >/dev/null 2>&1 || true
17
+ echo 'stopped ${SESSION} / ${CONTAINER}'
18
+ "