Upload Kaiju Coder 7 runtime quantization recipe
Browse files- GGUF_CANDIDATE.md +51 -0
- PUBLIC_TESTING_QUICKSTART.md +24 -7
- README.md +48 -9
- scripts/kaiju_opencode_fast_proxy.py +234 -0
- scripts/probe-gojira-b-persisted-quantization.sh +185 -0
- scripts/run-gojira-b-kaiju-gguf-convert.sh +190 -0
GGUF_CANDIDATE.md
ADDED
|
@@ -0,0 +1,51 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Kaiju Coder 7 GGUF Candidate
|
| 2 |
+
|
| 3 |
+
This folder documents the persisted GGUF candidate for Kaiju Coder 7. The
|
| 4 |
+
artifact exists on Gojira-B, but it should stay marked as a candidate until a
|
| 5 |
+
runtime smoke test passes.
|
| 6 |
+
|
| 7 |
+
## Artifact
|
| 8 |
+
|
| 9 |
+
- Format: GGUF
|
| 10 |
+
- Outtype: `q8_0`
|
| 11 |
+
- Remote path:
|
| 12 |
+
`/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf`
|
| 13 |
+
- Remote size: `27G`
|
| 14 |
+
- SHA256:
|
| 15 |
+
`596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e`
|
| 16 |
+
- Source model:
|
| 17 |
+
`/home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged`
|
| 18 |
+
- Conversion evidence:
|
| 19 |
+
`runs/gguf-conversion/20260603T231446Z/gguf-conversion.log`
|
| 20 |
+
|
| 21 |
+
## Status
|
| 22 |
+
|
| 23 |
+
Converted successfully on 2026-06-03. Runtime smoke is still required before
|
| 24 |
+
public upload or a Hugging Face quantized-weights claim.
|
| 25 |
+
|
| 26 |
+
The conversion path is promising because the current `llama.cpp`
|
| 27 |
+
`convert_hf_to_gguf.py` support list includes `Qwen3_5ForConditionalGeneration`
|
| 28 |
+
and the Q8_0 dry run completed before the real conversion.
|
| 29 |
+
|
| 30 |
+
## Recreate
|
| 31 |
+
|
| 32 |
+
```bash
|
| 33 |
+
./scripts/probe-gojira-b-persisted-quantization.sh
|
| 34 |
+
./scripts/run-gojira-b-kaiju-gguf-convert.sh
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
The conversion script stops the active vLLM runtime to free RAM, writes the GGUF
|
| 38 |
+
artifact, records a checksum and manifest, then restarts the fast vLLM runtime.
|
| 39 |
+
|
| 40 |
+
## Release Rule
|
| 41 |
+
|
| 42 |
+
Do not publish this as public quantized weights until all of these pass:
|
| 43 |
+
|
| 44 |
+
- runtime loads the GGUF with model id `kaiju-coder-7`
|
| 45 |
+
- direct identity smoke passes
|
| 46 |
+
- direct business-owner document smoke passes
|
| 47 |
+
- OpenCode or router smoke passes through the intended runtime
|
| 48 |
+
- README/model card states exact runtime, context, memory, and quality caveats
|
| 49 |
+
|
| 50 |
+
Until then, the public quantized path remains `kaiju-coder-7-quantized-runtime`,
|
| 51 |
+
which documents the already-smoked vLLM bitsandbytes setup.
|
PUBLIC_TESTING_QUICKSTART.md
CHANGED
|
@@ -19,7 +19,7 @@ Use this if you already have Kaiju Coder 7 served at an OpenAI-compatible
|
|
| 19 |
```bash
|
| 20 |
git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
|
| 21 |
cd kaiju-coder-7-opencode
|
| 22 |
-
python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:
|
| 23 |
```
|
| 24 |
|
| 25 |
Then run OpenCode inside the project you want to edit:
|
|
@@ -65,23 +65,31 @@ the server to expose:
|
|
| 65 |
|
| 66 |
```text
|
| 67 |
model id: kaiju-coder-7
|
| 68 |
-
base URL: http://127.0.0.1:
|
| 69 |
context: 16384
|
| 70 |
```
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
Then install the OpenCode helper with:
|
| 73 |
|
| 74 |
```bash
|
| 75 |
git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
|
| 76 |
cd kaiju-coder-7-opencode
|
| 77 |
-
python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:
|
| 78 |
```
|
| 79 |
|
| 80 |
### Path 3: Runtime-Quantized Local Candidate
|
| 81 |
|
| 82 |
Use this only if you are comfortable with advanced serving setups. The current
|
| 83 |
-
working quantized option is a runtime bitsandbytes recipe
|
| 84 |
-
|
| 85 |
|
| 86 |
```bash
|
| 87 |
git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-quantized-runtime
|
|
@@ -115,9 +123,12 @@ Expected result:
|
|
| 115 |
- Public model id: `kaiju-coder-7`
|
| 116 |
- OpenCode context: `16384`
|
| 117 |
- Output cap for public testing: `2500`
|
|
|
|
| 118 |
- Current reliable product path: model plus deterministic business-owner
|
| 119 |
-
harness plus verifier
|
| 120 |
-
- Raw multi-file OpenCode generation: still too slow for broad paid
|
|
|
|
|
|
|
| 121 |
- Paid API: not public until launch preflight passes
|
| 122 |
|
| 123 |
## What Not To Claim Yet
|
|
@@ -134,15 +145,21 @@ Do claim:
|
|
| 134 |
- Kaiju Coder 7 has a working local/OpenCode release candidate
|
| 135 |
- the current tested OpenCode default is 16k context
|
| 136 |
- the helper package includes a lean agent and compaction loop guard
|
|
|
|
|
|
|
| 137 |
- the paid API scaffold has tests and a launch preflight, but is not yet public
|
| 138 |
- the packaged public smoke verifies a fresh OpenCode one-file write before
|
| 139 |
public claims are refreshed
|
|
|
|
|
|
|
| 140 |
|
| 141 |
## Current Blockers Before Public Release
|
| 142 |
|
| 143 |
- Hugging Face repo creation still requires a write-capable token or namespace.
|
| 144 |
- Full merged model upload has not completed; the merged folder must first have
|
| 145 |
the metadata packet synced by `prepare_hf_merged_model_metadata.sh`.
|
|
|
|
|
|
|
| 146 |
- Public paid API launch needs real Cloudflare D1/KV/R2 bindings, Wrangler
|
| 147 |
secret verification, Stripe webhook staging evidence, staging traffic, latency
|
| 148 |
evidence, and rollback proof.
|
|
|
|
| 19 |
```bash
|
| 20 |
git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
|
| 21 |
cd kaiju-coder-7-opencode
|
| 22 |
+
python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18181/v1
|
| 23 |
```
|
| 24 |
|
| 25 |
Then run OpenCode inside the project you want to edit:
|
|
|
|
| 65 |
|
| 66 |
```text
|
| 67 |
model id: kaiju-coder-7
|
| 68 |
+
base URL: http://127.0.0.1:18084/v1
|
| 69 |
context: 16384
|
| 70 |
```
|
| 71 |
|
| 72 |
+
For the fastest OpenCode behavior, run the bundled fast proxy in a separate
|
| 73 |
+
terminal and point OpenCode at the proxy:
|
| 74 |
+
|
| 75 |
+
```bash
|
| 76 |
+
KAIJU_OPENAI_BASE_URL=http://127.0.0.1:18084/v1 \
|
| 77 |
+
python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
Then install the OpenCode helper with:
|
| 81 |
|
| 82 |
```bash
|
| 83 |
git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
|
| 84 |
cd kaiju-coder-7-opencode
|
| 85 |
+
python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18181/v1
|
| 86 |
```
|
| 87 |
|
| 88 |
### Path 3: Runtime-Quantized Local Candidate
|
| 89 |
|
| 90 |
Use this only if you are comfortable with advanced serving setups. The current
|
| 91 |
+
working quantized option is a runtime bitsandbytes recipe. A Q8_0 GGUF artifact
|
| 92 |
+
has been converted, but it is still a candidate until runtime smoke passes.
|
| 93 |
|
| 94 |
```bash
|
| 95 |
git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-quantized-runtime
|
|
|
|
| 123 |
- Public model id: `kaiju-coder-7`
|
| 124 |
- OpenCode context: `16384`
|
| 125 |
- Output cap for public testing: `2500`
|
| 126 |
+
- Fast OpenCode path: vLLM bitsandbytes runtime behind the Kaiju fast proxy
|
| 127 |
- Current reliable product path: model plus deterministic business-owner
|
| 128 |
+
harness/router plus verifier
|
| 129 |
+
- Raw multi-file OpenCode generation: still too slow for broad paid claims;
|
| 130 |
+
useful for testing, but paid API claims should favor harnessed product
|
| 131 |
+
workflows until broader latency gates pass
|
| 132 |
- Paid API: not public until launch preflight passes
|
| 133 |
|
| 134 |
## What Not To Claim Yet
|
|
|
|
| 145 |
- Kaiju Coder 7 has a working local/OpenCode release candidate
|
| 146 |
- the current tested OpenCode default is 16k context
|
| 147 |
- the helper package includes a lean agent and compaction loop guard
|
| 148 |
+
- the fast proxy keeps OpenCode tool calls intact while forcing bounded,
|
| 149 |
+
non-thinking generation
|
| 150 |
- the paid API scaffold has tests and a launch preflight, but is not yet public
|
| 151 |
- the packaged public smoke verifies a fresh OpenCode one-file write before
|
| 152 |
public claims are refreshed
|
| 153 |
+
- a GGUF Q8_0 candidate exists, but is not public quantized-weights release
|
| 154 |
+
evidence until runtime smoke passes
|
| 155 |
|
| 156 |
## Current Blockers Before Public Release
|
| 157 |
|
| 158 |
- Hugging Face repo creation still requires a write-capable token or namespace.
|
| 159 |
- Full merged model upload has not completed; the merged folder must first have
|
| 160 |
the metadata packet synced by `prepare_hf_merged_model_metadata.sh`.
|
| 161 |
+
- The GGUF Q8_0 candidate still needs a runtime smoke before public
|
| 162 |
+
quantized-weights upload.
|
| 163 |
- Public paid API launch needs real Cloudflare D1/KV/R2 bindings, Wrangler
|
| 164 |
secret verification, Stripe webhook staging evidence, staging traffic, latency
|
| 165 |
evidence, and rollback proof.
|
README.md
CHANGED
|
@@ -14,8 +14,9 @@ weight artifact yet.
|
|
| 14 |
- Required OpenCode launch flag: `--enable-auto-tool-choice`
|
| 15 |
- Required preinstall in this image: `pandas`
|
| 16 |
- Tested contexts: `8192`, `16384`
|
| 17 |
-
- OpenCode smoke: passed
|
| 18 |
-
- Persisted quantized Hugging Face weights:
|
|
|
|
| 19 |
|
| 20 |
## Run
|
| 21 |
|
|
@@ -30,7 +31,14 @@ KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \
|
|
| 30 |
```
|
| 31 |
|
| 32 |
The script stops the merged SGLang service, starts vLLM on port `18084`, runs
|
| 33 |
-
the benchmark, then restores
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
## Evidence
|
| 36 |
|
|
@@ -40,6 +48,7 @@ Runs:
|
|
| 40 |
- `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
|
| 41 |
- `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
|
| 42 |
- `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
|
|
|
|
| 43 |
|
| 44 |
| Runtime | Context | Prompt | OK | Seconds | Chars | Chars/s |
|
| 45 |
| --- | ---: | --- | --- | ---: | ---: | ---: |
|
|
@@ -49,12 +58,23 @@ Runs:
|
|
| 49 |
| vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 |
|
| 50 |
| vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 |
|
| 51 |
| vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 |
|
|
|
|
|
|
|
| 52 |
|
| 53 |
Gojira-B log evidence recorded model load at about `17.8 GiB` memory for both
|
| 54 |
8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement
|
| 55 |
over the full bfloat16 vLLM model load, which reported about `50.22 GiB`.
|
| 56 |
-
The 16k business-document task passed
|
| 57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
OpenCode one-file smoke also passed through the runtime-quantized endpoint:
|
| 60 |
|
|
@@ -71,13 +91,32 @@ Result:
|
|
| 71 |
- Permission mode: `--dangerously-skip-permissions` inside the temporary smoke
|
| 72 |
harness only
|
| 73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
## Release Interpretation
|
| 75 |
|
| 76 |
This is a working quantized local runtime candidate. It is useful for internal
|
| 77 |
testing, serious GPU users, and the next paid API speed experiments. It is not
|
| 78 |
-
yet a standalone public quantized weights repo because the
|
| 79 |
-
full merged model loaded through bitsandbytes at runtime.
|
| 80 |
|
| 81 |
-
The next release step is to
|
| 82 |
-
|
| 83 |
requires access to the full Kaiju Coder 7 merged weights.
|
|
|
|
| 14 |
- Required OpenCode launch flag: `--enable-auto-tool-choice`
|
| 15 |
- Required preinstall in this image: `pandas`
|
| 16 |
- Tested contexts: `8192`, `16384`
|
| 17 |
+
- OpenCode smoke: passed through the local fast proxy
|
| 18 |
+
- Persisted quantized Hugging Face weights: GGUF Q8_0 converted, runtime smoke
|
| 19 |
+
pending before public upload
|
| 20 |
|
| 21 |
## Run
|
| 22 |
|
|
|
|
| 31 |
```
|
| 32 |
|
| 33 |
The script stops the merged SGLang service, starts vLLM on port `18084`, runs
|
| 34 |
+
the benchmark, then restores SGLang unless `KAIJU_VLLM_KEEP_RUNNING=1` is set.
|
| 35 |
+
For the current fast OpenCode setup, keep vLLM running and point the fast proxy
|
| 36 |
+
at port `18084`.
|
| 37 |
+
|
| 38 |
+
```bash
|
| 39 |
+
KAIJU_OPENAI_BASE_URL=http://127.0.0.1:18084/v1 \
|
| 40 |
+
python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181
|
| 41 |
+
```
|
| 42 |
|
| 43 |
## Evidence
|
| 44 |
|
|
|
|
| 48 |
- `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
|
| 49 |
- `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
|
| 50 |
- `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
|
| 51 |
+
- `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md`
|
| 52 |
|
| 53 |
| Runtime | Context | Prompt | OK | Seconds | Chars | Chars/s |
|
| 54 |
| --- | ---: | --- | --- | ---: | ---: | ---: |
|
|
|
|
| 58 |
| vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 |
|
| 59 |
| vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 |
|
| 60 |
| vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 |
|
| 61 |
+
| vLLM bitsandbytes | 16384 | code_patch | True | 24.97 | 997 | 39.924 |
|
| 62 |
+
| vLLM bitsandbytes | 16384 | business_doc | True | 34.46 | 1615 | 46.874 |
|
| 63 |
|
| 64 |
Gojira-B log evidence recorded model load at about `17.8 GiB` memory for both
|
| 65 |
8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement
|
| 66 |
over the full bfloat16 vLLM model load, which reported about `50.22 GiB`.
|
| 67 |
+
The 16k business-document task passed, and the current speed pass keeps the
|
| 68 |
+
runtime-quantized vLLM service active for OpenCode through the local proxy.
|
| 69 |
+
|
| 70 |
+
The dedicated website harness/router speed pass produced a complete checked
|
| 71 |
+
website in about `7.2s` through vLLM bitsandbytes:
|
| 72 |
+
|
| 73 |
+
- Direct website harness: `runs/harness/website-speed-pass/avery-stone-vllm.html`
|
| 74 |
+
- Router artifact: `runs/router-speed-pass/20260603T223731Z-website-build-a-premium-one-page-website-for-avery-stone-construction-a-reside/index.html`
|
| 75 |
+
- Local-proxy router artifact: `runs/router-speed-pass/20260603T224328Z-website-build-a-premium-one-page-website-for-bennett-family-dental-in-charlott/index.html`
|
| 76 |
+
- Router checks: complete HTML, required sections, external images,
|
| 77 |
+
responsive CSS, no lorem ipsum, manifest write
|
| 78 |
|
| 79 |
OpenCode one-file smoke also passed through the runtime-quantized endpoint:
|
| 80 |
|
|
|
|
| 91 |
- Permission mode: `--dangerously-skip-permissions` inside the temporary smoke
|
| 92 |
harness only
|
| 93 |
|
| 94 |
+
## Persisted GGUF Candidate
|
| 95 |
+
|
| 96 |
+
A Q8_0 GGUF candidate now exists on Gojira-B:
|
| 97 |
+
|
| 98 |
+
```text
|
| 99 |
+
/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
- Size: `27G`
|
| 103 |
+
- SHA256:
|
| 104 |
+
`596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e`
|
| 105 |
+
- Conversion evidence:
|
| 106 |
+
`runs/gguf-conversion/20260603T231446Z/gguf-conversion.log`
|
| 107 |
+
- Local docs: `release/gguf/README.md`
|
| 108 |
+
|
| 109 |
+
This is not public quantized-weights release evidence yet. It still needs a
|
| 110 |
+
runtime smoke that proves identity, business-owner output, and the intended
|
| 111 |
+
OpenCode/router path under an actual GGUF runtime.
|
| 112 |
+
|
| 113 |
## Release Interpretation
|
| 114 |
|
| 115 |
This is a working quantized local runtime candidate. It is useful for internal
|
| 116 |
testing, serious GPU users, and the next paid API speed experiments. It is not
|
| 117 |
+
yet a standalone public quantized weights repo because the only fully smoked
|
| 118 |
+
path is still the full merged model loaded through bitsandbytes at runtime.
|
| 119 |
|
| 120 |
+
The next release step is to smoke-test the GGUF candidate or package this
|
| 121 |
+
runtime path as an advanced serving recipe while clearly saying it still
|
| 122 |
requires access to the full Kaiju Coder 7 merged weights.
|
scripts/kaiju_opencode_fast_proxy.py
ADDED
|
@@ -0,0 +1,234 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Tool-safe OpenAI-compatible fast proxy for Kaiju Coder 7 OpenCode.
|
| 3 |
+
|
| 4 |
+
The normal Gojira gateway is product/API oriented and aggregates content. OpenCode
|
| 5 |
+
needs raw tool-call chunks preserved, so this proxy only patches serving knobs
|
| 6 |
+
and then passes upstream responses through unchanged.
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
from __future__ import annotations
|
| 10 |
+
|
| 11 |
+
import argparse
|
| 12 |
+
import json
|
| 13 |
+
import os
|
| 14 |
+
import time
|
| 15 |
+
import urllib.error
|
| 16 |
+
import urllib.request
|
| 17 |
+
from http import HTTPStatus
|
| 18 |
+
from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
|
| 19 |
+
from typing import Any
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
DEFAULT_HOST = "127.0.0.1"
|
| 23 |
+
DEFAULT_PORT = int(os.environ.get("KAIJU_OPENCODE_FAST_PROXY_PORT", "18181"))
|
| 24 |
+
UPSTREAM_BASE_URL = os.environ.get("KAIJU_OPENAI_BASE_URL", "http://100.109.109.14:18084/v1")
|
| 25 |
+
DEFAULT_MODEL = os.environ.get("KAIJU_DEFAULT_MODEL", "kaiju-coder-7")
|
| 26 |
+
API_KEY = os.environ.get("KAIJU_OPENAI_API_KEY", "")
|
| 27 |
+
NORMAL_MAX_TOKENS = int(os.environ.get("KAIJU_NORMAL_MAX_TOKENS", "384"))
|
| 28 |
+
WORK_MAX_TOKENS = int(os.environ.get("KAIJU_WORK_MAX_TOKENS", "1536"))
|
| 29 |
+
ARTIFACT_MAX_TOKENS = int(os.environ.get("KAIJU_ARTIFACT_MAX_TOKENS", "4096"))
|
| 30 |
+
MAX_REQUEST_BYTES = int(os.environ.get("KAIJU_MAX_REQUEST_BYTES", "2097152"))
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
def normalize_messages(messages: Any) -> list[dict[str, Any]]:
|
| 34 |
+
if not isinstance(messages, list):
|
| 35 |
+
return []
|
| 36 |
+
return [message for message in messages if isinstance(message, dict)]
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
def message_text(messages: list[dict[str, Any]]) -> str:
|
| 40 |
+
parts: list[str] = []
|
| 41 |
+
for message in messages:
|
| 42 |
+
content = message.get("content", "")
|
| 43 |
+
if isinstance(content, str):
|
| 44 |
+
parts.append(content)
|
| 45 |
+
else:
|
| 46 |
+
parts.append(json.dumps(content, ensure_ascii=False))
|
| 47 |
+
return "\n".join(parts).lower()
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def classify_job(messages: list[dict[str, Any]]) -> str:
|
| 51 |
+
text = message_text(messages)
|
| 52 |
+
artifact_terms = (
|
| 53 |
+
"complete html",
|
| 54 |
+
"html file",
|
| 55 |
+
"one-file website",
|
| 56 |
+
"landing page",
|
| 57 |
+
"build a website",
|
| 58 |
+
"make a website",
|
| 59 |
+
"full file",
|
| 60 |
+
)
|
| 61 |
+
work_terms = (
|
| 62 |
+
"create ",
|
| 63 |
+
"write ",
|
| 64 |
+
"edit ",
|
| 65 |
+
"implement",
|
| 66 |
+
"debug",
|
| 67 |
+
"fix",
|
| 68 |
+
"refactor",
|
| 69 |
+
"test",
|
| 70 |
+
"repo",
|
| 71 |
+
"file",
|
| 72 |
+
)
|
| 73 |
+
if any(term in text for term in artifact_terms):
|
| 74 |
+
return "artifact"
|
| 75 |
+
if any(term in text for term in work_terms):
|
| 76 |
+
return "work"
|
| 77 |
+
return "normal"
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
def target_tokens(job_class: str) -> int:
|
| 81 |
+
if job_class == "artifact":
|
| 82 |
+
return ARTIFACT_MAX_TOKENS
|
| 83 |
+
if job_class == "work":
|
| 84 |
+
return WORK_MAX_TOKENS
|
| 85 |
+
return NORMAL_MAX_TOKENS
|
| 86 |
+
|
| 87 |
+
|
| 88 |
+
def patch_chat_payload(body: dict[str, Any]) -> dict[str, Any]:
|
| 89 |
+
patched = dict(body)
|
| 90 |
+
patched["model"] = DEFAULT_MODEL
|
| 91 |
+
messages = normalize_messages(patched.get("messages"))
|
| 92 |
+
job_class = classify_job(messages)
|
| 93 |
+
patched["max_tokens"] = target_tokens(job_class)
|
| 94 |
+
patched["chat_template_kwargs"] = {
|
| 95 |
+
**(patched.get("chat_template_kwargs") if isinstance(patched.get("chat_template_kwargs"), dict) else {}),
|
| 96 |
+
"enable_thinking": False,
|
| 97 |
+
"thinking": False,
|
| 98 |
+
}
|
| 99 |
+
return patched
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
class Handler(BaseHTTPRequestHandler):
|
| 103 |
+
server_version = "KaijuOpenCodeFastProxy/0.1"
|
| 104 |
+
|
| 105 |
+
def log_message(self, fmt: str, *args: Any) -> None:
|
| 106 |
+
print(f"{time.strftime('%Y-%m-%d %H:%M:%S')} {self.address_string()} - {fmt % args}", flush=True)
|
| 107 |
+
|
| 108 |
+
def _json(self, status: int, payload: dict[str, Any]) -> None:
|
| 109 |
+
data = json.dumps(payload).encode("utf-8")
|
| 110 |
+
self.send_response(status)
|
| 111 |
+
self.send_header("content-type", "application/json; charset=utf-8")
|
| 112 |
+
self.send_header("cache-control", "no-store")
|
| 113 |
+
self.send_header("content-length", str(len(data)))
|
| 114 |
+
self.end_headers()
|
| 115 |
+
self.wfile.write(data)
|
| 116 |
+
|
| 117 |
+
def _read_json(self) -> dict[str, Any]:
|
| 118 |
+
length = int(self.headers.get("content-length", "0"))
|
| 119 |
+
if length > MAX_REQUEST_BYTES:
|
| 120 |
+
raise ValueError("request body too large")
|
| 121 |
+
raw = self.rfile.read(length)
|
| 122 |
+
if not raw:
|
| 123 |
+
return {}
|
| 124 |
+
value = json.loads(raw.decode("utf-8"))
|
| 125 |
+
if not isinstance(value, dict):
|
| 126 |
+
raise ValueError("request body must be a JSON object")
|
| 127 |
+
return value
|
| 128 |
+
|
| 129 |
+
def do_GET(self) -> None: # noqa: N802 - BaseHTTPRequestHandler API.
|
| 130 |
+
if self.path == "/health":
|
| 131 |
+
self._json(
|
| 132 |
+
HTTPStatus.OK,
|
| 133 |
+
{
|
| 134 |
+
"ok": True,
|
| 135 |
+
"model": DEFAULT_MODEL,
|
| 136 |
+
"upstream": UPSTREAM_BASE_URL,
|
| 137 |
+
"normal_max_tokens": NORMAL_MAX_TOKENS,
|
| 138 |
+
"work_max_tokens": WORK_MAX_TOKENS,
|
| 139 |
+
"artifact_max_tokens": ARTIFACT_MAX_TOKENS,
|
| 140 |
+
},
|
| 141 |
+
)
|
| 142 |
+
return
|
| 143 |
+
if self.path == "/v1/models":
|
| 144 |
+
self._forward_get("/models")
|
| 145 |
+
return
|
| 146 |
+
self._json(HTTPStatus.NOT_FOUND, {"error": {"message": "Not found", "type": "not_found"}})
|
| 147 |
+
|
| 148 |
+
def do_POST(self) -> None: # noqa: N802 - BaseHTTPRequestHandler API.
|
| 149 |
+
if self.path != "/v1/chat/completions":
|
| 150 |
+
self._json(HTTPStatus.NOT_FOUND, {"error": {"message": "Not found", "type": "not_found"}})
|
| 151 |
+
return
|
| 152 |
+
try:
|
| 153 |
+
body = patch_chat_payload(self._read_json())
|
| 154 |
+
except Exception as error: # noqa: BLE001 - return request parse failures.
|
| 155 |
+
self._json(HTTPStatus.BAD_REQUEST, {"error": {"message": str(error), "type": "bad_request"}})
|
| 156 |
+
return
|
| 157 |
+
self._forward_post("/chat/completions", body)
|
| 158 |
+
|
| 159 |
+
def _headers(self) -> dict[str, str]:
|
| 160 |
+
headers = {"content-type": "application/json"}
|
| 161 |
+
if API_KEY:
|
| 162 |
+
headers["authorization"] = f"Bearer {API_KEY}"
|
| 163 |
+
return headers
|
| 164 |
+
|
| 165 |
+
def _forward_get(self, suffix: str) -> None:
|
| 166 |
+
request = urllib.request.Request(
|
| 167 |
+
f"{UPSTREAM_BASE_URL.rstrip('/')}{suffix}",
|
| 168 |
+
headers=self._headers(),
|
| 169 |
+
method="GET",
|
| 170 |
+
)
|
| 171 |
+
try:
|
| 172 |
+
with urllib.request.urlopen(request, timeout=30) as upstream:
|
| 173 |
+
data = upstream.read()
|
| 174 |
+
self.send_response(upstream.status)
|
| 175 |
+
self.send_header("content-type", upstream.headers.get("content-type", "application/json"))
|
| 176 |
+
self.send_header("cache-control", "no-store")
|
| 177 |
+
self.send_header("content-length", str(len(data)))
|
| 178 |
+
self.end_headers()
|
| 179 |
+
self.wfile.write(data)
|
| 180 |
+
except urllib.error.HTTPError as error:
|
| 181 |
+
self._json(error.code, {"error": {"message": error.read().decode("utf-8", errors="replace")[:500]}})
|
| 182 |
+
except Exception as error: # noqa: BLE001 - proxy health should surface upstream failures.
|
| 183 |
+
self._json(HTTPStatus.BAD_GATEWAY, {"error": {"message": str(error), "type": "upstream_error"}})
|
| 184 |
+
|
| 185 |
+
def _forward_post(self, suffix: str, body: dict[str, Any]) -> None:
|
| 186 |
+
data = json.dumps(body).encode("utf-8")
|
| 187 |
+
request = urllib.request.Request(
|
| 188 |
+
f"{UPSTREAM_BASE_URL.rstrip('/')}{suffix}",
|
| 189 |
+
data=data,
|
| 190 |
+
headers=self._headers(),
|
| 191 |
+
method="POST",
|
| 192 |
+
)
|
| 193 |
+
try:
|
| 194 |
+
timeout = 1200 if classify_job(normalize_messages(body.get("messages"))) == "artifact" else 600
|
| 195 |
+
with urllib.request.urlopen(request, timeout=timeout) as upstream:
|
| 196 |
+
content_type = upstream.headers.get("content-type", "application/json")
|
| 197 |
+
if body.get("stream") is True:
|
| 198 |
+
self.send_response(upstream.status)
|
| 199 |
+
self.send_header("content-type", content_type)
|
| 200 |
+
self.send_header("cache-control", "no-store, no-transform")
|
| 201 |
+
self.send_header("connection", "close")
|
| 202 |
+
self.end_headers()
|
| 203 |
+
for chunk in upstream:
|
| 204 |
+
self.wfile.write(chunk)
|
| 205 |
+
self.wfile.flush()
|
| 206 |
+
return
|
| 207 |
+
response = upstream.read()
|
| 208 |
+
self.send_response(upstream.status)
|
| 209 |
+
self.send_header("content-type", content_type)
|
| 210 |
+
self.send_header("cache-control", "no-store")
|
| 211 |
+
self.send_header("content-length", str(len(response)))
|
| 212 |
+
self.end_headers()
|
| 213 |
+
self.wfile.write(response)
|
| 214 |
+
except urllib.error.HTTPError as error:
|
| 215 |
+
detail = error.read().decode("utf-8", errors="replace")[:500]
|
| 216 |
+
self._json(error.code, {"error": {"message": detail, "type": "upstream_error"}})
|
| 217 |
+
except Exception as error: # noqa: BLE001 - proxy should report upstream failures.
|
| 218 |
+
self._json(HTTPStatus.BAD_GATEWAY, {"error": {"message": str(error), "type": "upstream_error"}})
|
| 219 |
+
|
| 220 |
+
|
| 221 |
+
def main() -> int:
|
| 222 |
+
parser = argparse.ArgumentParser(description=__doc__)
|
| 223 |
+
parser.add_argument("--host", default=DEFAULT_HOST)
|
| 224 |
+
parser.add_argument("--port", type=int, default=DEFAULT_PORT)
|
| 225 |
+
args = parser.parse_args()
|
| 226 |
+
server = ThreadingHTTPServer((args.host, args.port), Handler)
|
| 227 |
+
print(f"Kaiju OpenCode fast proxy listening on http://{args.host}:{args.port}", flush=True)
|
| 228 |
+
print(f"Upstream: {UPSTREAM_BASE_URL}", flush=True)
|
| 229 |
+
server.serve_forever()
|
| 230 |
+
return 0
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
if __name__ == "__main__":
|
| 234 |
+
raise SystemExit(main())
|
scripts/probe-gojira-b-persisted-quantization.sh
ADDED
|
@@ -0,0 +1,185 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
set -euo pipefail
|
| 3 |
+
|
| 4 |
+
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 5 |
+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
| 6 |
+
# shellcheck source=scripts/gojira-b-ssh-lib.sh
|
| 7 |
+
source "${SCRIPT_DIR}/gojira-b-ssh-lib.sh"
|
| 8 |
+
kaiju_gojira_b_init
|
| 9 |
+
|
| 10 |
+
STAMP="$(date -u +%Y%m%dT%H%M%SZ)"
|
| 11 |
+
RUN_DIR="${ROOT}/runs/quantization-probes/${STAMP}"
|
| 12 |
+
LOG="${RUN_DIR}/persisted-quantization-probe.log"
|
| 13 |
+
SUMMARY="${RUN_DIR}/summary.md"
|
| 14 |
+
MODEL_REMOTE="${KAIJU_QUANT_MODEL_REMOTE:-/home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged}"
|
| 15 |
+
VLLM_IMAGE="${KAIJU_QUANT_VLLM_IMAGE:-gojira/vllm-openai-ray:nightly}"
|
| 16 |
+
LLAMA_DIR="${KAIJU_LLAMA_CPP_REMOTE:-/home/richardecholsai5/tools/llama.cpp}"
|
| 17 |
+
|
| 18 |
+
mkdir -p "${RUN_DIR}"
|
| 19 |
+
printf -v MODEL_REMOTE_Q "%q" "${MODEL_REMOTE}"
|
| 20 |
+
printf -v VLLM_IMAGE_Q "%q" "${VLLM_IMAGE}"
|
| 21 |
+
printf -v LLAMA_DIR_Q "%q" "${LLAMA_DIR}"
|
| 22 |
+
|
| 23 |
+
set +e
|
| 24 |
+
kaiju_gojira_b_ssh "MODEL_REMOTE=${MODEL_REMOTE_Q} VLLM_IMAGE=${VLLM_IMAGE_Q} LLAMA_DIR=${LLAMA_DIR_Q} bash -s" <<'REMOTE' 2>&1 | tee "${LOG}"
|
| 25 |
+
set -euo pipefail
|
| 26 |
+
|
| 27 |
+
echo "== Host and model =="
|
| 28 |
+
test -d "${MODEL_REMOTE}" || { echo "missing model: ${MODEL_REMOTE}" >&2; exit 2; }
|
| 29 |
+
du -sh "${MODEL_REMOTE}"
|
| 30 |
+
df -h /home | tail -1
|
| 31 |
+
free -h | sed -n '1,3p'
|
| 32 |
+
nvidia-smi --query-gpu=name,memory.total,memory.used,memory.free --format=csv,noheader || true
|
| 33 |
+
docker ps --format "{{.Names}} {{.Status}} {{.Image}}" | grep -Ei "qwen|kaiju|sglang|vllm" || true
|
| 34 |
+
|
| 35 |
+
echo
|
| 36 |
+
echo "== Model config =="
|
| 37 |
+
MODEL_REMOTE="${MODEL_REMOTE}" python3 - <<'PY'
|
| 38 |
+
import json
|
| 39 |
+
import os
|
| 40 |
+
from pathlib import Path
|
| 41 |
+
|
| 42 |
+
config = json.loads((Path(os.environ["MODEL_REMOTE"]) / "config.json").read_text())
|
| 43 |
+
text = config.get("text_config") or {}
|
| 44 |
+
print("model_type:", config.get("model_type"))
|
| 45 |
+
print("architectures:", config.get("architectures"))
|
| 46 |
+
print("text_model_type:", text.get("model_type"))
|
| 47 |
+
print("layers:", text.get("num_hidden_layers"))
|
| 48 |
+
print("layer_types:", ",".join(sorted(set(text.get("layer_types") or []))))
|
| 49 |
+
PY
|
| 50 |
+
|
| 51 |
+
echo
|
| 52 |
+
echo "== vLLM/Qwen3.5-capable Python stack =="
|
| 53 |
+
docker run --rm --entrypoint bash -v "${MODEL_REMOTE}":/models/kaiju:ro "${VLLM_IMAGE}" -lc '
|
| 54 |
+
set -euo pipefail
|
| 55 |
+
python3 - <<PY
|
| 56 |
+
from transformers import AutoConfig
|
| 57 |
+
cfg = AutoConfig.from_pretrained("/models/kaiju", trust_remote_code=True)
|
| 58 |
+
print("AutoConfig:", type(cfg).__name__, getattr(cfg, "model_type", None))
|
| 59 |
+
PY
|
| 60 |
+
python3 - <<PY
|
| 61 |
+
for mod in ["torch", "transformers", "safetensors", "vllm", "huggingface_hub"]:
|
| 62 |
+
m = __import__(mod)
|
| 63 |
+
version = getattr(m, "__version__", "installed")
|
| 64 |
+
print(mod + ": " + str(version))
|
| 65 |
+
PY
|
| 66 |
+
'
|
| 67 |
+
|
| 68 |
+
echo
|
| 69 |
+
echo "== Persistent quantization package import probe =="
|
| 70 |
+
docker run --rm --entrypoint bash -v "${MODEL_REMOTE}":/models/kaiju:ro "${VLLM_IMAGE}" -lc '
|
| 71 |
+
set -euo pipefail
|
| 72 |
+
for pkg in llmcompressor autoawq auto-gptq; do
|
| 73 |
+
echo "-- pip install ${pkg}"
|
| 74 |
+
if python3 -m pip install -q --no-cache-dir "${pkg}" >/tmp/kaiju-pip-${pkg}.log 2>&1; then
|
| 75 |
+
echo "${pkg}: install ok"
|
| 76 |
+
else
|
| 77 |
+
echo "${pkg}: install failed"
|
| 78 |
+
sed -n "1,120p" "/tmp/kaiju-pip-${pkg}.log"
|
| 79 |
+
fi
|
| 80 |
+
done
|
| 81 |
+
python3 - <<PY
|
| 82 |
+
mods = [("llmcompressor", "llmcompressor"), ("autoawq", "awq"), ("auto-gptq", "auto_gptq")]
|
| 83 |
+
for label, mod in mods:
|
| 84 |
+
try:
|
| 85 |
+
m = __import__(mod)
|
| 86 |
+
version = getattr(m, "__version__", "installed")
|
| 87 |
+
print(label + ": import ok: " + str(version))
|
| 88 |
+
except Exception as exc:
|
| 89 |
+
print(f"{label}: import failed: {type(exc).__name__}: {exc}")
|
| 90 |
+
PY
|
| 91 |
+
python3 - <<PY
|
| 92 |
+
from transformers import AutoConfig
|
| 93 |
+
try:
|
| 94 |
+
cfg = AutoConfig.from_pretrained("/models/kaiju", trust_remote_code=True)
|
| 95 |
+
print("post-install AutoConfig:", type(cfg).__name__, getattr(cfg, "model_type", None))
|
| 96 |
+
except Exception as exc:
|
| 97 |
+
print("post-install AutoConfig failed:", type(exc).__name__, exc)
|
| 98 |
+
PY
|
| 99 |
+
'
|
| 100 |
+
|
| 101 |
+
echo
|
| 102 |
+
echo "== LLM Compressor no-deps stack-preservation probe =="
|
| 103 |
+
docker run --rm --entrypoint bash -v "${MODEL_REMOTE}":/models/kaiju:ro "${VLLM_IMAGE}" -lc '
|
| 104 |
+
set -euo pipefail
|
| 105 |
+
python3 -m pip install -q --no-cache-dir --no-deps llmcompressor >/tmp/kaiju-pip-llmcompressor-nodeps.log 2>&1 || {
|
| 106 |
+
echo "llmcompressor --no-deps install failed"
|
| 107 |
+
sed -n "1,120p" /tmp/kaiju-pip-llmcompressor-nodeps.log
|
| 108 |
+
}
|
| 109 |
+
python3 - <<PY
|
| 110 |
+
try:
|
| 111 |
+
import llmcompressor
|
| 112 |
+
print("llmcompressor no-deps import:", getattr(llmcompressor, "__version__", "installed"))
|
| 113 |
+
except Exception as exc:
|
| 114 |
+
print("llmcompressor no-deps import failed:", type(exc).__name__, exc)
|
| 115 |
+
from transformers import AutoConfig
|
| 116 |
+
cfg = AutoConfig.from_pretrained("/models/kaiju", trust_remote_code=True)
|
| 117 |
+
print("no-deps AutoConfig:", type(cfg).__name__, getattr(cfg, "model_type", None))
|
| 118 |
+
PY
|
| 119 |
+
'
|
| 120 |
+
|
| 121 |
+
echo
|
| 122 |
+
echo "== llama.cpp GGUF support probe =="
|
| 123 |
+
mkdir -p "$(dirname "${LLAMA_DIR}")"
|
| 124 |
+
if [[ -d "${LLAMA_DIR}/.git" ]]; then
|
| 125 |
+
git -C "${LLAMA_DIR}" fetch --depth 1 origin master >/dev/null 2>&1 || true
|
| 126 |
+
git -C "${LLAMA_DIR}" checkout -q FETCH_HEAD >/dev/null 2>&1 || true
|
| 127 |
+
else
|
| 128 |
+
rm -rf "${LLAMA_DIR}"
|
| 129 |
+
git clone --depth 1 https://github.com/ggml-org/llama.cpp "${LLAMA_DIR}" >/dev/null
|
| 130 |
+
fi
|
| 131 |
+
docker run --rm --entrypoint bash \
|
| 132 |
+
-v "${MODEL_REMOTE}":/models/kaiju:ro \
|
| 133 |
+
-v "${LLAMA_DIR}":/llama.cpp:ro \
|
| 134 |
+
"${VLLM_IMAGE}" -lc '
|
| 135 |
+
set -euo pipefail
|
| 136 |
+
cd /llama.cpp
|
| 137 |
+
python3 convert_hf_to_gguf.py --print-supported-models 2>&1 | grep -Ei "qwen3_5|qwen3.5|qwen35|qwen3" | head -40 || true
|
| 138 |
+
python3 convert_hf_to_gguf.py --help | grep -E -- "--dry-run|--outtype|--vocab-only" || true
|
| 139 |
+
set +e
|
| 140 |
+
python3 convert_hf_to_gguf.py \
|
| 141 |
+
--dry-run \
|
| 142 |
+
--outtype q8_0 \
|
| 143 |
+
--outfile /tmp/kaiju-coder-7-q8_0-dry-run.gguf \
|
| 144 |
+
/models/kaiju 2>&1 | sed -n "1,220p"
|
| 145 |
+
DRY_STATUS=${PIPESTATUS[0]}
|
| 146 |
+
set -e
|
| 147 |
+
echo "gguf_dry_run_exit: ${DRY_STATUS}"
|
| 148 |
+
exit 0
|
| 149 |
+
'
|
| 150 |
+
REMOTE
|
| 151 |
+
STATUS=${PIPESTATUS[0]}
|
| 152 |
+
set -e
|
| 153 |
+
|
| 154 |
+
{
|
| 155 |
+
echo "# Kaiju Coder 7 Persisted Quantization Probe"
|
| 156 |
+
echo
|
| 157 |
+
echo "- Timestamp: \`${STAMP}\`"
|
| 158 |
+
echo "- Model: \`${MODEL_REMOTE}\`"
|
| 159 |
+
echo "- vLLM image: \`${VLLM_IMAGE}\`"
|
| 160 |
+
echo "- llama.cpp path: \`${LLAMA_DIR}\`"
|
| 161 |
+
echo "- Exit code: \`${STATUS}\`"
|
| 162 |
+
echo "- Log: \`${LOG}\`"
|
| 163 |
+
echo
|
| 164 |
+
echo "## Interpretation"
|
| 165 |
+
echo
|
| 166 |
+
if grep -q "Model architecture: QWEN35" "${LOG}" || grep -qi "QWEN35" "${LOG}"; then
|
| 167 |
+
echo "- GGUF conversion support probe found Qwen3.5/QWEN35 handling."
|
| 168 |
+
else
|
| 169 |
+
echo "- GGUF conversion support is not proven by this probe."
|
| 170 |
+
fi
|
| 171 |
+
if grep -q "AutoConfig: Qwen3_5Config" "${LOG}"; then
|
| 172 |
+
echo "- The pinned vLLM nightly stack recognizes Kaiju's Qwen3.5 config."
|
| 173 |
+
else
|
| 174 |
+
echo "- The pinned vLLM nightly stack did not recognize Kaiju's config."
|
| 175 |
+
fi
|
| 176 |
+
if grep -q "llmcompressor:" "${LOG}"; then
|
| 177 |
+
echo "- LLM Compressor package import was probed."
|
| 178 |
+
fi
|
| 179 |
+
echo
|
| 180 |
+
echo "Do not claim a persisted quantized artifact exists unless a later run writes"
|
| 181 |
+
echo "and verifies the quantized weights."
|
| 182 |
+
} > "${SUMMARY}"
|
| 183 |
+
|
| 184 |
+
echo "Summary: ${SUMMARY}"
|
| 185 |
+
exit "${STATUS}"
|
scripts/run-gojira-b-kaiju-gguf-convert.sh
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
set -euo pipefail
|
| 3 |
+
|
| 4 |
+
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
|
| 5 |
+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
| 6 |
+
# shellcheck source=scripts/gojira-b-ssh-lib.sh
|
| 7 |
+
source "${SCRIPT_DIR}/gojira-b-ssh-lib.sh"
|
| 8 |
+
kaiju_gojira_b_init
|
| 9 |
+
|
| 10 |
+
MODEL_REMOTE="${KAIJU_GGUF_MODEL_REMOTE:-/home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged}"
|
| 11 |
+
OUT_DIR="${KAIJU_GGUF_OUT_DIR:-/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf}"
|
| 12 |
+
OUTTYPE="${KAIJU_GGUF_OUTTYPE:-q8_0}"
|
| 13 |
+
OUTTYPE_UPPER="$(printf "%s" "${OUTTYPE}" | tr "[:lower:]" "[:upper:]")"
|
| 14 |
+
OUTFILE="${KAIJU_GGUF_OUTFILE:-kaiju-coder-7-${OUTTYPE_UPPER}.gguf}"
|
| 15 |
+
VLLM_IMAGE="${KAIJU_GGUF_VLLM_IMAGE:-gojira/vllm-openai-ray:nightly}"
|
| 16 |
+
LLAMA_DIR="${KAIJU_LLAMA_CPP_REMOTE:-/home/richardecholsai5/tools/llama.cpp}"
|
| 17 |
+
FORCE="${KAIJU_GGUF_FORCE:-0}"
|
| 18 |
+
STOP_VLLM="${KAIJU_GGUF_STOP_VLLM:-1}"
|
| 19 |
+
RESTART_VLLM="${KAIJU_GGUF_RESTART_VLLM:-1}"
|
| 20 |
+
STAMP="$(date -u +%Y%m%dT%H%M%SZ)"
|
| 21 |
+
RUN_DIR="${ROOT}/runs/gguf-conversion/${STAMP}"
|
| 22 |
+
LOG="${RUN_DIR}/gguf-conversion.log"
|
| 23 |
+
SUMMARY="${RUN_DIR}/summary.md"
|
| 24 |
+
|
| 25 |
+
mkdir -p "${RUN_DIR}"
|
| 26 |
+
|
| 27 |
+
printf -v MODEL_REMOTE_Q "%q" "${MODEL_REMOTE}"
|
| 28 |
+
printf -v OUT_DIR_Q "%q" "${OUT_DIR}"
|
| 29 |
+
printf -v OUTFILE_Q "%q" "${OUTFILE}"
|
| 30 |
+
printf -v OUTTYPE_Q "%q" "${OUTTYPE}"
|
| 31 |
+
printf -v VLLM_IMAGE_Q "%q" "${VLLM_IMAGE}"
|
| 32 |
+
printf -v LLAMA_DIR_Q "%q" "${LLAMA_DIR}"
|
| 33 |
+
printf -v FORCE_Q "%q" "${FORCE}"
|
| 34 |
+
printf -v STOP_VLLM_Q "%q" "${STOP_VLLM}"
|
| 35 |
+
printf -v RESTART_VLLM_Q "%q" "${RESTART_VLLM}"
|
| 36 |
+
|
| 37 |
+
set +e
|
| 38 |
+
kaiju_gojira_b_ssh "MODEL_REMOTE=${MODEL_REMOTE_Q} OUT_DIR=${OUT_DIR_Q} OUTFILE=${OUTFILE_Q} OUTTYPE=${OUTTYPE_Q} VLLM_IMAGE=${VLLM_IMAGE_Q} LLAMA_DIR=${LLAMA_DIR_Q} FORCE=${FORCE_Q} STOP_VLLM=${STOP_VLLM_Q} RESTART_VLLM=${RESTART_VLLM_Q} bash -s" <<'REMOTE' 2>&1 | tee "${LOG}"
|
| 39 |
+
set -euo pipefail
|
| 40 |
+
|
| 41 |
+
VLLM_SESSION="${KAIJU_VLLM_SESSION:-kaiju_qwen36_v18_merged_vllm}"
|
| 42 |
+
VLLM_CONTAINER="${KAIJU_VLLM_CONTAINER:-qwen36-merged-vllm-18084}"
|
| 43 |
+
OUT_PATH="${OUT_DIR}/${OUTFILE}"
|
| 44 |
+
|
| 45 |
+
restart_vllm() {
|
| 46 |
+
if [[ "${RESTART_VLLM}" != "1" ]]; then
|
| 47 |
+
return
|
| 48 |
+
fi
|
| 49 |
+
if tmux has-session -t "${VLLM_SESSION}" 2>/dev/null; then
|
| 50 |
+
return
|
| 51 |
+
fi
|
| 52 |
+
echo "Restarting vLLM fast runtime on 18084"
|
| 53 |
+
mkdir -p /home/richardecholsai5/kaiju-coder/logs /home/richardecholsai5/hf-cache
|
| 54 |
+
sudo docker rm -f "${VLLM_CONTAINER}" >/dev/null 2>&1 || true
|
| 55 |
+
LOG=/home/richardecholsai5/kaiju-coder/logs/qwen36-merged-vllm-18084.log
|
| 56 |
+
rm -f "${LOG}"
|
| 57 |
+
tmux new-session -d -s "${VLLM_SESSION}" "set -euo pipefail; sudo docker run --rm --gpus all --network host --ipc=host \
|
| 58 |
+
-v '${MODEL_REMOTE}':/models/kaiju-merged:ro \
|
| 59 |
+
-v /home/richardecholsai5/hf-cache:/root/.cache/huggingface \
|
| 60 |
+
--name '${VLLM_CONTAINER}' \
|
| 61 |
+
--entrypoint bash \
|
| 62 |
+
'${VLLM_IMAGE}' \
|
| 63 |
+
-lc 'python3 -m pip install -q pandas; python3 -m vllm.entrypoints.openai.api_server \
|
| 64 |
+
--model /models/kaiju-merged \
|
| 65 |
+
--served-model-name kaiju-coder-7 \
|
| 66 |
+
--host 0.0.0.0 \
|
| 67 |
+
--port 18084 \
|
| 68 |
+
--max-model-len 16384 \
|
| 69 |
+
--gpu-memory-utilization 0.90 \
|
| 70 |
+
--trust-remote-code \
|
| 71 |
+
--language-model-only \
|
| 72 |
+
--dtype bfloat16 \
|
| 73 |
+
--tool-call-parser qwen3_coder \
|
| 74 |
+
--reasoning-parser qwen3 \
|
| 75 |
+
--quantization bitsandbytes \
|
| 76 |
+
--load-format bitsandbytes \
|
| 77 |
+
--enable-auto-tool-choice \
|
| 78 |
+
--uvicorn-log-level info' 2>&1 | tee '${LOG}'"
|
| 79 |
+
}
|
| 80 |
+
trap restart_vllm EXIT
|
| 81 |
+
|
| 82 |
+
echo "== GGUF conversion request =="
|
| 83 |
+
echo "model: ${MODEL_REMOTE}"
|
| 84 |
+
echo "out: ${OUT_PATH}"
|
| 85 |
+
echo "outtype: ${OUTTYPE}"
|
| 86 |
+
test -d "${MODEL_REMOTE}" || { echo "missing model: ${MODEL_REMOTE}" >&2; exit 2; }
|
| 87 |
+
mkdir -p "${OUT_DIR}" "$(dirname "${LLAMA_DIR}")"
|
| 88 |
+
du -sh "${MODEL_REMOTE}"
|
| 89 |
+
df -h /home | tail -1
|
| 90 |
+
free -h | sed -n '1,3p'
|
| 91 |
+
|
| 92 |
+
if [[ "${STOP_VLLM}" == "1" ]]; then
|
| 93 |
+
echo "Stopping active vLLM runtime to free RAM"
|
| 94 |
+
tmux kill-session -t "${VLLM_SESSION}" >/dev/null 2>&1 || true
|
| 95 |
+
sudo docker rm -f "${VLLM_CONTAINER}" >/dev/null 2>&1 || true
|
| 96 |
+
sleep 3
|
| 97 |
+
free -h | sed -n '1,3p'
|
| 98 |
+
fi
|
| 99 |
+
|
| 100 |
+
if [[ -s "${OUT_PATH}" && "${FORCE}" != "1" ]]; then
|
| 101 |
+
echo "Existing GGUF found, skipping conversion: ${OUT_PATH}"
|
| 102 |
+
else
|
| 103 |
+
if [[ -d "${LLAMA_DIR}/.git" ]]; then
|
| 104 |
+
git -C "${LLAMA_DIR}" fetch --depth 1 origin master >/dev/null 2>&1 || true
|
| 105 |
+
git -C "${LLAMA_DIR}" checkout -q FETCH_HEAD >/dev/null 2>&1 || true
|
| 106 |
+
else
|
| 107 |
+
rm -rf "${LLAMA_DIR}"
|
| 108 |
+
git clone --depth 1 https://github.com/ggml-org/llama.cpp "${LLAMA_DIR}" >/dev/null
|
| 109 |
+
fi
|
| 110 |
+
rm -f "${OUT_PATH}.tmp" "${OUT_PATH}"
|
| 111 |
+
docker run --rm --entrypoint bash \
|
| 112 |
+
-v "${MODEL_REMOTE}":/models/kaiju:ro \
|
| 113 |
+
-v "${OUT_DIR}":/out \
|
| 114 |
+
-v "${LLAMA_DIR}":/llama.cpp:ro \
|
| 115 |
+
"${VLLM_IMAGE}" -lc "
|
| 116 |
+
set -euo pipefail
|
| 117 |
+
cd /llama.cpp
|
| 118 |
+
python3 convert_hf_to_gguf.py \
|
| 119 |
+
--outtype '${OUTTYPE}' \
|
| 120 |
+
--outfile '/out/${OUTFILE}.tmp' \
|
| 121 |
+
/models/kaiju
|
| 122 |
+
"
|
| 123 |
+
mv "${OUT_PATH}.tmp" "${OUT_PATH}"
|
| 124 |
+
fi
|
| 125 |
+
|
| 126 |
+
echo
|
| 127 |
+
echo "== GGUF artifact =="
|
| 128 |
+
ls -lh "${OUT_PATH}"
|
| 129 |
+
sha256sum "${OUT_PATH}" | tee "${OUT_PATH}.sha256"
|
| 130 |
+
OUT_PATH_PY="${OUT_PATH}" \
|
| 131 |
+
OUT_DIR_PY="${OUT_DIR}" \
|
| 132 |
+
OUTTYPE_PY="${OUTTYPE}" \
|
| 133 |
+
MODEL_REMOTE_PY="${MODEL_REMOTE}" \
|
| 134 |
+
LLAMA_DIR_PY="${LLAMA_DIR}" \
|
| 135 |
+
python3 - <<'PY'
|
| 136 |
+
import json
|
| 137 |
+
import os
|
| 138 |
+
from pathlib import Path
|
| 139 |
+
|
| 140 |
+
out = Path(os.environ["OUT_PATH_PY"])
|
| 141 |
+
out_dir = Path(os.environ["OUT_DIR_PY"])
|
| 142 |
+
outtype = os.environ["OUTTYPE_PY"]
|
| 143 |
+
model_remote = os.environ["MODEL_REMOTE_PY"]
|
| 144 |
+
llama_dir = os.environ["LLAMA_DIR_PY"]
|
| 145 |
+
manifest = {
|
| 146 |
+
"product": "Kaiju Coder 7",
|
| 147 |
+
"model_id": "kaiju-coder-7",
|
| 148 |
+
"format": "GGUF",
|
| 149 |
+
"outtype": outtype,
|
| 150 |
+
"artifact": str(out),
|
| 151 |
+
"sha256_file": str(out) + ".sha256",
|
| 152 |
+
"source_model": model_remote,
|
| 153 |
+
"converter": llama_dir,
|
| 154 |
+
"status": "converted_pending_runtime_smoke",
|
| 155 |
+
}
|
| 156 |
+
(out_dir / "GGUF_RELEASE_MANIFEST.json").write_text(json.dumps(manifest, indent=2) + "\n")
|
| 157 |
+
(out_dir / "README.md").write_text(
|
| 158 |
+
"# Kaiju Coder 7 GGUF Candidate\n\n"
|
| 159 |
+
"This is a persisted GGUF candidate converted from the merged Kaiju Coder 7 model.\n"
|
| 160 |
+
"It is not public release-ready until a runtime smoke test passes.\n\n"
|
| 161 |
+
f"- Artifact: `{out.name}`\n"
|
| 162 |
+
f"- Outtype: `{outtype}`\n"
|
| 163 |
+
f"- Source: `{model_remote}`\n",
|
| 164 |
+
encoding="utf-8",
|
| 165 |
+
)
|
| 166 |
+
PY
|
| 167 |
+
REMOTE
|
| 168 |
+
STATUS=${PIPESTATUS[0]}
|
| 169 |
+
set -e
|
| 170 |
+
|
| 171 |
+
{
|
| 172 |
+
echo "# Kaiju Coder 7 GGUF Conversion"
|
| 173 |
+
echo
|
| 174 |
+
echo "- Timestamp: \`${STAMP}\`"
|
| 175 |
+
echo "- Exit code: \`${STATUS}\`"
|
| 176 |
+
echo "- Model: \`${MODEL_REMOTE}\`"
|
| 177 |
+
echo "- Out dir: \`${OUT_DIR}\`"
|
| 178 |
+
echo "- Out file: \`${OUTFILE}\`"
|
| 179 |
+
echo "- Out type: \`${OUTTYPE}\`"
|
| 180 |
+
echo "- Log: \`${LOG}\`"
|
| 181 |
+
echo
|
| 182 |
+
if grep -q "GGUF artifact" "${LOG}" && grep -qE "^[0-9a-f]{64}[[:space:]]+${OUT_DIR}/${OUTFILE}$" "${LOG}"; then
|
| 183 |
+
echo "Status: converted; runtime smoke still required before public release."
|
| 184 |
+
else
|
| 185 |
+
echo "Status: conversion incomplete or failed."
|
| 186 |
+
fi
|
| 187 |
+
} > "${SUMMARY}"
|
| 188 |
+
|
| 189 |
+
echo "Summary: ${SUMMARY}"
|
| 190 |
+
exit "${STATUS}"
|