restokes92 commited on
Commit
785f3d7
·
verified ·
1 Parent(s): 6d7449a

Upload Kaiju Coder 7 runtime quantization recipe

Browse files
GGUF_CANDIDATE.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Kaiju Coder 7 GGUF Candidate
2
+
3
+ This folder documents the persisted GGUF candidate for Kaiju Coder 7. The
4
+ artifact exists on Gojira-B, but it should stay marked as a candidate until a
5
+ runtime smoke test passes.
6
+
7
+ ## Artifact
8
+
9
+ - Format: GGUF
10
+ - Outtype: `q8_0`
11
+ - Remote path:
12
+ `/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf`
13
+ - Remote size: `27G`
14
+ - SHA256:
15
+ `596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e`
16
+ - Source model:
17
+ `/home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged`
18
+ - Conversion evidence:
19
+ `runs/gguf-conversion/20260603T231446Z/gguf-conversion.log`
20
+
21
+ ## Status
22
+
23
+ Converted successfully on 2026-06-03. Runtime smoke is still required before
24
+ public upload or a Hugging Face quantized-weights claim.
25
+
26
+ The conversion path is promising because the current `llama.cpp`
27
+ `convert_hf_to_gguf.py` support list includes `Qwen3_5ForConditionalGeneration`
28
+ and the Q8_0 dry run completed before the real conversion.
29
+
30
+ ## Recreate
31
+
32
+ ```bash
33
+ ./scripts/probe-gojira-b-persisted-quantization.sh
34
+ ./scripts/run-gojira-b-kaiju-gguf-convert.sh
35
+ ```
36
+
37
+ The conversion script stops the active vLLM runtime to free RAM, writes the GGUF
38
+ artifact, records a checksum and manifest, then restarts the fast vLLM runtime.
39
+
40
+ ## Release Rule
41
+
42
+ Do not publish this as public quantized weights until all of these pass:
43
+
44
+ - runtime loads the GGUF with model id `kaiju-coder-7`
45
+ - direct identity smoke passes
46
+ - direct business-owner document smoke passes
47
+ - OpenCode or router smoke passes through the intended runtime
48
+ - README/model card states exact runtime, context, memory, and quality caveats
49
+
50
+ Until then, the public quantized path remains `kaiju-coder-7-quantized-runtime`,
51
+ which documents the already-smoked vLLM bitsandbytes setup.
PUBLIC_TESTING_QUICKSTART.md CHANGED
@@ -19,7 +19,7 @@ Use this if you already have Kaiju Coder 7 served at an OpenAI-compatible
19
  ```bash
20
  git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
21
  cd kaiju-coder-7-opencode
22
- python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18083/v1
23
  ```
24
 
25
  Then run OpenCode inside the project you want to edit:
@@ -65,23 +65,31 @@ the server to expose:
65
 
66
  ```text
67
  model id: kaiju-coder-7
68
- base URL: http://127.0.0.1:18083/v1
69
  context: 16384
70
  ```
71
 
 
 
 
 
 
 
 
 
72
  Then install the OpenCode helper with:
73
 
74
  ```bash
75
  git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
76
  cd kaiju-coder-7-opencode
77
- python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18083/v1
78
  ```
79
 
80
  ### Path 3: Runtime-Quantized Local Candidate
81
 
82
  Use this only if you are comfortable with advanced serving setups. The current
83
- working quantized option is a runtime bitsandbytes recipe, not a separate
84
- persisted quantized weights repo.
85
 
86
  ```bash
87
  git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-quantized-runtime
@@ -115,9 +123,12 @@ Expected result:
115
  - Public model id: `kaiju-coder-7`
116
  - OpenCode context: `16384`
117
  - Output cap for public testing: `2500`
 
118
  - Current reliable product path: model plus deterministic business-owner
119
- harness plus verifier
120
- - Raw multi-file OpenCode generation: still too slow for broad paid API claims
 
 
121
  - Paid API: not public until launch preflight passes
122
 
123
  ## What Not To Claim Yet
@@ -134,15 +145,21 @@ Do claim:
134
  - Kaiju Coder 7 has a working local/OpenCode release candidate
135
  - the current tested OpenCode default is 16k context
136
  - the helper package includes a lean agent and compaction loop guard
 
 
137
  - the paid API scaffold has tests and a launch preflight, but is not yet public
138
  - the packaged public smoke verifies a fresh OpenCode one-file write before
139
  public claims are refreshed
 
 
140
 
141
  ## Current Blockers Before Public Release
142
 
143
  - Hugging Face repo creation still requires a write-capable token or namespace.
144
  - Full merged model upload has not completed; the merged folder must first have
145
  the metadata packet synced by `prepare_hf_merged_model_metadata.sh`.
 
 
146
  - Public paid API launch needs real Cloudflare D1/KV/R2 bindings, Wrangler
147
  secret verification, Stripe webhook staging evidence, staging traffic, latency
148
  evidence, and rollback proof.
 
19
  ```bash
20
  git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
21
  cd kaiju-coder-7-opencode
22
+ python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18181/v1
23
  ```
24
 
25
  Then run OpenCode inside the project you want to edit:
 
65
 
66
  ```text
67
  model id: kaiju-coder-7
68
+ base URL: http://127.0.0.1:18084/v1
69
  context: 16384
70
  ```
71
 
72
+ For the fastest OpenCode behavior, run the bundled fast proxy in a separate
73
+ terminal and point OpenCode at the proxy:
74
+
75
+ ```bash
76
+ KAIJU_OPENAI_BASE_URL=http://127.0.0.1:18084/v1 \
77
+ python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181
78
+ ```
79
+
80
  Then install the OpenCode helper with:
81
 
82
  ```bash
83
  git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
84
  cd kaiju-coder-7-opencode
85
+ python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18181/v1
86
  ```
87
 
88
  ### Path 3: Runtime-Quantized Local Candidate
89
 
90
  Use this only if you are comfortable with advanced serving setups. The current
91
+ working quantized option is a runtime bitsandbytes recipe. A Q8_0 GGUF artifact
92
+ has been converted, but it is still a candidate until runtime smoke passes.
93
 
94
  ```bash
95
  git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-quantized-runtime
 
123
  - Public model id: `kaiju-coder-7`
124
  - OpenCode context: `16384`
125
  - Output cap for public testing: `2500`
126
+ - Fast OpenCode path: vLLM bitsandbytes runtime behind the Kaiju fast proxy
127
  - Current reliable product path: model plus deterministic business-owner
128
+ harness/router plus verifier
129
+ - Raw multi-file OpenCode generation: still too slow for broad paid claims;
130
+ useful for testing, but paid API claims should favor harnessed product
131
+ workflows until broader latency gates pass
132
  - Paid API: not public until launch preflight passes
133
 
134
  ## What Not To Claim Yet
 
145
  - Kaiju Coder 7 has a working local/OpenCode release candidate
146
  - the current tested OpenCode default is 16k context
147
  - the helper package includes a lean agent and compaction loop guard
148
+ - the fast proxy keeps OpenCode tool calls intact while forcing bounded,
149
+ non-thinking generation
150
  - the paid API scaffold has tests and a launch preflight, but is not yet public
151
  - the packaged public smoke verifies a fresh OpenCode one-file write before
152
  public claims are refreshed
153
+ - a GGUF Q8_0 candidate exists, but is not public quantized-weights release
154
+ evidence until runtime smoke passes
155
 
156
  ## Current Blockers Before Public Release
157
 
158
  - Hugging Face repo creation still requires a write-capable token or namespace.
159
  - Full merged model upload has not completed; the merged folder must first have
160
  the metadata packet synced by `prepare_hf_merged_model_metadata.sh`.
161
+ - The GGUF Q8_0 candidate still needs a runtime smoke before public
162
+ quantized-weights upload.
163
  - Public paid API launch needs real Cloudflare D1/KV/R2 bindings, Wrangler
164
  secret verification, Stripe webhook staging evidence, staging traffic, latency
165
  evidence, and rollback proof.
README.md CHANGED
@@ -14,8 +14,9 @@ weight artifact yet.
14
  - Required OpenCode launch flag: `--enable-auto-tool-choice`
15
  - Required preinstall in this image: `pandas`
16
  - Tested contexts: `8192`, `16384`
17
- - OpenCode smoke: passed
18
- - Persisted quantized Hugging Face weights: pending
 
19
 
20
  ## Run
21
 
@@ -30,7 +31,14 @@ KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \
30
  ```
31
 
32
  The script stops the merged SGLang service, starts vLLM on port `18084`, runs
33
- the benchmark, then restores the recommended SGLang service on port `18083`.
 
 
 
 
 
 
 
34
 
35
  ## Evidence
36
 
@@ -40,6 +48,7 @@ Runs:
40
  - `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
41
  - `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
42
  - `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
 
43
 
44
  | Runtime | Context | Prompt | OK | Seconds | Chars | Chars/s |
45
  | --- | ---: | --- | --- | ---: | ---: | ---: |
@@ -49,12 +58,23 @@ Runs:
49
  | vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 |
50
  | vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 |
51
  | vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 |
 
 
52
 
53
  Gojira-B log evidence recorded model load at about `17.8 GiB` memory for both
54
  8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement
55
  over the full bfloat16 vLLM model load, which reported about `50.22 GiB`.
56
- The 16k business-document task passed after the wrapper restored the default
57
- SGLang service.
 
 
 
 
 
 
 
 
 
58
 
59
  OpenCode one-file smoke also passed through the runtime-quantized endpoint:
60
 
@@ -71,13 +91,32 @@ Result:
71
  - Permission mode: `--dangerously-skip-permissions` inside the temporary smoke
72
  harness only
73
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
  ## Release Interpretation
75
 
76
  This is a working quantized local runtime candidate. It is useful for internal
77
  testing, serious GPU users, and the next paid API speed experiments. It is not
78
- yet a standalone public quantized weights repo because the artifact is still the
79
- full merged model loaded through bitsandbytes at runtime.
80
 
81
- The next release step is to produce a persisted quantized artifact, or package
82
- this runtime path as an advanced serving recipe while clearly saying it still
83
  requires access to the full Kaiju Coder 7 merged weights.
 
14
  - Required OpenCode launch flag: `--enable-auto-tool-choice`
15
  - Required preinstall in this image: `pandas`
16
  - Tested contexts: `8192`, `16384`
17
+ - OpenCode smoke: passed through the local fast proxy
18
+ - Persisted quantized Hugging Face weights: GGUF Q8_0 converted, runtime smoke
19
+ pending before public upload
20
 
21
  ## Run
22
 
 
31
  ```
32
 
33
  The script stops the merged SGLang service, starts vLLM on port `18084`, runs
34
+ the benchmark, then restores SGLang unless `KAIJU_VLLM_KEEP_RUNNING=1` is set.
35
+ For the current fast OpenCode setup, keep vLLM running and point the fast proxy
36
+ at port `18084`.
37
+
38
+ ```bash
39
+ KAIJU_OPENAI_BASE_URL=http://127.0.0.1:18084/v1 \
40
+ python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181
41
+ ```
42
 
43
  ## Evidence
44
 
 
48
  - `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
49
  - `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
50
  - `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
51
+ - `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md`
52
 
53
  | Runtime | Context | Prompt | OK | Seconds | Chars | Chars/s |
54
  | --- | ---: | --- | --- | ---: | ---: | ---: |
 
58
  | vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 |
59
  | vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 |
60
  | vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 |
61
+ | vLLM bitsandbytes | 16384 | code_patch | True | 24.97 | 997 | 39.924 |
62
+ | vLLM bitsandbytes | 16384 | business_doc | True | 34.46 | 1615 | 46.874 |
63
 
64
  Gojira-B log evidence recorded model load at about `17.8 GiB` memory for both
65
  8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement
66
  over the full bfloat16 vLLM model load, which reported about `50.22 GiB`.
67
+ The 16k business-document task passed, and the current speed pass keeps the
68
+ runtime-quantized vLLM service active for OpenCode through the local proxy.
69
+
70
+ The dedicated website harness/router speed pass produced a complete checked
71
+ website in about `7.2s` through vLLM bitsandbytes:
72
+
73
+ - Direct website harness: `runs/harness/website-speed-pass/avery-stone-vllm.html`
74
+ - Router artifact: `runs/router-speed-pass/20260603T223731Z-website-build-a-premium-one-page-website-for-avery-stone-construction-a-reside/index.html`
75
+ - Local-proxy router artifact: `runs/router-speed-pass/20260603T224328Z-website-build-a-premium-one-page-website-for-bennett-family-dental-in-charlott/index.html`
76
+ - Router checks: complete HTML, required sections, external images,
77
+ responsive CSS, no lorem ipsum, manifest write
78
 
79
  OpenCode one-file smoke also passed through the runtime-quantized endpoint:
80
 
 
91
  - Permission mode: `--dangerously-skip-permissions` inside the temporary smoke
92
  harness only
93
 
94
+ ## Persisted GGUF Candidate
95
+
96
+ A Q8_0 GGUF candidate now exists on Gojira-B:
97
+
98
+ ```text
99
+ /home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf
100
+ ```
101
+
102
+ - Size: `27G`
103
+ - SHA256:
104
+ `596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e`
105
+ - Conversion evidence:
106
+ `runs/gguf-conversion/20260603T231446Z/gguf-conversion.log`
107
+ - Local docs: `release/gguf/README.md`
108
+
109
+ This is not public quantized-weights release evidence yet. It still needs a
110
+ runtime smoke that proves identity, business-owner output, and the intended
111
+ OpenCode/router path under an actual GGUF runtime.
112
+
113
  ## Release Interpretation
114
 
115
  This is a working quantized local runtime candidate. It is useful for internal
116
  testing, serious GPU users, and the next paid API speed experiments. It is not
117
+ yet a standalone public quantized weights repo because the only fully smoked
118
+ path is still the full merged model loaded through bitsandbytes at runtime.
119
 
120
+ The next release step is to smoke-test the GGUF candidate or package this
121
+ runtime path as an advanced serving recipe while clearly saying it still
122
  requires access to the full Kaiju Coder 7 merged weights.
scripts/kaiju_opencode_fast_proxy.py ADDED
@@ -0,0 +1,234 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Tool-safe OpenAI-compatible fast proxy for Kaiju Coder 7 OpenCode.
3
+
4
+ The normal Gojira gateway is product/API oriented and aggregates content. OpenCode
5
+ needs raw tool-call chunks preserved, so this proxy only patches serving knobs
6
+ and then passes upstream responses through unchanged.
7
+ """
8
+
9
+ from __future__ import annotations
10
+
11
+ import argparse
12
+ import json
13
+ import os
14
+ import time
15
+ import urllib.error
16
+ import urllib.request
17
+ from http import HTTPStatus
18
+ from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
19
+ from typing import Any
20
+
21
+
22
+ DEFAULT_HOST = "127.0.0.1"
23
+ DEFAULT_PORT = int(os.environ.get("KAIJU_OPENCODE_FAST_PROXY_PORT", "18181"))
24
+ UPSTREAM_BASE_URL = os.environ.get("KAIJU_OPENAI_BASE_URL", "http://100.109.109.14:18084/v1")
25
+ DEFAULT_MODEL = os.environ.get("KAIJU_DEFAULT_MODEL", "kaiju-coder-7")
26
+ API_KEY = os.environ.get("KAIJU_OPENAI_API_KEY", "")
27
+ NORMAL_MAX_TOKENS = int(os.environ.get("KAIJU_NORMAL_MAX_TOKENS", "384"))
28
+ WORK_MAX_TOKENS = int(os.environ.get("KAIJU_WORK_MAX_TOKENS", "1536"))
29
+ ARTIFACT_MAX_TOKENS = int(os.environ.get("KAIJU_ARTIFACT_MAX_TOKENS", "4096"))
30
+ MAX_REQUEST_BYTES = int(os.environ.get("KAIJU_MAX_REQUEST_BYTES", "2097152"))
31
+
32
+
33
+ def normalize_messages(messages: Any) -> list[dict[str, Any]]:
34
+ if not isinstance(messages, list):
35
+ return []
36
+ return [message for message in messages if isinstance(message, dict)]
37
+
38
+
39
+ def message_text(messages: list[dict[str, Any]]) -> str:
40
+ parts: list[str] = []
41
+ for message in messages:
42
+ content = message.get("content", "")
43
+ if isinstance(content, str):
44
+ parts.append(content)
45
+ else:
46
+ parts.append(json.dumps(content, ensure_ascii=False))
47
+ return "\n".join(parts).lower()
48
+
49
+
50
+ def classify_job(messages: list[dict[str, Any]]) -> str:
51
+ text = message_text(messages)
52
+ artifact_terms = (
53
+ "complete html",
54
+ "html file",
55
+ "one-file website",
56
+ "landing page",
57
+ "build a website",
58
+ "make a website",
59
+ "full file",
60
+ )
61
+ work_terms = (
62
+ "create ",
63
+ "write ",
64
+ "edit ",
65
+ "implement",
66
+ "debug",
67
+ "fix",
68
+ "refactor",
69
+ "test",
70
+ "repo",
71
+ "file",
72
+ )
73
+ if any(term in text for term in artifact_terms):
74
+ return "artifact"
75
+ if any(term in text for term in work_terms):
76
+ return "work"
77
+ return "normal"
78
+
79
+
80
+ def target_tokens(job_class: str) -> int:
81
+ if job_class == "artifact":
82
+ return ARTIFACT_MAX_TOKENS
83
+ if job_class == "work":
84
+ return WORK_MAX_TOKENS
85
+ return NORMAL_MAX_TOKENS
86
+
87
+
88
+ def patch_chat_payload(body: dict[str, Any]) -> dict[str, Any]:
89
+ patched = dict(body)
90
+ patched["model"] = DEFAULT_MODEL
91
+ messages = normalize_messages(patched.get("messages"))
92
+ job_class = classify_job(messages)
93
+ patched["max_tokens"] = target_tokens(job_class)
94
+ patched["chat_template_kwargs"] = {
95
+ **(patched.get("chat_template_kwargs") if isinstance(patched.get("chat_template_kwargs"), dict) else {}),
96
+ "enable_thinking": False,
97
+ "thinking": False,
98
+ }
99
+ return patched
100
+
101
+
102
+ class Handler(BaseHTTPRequestHandler):
103
+ server_version = "KaijuOpenCodeFastProxy/0.1"
104
+
105
+ def log_message(self, fmt: str, *args: Any) -> None:
106
+ print(f"{time.strftime('%Y-%m-%d %H:%M:%S')} {self.address_string()} - {fmt % args}", flush=True)
107
+
108
+ def _json(self, status: int, payload: dict[str, Any]) -> None:
109
+ data = json.dumps(payload).encode("utf-8")
110
+ self.send_response(status)
111
+ self.send_header("content-type", "application/json; charset=utf-8")
112
+ self.send_header("cache-control", "no-store")
113
+ self.send_header("content-length", str(len(data)))
114
+ self.end_headers()
115
+ self.wfile.write(data)
116
+
117
+ def _read_json(self) -> dict[str, Any]:
118
+ length = int(self.headers.get("content-length", "0"))
119
+ if length > MAX_REQUEST_BYTES:
120
+ raise ValueError("request body too large")
121
+ raw = self.rfile.read(length)
122
+ if not raw:
123
+ return {}
124
+ value = json.loads(raw.decode("utf-8"))
125
+ if not isinstance(value, dict):
126
+ raise ValueError("request body must be a JSON object")
127
+ return value
128
+
129
+ def do_GET(self) -> None: # noqa: N802 - BaseHTTPRequestHandler API.
130
+ if self.path == "/health":
131
+ self._json(
132
+ HTTPStatus.OK,
133
+ {
134
+ "ok": True,
135
+ "model": DEFAULT_MODEL,
136
+ "upstream": UPSTREAM_BASE_URL,
137
+ "normal_max_tokens": NORMAL_MAX_TOKENS,
138
+ "work_max_tokens": WORK_MAX_TOKENS,
139
+ "artifact_max_tokens": ARTIFACT_MAX_TOKENS,
140
+ },
141
+ )
142
+ return
143
+ if self.path == "/v1/models":
144
+ self._forward_get("/models")
145
+ return
146
+ self._json(HTTPStatus.NOT_FOUND, {"error": {"message": "Not found", "type": "not_found"}})
147
+
148
+ def do_POST(self) -> None: # noqa: N802 - BaseHTTPRequestHandler API.
149
+ if self.path != "/v1/chat/completions":
150
+ self._json(HTTPStatus.NOT_FOUND, {"error": {"message": "Not found", "type": "not_found"}})
151
+ return
152
+ try:
153
+ body = patch_chat_payload(self._read_json())
154
+ except Exception as error: # noqa: BLE001 - return request parse failures.
155
+ self._json(HTTPStatus.BAD_REQUEST, {"error": {"message": str(error), "type": "bad_request"}})
156
+ return
157
+ self._forward_post("/chat/completions", body)
158
+
159
+ def _headers(self) -> dict[str, str]:
160
+ headers = {"content-type": "application/json"}
161
+ if API_KEY:
162
+ headers["authorization"] = f"Bearer {API_KEY}"
163
+ return headers
164
+
165
+ def _forward_get(self, suffix: str) -> None:
166
+ request = urllib.request.Request(
167
+ f"{UPSTREAM_BASE_URL.rstrip('/')}{suffix}",
168
+ headers=self._headers(),
169
+ method="GET",
170
+ )
171
+ try:
172
+ with urllib.request.urlopen(request, timeout=30) as upstream:
173
+ data = upstream.read()
174
+ self.send_response(upstream.status)
175
+ self.send_header("content-type", upstream.headers.get("content-type", "application/json"))
176
+ self.send_header("cache-control", "no-store")
177
+ self.send_header("content-length", str(len(data)))
178
+ self.end_headers()
179
+ self.wfile.write(data)
180
+ except urllib.error.HTTPError as error:
181
+ self._json(error.code, {"error": {"message": error.read().decode("utf-8", errors="replace")[:500]}})
182
+ except Exception as error: # noqa: BLE001 - proxy health should surface upstream failures.
183
+ self._json(HTTPStatus.BAD_GATEWAY, {"error": {"message": str(error), "type": "upstream_error"}})
184
+
185
+ def _forward_post(self, suffix: str, body: dict[str, Any]) -> None:
186
+ data = json.dumps(body).encode("utf-8")
187
+ request = urllib.request.Request(
188
+ f"{UPSTREAM_BASE_URL.rstrip('/')}{suffix}",
189
+ data=data,
190
+ headers=self._headers(),
191
+ method="POST",
192
+ )
193
+ try:
194
+ timeout = 1200 if classify_job(normalize_messages(body.get("messages"))) == "artifact" else 600
195
+ with urllib.request.urlopen(request, timeout=timeout) as upstream:
196
+ content_type = upstream.headers.get("content-type", "application/json")
197
+ if body.get("stream") is True:
198
+ self.send_response(upstream.status)
199
+ self.send_header("content-type", content_type)
200
+ self.send_header("cache-control", "no-store, no-transform")
201
+ self.send_header("connection", "close")
202
+ self.end_headers()
203
+ for chunk in upstream:
204
+ self.wfile.write(chunk)
205
+ self.wfile.flush()
206
+ return
207
+ response = upstream.read()
208
+ self.send_response(upstream.status)
209
+ self.send_header("content-type", content_type)
210
+ self.send_header("cache-control", "no-store")
211
+ self.send_header("content-length", str(len(response)))
212
+ self.end_headers()
213
+ self.wfile.write(response)
214
+ except urllib.error.HTTPError as error:
215
+ detail = error.read().decode("utf-8", errors="replace")[:500]
216
+ self._json(error.code, {"error": {"message": detail, "type": "upstream_error"}})
217
+ except Exception as error: # noqa: BLE001 - proxy should report upstream failures.
218
+ self._json(HTTPStatus.BAD_GATEWAY, {"error": {"message": str(error), "type": "upstream_error"}})
219
+
220
+
221
+ def main() -> int:
222
+ parser = argparse.ArgumentParser(description=__doc__)
223
+ parser.add_argument("--host", default=DEFAULT_HOST)
224
+ parser.add_argument("--port", type=int, default=DEFAULT_PORT)
225
+ args = parser.parse_args()
226
+ server = ThreadingHTTPServer((args.host, args.port), Handler)
227
+ print(f"Kaiju OpenCode fast proxy listening on http://{args.host}:{args.port}", flush=True)
228
+ print(f"Upstream: {UPSTREAM_BASE_URL}", flush=True)
229
+ server.serve_forever()
230
+ return 0
231
+
232
+
233
+ if __name__ == "__main__":
234
+ raise SystemExit(main())
scripts/probe-gojira-b-persisted-quantization.sh ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
5
+ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
6
+ # shellcheck source=scripts/gojira-b-ssh-lib.sh
7
+ source "${SCRIPT_DIR}/gojira-b-ssh-lib.sh"
8
+ kaiju_gojira_b_init
9
+
10
+ STAMP="$(date -u +%Y%m%dT%H%M%SZ)"
11
+ RUN_DIR="${ROOT}/runs/quantization-probes/${STAMP}"
12
+ LOG="${RUN_DIR}/persisted-quantization-probe.log"
13
+ SUMMARY="${RUN_DIR}/summary.md"
14
+ MODEL_REMOTE="${KAIJU_QUANT_MODEL_REMOTE:-/home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged}"
15
+ VLLM_IMAGE="${KAIJU_QUANT_VLLM_IMAGE:-gojira/vllm-openai-ray:nightly}"
16
+ LLAMA_DIR="${KAIJU_LLAMA_CPP_REMOTE:-/home/richardecholsai5/tools/llama.cpp}"
17
+
18
+ mkdir -p "${RUN_DIR}"
19
+ printf -v MODEL_REMOTE_Q "%q" "${MODEL_REMOTE}"
20
+ printf -v VLLM_IMAGE_Q "%q" "${VLLM_IMAGE}"
21
+ printf -v LLAMA_DIR_Q "%q" "${LLAMA_DIR}"
22
+
23
+ set +e
24
+ kaiju_gojira_b_ssh "MODEL_REMOTE=${MODEL_REMOTE_Q} VLLM_IMAGE=${VLLM_IMAGE_Q} LLAMA_DIR=${LLAMA_DIR_Q} bash -s" <<'REMOTE' 2>&1 | tee "${LOG}"
25
+ set -euo pipefail
26
+
27
+ echo "== Host and model =="
28
+ test -d "${MODEL_REMOTE}" || { echo "missing model: ${MODEL_REMOTE}" >&2; exit 2; }
29
+ du -sh "${MODEL_REMOTE}"
30
+ df -h /home | tail -1
31
+ free -h | sed -n '1,3p'
32
+ nvidia-smi --query-gpu=name,memory.total,memory.used,memory.free --format=csv,noheader || true
33
+ docker ps --format "{{.Names}} {{.Status}} {{.Image}}" | grep -Ei "qwen|kaiju|sglang|vllm" || true
34
+
35
+ echo
36
+ echo "== Model config =="
37
+ MODEL_REMOTE="${MODEL_REMOTE}" python3 - <<'PY'
38
+ import json
39
+ import os
40
+ from pathlib import Path
41
+
42
+ config = json.loads((Path(os.environ["MODEL_REMOTE"]) / "config.json").read_text())
43
+ text = config.get("text_config") or {}
44
+ print("model_type:", config.get("model_type"))
45
+ print("architectures:", config.get("architectures"))
46
+ print("text_model_type:", text.get("model_type"))
47
+ print("layers:", text.get("num_hidden_layers"))
48
+ print("layer_types:", ",".join(sorted(set(text.get("layer_types") or []))))
49
+ PY
50
+
51
+ echo
52
+ echo "== vLLM/Qwen3.5-capable Python stack =="
53
+ docker run --rm --entrypoint bash -v "${MODEL_REMOTE}":/models/kaiju:ro "${VLLM_IMAGE}" -lc '
54
+ set -euo pipefail
55
+ python3 - <<PY
56
+ from transformers import AutoConfig
57
+ cfg = AutoConfig.from_pretrained("/models/kaiju", trust_remote_code=True)
58
+ print("AutoConfig:", type(cfg).__name__, getattr(cfg, "model_type", None))
59
+ PY
60
+ python3 - <<PY
61
+ for mod in ["torch", "transformers", "safetensors", "vllm", "huggingface_hub"]:
62
+ m = __import__(mod)
63
+ version = getattr(m, "__version__", "installed")
64
+ print(mod + ": " + str(version))
65
+ PY
66
+ '
67
+
68
+ echo
69
+ echo "== Persistent quantization package import probe =="
70
+ docker run --rm --entrypoint bash -v "${MODEL_REMOTE}":/models/kaiju:ro "${VLLM_IMAGE}" -lc '
71
+ set -euo pipefail
72
+ for pkg in llmcompressor autoawq auto-gptq; do
73
+ echo "-- pip install ${pkg}"
74
+ if python3 -m pip install -q --no-cache-dir "${pkg}" >/tmp/kaiju-pip-${pkg}.log 2>&1; then
75
+ echo "${pkg}: install ok"
76
+ else
77
+ echo "${pkg}: install failed"
78
+ sed -n "1,120p" "/tmp/kaiju-pip-${pkg}.log"
79
+ fi
80
+ done
81
+ python3 - <<PY
82
+ mods = [("llmcompressor", "llmcompressor"), ("autoawq", "awq"), ("auto-gptq", "auto_gptq")]
83
+ for label, mod in mods:
84
+ try:
85
+ m = __import__(mod)
86
+ version = getattr(m, "__version__", "installed")
87
+ print(label + ": import ok: " + str(version))
88
+ except Exception as exc:
89
+ print(f"{label}: import failed: {type(exc).__name__}: {exc}")
90
+ PY
91
+ python3 - <<PY
92
+ from transformers import AutoConfig
93
+ try:
94
+ cfg = AutoConfig.from_pretrained("/models/kaiju", trust_remote_code=True)
95
+ print("post-install AutoConfig:", type(cfg).__name__, getattr(cfg, "model_type", None))
96
+ except Exception as exc:
97
+ print("post-install AutoConfig failed:", type(exc).__name__, exc)
98
+ PY
99
+ '
100
+
101
+ echo
102
+ echo "== LLM Compressor no-deps stack-preservation probe =="
103
+ docker run --rm --entrypoint bash -v "${MODEL_REMOTE}":/models/kaiju:ro "${VLLM_IMAGE}" -lc '
104
+ set -euo pipefail
105
+ python3 -m pip install -q --no-cache-dir --no-deps llmcompressor >/tmp/kaiju-pip-llmcompressor-nodeps.log 2>&1 || {
106
+ echo "llmcompressor --no-deps install failed"
107
+ sed -n "1,120p" /tmp/kaiju-pip-llmcompressor-nodeps.log
108
+ }
109
+ python3 - <<PY
110
+ try:
111
+ import llmcompressor
112
+ print("llmcompressor no-deps import:", getattr(llmcompressor, "__version__", "installed"))
113
+ except Exception as exc:
114
+ print("llmcompressor no-deps import failed:", type(exc).__name__, exc)
115
+ from transformers import AutoConfig
116
+ cfg = AutoConfig.from_pretrained("/models/kaiju", trust_remote_code=True)
117
+ print("no-deps AutoConfig:", type(cfg).__name__, getattr(cfg, "model_type", None))
118
+ PY
119
+ '
120
+
121
+ echo
122
+ echo "== llama.cpp GGUF support probe =="
123
+ mkdir -p "$(dirname "${LLAMA_DIR}")"
124
+ if [[ -d "${LLAMA_DIR}/.git" ]]; then
125
+ git -C "${LLAMA_DIR}" fetch --depth 1 origin master >/dev/null 2>&1 || true
126
+ git -C "${LLAMA_DIR}" checkout -q FETCH_HEAD >/dev/null 2>&1 || true
127
+ else
128
+ rm -rf "${LLAMA_DIR}"
129
+ git clone --depth 1 https://github.com/ggml-org/llama.cpp "${LLAMA_DIR}" >/dev/null
130
+ fi
131
+ docker run --rm --entrypoint bash \
132
+ -v "${MODEL_REMOTE}":/models/kaiju:ro \
133
+ -v "${LLAMA_DIR}":/llama.cpp:ro \
134
+ "${VLLM_IMAGE}" -lc '
135
+ set -euo pipefail
136
+ cd /llama.cpp
137
+ python3 convert_hf_to_gguf.py --print-supported-models 2>&1 | grep -Ei "qwen3_5|qwen3.5|qwen35|qwen3" | head -40 || true
138
+ python3 convert_hf_to_gguf.py --help | grep -E -- "--dry-run|--outtype|--vocab-only" || true
139
+ set +e
140
+ python3 convert_hf_to_gguf.py \
141
+ --dry-run \
142
+ --outtype q8_0 \
143
+ --outfile /tmp/kaiju-coder-7-q8_0-dry-run.gguf \
144
+ /models/kaiju 2>&1 | sed -n "1,220p"
145
+ DRY_STATUS=${PIPESTATUS[0]}
146
+ set -e
147
+ echo "gguf_dry_run_exit: ${DRY_STATUS}"
148
+ exit 0
149
+ '
150
+ REMOTE
151
+ STATUS=${PIPESTATUS[0]}
152
+ set -e
153
+
154
+ {
155
+ echo "# Kaiju Coder 7 Persisted Quantization Probe"
156
+ echo
157
+ echo "- Timestamp: \`${STAMP}\`"
158
+ echo "- Model: \`${MODEL_REMOTE}\`"
159
+ echo "- vLLM image: \`${VLLM_IMAGE}\`"
160
+ echo "- llama.cpp path: \`${LLAMA_DIR}\`"
161
+ echo "- Exit code: \`${STATUS}\`"
162
+ echo "- Log: \`${LOG}\`"
163
+ echo
164
+ echo "## Interpretation"
165
+ echo
166
+ if grep -q "Model architecture: QWEN35" "${LOG}" || grep -qi "QWEN35" "${LOG}"; then
167
+ echo "- GGUF conversion support probe found Qwen3.5/QWEN35 handling."
168
+ else
169
+ echo "- GGUF conversion support is not proven by this probe."
170
+ fi
171
+ if grep -q "AutoConfig: Qwen3_5Config" "${LOG}"; then
172
+ echo "- The pinned vLLM nightly stack recognizes Kaiju's Qwen3.5 config."
173
+ else
174
+ echo "- The pinned vLLM nightly stack did not recognize Kaiju's config."
175
+ fi
176
+ if grep -q "llmcompressor:" "${LOG}"; then
177
+ echo "- LLM Compressor package import was probed."
178
+ fi
179
+ echo
180
+ echo "Do not claim a persisted quantized artifact exists unless a later run writes"
181
+ echo "and verifies the quantized weights."
182
+ } > "${SUMMARY}"
183
+
184
+ echo "Summary: ${SUMMARY}"
185
+ exit "${STATUS}"
scripts/run-gojira-b-kaiju-gguf-convert.sh ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
5
+ SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
6
+ # shellcheck source=scripts/gojira-b-ssh-lib.sh
7
+ source "${SCRIPT_DIR}/gojira-b-ssh-lib.sh"
8
+ kaiju_gojira_b_init
9
+
10
+ MODEL_REMOTE="${KAIJU_GGUF_MODEL_REMOTE:-/home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged}"
11
+ OUT_DIR="${KAIJU_GGUF_OUT_DIR:-/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf}"
12
+ OUTTYPE="${KAIJU_GGUF_OUTTYPE:-q8_0}"
13
+ OUTTYPE_UPPER="$(printf "%s" "${OUTTYPE}" | tr "[:lower:]" "[:upper:]")"
14
+ OUTFILE="${KAIJU_GGUF_OUTFILE:-kaiju-coder-7-${OUTTYPE_UPPER}.gguf}"
15
+ VLLM_IMAGE="${KAIJU_GGUF_VLLM_IMAGE:-gojira/vllm-openai-ray:nightly}"
16
+ LLAMA_DIR="${KAIJU_LLAMA_CPP_REMOTE:-/home/richardecholsai5/tools/llama.cpp}"
17
+ FORCE="${KAIJU_GGUF_FORCE:-0}"
18
+ STOP_VLLM="${KAIJU_GGUF_STOP_VLLM:-1}"
19
+ RESTART_VLLM="${KAIJU_GGUF_RESTART_VLLM:-1}"
20
+ STAMP="$(date -u +%Y%m%dT%H%M%SZ)"
21
+ RUN_DIR="${ROOT}/runs/gguf-conversion/${STAMP}"
22
+ LOG="${RUN_DIR}/gguf-conversion.log"
23
+ SUMMARY="${RUN_DIR}/summary.md"
24
+
25
+ mkdir -p "${RUN_DIR}"
26
+
27
+ printf -v MODEL_REMOTE_Q "%q" "${MODEL_REMOTE}"
28
+ printf -v OUT_DIR_Q "%q" "${OUT_DIR}"
29
+ printf -v OUTFILE_Q "%q" "${OUTFILE}"
30
+ printf -v OUTTYPE_Q "%q" "${OUTTYPE}"
31
+ printf -v VLLM_IMAGE_Q "%q" "${VLLM_IMAGE}"
32
+ printf -v LLAMA_DIR_Q "%q" "${LLAMA_DIR}"
33
+ printf -v FORCE_Q "%q" "${FORCE}"
34
+ printf -v STOP_VLLM_Q "%q" "${STOP_VLLM}"
35
+ printf -v RESTART_VLLM_Q "%q" "${RESTART_VLLM}"
36
+
37
+ set +e
38
+ kaiju_gojira_b_ssh "MODEL_REMOTE=${MODEL_REMOTE_Q} OUT_DIR=${OUT_DIR_Q} OUTFILE=${OUTFILE_Q} OUTTYPE=${OUTTYPE_Q} VLLM_IMAGE=${VLLM_IMAGE_Q} LLAMA_DIR=${LLAMA_DIR_Q} FORCE=${FORCE_Q} STOP_VLLM=${STOP_VLLM_Q} RESTART_VLLM=${RESTART_VLLM_Q} bash -s" <<'REMOTE' 2>&1 | tee "${LOG}"
39
+ set -euo pipefail
40
+
41
+ VLLM_SESSION="${KAIJU_VLLM_SESSION:-kaiju_qwen36_v18_merged_vllm}"
42
+ VLLM_CONTAINER="${KAIJU_VLLM_CONTAINER:-qwen36-merged-vllm-18084}"
43
+ OUT_PATH="${OUT_DIR}/${OUTFILE}"
44
+
45
+ restart_vllm() {
46
+ if [[ "${RESTART_VLLM}" != "1" ]]; then
47
+ return
48
+ fi
49
+ if tmux has-session -t "${VLLM_SESSION}" 2>/dev/null; then
50
+ return
51
+ fi
52
+ echo "Restarting vLLM fast runtime on 18084"
53
+ mkdir -p /home/richardecholsai5/kaiju-coder/logs /home/richardecholsai5/hf-cache
54
+ sudo docker rm -f "${VLLM_CONTAINER}" >/dev/null 2>&1 || true
55
+ LOG=/home/richardecholsai5/kaiju-coder/logs/qwen36-merged-vllm-18084.log
56
+ rm -f "${LOG}"
57
+ tmux new-session -d -s "${VLLM_SESSION}" "set -euo pipefail; sudo docker run --rm --gpus all --network host --ipc=host \
58
+ -v '${MODEL_REMOTE}':/models/kaiju-merged:ro \
59
+ -v /home/richardecholsai5/hf-cache:/root/.cache/huggingface \
60
+ --name '${VLLM_CONTAINER}' \
61
+ --entrypoint bash \
62
+ '${VLLM_IMAGE}' \
63
+ -lc 'python3 -m pip install -q pandas; python3 -m vllm.entrypoints.openai.api_server \
64
+ --model /models/kaiju-merged \
65
+ --served-model-name kaiju-coder-7 \
66
+ --host 0.0.0.0 \
67
+ --port 18084 \
68
+ --max-model-len 16384 \
69
+ --gpu-memory-utilization 0.90 \
70
+ --trust-remote-code \
71
+ --language-model-only \
72
+ --dtype bfloat16 \
73
+ --tool-call-parser qwen3_coder \
74
+ --reasoning-parser qwen3 \
75
+ --quantization bitsandbytes \
76
+ --load-format bitsandbytes \
77
+ --enable-auto-tool-choice \
78
+ --uvicorn-log-level info' 2>&1 | tee '${LOG}'"
79
+ }
80
+ trap restart_vllm EXIT
81
+
82
+ echo "== GGUF conversion request =="
83
+ echo "model: ${MODEL_REMOTE}"
84
+ echo "out: ${OUT_PATH}"
85
+ echo "outtype: ${OUTTYPE}"
86
+ test -d "${MODEL_REMOTE}" || { echo "missing model: ${MODEL_REMOTE}" >&2; exit 2; }
87
+ mkdir -p "${OUT_DIR}" "$(dirname "${LLAMA_DIR}")"
88
+ du -sh "${MODEL_REMOTE}"
89
+ df -h /home | tail -1
90
+ free -h | sed -n '1,3p'
91
+
92
+ if [[ "${STOP_VLLM}" == "1" ]]; then
93
+ echo "Stopping active vLLM runtime to free RAM"
94
+ tmux kill-session -t "${VLLM_SESSION}" >/dev/null 2>&1 || true
95
+ sudo docker rm -f "${VLLM_CONTAINER}" >/dev/null 2>&1 || true
96
+ sleep 3
97
+ free -h | sed -n '1,3p'
98
+ fi
99
+
100
+ if [[ -s "${OUT_PATH}" && "${FORCE}" != "1" ]]; then
101
+ echo "Existing GGUF found, skipping conversion: ${OUT_PATH}"
102
+ else
103
+ if [[ -d "${LLAMA_DIR}/.git" ]]; then
104
+ git -C "${LLAMA_DIR}" fetch --depth 1 origin master >/dev/null 2>&1 || true
105
+ git -C "${LLAMA_DIR}" checkout -q FETCH_HEAD >/dev/null 2>&1 || true
106
+ else
107
+ rm -rf "${LLAMA_DIR}"
108
+ git clone --depth 1 https://github.com/ggml-org/llama.cpp "${LLAMA_DIR}" >/dev/null
109
+ fi
110
+ rm -f "${OUT_PATH}.tmp" "${OUT_PATH}"
111
+ docker run --rm --entrypoint bash \
112
+ -v "${MODEL_REMOTE}":/models/kaiju:ro \
113
+ -v "${OUT_DIR}":/out \
114
+ -v "${LLAMA_DIR}":/llama.cpp:ro \
115
+ "${VLLM_IMAGE}" -lc "
116
+ set -euo pipefail
117
+ cd /llama.cpp
118
+ python3 convert_hf_to_gguf.py \
119
+ --outtype '${OUTTYPE}' \
120
+ --outfile '/out/${OUTFILE}.tmp' \
121
+ /models/kaiju
122
+ "
123
+ mv "${OUT_PATH}.tmp" "${OUT_PATH}"
124
+ fi
125
+
126
+ echo
127
+ echo "== GGUF artifact =="
128
+ ls -lh "${OUT_PATH}"
129
+ sha256sum "${OUT_PATH}" | tee "${OUT_PATH}.sha256"
130
+ OUT_PATH_PY="${OUT_PATH}" \
131
+ OUT_DIR_PY="${OUT_DIR}" \
132
+ OUTTYPE_PY="${OUTTYPE}" \
133
+ MODEL_REMOTE_PY="${MODEL_REMOTE}" \
134
+ LLAMA_DIR_PY="${LLAMA_DIR}" \
135
+ python3 - <<'PY'
136
+ import json
137
+ import os
138
+ from pathlib import Path
139
+
140
+ out = Path(os.environ["OUT_PATH_PY"])
141
+ out_dir = Path(os.environ["OUT_DIR_PY"])
142
+ outtype = os.environ["OUTTYPE_PY"]
143
+ model_remote = os.environ["MODEL_REMOTE_PY"]
144
+ llama_dir = os.environ["LLAMA_DIR_PY"]
145
+ manifest = {
146
+ "product": "Kaiju Coder 7",
147
+ "model_id": "kaiju-coder-7",
148
+ "format": "GGUF",
149
+ "outtype": outtype,
150
+ "artifact": str(out),
151
+ "sha256_file": str(out) + ".sha256",
152
+ "source_model": model_remote,
153
+ "converter": llama_dir,
154
+ "status": "converted_pending_runtime_smoke",
155
+ }
156
+ (out_dir / "GGUF_RELEASE_MANIFEST.json").write_text(json.dumps(manifest, indent=2) + "\n")
157
+ (out_dir / "README.md").write_text(
158
+ "# Kaiju Coder 7 GGUF Candidate\n\n"
159
+ "This is a persisted GGUF candidate converted from the merged Kaiju Coder 7 model.\n"
160
+ "It is not public release-ready until a runtime smoke test passes.\n\n"
161
+ f"- Artifact: `{out.name}`\n"
162
+ f"- Outtype: `{outtype}`\n"
163
+ f"- Source: `{model_remote}`\n",
164
+ encoding="utf-8",
165
+ )
166
+ PY
167
+ REMOTE
168
+ STATUS=${PIPESTATUS[0]}
169
+ set -e
170
+
171
+ {
172
+ echo "# Kaiju Coder 7 GGUF Conversion"
173
+ echo
174
+ echo "- Timestamp: \`${STAMP}\`"
175
+ echo "- Exit code: \`${STATUS}\`"
176
+ echo "- Model: \`${MODEL_REMOTE}\`"
177
+ echo "- Out dir: \`${OUT_DIR}\`"
178
+ echo "- Out file: \`${OUTFILE}\`"
179
+ echo "- Out type: \`${OUTTYPE}\`"
180
+ echo "- Log: \`${LOG}\`"
181
+ echo
182
+ if grep -q "GGUF artifact" "${LOG}" && grep -qE "^[0-9a-f]{64}[[:space:]]+${OUT_DIR}/${OUTFILE}$" "${LOG}"; then
183
+ echo "Status: converted; runtime smoke still required before public release."
184
+ else
185
+ echo "Status: conversion incomplete or failed."
186
+ fi
187
+ } > "${SUMMARY}"
188
+
189
+ echo "Summary: ${SUMMARY}"
190
+ exit "${STATUS}"