Upload Kaiju Coder 7 runtime quantization recipe

Browse files

Files changed (6) hide show

GGUF_CANDIDATE.md +51 -0
PUBLIC_TESTING_QUICKSTART.md +24 -7
README.md +48 -9
scripts/kaiju_opencode_fast_proxy.py +234 -0
scripts/probe-gojira-b-persisted-quantization.sh +185 -0
scripts/run-gojira-b-kaiju-gguf-convert.sh +190 -0

GGUF_CANDIDATE.md ADDED Viewed

	@@ -0,0 +1,51 @@

+# Kaiju Coder 7 GGUF Candidate
+This folder documents the persisted GGUF candidate for Kaiju Coder 7. The
+artifact exists on Gojira-B, but it should stay marked as a candidate until a
+runtime smoke test passes.
+## Artifact
+- Format: GGUF
+- Outtype: `q8_0`
+- Remote path:
+  `/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf`
+- Remote size: `27G`
+- SHA256:
+  `596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e`
+- Source model:
+  `/home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged`
+- Conversion evidence:
+  `runs/gguf-conversion/20260603T231446Z/gguf-conversion.log`
+## Status
+Converted successfully on 2026-06-03. Runtime smoke is still required before
+public upload or a Hugging Face quantized-weights claim.
+The conversion path is promising because the current `llama.cpp`
+`convert_hf_to_gguf.py` support list includes `Qwen3_5ForConditionalGeneration`
+and the Q8_0 dry run completed before the real conversion.
+## Recreate
+```bash
+./scripts/probe-gojira-b-persisted-quantization.sh
+./scripts/run-gojira-b-kaiju-gguf-convert.sh
+```
+The conversion script stops the active vLLM runtime to free RAM, writes the GGUF
+artifact, records a checksum and manifest, then restarts the fast vLLM runtime.
+## Release Rule
+Do not publish this as public quantized weights until all of these pass:
+- runtime loads the GGUF with model id `kaiju-coder-7`
+- direct identity smoke passes
+- direct business-owner document smoke passes
+- OpenCode or router smoke passes through the intended runtime
+- README/model card states exact runtime, context, memory, and quality caveats
+Until then, the public quantized path remains `kaiju-coder-7-quantized-runtime`,
+which documents the already-smoked vLLM bitsandbytes setup.

PUBLIC_TESTING_QUICKSTART.md CHANGED Viewed

@@ -19,7 +19,7 @@ Use this if you already have Kaiju Coder 7 served at an OpenAI-compatible
 ```bash
 git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
 cd kaiju-coder-7-opencode
-python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18083/v1
 ```
 Then run OpenCode inside the project you want to edit:
@@ -65,23 +65,31 @@ the server to expose:
 ```text
 model id: kaiju-coder-7
-base URL: http://127.0.0.1:18083/v1
 context: 16384
 ```
 Then install the OpenCode helper with:
 ```bash
 git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
 cd kaiju-coder-7-opencode
-python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18083/v1
 ```
 ### Path 3: Runtime-Quantized Local Candidate
 Use this only if you are comfortable with advanced serving setups. The current
-working quantized option is a runtime bitsandbytes recipe, not a separate
-persisted quantized weights repo.
 ```bash
 git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-quantized-runtime
@@ -115,9 +123,12 @@ Expected result:
 - Public model id: `kaiju-coder-7`
 - OpenCode context: `16384`
 - Output cap for public testing: `2500`
 - Current reliable product path: model plus deterministic business-owner
-  harness plus verifier
-- Raw multi-file OpenCode generation: still too slow for broad paid API claims
 - Paid API: not public until launch preflight passes
 ## What Not To Claim Yet
@@ -134,15 +145,21 @@ Do claim:
 - Kaiju Coder 7 has a working local/OpenCode release candidate
 - the current tested OpenCode default is 16k context
 - the helper package includes a lean agent and compaction loop guard
 - the paid API scaffold has tests and a launch preflight, but is not yet public
 - the packaged public smoke verifies a fresh OpenCode one-file write before
   public claims are refreshed
 ## Current Blockers Before Public Release
 - Hugging Face repo creation still requires a write-capable token or namespace.
 - Full merged model upload has not completed; the merged folder must first have
   the metadata packet synced by `prepare_hf_merged_model_metadata.sh`.
 - Public paid API launch needs real Cloudflare D1/KV/R2 bindings, Wrangler
   secret verification, Stripe webhook staging evidence, staging traffic, latency
   evidence, and rollback proof.

 ```bash
 git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
 cd kaiju-coder-7-opencode
+python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18181/v1
 ```
 Then run OpenCode inside the project you want to edit:
 ```text
 model id: kaiju-coder-7
+base URL: http://127.0.0.1:18084/v1
 context: 16384
 ```
+For the fastest OpenCode behavior, run the bundled fast proxy in a separate
+terminal and point OpenCode at the proxy:
+```bash
+KAIJU_OPENAI_BASE_URL=http://127.0.0.1:18084/v1 \
+python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181
+```
 Then install the OpenCode helper with:
 ```bash
 git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-opencode
 cd kaiju-coder-7-opencode
+python3 scripts/install_kaiju_opencode_profile.py --base-url http://127.0.0.1:18181/v1
 ```
 ### Path 3: Runtime-Quantized Local Candidate
 Use this only if you are comfortable with advanced serving setups. The current
+working quantized option is a runtime bitsandbytes recipe. A Q8_0 GGUF artifact
+has been converted, but it is still a candidate until runtime smoke passes.
 ```bash
 git clone https://huggingface.co/RMDWLLC/kaiju-coder-7-quantized-runtime
 - Public model id: `kaiju-coder-7`
 - OpenCode context: `16384`
 - Output cap for public testing: `2500`
+- Fast OpenCode path: vLLM bitsandbytes runtime behind the Kaiju fast proxy
 - Current reliable product path: model plus deterministic business-owner
+  harness/router plus verifier
+- Raw multi-file OpenCode generation: still too slow for broad paid claims;
+  useful for testing, but paid API claims should favor harnessed product
+  workflows until broader latency gates pass
 - Paid API: not public until launch preflight passes
 ## What Not To Claim Yet
 - Kaiju Coder 7 has a working local/OpenCode release candidate
 - the current tested OpenCode default is 16k context
 - the helper package includes a lean agent and compaction loop guard
+- the fast proxy keeps OpenCode tool calls intact while forcing bounded,
+  non-thinking generation
 - the paid API scaffold has tests and a launch preflight, but is not yet public
 - the packaged public smoke verifies a fresh OpenCode one-file write before
   public claims are refreshed
+- a GGUF Q8_0 candidate exists, but is not public quantized-weights release
+  evidence until runtime smoke passes
 ## Current Blockers Before Public Release
 - Hugging Face repo creation still requires a write-capable token or namespace.
 - Full merged model upload has not completed; the merged folder must first have
   the metadata packet synced by `prepare_hf_merged_model_metadata.sh`.
+- The GGUF Q8_0 candidate still needs a runtime smoke before public
+  quantized-weights upload.
 - Public paid API launch needs real Cloudflare D1/KV/R2 bindings, Wrangler
   secret verification, Stripe webhook staging evidence, staging traffic, latency
   evidence, and rollback proof.

README.md CHANGED Viewed

@@ -14,8 +14,9 @@ weight artifact yet.
 - Required OpenCode launch flag: `--enable-auto-tool-choice`
 - Required preinstall in this image: `pandas`
 - Tested contexts: `8192`, `16384`
-- OpenCode smoke: passed
-- Persisted quantized Hugging Face weights: pending
 ## Run
@@ -30,7 +31,14 @@ KAIJU_VLLM_LOAD_FORMAT=bitsandbytes \
 ```
 The script stops the merged SGLang service, starts vLLM on port `18084`, runs
-the benchmark, then restores the recommended SGLang service on port `18083`.
 ## Evidence
@@ -40,6 +48,7 @@ Runs:
 - `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
 - `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
 - `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
 | Runtime | Context | Prompt | OK | Seconds | Chars | Chars/s |
 | --- | ---: | --- | --- | ---: | ---: | ---: |
@@ -49,12 +58,23 @@ Runs:
 | vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 |
 | vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 |
 | vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 |
 Gojira-B log evidence recorded model load at about `17.8 GiB` memory for both
 8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement
 over the full bfloat16 vLLM model load, which reported about `50.22 GiB`.
-The 16k business-document task passed after the wrapper restored the default
-SGLang service.
 OpenCode one-file smoke also passed through the runtime-quantized endpoint:
@@ -71,13 +91,32 @@ Result:
 - Permission mode: `--dangerously-skip-permissions` inside the temporary smoke
   harness only
 ## Release Interpretation
 This is a working quantized local runtime candidate. It is useful for internal
 testing, serious GPU users, and the next paid API speed experiments. It is not
-yet a standalone public quantized weights repo because the artifact is still the
-full merged model loaded through bitsandbytes at runtime.
-The next release step is to produce a persisted quantized artifact, or package
-this runtime path as an advanced serving recipe while clearly saying it still
 requires access to the full Kaiju Coder 7 merged weights.

 - Required OpenCode launch flag: `--enable-auto-tool-choice`
 - Required preinstall in this image: `pandas`
 - Tested contexts: `8192`, `16384`
+- OpenCode smoke: passed through the local fast proxy
+- Persisted quantized Hugging Face weights: GGUF Q8_0 converted, runtime smoke
+  pending before public upload
 ## Run
 ```
 The script stops the merged SGLang service, starts vLLM on port `18084`, runs
+the benchmark, then restores SGLang unless `KAIJU_VLLM_KEEP_RUNNING=1` is set.
+For the current fast OpenCode setup, keep vLLM running and point the fast proxy
+at port `18084`.
+```bash
+KAIJU_OPENAI_BASE_URL=http://127.0.0.1:18084/v1 \
+python3 scripts/kaiju_opencode_fast_proxy.py --host 127.0.0.1 --port 18181
+```
 ## Evidence
 - `runs/benchmarks/20260603T154450Z-kaiju-coder-7-serving/summary.md`
 - `runs/benchmarks/20260603T161316Z-kaiju-coder-7-serving/summary.md`
 - `runs/benchmarks/20260603T165512Z-kaiju-coder-7-serving/summary.md`
+- `runs/benchmarks/20260603T223337Z-kaiju-coder-7-serving/summary.md`
 | Runtime | Context | Prompt | OK | Seconds | Chars | Chars/s |
 | --- | ---: | --- | --- | ---: | ---: | ---: |
 | vLLM bitsandbytes | 16384 | code_patch | True | 11.3 | 416 | 36.814 |
 | vLLM bitsandbytes | 16384 | business_doc | True | 53.44 | 1610 | 30.127 |
 | vLLM bitsandbytes | 16384 | identity | True | 19.65 | 26 | 1.323 |
+| vLLM bitsandbytes | 16384 | code_patch | True | 24.97 | 997 | 39.924 |
+| vLLM bitsandbytes | 16384 | business_doc | True | 34.46 | 1615 | 46.874 |
 Gojira-B log evidence recorded model load at about `17.8 GiB` memory for both
 8k and 16k bitsandbytes runs. This is a meaningful local-serving improvement
 over the full bfloat16 vLLM model load, which reported about `50.22 GiB`.
+The 16k business-document task passed, and the current speed pass keeps the
+runtime-quantized vLLM service active for OpenCode through the local proxy.
+The dedicated website harness/router speed pass produced a complete checked
+website in about `7.2s` through vLLM bitsandbytes:
+- Direct website harness: `runs/harness/website-speed-pass/avery-stone-vllm.html`
+- Router artifact: `runs/router-speed-pass/20260603T223731Z-website-build-a-premium-one-page-website-for-avery-stone-construction-a-reside/index.html`
+- Local-proxy router artifact: `runs/router-speed-pass/20260603T224328Z-website-build-a-premium-one-page-website-for-bennett-family-dental-in-charlott/index.html`
+- Router checks: complete HTML, required sections, external images,
+  responsive CSS, no lorem ipsum, manifest write
 OpenCode one-file smoke also passed through the runtime-quantized endpoint:
 - Permission mode: `--dangerously-skip-permissions` inside the temporary smoke
   harness only
+## Persisted GGUF Candidate
+A Q8_0 GGUF candidate now exists on Gojira-B:
+```text
+/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf
+```
+- Size: `27G`
+- SHA256:
+  `596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e`
+- Conversion evidence:
+  `runs/gguf-conversion/20260603T231446Z/gguf-conversion.log`
+- Local docs: `release/gguf/README.md`
+This is not public quantized-weights release evidence yet. It still needs a
+runtime smoke that proves identity, business-owner output, and the intended
+OpenCode/router path under an actual GGUF runtime.
 ## Release Interpretation
 This is a working quantized local runtime candidate. It is useful for internal
 testing, serious GPU users, and the next paid API speed experiments. It is not
+yet a standalone public quantized weights repo because the only fully smoked
+path is still the full merged model loaded through bitsandbytes at runtime.
+The next release step is to smoke-test the GGUF candidate or package this
+runtime path as an advanced serving recipe while clearly saying it still
 requires access to the full Kaiju Coder 7 merged weights.

scripts/kaiju_opencode_fast_proxy.py ADDED Viewed

	@@ -0,0 +1,234 @@

+#!/usr/bin/env python3
+"""Tool-safe OpenAI-compatible fast proxy for Kaiju Coder 7 OpenCode.
+The normal Gojira gateway is product/API oriented and aggregates content. OpenCode
+needs raw tool-call chunks preserved, so this proxy only patches serving knobs
+and then passes upstream responses through unchanged.
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import time
+import urllib.error
+import urllib.request
+from http import HTTPStatus
+from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
+from typing import Any
+DEFAULT_HOST = "127.0.0.1"
+DEFAULT_PORT = int(os.environ.get("KAIJU_OPENCODE_FAST_PROXY_PORT", "18181"))
+UPSTREAM_BASE_URL = os.environ.get("KAIJU_OPENAI_BASE_URL", "http://100.109.109.14:18084/v1")
+DEFAULT_MODEL = os.environ.get("KAIJU_DEFAULT_MODEL", "kaiju-coder-7")
+API_KEY = os.environ.get("KAIJU_OPENAI_API_KEY", "")
+NORMAL_MAX_TOKENS = int(os.environ.get("KAIJU_NORMAL_MAX_TOKENS", "384"))
+WORK_MAX_TOKENS = int(os.environ.get("KAIJU_WORK_MAX_TOKENS", "1536"))
+ARTIFACT_MAX_TOKENS = int(os.environ.get("KAIJU_ARTIFACT_MAX_TOKENS", "4096"))
+MAX_REQUEST_BYTES = int(os.environ.get("KAIJU_MAX_REQUEST_BYTES", "2097152"))
+def normalize_messages(messages: Any) -> list[dict[str, Any]]:
+    if not isinstance(messages, list):
+        return []
+    return [message for message in messages if isinstance(message, dict)]
+def message_text(messages: list[dict[str, Any]]) -> str:
+    parts: list[str] = []
+    for message in messages:
+        content = message.get("content", "")
+        if isinstance(content, str):
+            parts.append(content)
+        else:
+            parts.append(json.dumps(content, ensure_ascii=False))
+    return "\n".join(parts).lower()
+def classify_job(messages: list[dict[str, Any]]) -> str:
+    text = message_text(messages)
+    artifact_terms = (
+        "complete html",
+        "html file",
+        "one-file website",
+        "landing page",
+        "build a website",
+        "make a website",
+        "full file",
+    )
+    work_terms = (
+        "create ",
+        "write ",
+        "edit ",
+        "implement",
+        "debug",
+        "fix",
+        "refactor",
+        "test",
+        "repo",
+        "file",
+    )
+    if any(term in text for term in artifact_terms):
+        return "artifact"
+    if any(term in text for term in work_terms):
+        return "work"
+    return "normal"
+def target_tokens(job_class: str) -> int:
+    if job_class == "artifact":
+        return ARTIFACT_MAX_TOKENS
+    if job_class == "work":
+        return WORK_MAX_TOKENS
+    return NORMAL_MAX_TOKENS
+def patch_chat_payload(body: dict[str, Any]) -> dict[str, Any]:
+    patched = dict(body)
+    patched["model"] = DEFAULT_MODEL
+    messages = normalize_messages(patched.get("messages"))
+    job_class = classify_job(messages)
+    patched["max_tokens"] = target_tokens(job_class)
+    patched["chat_template_kwargs"] = {
+        **(patched.get("chat_template_kwargs") if isinstance(patched.get("chat_template_kwargs"), dict) else {}),
+        "enable_thinking": False,
+        "thinking": False,
+    }
+    return patched
+class Handler(BaseHTTPRequestHandler):
+    server_version = "KaijuOpenCodeFastProxy/0.1"
+    def log_message(self, fmt: str, *args: Any) -> None:
+        print(f"{time.strftime('%Y-%m-%d %H:%M:%S')} {self.address_string()} - {fmt % args}", flush=True)
+    def _json(self, status: int, payload: dict[str, Any]) -> None:
+        data = json.dumps(payload).encode("utf-8")
+        self.send_response(status)
+        self.send_header("content-type", "application/json; charset=utf-8")
+        self.send_header("cache-control", "no-store")
+        self.send_header("content-length", str(len(data)))
+        self.end_headers()
+        self.wfile.write(data)
+    def _read_json(self) -> dict[str, Any]:
+        length = int(self.headers.get("content-length", "0"))
+        if length > MAX_REQUEST_BYTES:
+            raise ValueError("request body too large")
+        raw = self.rfile.read(length)
+        if not raw:
+            return {}
+        value = json.loads(raw.decode("utf-8"))
+        if not isinstance(value, dict):
+            raise ValueError("request body must be a JSON object")
+        return value
+    def do_GET(self) -> None:  # noqa: N802 - BaseHTTPRequestHandler API.
+        if self.path == "/health":
+            self._json(
+                HTTPStatus.OK,
+                {
+                    "ok": True,
+                    "model": DEFAULT_MODEL,
+                    "upstream": UPSTREAM_BASE_URL,
+                    "normal_max_tokens": NORMAL_MAX_TOKENS,
+                    "work_max_tokens": WORK_MAX_TOKENS,
+                    "artifact_max_tokens": ARTIFACT_MAX_TOKENS,
+                },
+            )
+            return
+        if self.path == "/v1/models":
+            self._forward_get("/models")
+            return
+        self._json(HTTPStatus.NOT_FOUND, {"error": {"message": "Not found", "type": "not_found"}})
+    def do_POST(self) -> None:  # noqa: N802 - BaseHTTPRequestHandler API.
+        if self.path != "/v1/chat/completions":
+            self._json(HTTPStatus.NOT_FOUND, {"error": {"message": "Not found", "type": "not_found"}})
+            return
+        try:
+            body = patch_chat_payload(self._read_json())
+        except Exception as error:  # noqa: BLE001 - return request parse failures.
+            self._json(HTTPStatus.BAD_REQUEST, {"error": {"message": str(error), "type": "bad_request"}})
+            return
+        self._forward_post("/chat/completions", body)
+    def _headers(self) -> dict[str, str]:
+        headers = {"content-type": "application/json"}
+        if API_KEY:
+            headers["authorization"] = f"Bearer {API_KEY}"
+        return headers
+    def _forward_get(self, suffix: str) -> None:
+        request = urllib.request.Request(
+            f"{UPSTREAM_BASE_URL.rstrip('/')}{suffix}",
+            headers=self._headers(),
+            method="GET",
+        )
+        try:
+            with urllib.request.urlopen(request, timeout=30) as upstream:
+                data = upstream.read()
+                self.send_response(upstream.status)
+                self.send_header("content-type", upstream.headers.get("content-type", "application/json"))
+                self.send_header("cache-control", "no-store")
+                self.send_header("content-length", str(len(data)))
+                self.end_headers()
+                self.wfile.write(data)
+        except urllib.error.HTTPError as error:
+            self._json(error.code, {"error": {"message": error.read().decode("utf-8", errors="replace")[:500]}})
+        except Exception as error:  # noqa: BLE001 - proxy health should surface upstream failures.
+            self._json(HTTPStatus.BAD_GATEWAY, {"error": {"message": str(error), "type": "upstream_error"}})
+    def _forward_post(self, suffix: str, body: dict[str, Any]) -> None:
+        data = json.dumps(body).encode("utf-8")
+        request = urllib.request.Request(
+            f"{UPSTREAM_BASE_URL.rstrip('/')}{suffix}",
+            data=data,
+            headers=self._headers(),
+            method="POST",
+        )
+        try:
+            timeout = 1200 if classify_job(normalize_messages(body.get("messages"))) == "artifact" else 600
+            with urllib.request.urlopen(request, timeout=timeout) as upstream:
+                content_type = upstream.headers.get("content-type", "application/json")
+                if body.get("stream") is True:
+                    self.send_response(upstream.status)
+                    self.send_header("content-type", content_type)
+                    self.send_header("cache-control", "no-store, no-transform")
+                    self.send_header("connection", "close")
+                    self.end_headers()
+                    for chunk in upstream:
+                        self.wfile.write(chunk)
+                        self.wfile.flush()
+                    return
+                response = upstream.read()
+                self.send_response(upstream.status)
+                self.send_header("content-type", content_type)
+                self.send_header("cache-control", "no-store")
+                self.send_header("content-length", str(len(response)))
+                self.end_headers()
+                self.wfile.write(response)
+        except urllib.error.HTTPError as error:
+            detail = error.read().decode("utf-8", errors="replace")[:500]
+            self._json(error.code, {"error": {"message": detail, "type": "upstream_error"}})
+        except Exception as error:  # noqa: BLE001 - proxy should report upstream failures.
+            self._json(HTTPStatus.BAD_GATEWAY, {"error": {"message": str(error), "type": "upstream_error"}})
+def main() -> int:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--host", default=DEFAULT_HOST)
+    parser.add_argument("--port", type=int, default=DEFAULT_PORT)
+    args = parser.parse_args()
+    server = ThreadingHTTPServer((args.host, args.port), Handler)
+    print(f"Kaiju OpenCode fast proxy listening on http://{args.host}:{args.port}", flush=True)
+    print(f"Upstream: {UPSTREAM_BASE_URL}", flush=True)
+    server.serve_forever()
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

scripts/probe-gojira-b-persisted-quantization.sh ADDED Viewed

	@@ -0,0 +1,185 @@

+#!/usr/bin/env bash
+set -euo pipefail
+ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# shellcheck source=scripts/gojira-b-ssh-lib.sh
+source "${SCRIPT_DIR}/gojira-b-ssh-lib.sh"
+kaiju_gojira_b_init
+STAMP="$(date -u +%Y%m%dT%H%M%SZ)"
+RUN_DIR="${ROOT}/runs/quantization-probes/${STAMP}"
+LOG="${RUN_DIR}/persisted-quantization-probe.log"
+SUMMARY="${RUN_DIR}/summary.md"
+MODEL_REMOTE="${KAIJU_QUANT_MODEL_REMOTE:-/home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged}"
+VLLM_IMAGE="${KAIJU_QUANT_VLLM_IMAGE:-gojira/vllm-openai-ray:nightly}"
+LLAMA_DIR="${KAIJU_LLAMA_CPP_REMOTE:-/home/richardecholsai5/tools/llama.cpp}"
+mkdir -p "${RUN_DIR}"
+printf -v MODEL_REMOTE_Q "%q" "${MODEL_REMOTE}"
+printf -v VLLM_IMAGE_Q "%q" "${VLLM_IMAGE}"
+printf -v LLAMA_DIR_Q "%q" "${LLAMA_DIR}"
+set +e
+kaiju_gojira_b_ssh "MODEL_REMOTE=${MODEL_REMOTE_Q} VLLM_IMAGE=${VLLM_IMAGE_Q} LLAMA_DIR=${LLAMA_DIR_Q} bash -s" <<'REMOTE' 2>&1 | tee "${LOG}"
+set -euo pipefail
+echo "== Host and model =="
+test -d "${MODEL_REMOTE}" || { echo "missing model: ${MODEL_REMOTE}" >&2; exit 2; }
+du -sh "${MODEL_REMOTE}"
+df -h /home | tail -1
+free -h | sed -n '1,3p'
+nvidia-smi --query-gpu=name,memory.total,memory.used,memory.free --format=csv,noheader || true
+docker ps --format "{{.Names}} {{.Status}} {{.Image}}" | grep -Ei "qwen|kaiju|sglang|vllm" || true
+echo
+echo "== Model config =="
+MODEL_REMOTE="${MODEL_REMOTE}" python3 - <<'PY'
+import json
+import os
+from pathlib import Path
+config = json.loads((Path(os.environ["MODEL_REMOTE"]) / "config.json").read_text())
+text = config.get("text_config") or {}
+print("model_type:", config.get("model_type"))
+print("architectures:", config.get("architectures"))
+print("text_model_type:", text.get("model_type"))
+print("layers:", text.get("num_hidden_layers"))
+print("layer_types:", ",".join(sorted(set(text.get("layer_types") or []))))
+PY
+echo
+echo "== vLLM/Qwen3.5-capable Python stack =="
+docker run --rm --entrypoint bash -v "${MODEL_REMOTE}":/models/kaiju:ro "${VLLM_IMAGE}" -lc '
+set -euo pipefail
+python3 - <<PY
+from transformers import AutoConfig
+cfg = AutoConfig.from_pretrained("/models/kaiju", trust_remote_code=True)
+print("AutoConfig:", type(cfg).__name__, getattr(cfg, "model_type", None))
+PY
+python3 - <<PY
+for mod in ["torch", "transformers", "safetensors", "vllm", "huggingface_hub"]:
+    m = __import__(mod)
+    version = getattr(m, "__version__", "installed")
+    print(mod + ": " + str(version))
+PY
+'
+echo
+echo "== Persistent quantization package import probe =="
+docker run --rm --entrypoint bash -v "${MODEL_REMOTE}":/models/kaiju:ro "${VLLM_IMAGE}" -lc '
+set -euo pipefail
+for pkg in llmcompressor autoawq auto-gptq; do
+  echo "-- pip install ${pkg}"
+  if python3 -m pip install -q --no-cache-dir "${pkg}" >/tmp/kaiju-pip-${pkg}.log 2>&1; then
+    echo "${pkg}: install ok"
+  else
+    echo "${pkg}: install failed"
+    sed -n "1,120p" "/tmp/kaiju-pip-${pkg}.log"
+  fi
+done
+python3 - <<PY
+mods = [("llmcompressor", "llmcompressor"), ("autoawq", "awq"), ("auto-gptq", "auto_gptq")]
+for label, mod in mods:
+    try:
+        m = __import__(mod)
+        version = getattr(m, "__version__", "installed")
+        print(label + ": import ok: " + str(version))
+    except Exception as exc:
+        print(f"{label}: import failed: {type(exc).__name__}: {exc}")
+PY
+python3 - <<PY
+from transformers import AutoConfig
+try:
+    cfg = AutoConfig.from_pretrained("/models/kaiju", trust_remote_code=True)
+    print("post-install AutoConfig:", type(cfg).__name__, getattr(cfg, "model_type", None))
+except Exception as exc:
+    print("post-install AutoConfig failed:", type(exc).__name__, exc)
+PY
+'
+echo
+echo "== LLM Compressor no-deps stack-preservation probe =="
+docker run --rm --entrypoint bash -v "${MODEL_REMOTE}":/models/kaiju:ro "${VLLM_IMAGE}" -lc '
+set -euo pipefail
+python3 -m pip install -q --no-cache-dir --no-deps llmcompressor >/tmp/kaiju-pip-llmcompressor-nodeps.log 2>&1 || {
+  echo "llmcompressor --no-deps install failed"
+  sed -n "1,120p" /tmp/kaiju-pip-llmcompressor-nodeps.log
+}
+python3 - <<PY
+try:
+    import llmcompressor
+    print("llmcompressor no-deps import:", getattr(llmcompressor, "__version__", "installed"))
+except Exception as exc:
+    print("llmcompressor no-deps import failed:", type(exc).__name__, exc)
+from transformers import AutoConfig
+cfg = AutoConfig.from_pretrained("/models/kaiju", trust_remote_code=True)
+print("no-deps AutoConfig:", type(cfg).__name__, getattr(cfg, "model_type", None))
+PY
+'
+echo
+echo "== llama.cpp GGUF support probe =="
+mkdir -p "$(dirname "${LLAMA_DIR}")"
+if [[ -d "${LLAMA_DIR}/.git" ]]; then
+  git -C "${LLAMA_DIR}" fetch --depth 1 origin master >/dev/null 2>&1 || true
+  git -C "${LLAMA_DIR}" checkout -q FETCH_HEAD >/dev/null 2>&1 || true
+else
+  rm -rf "${LLAMA_DIR}"
+  git clone --depth 1 https://github.com/ggml-org/llama.cpp "${LLAMA_DIR}" >/dev/null
+fi
+docker run --rm --entrypoint bash \
+  -v "${MODEL_REMOTE}":/models/kaiju:ro \
+  -v "${LLAMA_DIR}":/llama.cpp:ro \
+  "${VLLM_IMAGE}" -lc '
+set -euo pipefail
+cd /llama.cpp
+python3 convert_hf_to_gguf.py --print-supported-models 2>&1 | grep -Ei "qwen3_5|qwen3.5|qwen35|qwen3" | head -40 || true
+python3 convert_hf_to_gguf.py --help | grep -E -- "--dry-run|--outtype|--vocab-only" || true
+set +e
+python3 convert_hf_to_gguf.py \
+  --dry-run \
+  --outtype q8_0 \
+  --outfile /tmp/kaiju-coder-7-q8_0-dry-run.gguf \
+  /models/kaiju 2>&1 | sed -n "1,220p"
+DRY_STATUS=${PIPESTATUS[0]}
+set -e
+echo "gguf_dry_run_exit: ${DRY_STATUS}"
+exit 0
+'
+REMOTE
+STATUS=${PIPESTATUS[0]}
+set -e
+{
+  echo "# Kaiju Coder 7 Persisted Quantization Probe"
+  echo
+  echo "- Timestamp: \`${STAMP}\`"
+  echo "- Model: \`${MODEL_REMOTE}\`"
+  echo "- vLLM image: \`${VLLM_IMAGE}\`"
+  echo "- llama.cpp path: \`${LLAMA_DIR}\`"
+  echo "- Exit code: \`${STATUS}\`"
+  echo "- Log: \`${LOG}\`"
+  echo
+  echo "## Interpretation"
+  echo
+  if grep -q "Model architecture: QWEN35" "${LOG}" || grep -qi "QWEN35" "${LOG}"; then
+    echo "- GGUF conversion support probe found Qwen3.5/QWEN35 handling."
+  else
+    echo "- GGUF conversion support is not proven by this probe."
+  fi
+  if grep -q "AutoConfig: Qwen3_5Config" "${LOG}"; then
+    echo "- The pinned vLLM nightly stack recognizes Kaiju's Qwen3.5 config."
+  else
+    echo "- The pinned vLLM nightly stack did not recognize Kaiju's config."
+  fi
+  if grep -q "llmcompressor:" "${LOG}"; then
+    echo "- LLM Compressor package import was probed."
+  fi
+  echo
+  echo "Do not claim a persisted quantized artifact exists unless a later run writes"
+  echo "and verifies the quantized weights."
+} > "${SUMMARY}"
+echo "Summary: ${SUMMARY}"
+exit "${STATUS}"

scripts/run-gojira-b-kaiju-gguf-convert.sh ADDED Viewed

	@@ -0,0 +1,190 @@

+#!/usr/bin/env bash
+set -euo pipefail
+ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+# shellcheck source=scripts/gojira-b-ssh-lib.sh
+source "${SCRIPT_DIR}/gojira-b-ssh-lib.sh"
+kaiju_gojira_b_init
+MODEL_REMOTE="${KAIJU_GGUF_MODEL_REMOTE:-/home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged}"
+OUT_DIR="${KAIJU_GGUF_OUT_DIR:-/home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf}"
+OUTTYPE="${KAIJU_GGUF_OUTTYPE:-q8_0}"
+OUTTYPE_UPPER="$(printf "%s" "${OUTTYPE}" | tr "[:lower:]" "[:upper:]")"
+OUTFILE="${KAIJU_GGUF_OUTFILE:-kaiju-coder-7-${OUTTYPE_UPPER}.gguf}"
+VLLM_IMAGE="${KAIJU_GGUF_VLLM_IMAGE:-gojira/vllm-openai-ray:nightly}"
+LLAMA_DIR="${KAIJU_LLAMA_CPP_REMOTE:-/home/richardecholsai5/tools/llama.cpp}"
+FORCE="${KAIJU_GGUF_FORCE:-0}"
+STOP_VLLM="${KAIJU_GGUF_STOP_VLLM:-1}"
+RESTART_VLLM="${KAIJU_GGUF_RESTART_VLLM:-1}"
+STAMP="$(date -u +%Y%m%dT%H%M%SZ)"
+RUN_DIR="${ROOT}/runs/gguf-conversion/${STAMP}"
+LOG="${RUN_DIR}/gguf-conversion.log"
+SUMMARY="${RUN_DIR}/summary.md"
+mkdir -p "${RUN_DIR}"
+printf -v MODEL_REMOTE_Q "%q" "${MODEL_REMOTE}"
+printf -v OUT_DIR_Q "%q" "${OUT_DIR}"
+printf -v OUTFILE_Q "%q" "${OUTFILE}"
+printf -v OUTTYPE_Q "%q" "${OUTTYPE}"
+printf -v VLLM_IMAGE_Q "%q" "${VLLM_IMAGE}"
+printf -v LLAMA_DIR_Q "%q" "${LLAMA_DIR}"
+printf -v FORCE_Q "%q" "${FORCE}"
+printf -v STOP_VLLM_Q "%q" "${STOP_VLLM}"
+printf -v RESTART_VLLM_Q "%q" "${RESTART_VLLM}"
+set +e
+kaiju_gojira_b_ssh "MODEL_REMOTE=${MODEL_REMOTE_Q} OUT_DIR=${OUT_DIR_Q} OUTFILE=${OUTFILE_Q} OUTTYPE=${OUTTYPE_Q} VLLM_IMAGE=${VLLM_IMAGE_Q} LLAMA_DIR=${LLAMA_DIR_Q} FORCE=${FORCE_Q} STOP_VLLM=${STOP_VLLM_Q} RESTART_VLLM=${RESTART_VLLM_Q} bash -s" <<'REMOTE' 2>&1 | tee "${LOG}"
+set -euo pipefail
+VLLM_SESSION="${KAIJU_VLLM_SESSION:-kaiju_qwen36_v18_merged_vllm}"
+VLLM_CONTAINER="${KAIJU_VLLM_CONTAINER:-qwen36-merged-vllm-18084}"
+OUT_PATH="${OUT_DIR}/${OUTFILE}"
+restart_vllm() {
+  if [[ "${RESTART_VLLM}" != "1" ]]; then
+    return
+  fi
+  if tmux has-session -t "${VLLM_SESSION}" 2>/dev/null; then
+    return
+  fi
+  echo "Restarting vLLM fast runtime on 18084"
+  mkdir -p /home/richardecholsai5/kaiju-coder/logs /home/richardecholsai5/hf-cache
+  sudo docker rm -f "${VLLM_CONTAINER}" >/dev/null 2>&1 || true
+  LOG=/home/richardecholsai5/kaiju-coder/logs/qwen36-merged-vllm-18084.log
+  rm -f "${LOG}"
+  tmux new-session -d -s "${VLLM_SESSION}" "set -euo pipefail; sudo docker run --rm --gpus all --network host --ipc=host \
+    -v '${MODEL_REMOTE}':/models/kaiju-merged:ro \
+    -v /home/richardecholsai5/hf-cache:/root/.cache/huggingface \
+    --name '${VLLM_CONTAINER}' \
+    --entrypoint bash \
+    '${VLLM_IMAGE}' \
+    -lc 'python3 -m pip install -q pandas; python3 -m vllm.entrypoints.openai.api_server \
+      --model /models/kaiju-merged \
+      --served-model-name kaiju-coder-7 \
+      --host 0.0.0.0 \
+      --port 18084 \
+      --max-model-len 16384 \
+      --gpu-memory-utilization 0.90 \
+      --trust-remote-code \
+      --language-model-only \
+      --dtype bfloat16 \
+      --tool-call-parser qwen3_coder \
+      --reasoning-parser qwen3 \
+      --quantization bitsandbytes \
+      --load-format bitsandbytes \
+      --enable-auto-tool-choice \
+      --uvicorn-log-level info' 2>&1 | tee '${LOG}'"
+}
+trap restart_vllm EXIT
+echo "== GGUF conversion request =="
+echo "model: ${MODEL_REMOTE}"
+echo "out: ${OUT_PATH}"
+echo "outtype: ${OUTTYPE}"
+test -d "${MODEL_REMOTE}" || { echo "missing model: ${MODEL_REMOTE}" >&2; exit 2; }
+mkdir -p "${OUT_DIR}" "$(dirname "${LLAMA_DIR}")"
+du -sh "${MODEL_REMOTE}"
+df -h /home | tail -1
+free -h | sed -n '1,3p'
+if [[ "${STOP_VLLM}" == "1" ]]; then
+  echo "Stopping active vLLM runtime to free RAM"
+  tmux kill-session -t "${VLLM_SESSION}" >/dev/null 2>&1 || true
+  sudo docker rm -f "${VLLM_CONTAINER}" >/dev/null 2>&1 || true
+  sleep 3
+  free -h | sed -n '1,3p'
+fi
+if [[ -s "${OUT_PATH}" && "${FORCE}" != "1" ]]; then
+  echo "Existing GGUF found, skipping conversion: ${OUT_PATH}"
+else
+  if [[ -d "${LLAMA_DIR}/.git" ]]; then
+    git -C "${LLAMA_DIR}" fetch --depth 1 origin master >/dev/null 2>&1 || true
+    git -C "${LLAMA_DIR}" checkout -q FETCH_HEAD >/dev/null 2>&1 || true
+  else
+    rm -rf "${LLAMA_DIR}"
+    git clone --depth 1 https://github.com/ggml-org/llama.cpp "${LLAMA_DIR}" >/dev/null
+  fi
+  rm -f "${OUT_PATH}.tmp" "${OUT_PATH}"
+  docker run --rm --entrypoint bash \
+    -v "${MODEL_REMOTE}":/models/kaiju:ro \
+    -v "${OUT_DIR}":/out \
+    -v "${LLAMA_DIR}":/llama.cpp:ro \
+    "${VLLM_IMAGE}" -lc "
+set -euo pipefail
+cd /llama.cpp
+python3 convert_hf_to_gguf.py \
+  --outtype '${OUTTYPE}' \
+  --outfile '/out/${OUTFILE}.tmp' \
+  /models/kaiju
+"
+  mv "${OUT_PATH}.tmp" "${OUT_PATH}"
+fi
+echo
+echo "== GGUF artifact =="
+ls -lh "${OUT_PATH}"
+sha256sum "${OUT_PATH}" | tee "${OUT_PATH}.sha256"
+OUT_PATH_PY="${OUT_PATH}" \
+OUT_DIR_PY="${OUT_DIR}" \
+OUTTYPE_PY="${OUTTYPE}" \
+MODEL_REMOTE_PY="${MODEL_REMOTE}" \
+LLAMA_DIR_PY="${LLAMA_DIR}" \
+python3 - <<'PY'
+import json
+import os
+from pathlib import Path
+out = Path(os.environ["OUT_PATH_PY"])
+out_dir = Path(os.environ["OUT_DIR_PY"])
+outtype = os.environ["OUTTYPE_PY"]
+model_remote = os.environ["MODEL_REMOTE_PY"]
+llama_dir = os.environ["LLAMA_DIR_PY"]
+manifest = {
+    "product": "Kaiju Coder 7",
+    "model_id": "kaiju-coder-7",
+    "format": "GGUF",
+    "outtype": outtype,
+    "artifact": str(out),
+    "sha256_file": str(out) + ".sha256",
+    "source_model": model_remote,
+    "converter": llama_dir,
+    "status": "converted_pending_runtime_smoke",
+}
+(out_dir / "GGUF_RELEASE_MANIFEST.json").write_text(json.dumps(manifest, indent=2) + "\n")
+(out_dir / "README.md").write_text(
+    "# Kaiju Coder 7 GGUF Candidate\n\n"
+    "This is a persisted GGUF candidate converted from the merged Kaiju Coder 7 model.\n"
+    "It is not public release-ready until a runtime smoke test passes.\n\n"
+    f"- Artifact: `{out.name}`\n"
+    f"- Outtype: `{outtype}`\n"
+    f"- Source: `{model_remote}`\n",
+    encoding="utf-8",
+)
+PY
+REMOTE
+STATUS=${PIPESTATUS[0]}
+set -e
+{
+  echo "# Kaiju Coder 7 GGUF Conversion"
+  echo
+  echo "- Timestamp: \`${STAMP}\`"
+  echo "- Exit code: \`${STATUS}\`"
+  echo "- Model: \`${MODEL_REMOTE}\`"
+  echo "- Out dir: \`${OUT_DIR}\`"
+  echo "- Out file: \`${OUTFILE}\`"
+  echo "- Out type: \`${OUTTYPE}\`"
+  echo "- Log: \`${LOG}\`"
+  echo
+  if grep -q "GGUF artifact" "${LOG}" && grep -qE "^[0-9a-f]{64}[[:space:]]+${OUT_DIR}/${OUTFILE}$" "${LOG}"; then
+    echo "Status: converted; runtime smoke still required before public release."
+  else
+    echo "Status: conversion incomplete or failed."
+  fi
+} > "${SUMMARY}"
+echo "Summary: ${SUMMARY}"
+exit "${STATUS}"