kaiju-coder-7 / EVAL_SCOREBOARD.md
restokes92's picture
Add files using upload-large-folder tool
4ca1eb4 verified

Kaiju Coder 7 Business-Owner Eval Scoreboard

This scoreboard tracks the current release-candidate evidence. Do not publish weights or paid API claims until every required row has a dated result and reviewer.

Completed Local Gates

Gate Command Result Date
Source inventory refresh python3 scripts/build_source_inventory.py Passed 2026-06-03
Candidate validation python3 scripts/validate_training_data.py --min-examples 350 1,689 examples / passed 2026-06-03
v1.7 category targets python3 scripts/check_dataset_targets.py --targets datasets/v1.7-targets.json Passed 2026-06-03
Business-owner SFT build python3 scripts/build_v17_business_owner_sft_dataset.py 1,881 rows / 192 repeats 2026-06-03
Router hard harness python3 evals/run_router_harness_eval.py --tasks evals/tasks/router-hard-harness.jsonl 23/23 2026-06-03
Router static checks python3 evals/run_router_static_checks.py runs/evals/20260603T103915Z-kaiju_router_harness/results.jsonl 23/23 2026-06-03
Business-suite prompts Included in router hard harness 2/2 2026-06-03
Deterministic API harness smoke python3 scripts/run_kaiju_api_harness_smoke.py Passed: website + business-suite API artifacts 2026-06-03
Direct business-suite artifact python3 scripts/run_kaiju_router.py --prompt "...Kiyomi 7.7.7 AI company operating pack..." --print-manifest 19 files / passed 2026-06-03
Full local RC smoke gate python3 scripts/run_kaiju_business_owner_rc_smoke.py Passed; latest router/static run 20260603T103915Z-kaiju_router_harness 2026-06-03
v1.7 LoRA train ./scripts/run-gojira-b-qwen36-lora-train.sh Finished; runtime 1663.7101s, train loss 1.7260706673065822, adapter present 2026-06-03
v1.7 SGLang serve ./scripts/start-qwen36-lora-sglang.sh with KAIJU_QWEN36_LORA_CONTEXT=4096, KAIJU_QWEN36_LORA_MEM_FRACTION=0.90 /v1/models returned kaiju_v17_business_owner 2026-06-03
Raw served adapter smoke: website python3 evals/run_openai_compat_smoke.py --base-url http://100.109.109.14:18083/v1 --model kaiju_v17_business_owner --tasks evals/tasks/smoke.jsonl --max-tasks 1 --disable-thinking Passed; 20260603T031300Z-kaiju_v17_business_owner, 2,726 chars in 174.49s 2026-06-03
Raw served adapter smoke: proposal python3 evals/run_openai_compat_smoke.py --base-url http://100.109.109.14:18083/v1 --model kaiju_v17_business_owner --tasks /tmp/kaiju-proposal-smoke.jsonl --system-prompt-file prompts/kaiju-coder-api-system.md --disable-thinking Passed; 20260603T032107Z-kaiju_v17_business_owner, 4,306 chars in 232.27s 2026-06-03
Raw served adapter quality: website python3 evals/score_quality_gate.py runs/evals/20260603T033825Z-kaiju_v17_business_owner/results.jsonl Failed paid-ready: 3.71/4.0, missing complete HTML after 12,706 chars / 793.96s 2026-06-03
Raw served adapter quality: proposal python3 evals/score_quality_gate.py runs/evals/20260603T032107Z-kaiju_v17_business_owner/results.jsonl Passed paid-ready: 4.0/4.0 2026-06-03
Raw served adapter quality: Jah credits python3 evals/score_quality_gate.py runs/evals/20260603T035612Z-kaiju_v17_business_owner/results.jsonl Passed paid-ready: 4.0/4.0 2026-06-03
Base Qwen comparison: proposal python3 evals/compare_quality_runs.py runs/quality-gates/20260603T035200Z-qwen36-27b/scores.jsonl runs/quality-gates/20260603T032107Z-kaiju_v17_business_owner/scores.jsonl Tie: base 4.0/4.0, Kaiju v1.7 4.0/4.0 2026-06-03
Base Qwen comparison: Jah credits python3 evals/compare_quality_runs.py runs/quality-gates/20260603T040140Z-qwen36-27b/scores.jsonl runs/quality-gates/20260603T035612Z-kaiju_v17_business_owner/scores.jsonl Tie: base 4.0/4.0, Kaiju v1.7 4.0/4.0; deterministic outputs were byte-identical 2026-06-03
Raw adapter differentiation probe Identity and Jah probes comparing qwen36-27b to kaiju_v17_business_owner Current v1.7 SGLang outputs can be byte-identical to base on deterministic prompts; 24-step v1.7 is too weak as a raw-weight differentiator 2026-06-03
v1.8 stronger LoRA train KAIJU_LORA_CONFIG=training/configs/qwen36-27b-lora-v1.8-business-owner.example.json KAIJU_SFT_DATASET=datasets/build/kaiju-sft-v1.7-business-owner-oversampled.jsonl KAIJU_LORA_RUN_DIR=runs/qwen36-27b-lora-v1.8-business-owner KAIJU_MIN_TRAIN_EXAMPLES=350 KAIJU_SKIP_DATASET_BUILD=1 KAIJU_TRAIN_BACKGROUND=1 ./scripts/run-gojira-b-qwen36-lora-train.sh Finished; runtime 11666.7564s, train loss 0.9281658741335074, adapter present 2026-06-03
v1.8 SGLang dynamic LoRA serve ./scripts/start-qwen36-lora-sglang.sh with v1.8 adapter, KAIJU_QWEN36_LORA_CONTEXT=8192, KAIJU_QWEN36_LORA_MEM_FRACTION=0.90 Historical only: /v1/models listed kaiju_v18_business_owner, but adapter-name-only output can be base-equivalent; not release evidence 2026-06-03
Corrected v1.8 dynamic LoRA selector Model selector qwen36-27b:kaiju_v18_business_owner under SGLang with fused target modules Fails: LoRA buffer shape torch.Size([8192, 16]) does not match weight shape torch.Size([14336, 16]); dynamic LoRA is not the release path 2026-06-03
v1.8 LoRA merge KAIJU_LORA_ADAPTER=/workspace/kaiju-coder/runs/qwen36-27b-lora-v1.8-business-owner/adapter ./scripts/run-gojira-b-qwen36-lora-merge.sh Passed; merged full model at /home/richardecholsai5/kaiju-coder/models/Kaiju-Coder-Qwen3.6-27B-v1.8-merged, 51G, 14 shards 2026-06-03
Kaiju Coder 7 merged SGLang serve ./scripts/start-qwen36-merged-sglang.sh with KAIJU_QWEN36_MERGED_CONTEXT=32768, KAIJU_QWEN36_MERGED_MEM_FRACTION=0.90 /v1/models returned kaiju-coder-7, max model len 32768; 12k/16k/24k/32k evidence is recorded in release/SERVING_BENCHMARKS.md 2026-06-03
Kaiju Coder 7 restored 32k direct API smoke python3 scripts/benchmark_kaiju_serving.py --contexts 32768 --prompts identity business_doc --max-tokens 768 --timeout 420 Passed; /v1/models returned kaiju-coder-7, max model len 32768; identity 2.92s; business proposal 94.28s, 1,737 chars 2026-06-03
Kaiju Coder 7 restored 32k OpenCode one-file smoke opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-opencode-32k-final-smoke 'Create hello.txt with exactly: Kaiju Coder 7 final 32k ok' Passed; wrote hello.txt with exactly Kaiju Coder 7 final 32k ok 2026-06-03
Kaiju Coder 7 current restored 16k direct API smoke python3 scripts/benchmark_kaiju_serving.py --contexts 16384 --prompts identity --max-tokens 64 --timeout 120 Passed; latest run runs/benchmarks/20260603T174545Z-kaiju-coder-7-serving/summary.md, identity 2.3s, 26 chars 2026-06-03
Kaiju Coder 7 current restored 16k OpenCode one-file smoke mkdir -p /tmp/kaiju-opencode-fresh-public-smoke && opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-opencode-fresh-public-smoke --dangerously-skip-permissions 'Create hello.txt with exactly: Kaiju Coder 7 fresh public smoke ok' Passed; /v1/models returned kaiju-coder-7, max model len 16384; wrote hello.txt with exactly Kaiju Coder 7 fresh public smoke ok 2026-06-03
Kaiju Coder 7 packaged public OpenCode smoke python3 scripts/run_kaiju_public_opencode_smoke.py --base-url http://127.0.0.1:18181/v1 --timeout 900 Passed; latest run runs/public-opencode-smoke/20260603T235002Z/summary.md, 4/4 checks passed; installer dry-run, OpenCode 1.15.13, live 16k model, and exact file written only in the requested temp workspace through the fast proxy 2026-06-03
Kaiju Coder 7 loop-guarded OpenCode install python3 scripts/install_kaiju_opencode_profile.py; opencode run -m kaiju/kaiju-coder-7 --agent kaiju-coder-7 --dir /tmp/kaiju-opencode-loopguard-smoke --dangerously-skip-permissions 'Create loopguard.txt with exactly: Kaiju Coder 7 loop guard installed' Passed; config includes /Users/richardecholsai7/.config/opencode/kaiju-no-autocontinue.mjs; wrote loopguard.txt with exact requested content and exited cleanly 2026-06-03
Current harnessed OpenCode customer-readiness pack python3 scripts/run_kaiju_opencode_customer_pack.py --mode harnessed Passed; latest run runs/opencode-customer-readiness/20260603T185835Z/summary.md, 4/4 tasks passed and 28/28 required files written, including release provenance and safety review 2026-06-03
Paid API Worker scaffold cd gateway/cloudflare-worker && npm run check && npm run preflight Passed 16/16 Worker tests and 17 scaffold preflight checks; covers bearer auth, inactive keys, insufficient credits, debit/refund, rate limit before debit, model kaiju-coder-7 enforcement, stream/thinking/token caps, secret-content rejection without logging, signed Stripe Checkout top-up idempotency, origin-only R2 artifact upload, account-scoped artifact download, guarded Cloudflare resource prep, Wrangler dry-run deploy, sanitized paid-launch evidence template packaging, reviewed Cloudflare bindings template, binding applier guardrails, and sanitized evidence collection helper 2026-06-03
Kaiju Coder 7 merged vLLM serve KAIJU_VLLM_CONTEXT=16384 ./scripts/run-gojira-b-vllm-serving-benchmark.sh Passed at 16k with Gojira nightly vLLM after pandas preinstall and --language-model-only; identity 19.99s, code patch 28.8s; not faster enough to replace SGLang 2026-06-03
Kaiju Coder 7 runtime-quantized vLLM serve KAIJU_VLLM_CONTEXT=16384 KAIJU_VLLM_QUANTIZATION=bitsandbytes KAIJU_VLLM_LOAD_FORMAT=bitsandbytes ./scripts/run-gojira-b-vllm-serving-benchmark.sh Passed at 8k and 16k; 16k identity 19.51s, code patch 11.3s; vLLM log reported about 17.8 GiB model memory 2026-06-03
Kaiju Coder 7 runtime-quantized business-doc smoke KAIJU_VLLM_CONTEXT=16384 KAIJU_VLLM_QUANTIZATION=bitsandbytes KAIJU_VLLM_LOAD_FORMAT=bitsandbytes KAIJU_VLLM_PROMPTS=business_doc KAIJU_VLLM_MAX_TOKENS=768 KAIJU_VLLM_PROMPT_TIMEOUT=420 ./scripts/run-gojira-b-vllm-serving-benchmark.sh Passed; business proposal 53.44s, 1,610 chars, 30.127 chars/s; wrapper restored SGLang after completion 2026-06-03
Kaiju Coder 7 runtime-quantized OpenCode one-file smoke bash scripts/run_kaiju_quantized_opencode_smoke.sh Passed at 16k after vLLM --enable-auto-tool-choice; OpenCode wrote hello.txt with exactly Kaiju Coder 7 quantized runtime ok 2026-06-03
Kaiju Coder 7 fast proxy plus website harness speed pass python3 scripts/run_kaiju_router.py --kind website --openai-base-url http://127.0.0.1:18181/v1 --model kaiju-coder-7 ... and OpenCode through http://127.0.0.1:18181/v1 Passed; local fast proxy forwards to vLLM bitsandbytes on 18084; direct website harness wrote 9,257 chars in 7.31s; router website passed all checks in 7.20s; local-proxy router website passed in 4.67s; public OpenCode smoke through the proxy passed in about 40s end to end 2026-06-03
Persisted quantization support probe ./scripts/probe-gojira-b-persisted-quantization.sh Passed as evidence probe; AWQ/GPTQ normal installs are not clean against the Qwen3.5-capable stack tonight, llmcompressor --no-deps preserves config support but needs a pinned dependency env, and llama.cpp supports Qwen3_5ForConditionalGeneration with Q8_0 dry-run passing 2026-06-03
GGUF Q8_0 persisted conversion ./scripts/run-gojira-b-kaiju-gguf-convert.sh Converted candidate at /home/richardecholsai5/kaiju-coder/models/kaiju-coder-7-gguf/kaiju-coder-7-Q8_0.gguf, 27G, SHA256 596a2c227a429c7309db753061d88d71ee3f8a3b48f17e41ba9d81b0f55bdd4e; runtime smoke still required before public quantized-weights release 2026-06-03
Public business-owner demo pack python3 scripts/run_kaiju_public_demo_pack.py --openai-base-url http://127.0.0.1:18181/v1 --model kaiju-coder-7 --planner-timeout 90 Passed 4/4 through the fast proxy in 64.529s: website 4.73s, owner AI company pack 29.85s with 19 files, Stripe safety plan 9.99s, CSV parser artifact 19.97s; run runs/public-demo-pack/20260603T235009Z/summary.md 2026-06-03
Hugging Face CLI install/auth check hf version && hf auth whoami && hf auth list hf installed locally at version 1.17.0; auth user restokes92; token name gojirakiyomikode 2026-06-03
Hugging Face public helper repos python3 scripts/check_hf_uploaded_release.py --namespace RMDWLLC --apply --require-public Passed 17/17; public downloads verified for adapter, OpenCode helper, and runtime helper, including installer dry-run, demo runner, and GGUF candidate note 2026-06-03
Hugging Face merged-model upload KAIJU_HF_NAMESPACE=RMDWLLC KAIJU_HF_UPLOAD_APPLY=1 bash scripts/upload_hf_merged_model_from_gojira_b.sh Uploaded public repo RMDWLLC/kaiju-coder-7; hf upload-large-folder processed 53.8G/53.8G, 39 files, 14 safetensors shards; metadata reports private: false 2026-06-03
v1.8 merged endpoint probe Direct OpenAI-compatible chat request with top-level chat_template_kwargs disabling thinking Passed; 1,155 visible chars in 60.17s, normal content response 2026-06-03
Kaiju Coder 7 merged focused proposal eval python3 evals/run_openai_compat_smoke.py --model kaiju-coder-7 --tasks evals/tasks/business-owner-v18-comparison.jsonl --max-tasks 1 --max-tokens 1800 ... then python3 evals/score_quality_gate.py <results.jsonl> Passed: 1/1 paid-ready, 4.0/4.0, 4,014 chars, 212.72s 2026-06-03
Kaiju Coder 7 merged focused Jah credits eval python3 evals/run_openai_compat_smoke.py --model kaiju-coder-7 --tasks evals/tasks/business-owner-v18-comparison.jsonl ... then python3 evals/score_quality_gate.py <results.jsonl> Passed: 4.0/4.0, 9,718 chars, 566.36s 2026-06-03
Full local RC smoke gate python3 scripts/run_kaiju_business_owner_rc_smoke.py Passed; latest router/static run 20260603T103915Z-kaiju_router_harness 2026-06-03

Required Before Release

Gate Required result Status
v1.7 LoRA train Finished metrics and adapter under runs/qwen36-27b-lora-v1.7-business-owner Passed
v1.8 stronger LoRA train Finished metrics and adapter under runs/qwen36-27b-lora-v1.8-business-owner Passed
v1.8 merged focused smoke python3 evals/run_openai_compat_smoke.py --tasks evals/tasks/business-owner-v18-comparison.jsonl --model kaiju-coder-7 ... then python3 evals/score_quality_gate.py Passed for proposal rerun and Jah credits backend; broader sweep pending
Direct commercial eval No critical failures, scored summary attached Passed for targeted high-value tasks when using the product harness plus 8k raw website mode; broader task sweep still pending
Base Qwen comparison Kaiju beats base Qwen on RMDW/Kiyomi practical tasks Not yet: raw deterministic identity still matches base; compare broader tasks before model-level improvement claims
GLM comparison Kaiju is near or above GLM on highest-value business-owner tasks Pending; required only before superiority claims
Local inference smoke OpenAI-compatible endpoint returns usable business-owner artifact Passed for v1.8 merged SGLang endpoint and product harness
Human review Richard reviews artifacts for usefulness, privacy, and sellability Approved for public HF visibility and paid API launch preflight on 2026-06-03
Release package Model card, provenance, license notes, eval summary, limitations, Hugging Face draft, completion audit, and run instructions complete Staged, bundled, uploaded to public HF repos, and verified with public downloads

Decision Rule

The v1.8 adapter is a completed local checkpoint and the merged full model is the current served raw-model path. The business-owner product should be published honestly as Kaiju Coder 7 plus deterministic harness plus verifier, with vLLM bitsandbytes plus the fast proxy as the current speed path. Do not claim raw-weight superiority until broader base/GLM and raw website comparisons pass.