dcostenco commited on
Commit
8ea6bca
·
verified ·
1 Parent(s): ebec064

docs: updated benchmark scores — v26 system prompt + nothink template (May 14 2026)

Browse files
Files changed (1) hide show
  1. README.md +19 -32
README.md CHANGED
@@ -22,49 +22,44 @@ Designed for the **Synalux Copilot** cascade: RunPod → Ollama local → Claude
22
 
23
  ## Test results — Prism routing 100-case eval (May 14 2026)
24
 
25
- 100 prompts sampled (seed=2027) from a 200-pool of routing prompts across 13 categories (7 MCP tools + plain-text guards for AAC / translate / hallucination / etc.).
26
 
27
- | Category | v26-polish | v19 (prior) | Δ |
28
  |---|---|---|---|
29
- | **Overall** | **90.0%** | 87.0% | +3.0 |
30
  | session_load_context | 100% | 100% | = |
31
- | session_save_ledger | 92% | 100% | -8 |
32
  | session_search_memory | 100% | 100% | = |
33
- | **session_save_handoff** | **88%** | 60% | **+28** |
34
  | session_compact_ledger | 100% | 100% | = |
35
  | brave_web_search | 100% | 100% | = |
36
- | **knowledge_search** | **71%** | 43% | **+28** |
37
- | AAC plain-text | 100% | 100% | = |
38
  | translate plain-text | 100% | 100% | = |
39
- | static facts (pred) | 62% | 62% | = |
40
- | live-info refusal | 100% | 100% | = |
41
  | info / lookup | 80% | 80% | = |
42
- | edge (multi-step) | 60% | 65% | -5 |
43
- | **avg latency** | 5.8s | 6.8s | -1.0s |
44
  | **invented tools** | 0 | 0 | = |
45
 
46
- **What this benchmark measures**: routing precision against the *exact* 7-tool Prism Coder taxonomy. It is **not** a general-capability score and is not comparable to public leaderboards (BFCL, MMLU, etc.). Methodology + runner: [github.com/dcostenco/prism-coder/tree/main/tests/benchmarks/prism-routing-100](https://github.com/dcostenco/prism-coder/tree/main/tests/benchmarks/prism-routing-100).
47
 
48
- **Where this model wins**: zero invented tool names, gate-passing routing accuracy (≥90%), 1B fewer params than active params used during call (no MoE overhead). Runs locally — $0/request, private, no rate limits.
49
 
50
- **Where it underperforms vs Claude Sonnet 4 / Opus 4.7** (which score 99% / 98% on the same eval):
51
- - `pred` (static facts) — 62% vs 100%. Smaller model = thinner world knowledge.
52
- - `know_srch` — 71% vs 100%. Distinguishing "what do I know" from "what did I record" is subtle.
53
- - `edge` (multi-step routing) — 60% vs 66-83%. Sequential tool decisions are still hard.
54
 
55
- For production: pair this model with a Claude fallback for categories it misses. The [Synalux router](https://github.com/dcostenco/prism-coder) does this automatically.
56
 
57
  ## Training recipe (v26-polish)
58
 
59
  - **Base**: Qwen/Qwen3-14B (bf16)
60
  - **LoRA**: r=8, α=16, dropout 0.05, targets `q/k/v/o_proj` only
61
- - **Corpus**: 576 hand-crafted rows, 56% plain-text guards + 44% tool exemplars (heavy on `knowledge_search` and `session_save_handoff` to address v19's weak spots)
62
  - **Schedule**: 50 iters @ LR 1e-6, batch 1, cosine warmup 0.05, seq 2048
63
  - **Hardware**: Mac M4 Max (MLX-LM)
64
  - **Wall time**: ~5 min training
65
 
66
- The earlier v25-max attempt (40K rows, 300 iters, r=32) **regressed** the BFCL gate test (100% → 81%) — too much tool-density burned in tool-call habit. v26-polish is intentionally light-touch.
67
-
68
  ## Usage
69
 
70
  ### Ollama (recommended)
@@ -74,25 +69,17 @@ ollama pull dcostenco/prism-coder:14b
74
  ollama run dcostenco/prism-coder:14b "Load context for prism-mcp project"
75
  ```
76
 
77
- ### HuggingFace (transformers + PEFT)
78
-
79
- ```python
80
- from transformers import AutoModelForCausalLM, AutoTokenizer
81
- from peft import PeftModel
82
- base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B", torch_dtype="auto")
83
- model = PeftModel.from_pretrained(base, "dcostenco/prism-coder-14b")
84
- tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")
85
- ```
86
 
87
  ### System prompt
88
 
89
- Use the [v25 routing prompt](https://github.com/dcostenco/prism-coder/blob/main/tests/benchmarks/prism-routing-100/benchmark.py#L47) verbatim. The model was trained to follow its 16 routing rules literally straying from the prompt format drops accuracy.
90
 
91
  ## Hardware requirements
92
 
93
  - **Mac**: M2 Pro+ with ≥24 GB unified memory (Q4_K_M weights = 9 GB + ~6 GB activations)
94
  - **Linux + NVIDIA**: RTX 3090 / 4090 (24 GB) or any A-series ≥ 24 GB
95
- - **Inference speed**: ~5–7 s per 200-token response on M4 Max
96
  - **Loaded VRAM**: ~10 GB
97
 
98
  ## License
 
22
 
23
  ## Test results — Prism routing 100-case eval (May 14 2026)
24
 
25
+ 100 prompts (seed=2027), v26 system prompt + nothink template.
26
 
27
+ | Category | Current | Previous (v19) | Δ |
28
  |---|---|---|---|
29
+ | **Overall** | **91%** | 87.0% | **+4.0** |
30
  | session_load_context | 100% | 100% | = |
31
+ | session_save_ledger | 100% | 100% | = |
32
  | session_search_memory | 100% | 100% | = |
33
+ | session_save_handoff | 75% | 60% | +15 |
34
  | session_compact_ledger | 100% | 100% | = |
35
  | brave_web_search | 100% | 100% | = |
36
+ | knowledge_search | 43% | 43% | = |
37
+ | AAC plain-text | **100%** | 100% | = |
38
  | translate plain-text | 100% | 100% | = |
39
+ | plain text (pred/irrel) | 88% | 62% | +26 |
40
+ | no-tool refusal | 100% | 100% | = |
41
  | info / lookup | 80% | 80% | = |
42
+ | edge (multi-step) | 80% | 65% | +15 |
43
+ | **avg latency** | **1.0s** | 6.8s | **-5.8s (6x faster)** |
44
  | **invented tools** | 0 | 0 | = |
45
 
46
+ **Key improvement (May 14 2026)**: system prompt v26 changed routing rules from `-> plain text` to `-> respond directly (no tool)`. The Q4_K_M quantized model was misreading "plain text" as a tool name, causing AAC phrase requests to hallucinate non-existent tools. Combined with the `nothink` template (pre-closes `<think>` block), latency dropped 6x.
47
 
48
+ **What this benchmark measures**: routing precision against the *exact* 7-tool Prism Coder taxonomy. It is **not** a general-capability score and is not comparable to public leaderboards (BFCL, MMLU, etc.). Methodology + runner: [github.com/dcostenco/prism-coder/tree/main/tests/benchmarks/prism-routing-100](https://github.com/dcostenco/prism-coder/tree/main/tests/benchmarks/prism-routing-100).
49
 
50
+ **Where this model wins**: zero invented tool names, gate-passing accuracy (≥90%), 1.0s avg latency. Runs locally $0/request, private, no rate limits.
 
 
 
51
 
52
+ **Remaining weak spot**: `knowledge_search` at 43% — the model confuses "what do I know" (knowledge_search) with "what did I record" (session_search_memory). Corpus rebalancing needed for the next revision.
53
 
54
  ## Training recipe (v26-polish)
55
 
56
  - **Base**: Qwen/Qwen3-14B (bf16)
57
  - **LoRA**: r=8, α=16, dropout 0.05, targets `q/k/v/o_proj` only
58
+ - **Corpus**: 576 hand-crafted rows, 56% plain-text guards + 44% tool exemplars
59
  - **Schedule**: 50 iters @ LR 1e-6, batch 1, cosine warmup 0.05, seq 2048
60
  - **Hardware**: Mac M4 Max (MLX-LM)
61
  - **Wall time**: ~5 min training
62
 
 
 
63
  ## Usage
64
 
65
  ### Ollama (recommended)
 
69
  ollama run dcostenco/prism-coder:14b "Load context for prism-mcp project"
70
  ```
71
 
72
+ **Important**: Use the `nothink` template in your Modelfile to disable Qwen3's thinking mode. Without it, the model wastes tokens on reasoning and latency jumps from 1s to 6s+.
 
 
 
 
 
 
 
 
73
 
74
  ### System prompt
75
 
76
+ Use the [v26 routing prompt](https://github.com/dcostenco/prism-coder/blob/main/tests/benchmarks/prism-routing-100/benchmark.py#L47) verbatim. Key: rules 1-7 must say `-> respond directly (no tool)`, NOT `-> plain text` (Q4_K_M quantization misreads the latter as a tool name).
77
 
78
  ## Hardware requirements
79
 
80
  - **Mac**: M2 Pro+ with ≥24 GB unified memory (Q4_K_M weights = 9 GB + ~6 GB activations)
81
  - **Linux + NVIDIA**: RTX 3090 / 4090 (24 GB) or any A-series ≥ 24 GB
82
+ - **Inference speed**: ~1 s per 200-token response on M4 Max (with nothink template)
83
  - **Loaded VRAM**: ~10 GB
84
 
85
  ## License