prism-coder-32b / README.md
dcostenco's picture
Upload README.md with huggingface_hub
58778b6 verified
metadata
language: en
license: apache-2.0
base_model: Qwen/Qwen3-30B-A3B
tags:
  - tool-calling
  - routing
  - aac
  - qwen3
  - moe
  - gguf

prism-coder:32b β€” Tool Routing Model (Desktop Quality Tier)

Fine-tuned Qwen3-30B-A3B (MoE) for 6-tool routing in the Prism AAC system. Quality escalation tier in the desktop cascade: 14B β†’ 32B β†’ cloud Claude.

v5 (May 2026): Switched base from dense Qwen3-32B to Qwen3-30B-A3B (MoE). Same accuracy, 9 GB smaller, ~4Γ— faster inference (only ~3B params active per token).

BFCL Routing Benchmark β€” v7 (Current)

Mean: 100.0% PERFECT (3-seed average, seeds 2027/2028/2029, 102 cases each)

Category Count Description Accuracy
aac 12 AAC phrase requests β†’ plain text 100%
cmpct 6 Ledger compaction 100%
edge 6 Multi-step / compound requests 100%
hand 8 Agent handoff / relay 100%
info 5 General facts β†’ plain text 100%
irrel 10 Irrelevant / live queries β†’ plain text 100%
know 7 Knowledge base search 100%
load 9 Session context loading 100%
pred 8 Factual / knowledge queries β†’ plain text 100%
save 13 Session ledger save 100%
smem 12 Session memory search 100%
tran 6 Translation requests β†’ plain text 100%

All 12 categories at 100%. No remaining failures.

Eval: MLX inference + thinking, temperature=0, 3-seed mean. Gate: β‰₯90% = deploy.

Full Cascade Benchmark (May 2026)

Individual BFCL scores (MLX, 3 seeds):

Model BFCL Size Tier
prism-coder:8b v36 100.0% PERFECT 4.7 GB Desktop / Mobile tier
prism-coder:14b v36 100.0% PERFECT 8.4 GB Desktop primary tier
prism-coder:32b v7 100.0% PERFECT 16 GB Desktop quality tier

Cascade eval: 14b β†’ 32b β†’ Claude Opus (102 cases Γ— 3 seeds)

Metric Result
Cascade accuracy 100.0% (mean, 3 seeds)
Opus-solo etalon 98.3%
Ξ” vs Opus +1.7%
Traffic served by 14b 99% (101/102 cases avg)
Traffic escalated to 32b 1% (1/102 avg) β€” catches save live state β†’ handoff edge case
Traffic reaching Opus API 0%

Fine-tuned cascade outperforms Claude Opus on edge (+16.7%) and know (+14.3%).

Version History

Version Base BFCL Notes
v7 (current) Qwen3-30B-A3B MoE 100.0% PERFECT Fixed: "what do I know + search memory" compound β†’ knowledge_search
v6 Qwen3-30B-A3B MoE 99.0% Fixed MoE merge (BF16 safetensors + correct MLX→HF key mapping)
v5 Qwen3-30B-A3B MoE 97.1% 18Γ— density fix; 9GB smaller, 4Γ— faster vs dense
v4 Qwen3-30B-A3B MoE 92.2% rank=32 experiment β€” regressed vs v3
v3 Qwen3-30B-A3B MoE 92.5% 20Γ— reps + LR=1e-5 β€” hit rank bottleneck
v2 Qwen3-30B-A3B MoE 92.5% v34 corpus + 1400 iters
v33 (dense) Qwen3-32B dense 99.0% Prior generation β€” larger/slower

Tools

The model routes between exactly 6 tools:

  1. session_load_context β€” load/fetch/resume project context
  2. session_save_ledger β€” note/log/remember/record progress
  3. session_save_handoff β€” handoff/relay to next agent/session
  4. session_compact_ledger β€” compact/archive/shrink ledger
  5. session_search_memory β€” recall past sessions/conversations
  6. knowledge_search β€” search stored notes/knowledge base

Files

File Size Use
qwen3-30b-a3b-v7-iq4nl.gguf 16 GB Current β€” recommended
qwen3-30b-a3b-v6-iq4nl.gguf 17 GB Previous (99.0%)
qwen3-30b-a3b-v5-iq4nl.gguf 17 GB Previous (97.1%)
qwen3-32b-v33-q6k.gguf 25 GB Dense predecessor (99.0%, legacy)

Usage (Ollama)

ollama run dcostenco/prism-coder:32b

Training

  • Base: Qwen/Qwen3-30B-A3B (HF BF16, ~57 GB)
  • Adapters: v6 LoRA (rank=8, scale=10, 8 layers, LR=1e-5)
  • Merge: Direct safetensors merge on HF BF16 base; delta = (scale/rank) Γ— B^T A^T for attn/gate; delta[i] = (scale/rank) Γ— B[i] A[i] for MoE experts (128 experts stacked)
  • Key fix: v5 merge used wrong base (MLX 4-bit, can't apply float LoRA delta) and uppercase regex lora_[AB] vs actual lowercase lora_a/lora_b adapter keys
  • Hardware: Apple Silicon (M-series, 64 GB RAM)