dcostenco commited on
Commit
7d5ab03
·
verified ·
1 Parent(s): 117256f

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +151 -0
README.md ADDED
@@ -0,0 +1,151 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - tool-routing
6
+ - function-calling
7
+ - prism-memory
8
+ - prism-aac
9
+ - qwen3
10
+ - gguf
11
+ base_model: Qwen/Qwen3-4B
12
+ ---
13
+
14
+ # prism-coder:4b — Full Prism Memory Router (Mid-Tier)
15
+
16
+ Fine-tuned Qwen3-4B for 17-tool Prism Memory routing in the [Prism AAC](https://github.com/dcostenco/prism-aac) system.
17
+ Primary deployment: **Mac / PC / high-memory mobile** via Ollama or llama.cpp GGUF — for devices with ≥8 GB free RAM.
18
+
19
+ ## BFCL Routing Benchmark — v43 (Current)
20
+
21
+ **100.0%** (64/64 strict, 8 categories)
22
+
23
+ | Category | Count | Description | Accuracy |
24
+ |----------|------:|-------------|:--------:|
25
+ | simple | 10 | Direct single-tool invocations | 100% |
26
+ | relevance_detection | 10 | No-tool abstention for off-topic prompts | 100% |
27
+ | hallucination | 10 | Reject fabricated / nonexistent tools | 100% |
28
+ | disambiguation | 8 | Pick correct tool from near-neighbors | 100% |
29
+ | format_sensitivity | 5 | Varied natural phrasing for same intent | 100% |
30
+ | ast_parameter | 5 | Correct argument extraction | 100% |
31
+ | edge_case | 8 | Boundary and adversarial inputs | 100% |
32
+ | multi_turn_chain | 8 | Two-step tool sequences | 100% |
33
+
34
+ Eval: Ollama inference, temperature=0, greedy decode.
35
+ Gate: ≥90% = deploy.
36
+
37
+ ## SWE Bench Blind Eval — v43
38
+
39
+ **100.0%** (68/68 strict, 7 categories) — held-out test set, no overlap with training data.
40
+
41
+ | Category | Count | Accuracy |
42
+ |----------|------:|:--------:|
43
+ | adversarial_trap | 15 | 100% |
44
+ | cascade | 10 | 100% |
45
+ | disambiguation | 8 | 100% |
46
+ | edge_case | 8 | 100% |
47
+ | multi_intent | 4 | 100% |
48
+ | natural_phrasing | 15 | 100% |
49
+ | verifier | 8 | 100% |
50
+
51
+ ## eval-300 — v43
52
+
53
+ **100.0%** (300/300 strict, 5 shuffled runs, 0 flaky tests)
54
+
55
+ | Category | Count | Accuracy |
56
+ |----------|------:|:--------:|
57
+ | abstention | 20 | 100% |
58
+ | adversarial_trap | 70 | 100% |
59
+ | cascade | 25 | 100% |
60
+ | disambiguation | 40 | 100% |
61
+ | edge_case | 25 | 100% |
62
+ | multi_intent | 20 | 100% |
63
+ | natural_phrasing | 50 | 100% |
64
+ | param_extraction | 25 | 100% |
65
+ | verifier | 25 | 100% |
66
+
67
+ ## Version History
68
+
69
+ | Version | BFCL | SWE Bench | eval-300 | Notes |
70
+ |---------|------|-----------|----------|-------|
71
+ | v43 | **100%** | **100%** | **100%** | Qwen3-4B base, 17-tool full router, Layer 3 inference-time remapping, 5 surgical patches |
72
+
73
+ ## Tools
74
+
75
+ The model routes to 17 Prism Memory tools:
76
+
77
+ | Tool | Trigger |
78
+ |------|---------|
79
+ | `session_load_context` | Load / resume / catch me up on project context |
80
+ | `session_save_ledger` | Jot down / log / note / record what we did |
81
+ | `session_save_experience` | Log milestone / achievement / success event |
82
+ | `session_save_handoff` | Save state for next agent / shift change |
83
+ | `session_search_memory` | Recall / remind me / find what we decided |
84
+ | `session_forget_memory` | Delete a specific memory entry by ID |
85
+ | `session_export_memory` | Export session to file (JSON / Markdown) |
86
+ | `session_compact_ledger` | Compact / prune old session entries |
87
+ | `session_health_check` | Check session integrity |
88
+ | `session_synthesize_edges` | Verify / rebuild session link graph |
89
+ | `session_backfill_links` | Reconnect / patch missing session links |
90
+ | `session_task_route` | Route a task to the right agent tier |
91
+ | `knowledge_search` | Search knowledge base / accumulated docs |
92
+ | `knowledge_forget` | Delete knowledge entries / wipe records |
93
+ | `knowledge_upvote` | Upvote / boost / increase rank of entry |
94
+ | `knowledge_downvote` | Downvote / lower rank of entry |
95
+ | `knowledge_set_retention` | Set TTL / auto-expire / retention policy |
96
+
97
+ Plain text (no tool) for: greetings, general questions, math, code help, weather, CS concepts.
98
+
99
+ ## Model Details
100
+
101
+ - **Base**: Qwen/Qwen3-4B
102
+ - **Format**: GGUF Q4_K_M (~2.3 GB)
103
+ - **Context**: 32,768 tokens
104
+ - **Training**: MLX LoRA on Apple Silicon, rank=32, alpha=64, 16/36 layers, LR=1e-4 (full) → 3e-5 (surgical patches), 5 patch rounds
105
+ - **Corpus**: ~30K rows — 36% tool-use, 40% AAC/clinical, 12% abstention, 12% safety
106
+ - **Merge**: direct safetensors delta merge (`delta = (alpha/rank) × B.T @ A.T`) — mlx_lm.fuse not used (silently drops LoRA weights)
107
+ - **Quantization**: llama.cpp F16 → Q4_K_M
108
+
109
+ ## Usage
110
+
111
+ ```bash
112
+ ollama pull dcostenco/prism-coder:4b-v43
113
+ ollama run dcostenco/prism-coder:4b-v43
114
+ ```
115
+
116
+ Or drop the GGUF into any llama.cpp-compatible runtime (LM Studio, Jan, llama-server).
117
+
118
+ In [Prism AAC](https://github.com/dcostenco/prism-aac) the app loads this model automatically on devices with ≥8 GB free RAM.
119
+
120
+ ## Training Scripts
121
+
122
+ The `training/` folder in this repo contains the full v43 training pipeline:
123
+
124
+ | Script | Purpose |
125
+ |--------|---------|
126
+ | `build_4b_v43_corpus.py` | Full v43 corpus builder (~30K rows) |
127
+ | `build_4b_v43_patch.py` | Patch 1 — initial BFCL failures |
128
+ | `build_4b_v43_patch2.py` | Patch 2 — param extraction + format |
129
+ | `build_4b_v43_patch4.py` | Patch 4 — task_route + casual phrasing |
130
+ | `build_4b_v43_swe_patch.py` | Patch 5 — SWE bench targeted |
131
+ | `combine_4b_swe_corpus.py` | Merge base + SWE patch corpus |
132
+ | `train_4b_v43_local.sh` | MLX LoRA training (Apple Silicon) |
133
+ | `train_4b_v43_swe_patch.sh` | Surgical SWE patch training run |
134
+ | `merge_4b_v43.py` | Safe LoRA merge (delta = scale × B.T @ A.T) |
135
+ | `export_4b_v43_gguf.sh` | HF safetensors → GGUF F16 → Q4_K_M → Ollama |
136
+ | `orchestrate_4b_to_100.sh` | Autonomous patch→train→eval loop |
137
+ | `bfcl_eval.py` | 64-test BFCL eval harness with Layer 3 |
138
+ | `swe_bench_test.py` | 68-test SWE blind eval harness |
139
+ | `eval_300.py` | 300-test standard eval (9 categories) |
140
+ | `analyze_swe_failures.py` | Parse failures → patch targets |
141
+ | `TRAINING_DECISIONS_4B_V43.md` | Hyperparams, corpus ratios, lessons learned |
142
+
143
+ ## Model Family
144
+
145
+ | Model | GGUF | RAM | Tools | Repo |
146
+ |-------|------|-----|-------|------|
147
+ | prism-coder:1b7 | 1.2 GB | ≥3 GB | 6 | [dcostenco/prism-coder-1.7b](https://huggingface.co/dcostenco/prism-coder-1.7b) |
148
+ | **prism-coder:4b** | **2.3 GB** | **≥8 GB** | **17** | **this repo** |
149
+ | prism-coder:8b | 4.9 GB | ≥16 GB | 6 | [dcostenco/prism-coder-8b](https://huggingface.co/dcostenco/prism-coder-8b) |
150
+ | prism-coder:14b | 8.4 GB | ≥24 GB | 6 + TypeScript | [dcostenco/prism-coder-14b](https://huggingface.co/dcostenco/prism-coder-14b) |
151
+ | prism-coder:32b | 16 GB | ≥48 GB | 6 | [dcostenco/prism-coder-32b](https://huggingface.co/dcostenco/prism-coder-32b) |