dcostenco commited on
Commit
db4f2d8
·
verified ·
1 Parent(s): 81c234c

Add training/TRAINING_DECISIONS_4B_V43.md

Browse files
Files changed (1) hide show
  1. training/TRAINING_DECISIONS_4B_V43.md +149 -0
training/TRAINING_DECISIONS_4B_V43.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Prism Coder 4B v43 — Training Decisions & Reuse Guide
2
+
3
+ > Apply these decisions to 8B, 14B, 32B training runs. The patterns are size-agnostic.
4
+
5
+ ## Architecture: Tiered Model Deployment
6
+
7
+ | Model | Device RAM | Role | Verifier? |
8
+ |-------|-----------|------|-----------|
9
+ | 1.7B | ≥3GB free | Primary agent (low-mem) | Self (or none) |
10
+ | 4B | ≥8GB free | Primary agent (mid-tier) | 4B self-verifier |
11
+ | 8B | ≥16GB free | Primary agent (high-tier) | 4B or self |
12
+ | 14B/32B | ≥24GB free | Primary agent (pro) | 4B |
13
+
14
+ **Key decision**: 1.7B stays on low-memory devices (phones, older Macs). 4B and above all serve the same Prism Memory tool-calling purpose — larger = better edge case handling, same corpus shape.
15
+
16
+ **Verifier tier**: Configured via `PRISM_VERIFIER_MODEL` env var (default: `prism-coder:1b7`). Set to `prism-coder:4b` on devices with ≥8GB free for higher accuracy at ~3× latency cost. The verifier call runs post-draft in `chat-verifier.ts::verifyOrRefuse()`.
17
+
18
+ ---
19
+
20
+ ## Training Hyperparameters (Validated for 4B, Scale for Others)
21
+
22
+ | Param | 4B v43 value | 8B guidance | 14B/32B guidance |
23
+ |-------|-------------|-------------|-----------------|
24
+ | Base model | Qwen/Qwen3-4B | Qwen/Qwen3-8B | Qwen/Qwen3-14B / 32B |
25
+ | LoRA rank | 32 | 32 | 32 (or 16 for speed) |
26
+ | LoRA alpha | 64 (scale=2.0) | 64 | 64 |
27
+ | LoRA layers | 16 of 36 | 16 of 36 | 16 of 48 |
28
+ | Batch size | 2 | 2 | 1 (grad-checkpoint) |
29
+ | Grad checkpoint | yes | yes | yes |
30
+ | Seq length | 2048 | 2048 | 2048 |
31
+ | LR (initial) | 1e-4 | 1e-4 | 5e-5 |
32
+ | LR (surgical patch) | 3e-5 | 2e-5 | 1e-5 |
33
+ | Iters (full run) | 2000 | 2000 | 1500 |
34
+ | Iters (patch) | 250 | 200 | 150 |
35
+ | Val batches | 25 | 25 | 25 |
36
+ | Save every | 200 | 200 | 200 |
37
+
38
+ **Critical**: 1e-4 LR on a surgical patch (small delta corpus appended to large existing corpus) causes catastrophic interference. Validated fix: 3e-5 for 4B. Scale down proportionally for larger models.
39
+
40
+ ---
41
+
42
+ ## Corpus Format (MUST match for all model sizes)
43
+
44
+ All rows must use the `"text"` key with pre-rendered ChatML strings:
45
+ ```json
46
+ {"text": "<|im_start|>system\n...<|im_end|>\n<|im_start|>user\n...<|im_end|>\n<|im_start|>assistant\n...<|im_end|>"}
47
+ ```
48
+
49
+ **Never use** `"messages"` key format — mlx_lm auto-detects format from the first row and crashes on mixed files.
50
+
51
+ Tool call format in training data:
52
+ ```
53
+ <tool_call>
54
+ {"name": "tool_name", "arguments": {...}}
55
+ </tool_call>
56
+ ```
57
+ **No pipes** — not `<|tool_call|>`. The eval harness (`bfcl_eval.py`, `swe_bench_test.py`) must use the same format.
58
+
59
+ Think block format:
60
+ ```
61
+ <|synalux_think|>reasoning here</|synalux_think|>
62
+ ```
63
+ Use string literals for both open and close tags — do NOT use f-string escaping for the close tag.
64
+
65
+ ---
66
+
67
+ ## Merge: mlx_lm.fuse is Broken for GGUF
68
+
69
+ `mlx_lm.fuse` silently loses LoRA weights during GGUF conversion. Use `merge_4b_v43.py` pattern instead:
70
+
71
+ ```python
72
+ # delta = scale * (A @ B) where scale = alpha / rank (pre-computed in adapter_config.json)
73
+ delta = scale * (A_matrix @ B_matrix)
74
+ merged_weight = base_weight + delta
75
+ ```
76
+
77
+ Script: `merge_4b_v43.py` — adapt for 8B/14B/32B by changing the base model path.
78
+
79
+ ---
80
+
81
+ ## Layer 3 (Inference-Time Remapping) — Apply Before Training Patches
82
+
83
+ Before writing a corpus patch, check if the failure is fixable by Layer 3 rules in `bfcl_eval.py::apply_layer3()`. Layer 3 fixes:
84
+ - Tool name remapping (semantic similarity false positives)
85
+ - Format normalization (pipe vs no-pipe)
86
+ - Context-based disambiguation (backfill_links vs synthesize_edges)
87
+ - Abstention for general programming/CS questions
88
+
89
+ Layer 3 is **zero-cost** (no training needed) and **regression-proof**. Use training patches only for failures that Layer 3 cannot fix.
90
+
91
+ ---
92
+
93
+ ## Corpus Mix Ratios (v2 — Verified)
94
+
95
+ | Category | Target % | Notes |
96
+ |----------|----------|-------|
97
+ | Tool-use (Prism Memory) | ~36% | All 29 tools, param extraction, multi-turn |
98
+ | AAC / clinical | ~40% | Critical — prevents mode collapse |
99
+ | Abstention | ~12% | CS/general questions → no tool call |
100
+ | Safety / refusal | ~12% | Edge cases, PII, etc. |
101
+
102
+ Minimum counts for quality gate: tool_calls ≥ 5000, AAC rows ≥ 40%, safety/refusal ≥ 10%.
103
+
104
+ ---
105
+
106
+ ## Patch Strategy (Surgical vs Full Retrain)
107
+
108
+ **Surgical patch** (preferred):
109
+ - Identify failing categories from swe_bench or BFCL
110
+ - Build targeted JSONL (30–100 rows per failure group)
111
+ - 3× oversample → append to existing `train.jsonl`
112
+ - Train at reduced LR (3e-5 for 4B, scale down for larger)
113
+ - Iters: 200–300 (enough to reinforce without overwriting)
114
+
115
+ **Full retrain** (when): catastrophic regression, base model upgrade, or corpus shape change requiring >20% new data.
116
+
117
+ ---
118
+
119
+ ## BFCL Gate: 100% Required Before Push
120
+
121
+ Gate enforced in `bfcl_eval.py`. Run with 3 seeds before any Ollama Hub push:
122
+ ```bash
123
+ python3 bfcl_eval.py --model prism-coder:4b-v43 --seeds 2027 2028 2029
124
+ ```
125
+ All seeds must show 100%. Partial seed pass = do not push.
126
+
127
+ SWE bench (`swe_bench_test.py`) is the secondary blind eval — target 100% strict, but this is harder and may accept ≥95% strict if BFCL=100%.
128
+
129
+ ---
130
+
131
+ ## Files in This Directory
132
+
133
+ | File | Purpose |
134
+ |------|---------|
135
+ | `build_4b_v43_corpus.py` | Full v43 corpus builder (28,454 base rows) |
136
+ | `build_4b_v43_patch.py` | Patch 1: initial BFCL failures |
137
+ | `build_4b_v43_patch2.py` | Patch 2: param extraction + format |
138
+ | `build_4b_v43_patch3.py` | Patch 3: (regressed — LR too high, abandoned) |
139
+ | `build_4b_v43_patch4.py` | Patch 4: task_route implicit + param extraction from casual phrasing |
140
+ | `build_4b_v43_swe_patch.py` | SWE bench targeted patch |
141
+ | `train_4b_v43_local.sh` | MLX LoRA training script (Apple Silicon) |
142
+ | `merge_4b_v43.py` | Safe merge: delta = scale × (A @ B) |
143
+ | `export_4b_v43_gguf.sh` | HF → GGUF F16 → Q4_K_M → Ollama register |
144
+ | `bfcl_eval.py` | 64-test BFCL suite with Layer 3 |
145
+ | `swe_bench_test.py` | 68-test blind SWE suite |
146
+ | `orchestrate_4b_to_100.sh` | Autonomous patch→train→eval loop |
147
+ | `analyze_swe_failures.py` | Parse swe_bench output → failure categories |
148
+
149
+ For 8B/14B/32B: copy the build_* and train_* scripts, update model name and hyperparams per the table above.