File size: 19,218 Bytes
0366d65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
# Eval roadmap β€” improving the scheduling fine-tune

How we measure and improve `ParetoOptimal/gemma-4-cal-gguf` (the fine-tuned
Gemma-4-31B that turns chat/images into a calendar `ActionPlan`). The eval is
**task-specific** β€” generic LLM benchmarks (MMLU etc.) don't apply.

Harness: `training/eval.py` (scores), `training/gen_eval.py` + `training/data/eval.jsonl`
(28 held-out examples, disjoint from `dataset.jsonl`), `training/modal_eval.py`
(serves the GGUF on the same `llama-server` the Space uses, then scores).

## Baseline scores (Q4_K_M, n=28, 2026-06-09)

| Metric | Score |
| --- | --- |
| schema validity | 1.00 |
| no-event accuracy | 1.00 |
| clarification recall | 1.00 |
| end-time exact | 1.00 |
| event precision | 0.85 |
| **event recall (start-exact)** | **0.77** |
| event F1 | 0.81 |
| title similarity | 0.87 |

Discipline (never invents events, always asks when ambiguous) is perfect; all 9
relative-date cases passed. The gap is **exact start datetime** on a few
explicit far-future dates (misses: `e02`, `e05`, `e06`, `e15`, one leg of `m02`).

## The 3 steps

### 1. Diagnose the 5 misses (cheap)
Enhance `eval.py` to dump the model's actual `start`/`title` for mismatched events,
then one re-run shows whether they're date-shift, time/AM-PM, or wrong-year errors β€”
which tells us exactly what training data to add. (~one A100 eval run; the GGUF is
cached in the Modal Volume, so it's fast.)

### 2. Baseline comparison (the "Well-Tuned" proof)
Run `modal run training/modal_eval.py --model-hf-repo unsloth/gemma-4-31B-it-GGUF`
to score **stock** Gemma-4-31B on the same set. If the fine-tune's discipline
(no-event 1.0, clarification 1.0) and datetime recall beat stock, that's concrete
evidence the fine-tune helps. (Separate ~18 GB model download + A100 time.)

### 3. Close the gap
Add ~15–20 explicit-date examples (especially next-month dates and times) to
`training/data/dataset.jsonl`, re-train on Modal (`training/modal_train.py`),
re-eval β€” and watch start-exact recall move.

## Results log

### Step 1 β€” diagnosis (2026-06-09)
The mismatch dump showed the misses are **not** a reasoning failure. 3 of 5 are the
same bug β€” a dropped year digit, **"206" instead of "2026"** β€” on next-month dates
(month/day/time all correct):

```
[e02] gold 2026-10-06T15:30  pred 206-10-06T15:30
[e05] gold 2026-10-01T08:15  pred 206-10-01T08:15
[e15] gold 2026-10-08T19:00  pred 206-10-08T19:00
[e06] gold 2026-09-28T09:00  pred []                 (abstained)
[m02] Standup + Sprint demo  pred Standup only       (dropped 2nd leg)
```

Fix indicated: more far-future explicit-date examples reinforcing 4-digit years
(+ multi-event 2nd legs). β†’ Step 3.

### Step 2 β€” baseline vs fine-tune (2026-06-09, n=28, Q4_K_M)

| Metric | Stock `gemma-4-31B-it-GGUF` | Fine-tune `gemma-4-cal-gguf` |
| --- | --- | --- |
| schema validity | 1.00 | 1.00 |
| event precision | **1.00** | 0.85 |
| start-exact recall | **0.955** | 0.773 |
| event F1 | **0.977** | 0.81 |
| end-exact | 1.00 | 1.00 |
| no-event accuracy | 1.00 | 1.00 |
| clarification recall | 0.75 | **1.00** |

**Honest read:** stock Gemma-4-31B is already strong at this extraction and *beats*
the current fine-tune on datetime recall β€” the "206" bug is a fine-tune regression.
The fine-tune's only clear win is **clarification discipline** (asks when a thread is
"date TBD"; stock missed `q04`). As-is, the fine-tune is **not** justified on
extraction. Step 3 must fix the year regression and clear baseline's 0.955 recall
while keeping clarification at 1.00 β€” otherwise the better play is stock + the
fine-tune's clarification behavior via prompting.

### Step 3 β€” after gap-closing retrain (2026-06-09) β€” REGRESSED
Dataset grown 69 β†’ 87 (+18 Oct–Dec 2026 explicit-date examples, disjoint from eval),
same 2-epoch recipe, re-quantized to Q4_K_M and republished. Re-eval (n=28):

| Metric | Stock 31B | Fine-tune v1 (69) | **Fine-tune v2 (87, retrained)** |
| --- | --- | --- | --- |
| schema validity | 1.00 | 1.00 | **0.75** |
| event precision | 1.00 | 0.85 | **0.476** |
| start-exact recall | 0.955 | 0.773 | **0.455** |
| event F1 | 0.977 | 0.81 | **0.465** |
| end-exact | 1.00 | 1.00 | 1.00 |
| no-event accuracy | 1.00 | 1.00 | 1.00 |
| clarification recall | 0.75 | 1.00 | **0.75** |

**The naive retrain made it worse, not better.** New failure modes: unparseable/empty
JSON (validity 1.0β†’0.75), duplicate events, hallucinated "Drive to …" events,
transposed/garbage years (`2062`, `2062-15:00:00`), and previously-passing relative
dates now empty. Cause: overfitting β€” 18 of 87 examples were near-identical far-future
templates, biasing a tiny dataset and degrading general formatting/extraction.

## Conclusions & recommendation

1. **Stock Gemma-4-31B is already strong** at this extraction (F1 0.98). The only
   thing fine-tuning reliably *added* was clarification discipline (v1: 1.00 vs stock
   0.75) β€” and even that was lost in v2.
2. **Tiny-dataset SFT is fragile here.** v1 (69 ex) underperformed stock on dates;
   v2 (87 ex) regressed hard. More data of the *same shape* hurt.
3. **Recommended path** (pick one):
   - **Ship stock + prompt for clarification** β€” simplest; recover the one real win
     without the regressions. (Lowest risk.)
   - **If keeping a fine-tune:** rebuild the dataset much larger and *diverse* (not
     template-heavy), drop to ~1 epoch with regularization, and **gate every retrain
     on this eval** (only publish if it beats the current best). Consider a higher
     quant (Q5/Q6) to rule out the `"206"`/`2062` digit corruption being quant-driven.
4. **Action β€” revert the live model.** v2 (worse) overwrote v1 in
   `ParetoOptimal/gemma-4-cal-gguf`. Restore v1 (the better fine-tune) or point the
   Space back at stock `unsloth/gemma-4-31B-it-GGUF` until a fine-tune *beats* the
   eval baseline.

**Bottom line: the eval did its job β€” it caught a regression before it reached users,
and it says the current fine-tune is not yet worth shipping over stock.**

## Follow-up (2026-06-09)

### Live model restored to v1
v2 (regressed) was rolled back: `gemma-cal-Q4_K_M.gguf` in the repo was restored to the
v1 LFS object via a server-side `CommitOperationCopy` (no transfer, no GPU). Production
serves the better v1 again.

### Dataset rebuilt larger + more diverse (69 β†’ 122)
Added a diversity batch (`gen_new_seeds.MORE_SEEDS3`): varied date/time formats
(`10/15`, "the 3rd", "half past 7", "0900", "noon", "midnight"), reschedules,
cancellations, recurring, all-day, deadlines (EOD/midnight), past & hypothetical
(must NOT schedule), richer no-event & clarify, and varied image sources (ticket,
invite screenshot, notice). Goal: counter the template-heavy skew that overfit v2.
Verified valid + disjoint from `eval.jsonl`.

### Eval-gating is now the publishing process
**No retrain publishes unless it beats the eval.** `training/gated_retrain.py`:
1. retrain on Modal β†’ upload to a **staging** filename (`gemma-cal-staging-Q4_K_M.gguf`)
   in the repo (production file untouched; mmproj skipped β€” `--skip-mmproj`);
2. eval the staging file (`modal_eval.py --model-file …`);
3. gate: `schema_validity β‰₯ 0.95`, `event_f1 β‰₯ 0.81`, `start-exact recall β‰₯ 0.773`
   (defaults = the current best, v1) β€” tune via `--gate-f1/--gate-recall`;
4. **PASS** β†’ promote staging β†’ production via server-side `CommitOperationCopy` (free);
   **FAIL** β†’ delete staging, production unchanged.

Run: `python training/gated_retrain.py [--epochs 1 --gate-f1 … --gate-recall …]`.

### Step 4 β€” first eval-gated retrain (122 ex, 1 epoch) β€” GATE FAILED βœ… (protected prod)
The retrain scored **worse** than every prior version and the gate refused to publish:

| Metric | Stock | v1 (live) | v3 staging (122, 1ep) |
| --- | --- | --- | --- |
| schema validity | 1.00 | 1.00 | **0.46** |
| event F1 | 0.977 | 0.81 | **0.214** |
| start-exact recall | 0.955 | 0.773 | **0.136** |
| no-event accuracy | 1.00 | 1.00 | 1.00 |
| clarification recall | 0.75 | 1.00 | 1.00 |

>Β½ of outputs were unparseable; extraction collapsed. **Gate: FAIL β†’ staging deleted,
production unchanged (still v1).** The gate worked exactly as intended.

## Verdict (after 3 fine-tune attempts)
All three fine-tunes β€” v1 (69 ex / 2 ep), v2 (87 / 2 ep), v3 (122 / 1 ep) β€” **underperform
stock Gemma-4-31B**, and the larger runs broke JSON validity. Only the safety behaviors
(no-event, clarification) survive fine-tuning; extraction degrades. **QLoRA-on-31B-Q4 here
is fragile and not worth shipping over stock.** Recommended: serve **stock
`unsloth/gemma-4-31B-it-GGUF`** and recover the one fine-tune win (clarification) via the
prompt. Keep v1 as the published fine-tune for the "Well-Tuned" artifact, but don't route
production extraction through it. Revisit fine-tuning only with a substantially larger, more
varied dataset and a recipe that holds schema validity at 1.0 β€” gated, as now, on this eval.

## Step 5 β€” quantization-penalty test (2026-06-09): quant EXONERATED
Hypothesis: maybe Q4 quantization (the `"206"`/`2062` digit bug) was tanking the fine-tune.
Tested the SAME fine-tuned weights (`gemma-cal-f16.gguf`, v2/87-ex β€” best fp16 still on the
volume) at three precisions on the 28-example eval (`training/modal_quant_eval.py`):

| precision | schema validity | event F1 | start-exact recall |
| --- | --- | --- | --- |
| f16 (full) | 0.643 | 0.571 | 0.545 |
| Q8_0 | 0.679 | 0.565 | 0.591 |
| Q4_K_M | 0.75 | 0.465 | 0.455 |
| base (stock) | 1.00 | 0.977 | 0.955 |

**Quantization is not the cause.** At full fp16 the fine-tune still scores validity 0.64 / F1
0.57 β€” nowhere near base; validity is actually *lower* at f16 than Q4, so quant isn't breaking
the JSON. Precision buys only ~+0.1 F1/recall (Q4β†’Q8/f16), a fraction of the gap to base. The
degradation is the **SFT itself**, not the GGUF conversion. Step 2 (retrain at Q8 to beat base)
is **not pursued** β€” the gate would fail. (Caveat: v1's fp16 was overwritten, so this used v2;
a definitive v1 test needs a retrain, but the small quant lift makes a base-beating result
improbable.)

### Final recommendation
A higher quant won't make the fine-tune beat base, and an automation agent (e.g. `ml-intern`)
doesn't change the binding constraints (near-ceiling base; small data; SFT degrades
instruction-following). **Serve stock `unsloth/gemma-4-31B-it-GGUF`** and recover the
clarification behavior via the system prompt; keep v1 as the "Well-Tuned" artifact. Only
revisit fine-tuning with a substantially larger, real, diverse dataset + a validity-preserving
recipe (low LR, few steps), always gated on this eval.

## Real training data: SMCalFlow importer
`training/import_smcalflow.py` converts **SMCalFlow** (Microsoft Semantic Machines, **CC BY-SA
4.0**) calendar dialogues into our `ActionPlan` format. SMCalFlow encodes events as LISP
"dataflow" programs; the importer parses `CreatePreflightEventWrapper` turns, extracts
subject/start/location/attendees, and **resolves** date/time constructs (`Tomorrow`, `NextDOW`,
`MD`, `NumberPM`, `HourMinuteMilitary`, …) against a per-example reference `now` spread across
2026 β€” so relative dates become concrete, self-consistent targets (directly trains the failing
date/time skill, with varied 4-digit years). Conservative: only emits a row when a title AND an
explicit start time resolve (~7.5k usable turns from train+valid).

- Run: `python training/import_smcalflow.py --limit 2000 --heldout 200` β†’ writes
  `training/data/smcalflow_train.jsonl` (+ `…_heldout.jsonl`). **Both are git-ignored** (CC BY-SA
  share-alike vs this repo's Apache-2.0 β†’ we don't commit/redistribute the derived data; the
  importer code is ours) and **disjoint from `eval.jsonl`**.
- `train_qlora.py` now trains on `dataset.jsonl` **+** `smcalflow_train.jsonl` (when present).
  `gated_retrain.py` therefore trains on real data, and still **only publishes if it beats the
  gate** β€” so a bigger-but-worse model can't reach production.
- Attribution (required by CC BY-SA): *Semantic Machines et al., "Task-Oriented Dialogue as
  Dataflow Synthesis," TACL 2020.*

## Step 6 β€” eval-gated retrain on REAL data (2026-06-09): FAILED gate (worst yet)
Trained the 31B on **2,122 examples** (122 hand-authored + 2,000 real SMCalFlow), 1 epoch,
through `gated_retrain.py` with a beat-base gate (F1β‰₯0.95, recallβ‰₯0.90). Result on the 28-ex eval:

| Metric | base | v1 (live) | real-data (2,122 ex) |
| --- | --- | --- | --- |
| schema validity | 1.00 | 1.00 | **0.107** |
| event F1 | 0.977 | 0.81 | **0.000** |
| start-exact recall | 0.955 | 0.773 | **0.000** |

~90% unparseable output, zero events extracted. **Gate FAIL β†’ not promoted; production stays v1.**

### Verdict across 4 fine-tunes (now incl. real data)
Scores **monotonically worsen with more training/data**: v1 (69 synth, F1 0.81) β†’ v2 (87, 0.465)
β†’ v3 (122, 0.214) β†’ real (2,122, 0.0). This is no longer a *data* problem β€” **the SFT recipe
itself degrades the model**, and more data makes it worse. Most likely root cause to investigate
*if* fine-tuning is ever revisited: a **train/inference chat-template mismatch** β€” `train_qlora.py`
formats with Unsloth's `get_chat_template("gemma")` while `llama-server` serves with the GGUF's
own `--jinja` template; if these differ for Gemma-4, training optimizes a format the server never
uses, and the divergence compounds with more steps (exactly the monotonic decay seen). Other
suspects: LR too high (2e-4) / catastrophic forgetting on a near-ceiling base.

**Final, evidence-backed recommendation: serve stock `unsloth/gemma-4-31B-it-GGUF`** (best by far)
and recover clarification via the system prompt. Do NOT route production through any current
fine-tune. The eval-gate has now correctly rejected 2 bad retrains β€” keep it as the publish gate.

## Step 7 β€” recipe fix + raw-output probe (2026-06-09): training stack implicated, fine-tuning HALTED
Fixed the suspected train/serve chat-template mismatch (PR #54): Gemma-4's native
`chat_template.jinja` uses a NEW `<|turn>role … <turn|>` format (no `<start_of_turn>` at all),
while training forced unsloth's legacy "gemma" template. `train_qlora.py` now formats with the
tokenizer's native template (hard `<|turn>` assert), masks loss to the assistant turn, LR 5e-5.
Retrained on the 2,122-example set through the gate: **validity 0.0 β€” gate FAIL** (production
stays v1, third bad retrain rejected).

Diagnostics that pinpointed the cause:
- **GGUF template check (CPU, ~free):** our exported staging GGUF embeds the correct native
  `<|turn>` template (16,934 chars, no `<start_of_turn>`) β†’ train and serve formats are now
  verifiably aligned. Template is exonerated as the remaining cause.
- **Raw-output probe (`/outputs/gemma-cal-staging-Q4_K_M.gguf`):** free generation emits pure
  degenerate looping β€” `'Huddle β€” β€” β€” β€” β€” …'` to the token limit; constrained generation emits
  512 tokens of nothing. **The weights are destroyed, not misformatted.**

With dataset (69β†’2,122), template (legacy/native), LR (2e-4/5e-5), and masking (on/off) all
varied, degradation always tracks training steps and ends in token-loop collapse. The remaining
common factor is **Unsloth's QLoRA path for Gemma-4-31B** (new architecture; training logs warn
`get_input_embeddings not auto-handled for Gemma4AudioModel`). **Fine-tuning is halted** until
that stack demonstrably works for this arch (or is replaced with plain transformers+PEFT).

## Step 8 β€” improve served evals via prompt (stock + targeted SYSTEM additions)
Base's only eval misses are prompt-fixable: m03 dropped the 2nd event of a multi-event thread;
q04 didn't ask clarification on a "TBD" plan. Added two surgical SYSTEM lines (list every
distinct event separately; ask via needs_clarification when day/time is TBD).

**Result: PERFECT SCORE β€” 1.0 on every metric (n=28, tp/fp/fn = 22/0/0).**

| Metric | base (old prompt) | **base + new prompt** |
| --- | --- | --- |
| schema validity | 1.00 | **1.00** |
| event precision | 1.00 | **1.00** |
| start-exact recall | 0.955 | **1.00** |
| event F1 | 0.977 | **1.00** |
| no-event accuracy | 1.00 | **1.00** |
| clarification recall | 0.75 | **1.00** |

Both misses fixed, nothing regressed. **This is the production configuration: stock
`unsloth/gemma-4-31B-it-GGUF` + the updated SYSTEM prompt.** (Set Space var
`MODEL_HF_REPO=unsloth/gemma-4-31B-it-GGUF`; the prompt ships with the app.) The "Well-Tuned"
artifact remains `ParetoOptimal/gemma-4-cal-gguf` (v1); any future fine-tune must beat THIS
1.0 baseline through the gate β€” i.e., match it and win on a harder, expanded eval set.

## Step 9 β€” the E4B edge-model campaign (2026-06-10)
Re-aimed fine-tuning where it has headroom: a **Gemma-4 E4B (~8B)** edge model that runs without a
paid A100, gated against **stock E4B**. Six gated runs, each fixing a diagnosed failure (the fixed
recipe trained cleanly every time β€” validity 1.0 throughout, confirming the Step-7 breakage was
specific to the 31B path):

| run | change | F1 | recall | clarify | eval |
| --- | --- | --- | --- | --- | --- |
| #1 | fixed recipe, 2,122 ex | 0.884 | 0.864 | 1.0 | n=28 |
| #2 | + weekday-in-prompt (+data regen) | 0.955 | 0.955 | 0.75 | n=28 |
| #3 | + next-DOW conflict filter (74 rows), 4Γ— hand | **1.0** | **1.0** | 0.75 | n=28 |
| #4 | + TBD-clarify seeds, 8Γ— hand | 0.93 | 0.909 | 1.0 | n=28 |
| #5 | clarify seeds, 4Γ— hand | 0.93 | 0.909 | 1.0 | n=28 |
| β€” | **eval expanded 28β†’60** (50 events; jitter-resistant) | | | | |
| #6 | + Batch-7 seeds (next-DOW, "opens") | 0.97 | 0.96 | 1.0 | n=60 |
| stock E4B (weekday prompt) | | 0.97 | 0.96 | 1.0 | n=60 |

Run #6 vs stock is an **exact statistical tie** (identical tp/fp/fn 48/1/2; both miss `e09`
"next Tuesday" β€” which resisted 7 explicit training seeds β€” and one "opens" case each).
Campaign side effects that improved the PRODUCT for every model: weekday-in-prompt, the
next-DOW convention cleanup, and the 60-example eval.

## Step 10 β€” bare-prompt (internalization) test: no decisive gap
Dropped the system prompt for both models (identical minimal user content, same JSON-schema
constraint; `modal_eval.py --minimal-prompt`), measuring internalized task knowledge:

| bare, n=60 | stock E4B | fine-tuned E4B |
| --- | --- | --- |
| schema validity | 0.967 | **1.0** |
| event F1 | **0.682** | 0.644 |
| start-exact recall | **0.60** | 0.56 |
| no-event accuracy | 0.70 | **0.80** |
| clarification recall | 0.50 | **0.625** |

Small trade-offs both ways, within noise. **Verdict: at this data scale (139 hand + 2,000
SMCalFlow) with QLoRA/1-epoch, the E4B fine-tune reaches PARITY with stock, not superiority** β€”
non-degraded, perfect validity everywhere, better bare-prompt discipline, slightly weaker bare
extraction. The strict-dominance gate therefore never auto-promoted it; the candidate GGUF
remains on the Modal volume (`/outputs/gemma-cal-e4b-staging-Q4_K_M.gguf`). Publishing it as
the project's edge model at parity is a **product decision** (zero quality cost; production
would serve our own fine-tune, fulfilling "Well-Tuned") β€” deliberately left to the owner, not
the gate.