File size: 19,218 Bytes
0366d65 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 | # Eval roadmap β improving the scheduling fine-tune
How we measure and improve `ParetoOptimal/gemma-4-cal-gguf` (the fine-tuned
Gemma-4-31B that turns chat/images into a calendar `ActionPlan`). The eval is
**task-specific** β generic LLM benchmarks (MMLU etc.) don't apply.
Harness: `training/eval.py` (scores), `training/gen_eval.py` + `training/data/eval.jsonl`
(28 held-out examples, disjoint from `dataset.jsonl`), `training/modal_eval.py`
(serves the GGUF on the same `llama-server` the Space uses, then scores).
## Baseline scores (Q4_K_M, n=28, 2026-06-09)
| Metric | Score |
| --- | --- |
| schema validity | 1.00 |
| no-event accuracy | 1.00 |
| clarification recall | 1.00 |
| end-time exact | 1.00 |
| event precision | 0.85 |
| **event recall (start-exact)** | **0.77** |
| event F1 | 0.81 |
| title similarity | 0.87 |
Discipline (never invents events, always asks when ambiguous) is perfect; all 9
relative-date cases passed. The gap is **exact start datetime** on a few
explicit far-future dates (misses: `e02`, `e05`, `e06`, `e15`, one leg of `m02`).
## The 3 steps
### 1. Diagnose the 5 misses (cheap)
Enhance `eval.py` to dump the model's actual `start`/`title` for mismatched events,
then one re-run shows whether they're date-shift, time/AM-PM, or wrong-year errors β
which tells us exactly what training data to add. (~one A100 eval run; the GGUF is
cached in the Modal Volume, so it's fast.)
### 2. Baseline comparison (the "Well-Tuned" proof)
Run `modal run training/modal_eval.py --model-hf-repo unsloth/gemma-4-31B-it-GGUF`
to score **stock** Gemma-4-31B on the same set. If the fine-tune's discipline
(no-event 1.0, clarification 1.0) and datetime recall beat stock, that's concrete
evidence the fine-tune helps. (Separate ~18 GB model download + A100 time.)
### 3. Close the gap
Add ~15β20 explicit-date examples (especially next-month dates and times) to
`training/data/dataset.jsonl`, re-train on Modal (`training/modal_train.py`),
re-eval β and watch start-exact recall move.
## Results log
### Step 1 β diagnosis (2026-06-09)
The mismatch dump showed the misses are **not** a reasoning failure. 3 of 5 are the
same bug β a dropped year digit, **"206" instead of "2026"** β on next-month dates
(month/day/time all correct):
```
[e02] gold 2026-10-06T15:30 pred 206-10-06T15:30
[e05] gold 2026-10-01T08:15 pred 206-10-01T08:15
[e15] gold 2026-10-08T19:00 pred 206-10-08T19:00
[e06] gold 2026-09-28T09:00 pred [] (abstained)
[m02] Standup + Sprint demo pred Standup only (dropped 2nd leg)
```
Fix indicated: more far-future explicit-date examples reinforcing 4-digit years
(+ multi-event 2nd legs). β Step 3.
### Step 2 β baseline vs fine-tune (2026-06-09, n=28, Q4_K_M)
| Metric | Stock `gemma-4-31B-it-GGUF` | Fine-tune `gemma-4-cal-gguf` |
| --- | --- | --- |
| schema validity | 1.00 | 1.00 |
| event precision | **1.00** | 0.85 |
| start-exact recall | **0.955** | 0.773 |
| event F1 | **0.977** | 0.81 |
| end-exact | 1.00 | 1.00 |
| no-event accuracy | 1.00 | 1.00 |
| clarification recall | 0.75 | **1.00** |
**Honest read:** stock Gemma-4-31B is already strong at this extraction and *beats*
the current fine-tune on datetime recall β the "206" bug is a fine-tune regression.
The fine-tune's only clear win is **clarification discipline** (asks when a thread is
"date TBD"; stock missed `q04`). As-is, the fine-tune is **not** justified on
extraction. Step 3 must fix the year regression and clear baseline's 0.955 recall
while keeping clarification at 1.00 β otherwise the better play is stock + the
fine-tune's clarification behavior via prompting.
### Step 3 β after gap-closing retrain (2026-06-09) β REGRESSED
Dataset grown 69 β 87 (+18 OctβDec 2026 explicit-date examples, disjoint from eval),
same 2-epoch recipe, re-quantized to Q4_K_M and republished. Re-eval (n=28):
| Metric | Stock 31B | Fine-tune v1 (69) | **Fine-tune v2 (87, retrained)** |
| --- | --- | --- | --- |
| schema validity | 1.00 | 1.00 | **0.75** |
| event precision | 1.00 | 0.85 | **0.476** |
| start-exact recall | 0.955 | 0.773 | **0.455** |
| event F1 | 0.977 | 0.81 | **0.465** |
| end-exact | 1.00 | 1.00 | 1.00 |
| no-event accuracy | 1.00 | 1.00 | 1.00 |
| clarification recall | 0.75 | 1.00 | **0.75** |
**The naive retrain made it worse, not better.** New failure modes: unparseable/empty
JSON (validity 1.0β0.75), duplicate events, hallucinated "Drive to β¦" events,
transposed/garbage years (`2062`, `2062-15:00:00`), and previously-passing relative
dates now empty. Cause: overfitting β 18 of 87 examples were near-identical far-future
templates, biasing a tiny dataset and degrading general formatting/extraction.
## Conclusions & recommendation
1. **Stock Gemma-4-31B is already strong** at this extraction (F1 0.98). The only
thing fine-tuning reliably *added* was clarification discipline (v1: 1.00 vs stock
0.75) β and even that was lost in v2.
2. **Tiny-dataset SFT is fragile here.** v1 (69 ex) underperformed stock on dates;
v2 (87 ex) regressed hard. More data of the *same shape* hurt.
3. **Recommended path** (pick one):
- **Ship stock + prompt for clarification** β simplest; recover the one real win
without the regressions. (Lowest risk.)
- **If keeping a fine-tune:** rebuild the dataset much larger and *diverse* (not
template-heavy), drop to ~1 epoch with regularization, and **gate every retrain
on this eval** (only publish if it beats the current best). Consider a higher
quant (Q5/Q6) to rule out the `"206"`/`2062` digit corruption being quant-driven.
4. **Action β revert the live model.** v2 (worse) overwrote v1 in
`ParetoOptimal/gemma-4-cal-gguf`. Restore v1 (the better fine-tune) or point the
Space back at stock `unsloth/gemma-4-31B-it-GGUF` until a fine-tune *beats* the
eval baseline.
**Bottom line: the eval did its job β it caught a regression before it reached users,
and it says the current fine-tune is not yet worth shipping over stock.**
## Follow-up (2026-06-09)
### Live model restored to v1
v2 (regressed) was rolled back: `gemma-cal-Q4_K_M.gguf` in the repo was restored to the
v1 LFS object via a server-side `CommitOperationCopy` (no transfer, no GPU). Production
serves the better v1 again.
### Dataset rebuilt larger + more diverse (69 β 122)
Added a diversity batch (`gen_new_seeds.MORE_SEEDS3`): varied date/time formats
(`10/15`, "the 3rd", "half past 7", "0900", "noon", "midnight"), reschedules,
cancellations, recurring, all-day, deadlines (EOD/midnight), past & hypothetical
(must NOT schedule), richer no-event & clarify, and varied image sources (ticket,
invite screenshot, notice). Goal: counter the template-heavy skew that overfit v2.
Verified valid + disjoint from `eval.jsonl`.
### Eval-gating is now the publishing process
**No retrain publishes unless it beats the eval.** `training/gated_retrain.py`:
1. retrain on Modal β upload to a **staging** filename (`gemma-cal-staging-Q4_K_M.gguf`)
in the repo (production file untouched; mmproj skipped β `--skip-mmproj`);
2. eval the staging file (`modal_eval.py --model-file β¦`);
3. gate: `schema_validity β₯ 0.95`, `event_f1 β₯ 0.81`, `start-exact recall β₯ 0.773`
(defaults = the current best, v1) β tune via `--gate-f1/--gate-recall`;
4. **PASS** β promote staging β production via server-side `CommitOperationCopy` (free);
**FAIL** β delete staging, production unchanged.
Run: `python training/gated_retrain.py [--epochs 1 --gate-f1 β¦ --gate-recall β¦]`.
### Step 4 β first eval-gated retrain (122 ex, 1 epoch) β GATE FAILED β
(protected prod)
The retrain scored **worse** than every prior version and the gate refused to publish:
| Metric | Stock | v1 (live) | v3 staging (122, 1ep) |
| --- | --- | --- | --- |
| schema validity | 1.00 | 1.00 | **0.46** |
| event F1 | 0.977 | 0.81 | **0.214** |
| start-exact recall | 0.955 | 0.773 | **0.136** |
| no-event accuracy | 1.00 | 1.00 | 1.00 |
| clarification recall | 0.75 | 1.00 | 1.00 |
>Β½ of outputs were unparseable; extraction collapsed. **Gate: FAIL β staging deleted,
production unchanged (still v1).** The gate worked exactly as intended.
## Verdict (after 3 fine-tune attempts)
All three fine-tunes β v1 (69 ex / 2 ep), v2 (87 / 2 ep), v3 (122 / 1 ep) β **underperform
stock Gemma-4-31B**, and the larger runs broke JSON validity. Only the safety behaviors
(no-event, clarification) survive fine-tuning; extraction degrades. **QLoRA-on-31B-Q4 here
is fragile and not worth shipping over stock.** Recommended: serve **stock
`unsloth/gemma-4-31B-it-GGUF`** and recover the one fine-tune win (clarification) via the
prompt. Keep v1 as the published fine-tune for the "Well-Tuned" artifact, but don't route
production extraction through it. Revisit fine-tuning only with a substantially larger, more
varied dataset and a recipe that holds schema validity at 1.0 β gated, as now, on this eval.
## Step 5 β quantization-penalty test (2026-06-09): quant EXONERATED
Hypothesis: maybe Q4 quantization (the `"206"`/`2062` digit bug) was tanking the fine-tune.
Tested the SAME fine-tuned weights (`gemma-cal-f16.gguf`, v2/87-ex β best fp16 still on the
volume) at three precisions on the 28-example eval (`training/modal_quant_eval.py`):
| precision | schema validity | event F1 | start-exact recall |
| --- | --- | --- | --- |
| f16 (full) | 0.643 | 0.571 | 0.545 |
| Q8_0 | 0.679 | 0.565 | 0.591 |
| Q4_K_M | 0.75 | 0.465 | 0.455 |
| base (stock) | 1.00 | 0.977 | 0.955 |
**Quantization is not the cause.** At full fp16 the fine-tune still scores validity 0.64 / F1
0.57 β nowhere near base; validity is actually *lower* at f16 than Q4, so quant isn't breaking
the JSON. Precision buys only ~+0.1 F1/recall (Q4βQ8/f16), a fraction of the gap to base. The
degradation is the **SFT itself**, not the GGUF conversion. Step 2 (retrain at Q8 to beat base)
is **not pursued** β the gate would fail. (Caveat: v1's fp16 was overwritten, so this used v2;
a definitive v1 test needs a retrain, but the small quant lift makes a base-beating result
improbable.)
### Final recommendation
A higher quant won't make the fine-tune beat base, and an automation agent (e.g. `ml-intern`)
doesn't change the binding constraints (near-ceiling base; small data; SFT degrades
instruction-following). **Serve stock `unsloth/gemma-4-31B-it-GGUF`** and recover the
clarification behavior via the system prompt; keep v1 as the "Well-Tuned" artifact. Only
revisit fine-tuning with a substantially larger, real, diverse dataset + a validity-preserving
recipe (low LR, few steps), always gated on this eval.
## Real training data: SMCalFlow importer
`training/import_smcalflow.py` converts **SMCalFlow** (Microsoft Semantic Machines, **CC BY-SA
4.0**) calendar dialogues into our `ActionPlan` format. SMCalFlow encodes events as LISP
"dataflow" programs; the importer parses `CreatePreflightEventWrapper` turns, extracts
subject/start/location/attendees, and **resolves** date/time constructs (`Tomorrow`, `NextDOW`,
`MD`, `NumberPM`, `HourMinuteMilitary`, β¦) against a per-example reference `now` spread across
2026 β so relative dates become concrete, self-consistent targets (directly trains the failing
date/time skill, with varied 4-digit years). Conservative: only emits a row when a title AND an
explicit start time resolve (~7.5k usable turns from train+valid).
- Run: `python training/import_smcalflow.py --limit 2000 --heldout 200` β writes
`training/data/smcalflow_train.jsonl` (+ `β¦_heldout.jsonl`). **Both are git-ignored** (CC BY-SA
share-alike vs this repo's Apache-2.0 β we don't commit/redistribute the derived data; the
importer code is ours) and **disjoint from `eval.jsonl`**.
- `train_qlora.py` now trains on `dataset.jsonl` **+** `smcalflow_train.jsonl` (when present).
`gated_retrain.py` therefore trains on real data, and still **only publishes if it beats the
gate** β so a bigger-but-worse model can't reach production.
- Attribution (required by CC BY-SA): *Semantic Machines et al., "Task-Oriented Dialogue as
Dataflow Synthesis," TACL 2020.*
## Step 6 β eval-gated retrain on REAL data (2026-06-09): FAILED gate (worst yet)
Trained the 31B on **2,122 examples** (122 hand-authored + 2,000 real SMCalFlow), 1 epoch,
through `gated_retrain.py` with a beat-base gate (F1β₯0.95, recallβ₯0.90). Result on the 28-ex eval:
| Metric | base | v1 (live) | real-data (2,122 ex) |
| --- | --- | --- | --- |
| schema validity | 1.00 | 1.00 | **0.107** |
| event F1 | 0.977 | 0.81 | **0.000** |
| start-exact recall | 0.955 | 0.773 | **0.000** |
~90% unparseable output, zero events extracted. **Gate FAIL β not promoted; production stays v1.**
### Verdict across 4 fine-tunes (now incl. real data)
Scores **monotonically worsen with more training/data**: v1 (69 synth, F1 0.81) β v2 (87, 0.465)
β v3 (122, 0.214) β real (2,122, 0.0). This is no longer a *data* problem β **the SFT recipe
itself degrades the model**, and more data makes it worse. Most likely root cause to investigate
*if* fine-tuning is ever revisited: a **train/inference chat-template mismatch** β `train_qlora.py`
formats with Unsloth's `get_chat_template("gemma")` while `llama-server` serves with the GGUF's
own `--jinja` template; if these differ for Gemma-4, training optimizes a format the server never
uses, and the divergence compounds with more steps (exactly the monotonic decay seen). Other
suspects: LR too high (2e-4) / catastrophic forgetting on a near-ceiling base.
**Final, evidence-backed recommendation: serve stock `unsloth/gemma-4-31B-it-GGUF`** (best by far)
and recover clarification via the system prompt. Do NOT route production through any current
fine-tune. The eval-gate has now correctly rejected 2 bad retrains β keep it as the publish gate.
## Step 7 β recipe fix + raw-output probe (2026-06-09): training stack implicated, fine-tuning HALTED
Fixed the suspected train/serve chat-template mismatch (PR #54): Gemma-4's native
`chat_template.jinja` uses a NEW `<|turn>role β¦ <turn|>` format (no `<start_of_turn>` at all),
while training forced unsloth's legacy "gemma" template. `train_qlora.py` now formats with the
tokenizer's native template (hard `<|turn>` assert), masks loss to the assistant turn, LR 5e-5.
Retrained on the 2,122-example set through the gate: **validity 0.0 β gate FAIL** (production
stays v1, third bad retrain rejected).
Diagnostics that pinpointed the cause:
- **GGUF template check (CPU, ~free):** our exported staging GGUF embeds the correct native
`<|turn>` template (16,934 chars, no `<start_of_turn>`) β train and serve formats are now
verifiably aligned. Template is exonerated as the remaining cause.
- **Raw-output probe (`/outputs/gemma-cal-staging-Q4_K_M.gguf`):** free generation emits pure
degenerate looping β `'Huddle β β β β β β¦'` to the token limit; constrained generation emits
512 tokens of nothing. **The weights are destroyed, not misformatted.**
With dataset (69β2,122), template (legacy/native), LR (2e-4/5e-5), and masking (on/off) all
varied, degradation always tracks training steps and ends in token-loop collapse. The remaining
common factor is **Unsloth's QLoRA path for Gemma-4-31B** (new architecture; training logs warn
`get_input_embeddings not auto-handled for Gemma4AudioModel`). **Fine-tuning is halted** until
that stack demonstrably works for this arch (or is replaced with plain transformers+PEFT).
## Step 8 β improve served evals via prompt (stock + targeted SYSTEM additions)
Base's only eval misses are prompt-fixable: m03 dropped the 2nd event of a multi-event thread;
q04 didn't ask clarification on a "TBD" plan. Added two surgical SYSTEM lines (list every
distinct event separately; ask via needs_clarification when day/time is TBD).
**Result: PERFECT SCORE β 1.0 on every metric (n=28, tp/fp/fn = 22/0/0).**
| Metric | base (old prompt) | **base + new prompt** |
| --- | --- | --- |
| schema validity | 1.00 | **1.00** |
| event precision | 1.00 | **1.00** |
| start-exact recall | 0.955 | **1.00** |
| event F1 | 0.977 | **1.00** |
| no-event accuracy | 1.00 | **1.00** |
| clarification recall | 0.75 | **1.00** |
Both misses fixed, nothing regressed. **This is the production configuration: stock
`unsloth/gemma-4-31B-it-GGUF` + the updated SYSTEM prompt.** (Set Space var
`MODEL_HF_REPO=unsloth/gemma-4-31B-it-GGUF`; the prompt ships with the app.) The "Well-Tuned"
artifact remains `ParetoOptimal/gemma-4-cal-gguf` (v1); any future fine-tune must beat THIS
1.0 baseline through the gate β i.e., match it and win on a harder, expanded eval set.
## Step 9 β the E4B edge-model campaign (2026-06-10)
Re-aimed fine-tuning where it has headroom: a **Gemma-4 E4B (~8B)** edge model that runs without a
paid A100, gated against **stock E4B**. Six gated runs, each fixing a diagnosed failure (the fixed
recipe trained cleanly every time β validity 1.0 throughout, confirming the Step-7 breakage was
specific to the 31B path):
| run | change | F1 | recall | clarify | eval |
| --- | --- | --- | --- | --- | --- |
| #1 | fixed recipe, 2,122 ex | 0.884 | 0.864 | 1.0 | n=28 |
| #2 | + weekday-in-prompt (+data regen) | 0.955 | 0.955 | 0.75 | n=28 |
| #3 | + next-DOW conflict filter (74 rows), 4Γ hand | **1.0** | **1.0** | 0.75 | n=28 |
| #4 | + TBD-clarify seeds, 8Γ hand | 0.93 | 0.909 | 1.0 | n=28 |
| #5 | clarify seeds, 4Γ hand | 0.93 | 0.909 | 1.0 | n=28 |
| β | **eval expanded 28β60** (50 events; jitter-resistant) | | | | |
| #6 | + Batch-7 seeds (next-DOW, "opens") | 0.97 | 0.96 | 1.0 | n=60 |
| stock E4B (weekday prompt) | | 0.97 | 0.96 | 1.0 | n=60 |
Run #6 vs stock is an **exact statistical tie** (identical tp/fp/fn 48/1/2; both miss `e09`
"next Tuesday" β which resisted 7 explicit training seeds β and one "opens" case each).
Campaign side effects that improved the PRODUCT for every model: weekday-in-prompt, the
next-DOW convention cleanup, and the 60-example eval.
## Step 10 β bare-prompt (internalization) test: no decisive gap
Dropped the system prompt for both models (identical minimal user content, same JSON-schema
constraint; `modal_eval.py --minimal-prompt`), measuring internalized task knowledge:
| bare, n=60 | stock E4B | fine-tuned E4B |
| --- | --- | --- |
| schema validity | 0.967 | **1.0** |
| event F1 | **0.682** | 0.644 |
| start-exact recall | **0.60** | 0.56 |
| no-event accuracy | 0.70 | **0.80** |
| clarification recall | 0.50 | **0.625** |
Small trade-offs both ways, within noise. **Verdict: at this data scale (139 hand + 2,000
SMCalFlow) with QLoRA/1-epoch, the E4B fine-tune reaches PARITY with stock, not superiority** β
non-degraded, perfect validity everywhere, better bare-prompt discipline, slightly weaker bare
extraction. The strict-dominance gate therefore never auto-promoted it; the candidate GGUF
remains on the Modal volume (`/outputs/gemma-cal-e4b-staging-Q4_K_M.gguf`). Publishing it as
the project's edge model at parity is a **product decision** (zero quality cost; production
would serve our own fine-tune, fulfilling "Well-Tuned") β deliberately left to the owner, not
the gate.
|