# Eval roadmap — improving the scheduling fine-tune

How we measure and improve `ParetoOptimal/gemma-4-cal-gguf` (the fine-tuned
Gemma-4-31B that turns chat/images into a calendar `ActionPlan`). The eval is
**task-specific** — generic LLM benchmarks (MMLU etc.) don't apply.

Harness: `training/eval.py` (scores), `training/gen_eval.py` + `training/data/eval.jsonl`
(28 held-out examples, disjoint from `dataset.jsonl`), `training/modal_eval.py`
(serves the GGUF on the same `llama-server` the Space uses, then scores).

## Baseline scores (Q4_K_M, n=28, 2026-06-09)

| Metric | Score |
| --- | --- |
| schema validity | 1.00 |
| no-event accuracy | 1.00 |
| clarification recall | 1.00 |
| end-time exact | 1.00 |
| event precision | 0.85 |
| **event recall (start-exact)** | **0.77** |
| event F1 | 0.81 |
| title similarity | 0.87 |

Discipline (never invents events, always asks when ambiguous) is perfect; all 9
relative-date cases passed. The gap is **exact start datetime** on a few
explicit far-future dates (misses: `e02`, `e05`, `e06`, `e15`, one leg of `m02`).

## The 3 steps

### 1. Diagnose the 5 misses (cheap)
Enhance `eval.py` to dump the model's actual `start`/`title` for mismatched events,
then one re-run shows whether they're date-shift, time/AM-PM, or wrong-year errors —
which tells us exactly what training data to add. (~one A100 eval run; the GGUF is
cached in the Modal Volume, so it's fast.)

### 2. Baseline comparison (the "Well-Tuned" proof)
Run `modal run training/modal_eval.py --model-hf-repo unsloth/gemma-4-31B-it-GGUF`
to score **stock** Gemma-4-31B on the same set. If the fine-tune's discipline
(no-event 1.0, clarification 1.0) and datetime recall beat stock, that's concrete
evidence the fine-tune helps. (Separate ~18 GB model download + A100 time.)

### 3. Close the gap
Add ~15–20 explicit-date examples (especially next-month dates and times) to
`training/data/dataset.jsonl`, re-train on Modal (`training/modal_train.py`),
re-eval — and watch start-exact recall move.

## Results log

### Step 1 — diagnosis (2026-06-09)
The mismatch dump showed the misses are **not** a reasoning failure. 3 of 5 are the
same bug — a dropped year digit, **"206" instead of "2026"** — on next-month dates
(month/day/time all correct):

```
[e02] gold 2026-10-06T15:30  pred 206-10-06T15:30
[e05] gold 2026-10-01T08:15  pred 206-10-01T08:15
[e15] gold 2026-10-08T19:00  pred 206-10-08T19:00
[e06] gold 2026-09-28T09:00  pred []                 (abstained)
[m02] Standup + Sprint demo  pred Standup only       (dropped 2nd leg)
```

Fix indicated: more far-future explicit-date examples reinforcing 4-digit years
(+ multi-event 2nd legs). → Step 3.

### Step 2 — baseline vs fine-tune (2026-06-09, n=28, Q4_K_M)

| Metric | Stock `gemma-4-31B-it-GGUF` | Fine-tune `gemma-4-cal-gguf` |
| --- | --- | --- |
| schema validity | 1.00 | 1.00 |
| event precision | **1.00** | 0.85 |
| start-exact recall | **0.955** | 0.773 |
| event F1 | **0.977** | 0.81 |
| end-exact | 1.00 | 1.00 |
| no-event accuracy | 1.00 | 1.00 |
| clarification recall | 0.75 | **1.00** |

**Honest read:** stock Gemma-4-31B is already strong at this extraction and *beats*
the current fine-tune on datetime recall — the "206" bug is a fine-tune regression.
The fine-tune's only clear win is **clarification discipline** (asks when a thread is
"date TBD"; stock missed `q04`). As-is, the fine-tune is **not** justified on
extraction. Step 3 must fix the year regression and clear baseline's 0.955 recall
while keeping clarification at 1.00 — otherwise the better play is stock + the
fine-tune's clarification behavior via prompting.

### Step 3 — after gap-closing retrain (2026-06-09) — REGRESSED
Dataset grown 69 → 87 (+18 Oct–Dec 2026 explicit-date examples, disjoint from eval),
same 2-epoch recipe, re-quantized to Q4_K_M and republished. Re-eval (n=28):

| Metric | Stock 31B | Fine-tune v1 (69) | **Fine-tune v2 (87, retrained)** |
| --- | --- | --- | --- |
| schema validity | 1.00 | 1.00 | **0.75** |
| event precision | 1.00 | 0.85 | **0.476** |
| start-exact recall | 0.955 | 0.773 | **0.455** |
| event F1 | 0.977 | 0.81 | **0.465** |
| end-exact | 1.00 | 1.00 | 1.00 |
| no-event accuracy | 1.00 | 1.00 | 1.00 |
| clarification recall | 0.75 | 1.00 | **0.75** |

**The naive retrain made it worse, not better.** New failure modes: unparseable/empty
JSON (validity 1.0→0.75), duplicate events, hallucinated "Drive to …" events,
transposed/garbage years (`2062`, `2062-15:00:00`), and previously-passing relative
dates now empty. Cause: overfitting — 18 of 87 examples were near-identical far-future
templates, biasing a tiny dataset and degrading general formatting/extraction.

## Conclusions & recommendation

1. **Stock Gemma-4-31B is already strong** at this extraction (F1 0.98). The only
   thing fine-tuning reliably *added* was clarification discipline (v1: 1.00 vs stock
   0.75) — and even that was lost in v2.
2. **Tiny-dataset SFT is fragile here.** v1 (69 ex) underperformed stock on dates;
   v2 (87 ex) regressed hard. More data of the *same shape* hurt.
3. **Recommended path** (pick one):
   - **Ship stock + prompt for clarification** — simplest; recover the one real win
     without the regressions. (Lowest risk.)
   - **If keeping a fine-tune:** rebuild the dataset much larger and *diverse* (not
     template-heavy), drop to ~1 epoch with regularization, and **gate every retrain
     on this eval** (only publish if it beats the current best). Consider a higher
     quant (Q5/Q6) to rule out the `"206"`/`2062` digit corruption being quant-driven.
4. **Action — revert the live model.** v2 (worse) overwrote v1 in
   `ParetoOptimal/gemma-4-cal-gguf`. Restore v1 (the better fine-tune) or point the
   Space back at stock `unsloth/gemma-4-31B-it-GGUF` until a fine-tune *beats* the
   eval baseline.

**Bottom line: the eval did its job — it caught a regression before it reached users,
and it says the current fine-tune is not yet worth shipping over stock.**

## Follow-up (2026-06-09)

### Live model restored to v1
v2 (regressed) was rolled back: `gemma-cal-Q4_K_M.gguf` in the repo was restored to the
v1 LFS object via a server-side `CommitOperationCopy` (no transfer, no GPU). Production
serves the better v1 again.

### Dataset rebuilt larger + more diverse (69 → 122)
Added a diversity batch (`gen_new_seeds.MORE_SEEDS3`): varied date/time formats
(`10/15`, "the 3rd", "half past 7", "0900", "noon", "midnight"), reschedules,
cancellations, recurring, all-day, deadlines (EOD/midnight), past & hypothetical
(must NOT schedule), richer no-event & clarify, and varied image sources (ticket,
invite screenshot, notice). Goal: counter the template-heavy skew that overfit v2.
Verified valid + disjoint from `eval.jsonl`.

### Eval-gating is now the publishing process
**No retrain publishes unless it beats the eval.** `training/gated_retrain.py`:
1. retrain on Modal → upload to a **staging** filename (`gemma-cal-staging-Q4_K_M.gguf`)
   in the repo (production file untouched; mmproj skipped — `--skip-mmproj`);
2. eval the staging file (`modal_eval.py --model-file …`);
3. gate: `schema_validity ≥ 0.95`, `event_f1 ≥ 0.81`, `start-exact recall ≥ 0.773`
   (defaults = the current best, v1) — tune via `--gate-f1/--gate-recall`;
4. **PASS** → promote staging → production via server-side `CommitOperationCopy` (free);
   **FAIL** → delete staging, production unchanged.

Run: `python training/gated_retrain.py [--epochs 1 --gate-f1 … --gate-recall …]`.

### Step 4 — first eval-gated retrain (122 ex, 1 epoch) — GATE FAILED ✅ (protected prod)
The retrain scored **worse** than every prior version and the gate refused to publish:

| Metric | Stock | v1 (live) | v3 staging (122, 1ep) |
| --- | --- | --- | --- |
| schema validity | 1.00 | 1.00 | **0.46** |
| event F1 | 0.977 | 0.81 | **0.214** |
| start-exact recall | 0.955 | 0.773 | **0.136** |
| no-event accuracy | 1.00 | 1.00 | 1.00 |
| clarification recall | 0.75 | 1.00 | 1.00 |

>½ of outputs were unparseable; extraction collapsed. **Gate: FAIL → staging deleted,
production unchanged (still v1).** The gate worked exactly as intended.

## Verdict (after 3 fine-tune attempts)
All three fine-tunes — v1 (69 ex / 2 ep), v2 (87 / 2 ep), v3 (122 / 1 ep) — **underperform
stock Gemma-4-31B**, and the larger runs broke JSON validity. Only the safety behaviors
(no-event, clarification) survive fine-tuning; extraction degrades. **QLoRA-on-31B-Q4 here
is fragile and not worth shipping over stock.** Recommended: serve **stock
`unsloth/gemma-4-31B-it-GGUF`** and recover the one fine-tune win (clarification) via the
prompt. Keep v1 as the published fine-tune for the "Well-Tuned" artifact, but don't route
production extraction through it. Revisit fine-tuning only with a substantially larger, more
varied dataset and a recipe that holds schema validity at 1.0 — gated, as now, on this eval.

## Step 5 — quantization-penalty test (2026-06-09): quant EXONERATED
Hypothesis: maybe Q4 quantization (the `"206"`/`2062` digit bug) was tanking the fine-tune.
Tested the SAME fine-tuned weights (`gemma-cal-f16.gguf`, v2/87-ex — best fp16 still on the
volume) at three precisions on the 28-example eval (`training/modal_quant_eval.py`):

| precision | schema validity | event F1 | start-exact recall |
| --- | --- | --- | --- |
| f16 (full) | 0.643 | 0.571 | 0.545 |
| Q8_0 | 0.679 | 0.565 | 0.591 |
| Q4_K_M | 0.75 | 0.465 | 0.455 |
| base (stock) | 1.00 | 0.977 | 0.955 |

**Quantization is not the cause.** At full fp16 the fine-tune still scores validity 0.64 / F1
0.57 — nowhere near base; validity is actually *lower* at f16 than Q4, so quant isn't breaking
the JSON. Precision buys only ~+0.1 F1/recall (Q4→Q8/f16), a fraction of the gap to base. The
degradation is the **SFT itself**, not the GGUF conversion. Step 2 (retrain at Q8 to beat base)
is **not pursued** — the gate would fail. (Caveat: v1's fp16 was overwritten, so this used v2;
a definitive v1 test needs a retrain, but the small quant lift makes a base-beating result
improbable.)

### Final recommendation
A higher quant won't make the fine-tune beat base, and an automation agent (e.g. `ml-intern`)
doesn't change the binding constraints (near-ceiling base; small data; SFT degrades
instruction-following). **Serve stock `unsloth/gemma-4-31B-it-GGUF`** and recover the
clarification behavior via the system prompt; keep v1 as the "Well-Tuned" artifact. Only
revisit fine-tuning with a substantially larger, real, diverse dataset + a validity-preserving
recipe (low LR, few steps), always gated on this eval.

## Real training data: SMCalFlow importer
`training/import_smcalflow.py` converts **SMCalFlow** (Microsoft Semantic Machines, **CC BY-SA
4.0**) calendar dialogues into our `ActionPlan` format. SMCalFlow encodes events as LISP
"dataflow" programs; the importer parses `CreatePreflightEventWrapper` turns, extracts
subject/start/location/attendees, and **resolves** date/time constructs (`Tomorrow`, `NextDOW`,
`MD`, `NumberPM`, `HourMinuteMilitary`, …) against a per-example reference `now` spread across
2026 — so relative dates become concrete, self-consistent targets (directly trains the failing
date/time skill, with varied 4-digit years). Conservative: only emits a row when a title AND an
explicit start time resolve (~7.5k usable turns from train+valid).

- Run: `python training/import_smcalflow.py --limit 2000 --heldout 200` → writes
  `training/data/smcalflow_train.jsonl` (+ `…_heldout.jsonl`). **Both are git-ignored** (CC BY-SA
  share-alike vs this repo's Apache-2.0 → we don't commit/redistribute the derived data; the
  importer code is ours) and **disjoint from `eval.jsonl`**.
- `train_qlora.py` now trains on `dataset.jsonl` **+** `smcalflow_train.jsonl` (when present).
  `gated_retrain.py` therefore trains on real data, and still **only publishes if it beats the
  gate** — so a bigger-but-worse model can't reach production.
- Attribution (required by CC BY-SA): *Semantic Machines et al., "Task-Oriented Dialogue as
  Dataflow Synthesis," TACL 2020.*

## Step 6 — eval-gated retrain on REAL data (2026-06-09): FAILED gate (worst yet)
Trained the 31B on **2,122 examples** (122 hand-authored + 2,000 real SMCalFlow), 1 epoch,
through `gated_retrain.py` with a beat-base gate (F1≥0.95, recall≥0.90). Result on the 28-ex eval:

| Metric | base | v1 (live) | real-data (2,122 ex) |
| --- | --- | --- | --- |
| schema validity | 1.00 | 1.00 | **0.107** |
| event F1 | 0.977 | 0.81 | **0.000** |
| start-exact recall | 0.955 | 0.773 | **0.000** |

~90% unparseable output, zero events extracted. **Gate FAIL → not promoted; production stays v1.**

### Verdict across 4 fine-tunes (now incl. real data)
Scores **monotonically worsen with more training/data**: v1 (69 synth, F1 0.81) → v2 (87, 0.465)
→ v3 (122, 0.214) → real (2,122, 0.0). This is no longer a *data* problem — **the SFT recipe
itself degrades the model**, and more data makes it worse. Most likely root cause to investigate
*if* fine-tuning is ever revisited: a **train/inference chat-template mismatch** — `train_qlora.py`
formats with Unsloth's `get_chat_template("gemma")` while `llama-server` serves with the GGUF's
own `--jinja` template; if these differ for Gemma-4, training optimizes a format the server never
uses, and the divergence compounds with more steps (exactly the monotonic decay seen). Other
suspects: LR too high (2e-4) / catastrophic forgetting on a near-ceiling base.

**Final, evidence-backed recommendation: serve stock `unsloth/gemma-4-31B-it-GGUF`** (best by far)
and recover clarification via the system prompt. Do NOT route production through any current
fine-tune. The eval-gate has now correctly rejected 2 bad retrains — keep it as the publish gate.

## Step 7 — recipe fix + raw-output probe (2026-06-09): training stack implicated, fine-tuning HALTED
Fixed the suspected train/serve chat-template mismatch (PR #54): Gemma-4's native
`chat_template.jinja` uses a NEW `<|turn>role … <turn|>` format (no `<start_of_turn>` at all),
while training forced unsloth's legacy "gemma" template. `train_qlora.py` now formats with the
tokenizer's native template (hard `<|turn>` assert), masks loss to the assistant turn, LR 5e-5.
Retrained on the 2,122-example set through the gate: **validity 0.0 — gate FAIL** (production
stays v1, third bad retrain rejected).

Diagnostics that pinpointed the cause:
- **GGUF template check (CPU, ~free):** our exported staging GGUF embeds the correct native
  `<|turn>` template (16,934 chars, no `<start_of_turn>`) → train and serve formats are now
  verifiably aligned. Template is exonerated as the remaining cause.
- **Raw-output probe (`/outputs/gemma-cal-staging-Q4_K_M.gguf`):** free generation emits pure
  degenerate looping — `'Huddle — — — — — …'` to the token limit; constrained generation emits
  512 tokens of nothing. **The weights are destroyed, not misformatted.**

With dataset (69→2,122), template (legacy/native), LR (2e-4/5e-5), and masking (on/off) all
varied, degradation always tracks training steps and ends in token-loop collapse. The remaining
common factor is **Unsloth's QLoRA path for Gemma-4-31B** (new architecture; training logs warn
`get_input_embeddings not auto-handled for Gemma4AudioModel`). **Fine-tuning is halted** until
that stack demonstrably works for this arch (or is replaced with plain transformers+PEFT).

## Step 8 — improve served evals via prompt (stock + targeted SYSTEM additions)
Base's only eval misses are prompt-fixable: m03 dropped the 2nd event of a multi-event thread;
q04 didn't ask clarification on a "TBD" plan. Added two surgical SYSTEM lines (list every
distinct event separately; ask via needs_clarification when day/time is TBD).

**Result: PERFECT SCORE — 1.0 on every metric (n=28, tp/fp/fn = 22/0/0).**

| Metric | base (old prompt) | **base + new prompt** |
| --- | --- | --- |
| schema validity | 1.00 | **1.00** |
| event precision | 1.00 | **1.00** |
| start-exact recall | 0.955 | **1.00** |
| event F1 | 0.977 | **1.00** |
| no-event accuracy | 1.00 | **1.00** |
| clarification recall | 0.75 | **1.00** |

Both misses fixed, nothing regressed. **This is the production configuration: stock
`unsloth/gemma-4-31B-it-GGUF` + the updated SYSTEM prompt.** (Set Space var
`MODEL_HF_REPO=unsloth/gemma-4-31B-it-GGUF`; the prompt ships with the app.) The "Well-Tuned"
artifact remains `ParetoOptimal/gemma-4-cal-gguf` (v1); any future fine-tune must beat THIS
1.0 baseline through the gate — i.e., match it and win on a harder, expanded eval set.

## Step 9 — the E4B edge-model campaign (2026-06-10)
Re-aimed fine-tuning where it has headroom: a **Gemma-4 E4B (~8B)** edge model that runs without a
paid A100, gated against **stock E4B**. Six gated runs, each fixing a diagnosed failure (the fixed
recipe trained cleanly every time — validity 1.0 throughout, confirming the Step-7 breakage was
specific to the 31B path):

| run | change | F1 | recall | clarify | eval |
| --- | --- | --- | --- | --- | --- |
| #1 | fixed recipe, 2,122 ex | 0.884 | 0.864 | 1.0 | n=28 |
| #2 | + weekday-in-prompt (+data regen) | 0.955 | 0.955 | 0.75 | n=28 |
| #3 | + next-DOW conflict filter (74 rows), 4× hand | **1.0** | **1.0** | 0.75 | n=28 |
| #4 | + TBD-clarify seeds, 8× hand | 0.93 | 0.909 | 1.0 | n=28 |
| #5 | clarify seeds, 4× hand | 0.93 | 0.909 | 1.0 | n=28 |
| — | **eval expanded 28→60** (50 events; jitter-resistant) | | | | |
| #6 | + Batch-7 seeds (next-DOW, "opens") | 0.97 | 0.96 | 1.0 | n=60 |
| stock E4B (weekday prompt) | | 0.97 | 0.96 | 1.0 | n=60 |

Run #6 vs stock is an **exact statistical tie** (identical tp/fp/fn 48/1/2; both miss `e09`
"next Tuesday" — which resisted 7 explicit training seeds — and one "opens" case each).
Campaign side effects that improved the PRODUCT for every model: weekday-in-prompt, the
next-DOW convention cleanup, and the 60-example eval.

## Step 10 — bare-prompt (internalization) test: no decisive gap
Dropped the system prompt for both models (identical minimal user content, same JSON-schema
constraint; `modal_eval.py --minimal-prompt`), measuring internalized task knowledge:

| bare, n=60 | stock E4B | fine-tuned E4B |
| --- | --- | --- |
| schema validity | 0.967 | **1.0** |
| event F1 | **0.682** | 0.644 |
| start-exact recall | **0.60** | 0.56 |
| no-event accuracy | 0.70 | **0.80** |
| clarification recall | 0.50 | **0.625** |

Small trade-offs both ways, within noise. **Verdict: at this data scale (139 hand + 2,000
SMCalFlow) with QLoRA/1-epoch, the E4B fine-tune reaches PARITY with stock, not superiority** —
non-degraded, perfect validity everywhere, better bare-prompt discipline, slightly weaker bare
extraction. The strict-dominance gate therefore never auto-promoted it; the candidate GGUF
remains on the Modal volume (`/outputs/gemma-cal-e4b-staging-Q4_K_M.gguf`). Publishing it as
the project's edge model at parity is a **product decision** (zero quality cost; production
would serve our own fine-tune, fulfilling "Well-Tuned") — deliberately left to the owner, not
the gate.