# Eval roadmap — improving the scheduling fine-tune How we measure and improve `ParetoOptimal/gemma-4-cal-gguf` (the fine-tuned Gemma-4-31B that turns chat/images into a calendar `ActionPlan`). The eval is **task-specific** — generic LLM benchmarks (MMLU etc.) don't apply. Harness: `training/eval.py` (scores), `training/gen_eval.py` + `training/data/eval.jsonl` (28 held-out examples, disjoint from `dataset.jsonl`), `training/modal_eval.py` (serves the GGUF on the same `llama-server` the Space uses, then scores). ## Baseline scores (Q4_K_M, n=28, 2026-06-09) | Metric | Score | | --- | --- | | schema validity | 1.00 | | no-event accuracy | 1.00 | | clarification recall | 1.00 | | end-time exact | 1.00 | | event precision | 0.85 | | **event recall (start-exact)** | **0.77** | | event F1 | 0.81 | | title similarity | 0.87 | Discipline (never invents events, always asks when ambiguous) is perfect; all 9 relative-date cases passed. The gap is **exact start datetime** on a few explicit far-future dates (misses: `e02`, `e05`, `e06`, `e15`, one leg of `m02`). ## The 3 steps ### 1. Diagnose the 5 misses (cheap) Enhance `eval.py` to dump the model's actual `start`/`title` for mismatched events, then one re-run shows whether they're date-shift, time/AM-PM, or wrong-year errors — which tells us exactly what training data to add. (~one A100 eval run; the GGUF is cached in the Modal Volume, so it's fast.) ### 2. Baseline comparison (the "Well-Tuned" proof) Run `modal run training/modal_eval.py --model-hf-repo unsloth/gemma-4-31B-it-GGUF` to score **stock** Gemma-4-31B on the same set. If the fine-tune's discipline (no-event 1.0, clarification 1.0) and datetime recall beat stock, that's concrete evidence the fine-tune helps. (Separate ~18 GB model download + A100 time.) ### 3. Close the gap Add ~15–20 explicit-date examples (especially next-month dates and times) to `training/data/dataset.jsonl`, re-train on Modal (`training/modal_train.py`), re-eval — and watch start-exact recall move. ## Results log ### Step 1 — diagnosis (2026-06-09) The mismatch dump showed the misses are **not** a reasoning failure. 3 of 5 are the same bug — a dropped year digit, **"206" instead of "2026"** — on next-month dates (month/day/time all correct): ``` [e02] gold 2026-10-06T15:30 pred 206-10-06T15:30 [e05] gold 2026-10-01T08:15 pred 206-10-01T08:15 [e15] gold 2026-10-08T19:00 pred 206-10-08T19:00 [e06] gold 2026-09-28T09:00 pred [] (abstained) [m02] Standup + Sprint demo pred Standup only (dropped 2nd leg) ``` Fix indicated: more far-future explicit-date examples reinforcing 4-digit years (+ multi-event 2nd legs). → Step 3. ### Step 2 — baseline vs fine-tune (2026-06-09, n=28, Q4_K_M) | Metric | Stock `gemma-4-31B-it-GGUF` | Fine-tune `gemma-4-cal-gguf` | | --- | --- | --- | | schema validity | 1.00 | 1.00 | | event precision | **1.00** | 0.85 | | start-exact recall | **0.955** | 0.773 | | event F1 | **0.977** | 0.81 | | end-exact | 1.00 | 1.00 | | no-event accuracy | 1.00 | 1.00 | | clarification recall | 0.75 | **1.00** | **Honest read:** stock Gemma-4-31B is already strong at this extraction and *beats* the current fine-tune on datetime recall — the "206" bug is a fine-tune regression. The fine-tune's only clear win is **clarification discipline** (asks when a thread is "date TBD"; stock missed `q04`). As-is, the fine-tune is **not** justified on extraction. Step 3 must fix the year regression and clear baseline's 0.955 recall while keeping clarification at 1.00 — otherwise the better play is stock + the fine-tune's clarification behavior via prompting. ### Step 3 — after gap-closing retrain (2026-06-09) — REGRESSED Dataset grown 69 → 87 (+18 Oct–Dec 2026 explicit-date examples, disjoint from eval), same 2-epoch recipe, re-quantized to Q4_K_M and republished. Re-eval (n=28): | Metric | Stock 31B | Fine-tune v1 (69) | **Fine-tune v2 (87, retrained)** | | --- | --- | --- | --- | | schema validity | 1.00 | 1.00 | **0.75** | | event precision | 1.00 | 0.85 | **0.476** | | start-exact recall | 0.955 | 0.773 | **0.455** | | event F1 | 0.977 | 0.81 | **0.465** | | end-exact | 1.00 | 1.00 | 1.00 | | no-event accuracy | 1.00 | 1.00 | 1.00 | | clarification recall | 0.75 | 1.00 | **0.75** | **The naive retrain made it worse, not better.** New failure modes: unparseable/empty JSON (validity 1.0→0.75), duplicate events, hallucinated "Drive to …" events, transposed/garbage years (`2062`, `2062-15:00:00`), and previously-passing relative dates now empty. Cause: overfitting — 18 of 87 examples were near-identical far-future templates, biasing a tiny dataset and degrading general formatting/extraction. ## Conclusions & recommendation 1. **Stock Gemma-4-31B is already strong** at this extraction (F1 0.98). The only thing fine-tuning reliably *added* was clarification discipline (v1: 1.00 vs stock 0.75) — and even that was lost in v2. 2. **Tiny-dataset SFT is fragile here.** v1 (69 ex) underperformed stock on dates; v2 (87 ex) regressed hard. More data of the *same shape* hurt. 3. **Recommended path** (pick one): - **Ship stock + prompt for clarification** — simplest; recover the one real win without the regressions. (Lowest risk.) - **If keeping a fine-tune:** rebuild the dataset much larger and *diverse* (not template-heavy), drop to ~1 epoch with regularization, and **gate every retrain on this eval** (only publish if it beats the current best). Consider a higher quant (Q5/Q6) to rule out the `"206"`/`2062` digit corruption being quant-driven. 4. **Action — revert the live model.** v2 (worse) overwrote v1 in `ParetoOptimal/gemma-4-cal-gguf`. Restore v1 (the better fine-tune) or point the Space back at stock `unsloth/gemma-4-31B-it-GGUF` until a fine-tune *beats* the eval baseline. **Bottom line: the eval did its job — it caught a regression before it reached users, and it says the current fine-tune is not yet worth shipping over stock.** ## Follow-up (2026-06-09) ### Live model restored to v1 v2 (regressed) was rolled back: `gemma-cal-Q4_K_M.gguf` in the repo was restored to the v1 LFS object via a server-side `CommitOperationCopy` (no transfer, no GPU). Production serves the better v1 again. ### Dataset rebuilt larger + more diverse (69 → 122) Added a diversity batch (`gen_new_seeds.MORE_SEEDS3`): varied date/time formats (`10/15`, "the 3rd", "half past 7", "0900", "noon", "midnight"), reschedules, cancellations, recurring, all-day, deadlines (EOD/midnight), past & hypothetical (must NOT schedule), richer no-event & clarify, and varied image sources (ticket, invite screenshot, notice). Goal: counter the template-heavy skew that overfit v2. Verified valid + disjoint from `eval.jsonl`. ### Eval-gating is now the publishing process **No retrain publishes unless it beats the eval.** `training/gated_retrain.py`: 1. retrain on Modal → upload to a **staging** filename (`gemma-cal-staging-Q4_K_M.gguf`) in the repo (production file untouched; mmproj skipped — `--skip-mmproj`); 2. eval the staging file (`modal_eval.py --model-file …`); 3. gate: `schema_validity ≥ 0.95`, `event_f1 ≥ 0.81`, `start-exact recall ≥ 0.773` (defaults = the current best, v1) — tune via `--gate-f1/--gate-recall`; 4. **PASS** → promote staging → production via server-side `CommitOperationCopy` (free); **FAIL** → delete staging, production unchanged. Run: `python training/gated_retrain.py [--epochs 1 --gate-f1 … --gate-recall …]`. ### Step 4 — first eval-gated retrain (122 ex, 1 epoch) — GATE FAILED ✅ (protected prod) The retrain scored **worse** than every prior version and the gate refused to publish: | Metric | Stock | v1 (live) | v3 staging (122, 1ep) | | --- | --- | --- | --- | | schema validity | 1.00 | 1.00 | **0.46** | | event F1 | 0.977 | 0.81 | **0.214** | | start-exact recall | 0.955 | 0.773 | **0.136** | | no-event accuracy | 1.00 | 1.00 | 1.00 | | clarification recall | 0.75 | 1.00 | 1.00 | >½ of outputs were unparseable; extraction collapsed. **Gate: FAIL → staging deleted, production unchanged (still v1).** The gate worked exactly as intended. ## Verdict (after 3 fine-tune attempts) All three fine-tunes — v1 (69 ex / 2 ep), v2 (87 / 2 ep), v3 (122 / 1 ep) — **underperform stock Gemma-4-31B**, and the larger runs broke JSON validity. Only the safety behaviors (no-event, clarification) survive fine-tuning; extraction degrades. **QLoRA-on-31B-Q4 here is fragile and not worth shipping over stock.** Recommended: serve **stock `unsloth/gemma-4-31B-it-GGUF`** and recover the one fine-tune win (clarification) via the prompt. Keep v1 as the published fine-tune for the "Well-Tuned" artifact, but don't route production extraction through it. Revisit fine-tuning only with a substantially larger, more varied dataset and a recipe that holds schema validity at 1.0 — gated, as now, on this eval. ## Step 5 — quantization-penalty test (2026-06-09): quant EXONERATED Hypothesis: maybe Q4 quantization (the `"206"`/`2062` digit bug) was tanking the fine-tune. Tested the SAME fine-tuned weights (`gemma-cal-f16.gguf`, v2/87-ex — best fp16 still on the volume) at three precisions on the 28-example eval (`training/modal_quant_eval.py`): | precision | schema validity | event F1 | start-exact recall | | --- | --- | --- | --- | | f16 (full) | 0.643 | 0.571 | 0.545 | | Q8_0 | 0.679 | 0.565 | 0.591 | | Q4_K_M | 0.75 | 0.465 | 0.455 | | base (stock) | 1.00 | 0.977 | 0.955 | **Quantization is not the cause.** At full fp16 the fine-tune still scores validity 0.64 / F1 0.57 — nowhere near base; validity is actually *lower* at f16 than Q4, so quant isn't breaking the JSON. Precision buys only ~+0.1 F1/recall (Q4→Q8/f16), a fraction of the gap to base. The degradation is the **SFT itself**, not the GGUF conversion. Step 2 (retrain at Q8 to beat base) is **not pursued** — the gate would fail. (Caveat: v1's fp16 was overwritten, so this used v2; a definitive v1 test needs a retrain, but the small quant lift makes a base-beating result improbable.) ### Final recommendation A higher quant won't make the fine-tune beat base, and an automation agent (e.g. `ml-intern`) doesn't change the binding constraints (near-ceiling base; small data; SFT degrades instruction-following). **Serve stock `unsloth/gemma-4-31B-it-GGUF`** and recover the clarification behavior via the system prompt; keep v1 as the "Well-Tuned" artifact. Only revisit fine-tuning with a substantially larger, real, diverse dataset + a validity-preserving recipe (low LR, few steps), always gated on this eval. ## Real training data: SMCalFlow importer `training/import_smcalflow.py` converts **SMCalFlow** (Microsoft Semantic Machines, **CC BY-SA 4.0**) calendar dialogues into our `ActionPlan` format. SMCalFlow encodes events as LISP "dataflow" programs; the importer parses `CreatePreflightEventWrapper` turns, extracts subject/start/location/attendees, and **resolves** date/time constructs (`Tomorrow`, `NextDOW`, `MD`, `NumberPM`, `HourMinuteMilitary`, …) against a per-example reference `now` spread across 2026 — so relative dates become concrete, self-consistent targets (directly trains the failing date/time skill, with varied 4-digit years). Conservative: only emits a row when a title AND an explicit start time resolve (~7.5k usable turns from train+valid). - Run: `python training/import_smcalflow.py --limit 2000 --heldout 200` → writes `training/data/smcalflow_train.jsonl` (+ `…_heldout.jsonl`). **Both are git-ignored** (CC BY-SA share-alike vs this repo's Apache-2.0 → we don't commit/redistribute the derived data; the importer code is ours) and **disjoint from `eval.jsonl`**. - `train_qlora.py` now trains on `dataset.jsonl` **+** `smcalflow_train.jsonl` (when present). `gated_retrain.py` therefore trains on real data, and still **only publishes if it beats the gate** — so a bigger-but-worse model can't reach production. - Attribution (required by CC BY-SA): *Semantic Machines et al., "Task-Oriented Dialogue as Dataflow Synthesis," TACL 2020.* ## Step 6 — eval-gated retrain on REAL data (2026-06-09): FAILED gate (worst yet) Trained the 31B on **2,122 examples** (122 hand-authored + 2,000 real SMCalFlow), 1 epoch, through `gated_retrain.py` with a beat-base gate (F1≥0.95, recall≥0.90). Result on the 28-ex eval: | Metric | base | v1 (live) | real-data (2,122 ex) | | --- | --- | --- | --- | | schema validity | 1.00 | 1.00 | **0.107** | | event F1 | 0.977 | 0.81 | **0.000** | | start-exact recall | 0.955 | 0.773 | **0.000** | ~90% unparseable output, zero events extracted. **Gate FAIL → not promoted; production stays v1.** ### Verdict across 4 fine-tunes (now incl. real data) Scores **monotonically worsen with more training/data**: v1 (69 synth, F1 0.81) → v2 (87, 0.465) → v3 (122, 0.214) → real (2,122, 0.0). This is no longer a *data* problem — **the SFT recipe itself degrades the model**, and more data makes it worse. Most likely root cause to investigate *if* fine-tuning is ever revisited: a **train/inference chat-template mismatch** — `train_qlora.py` formats with Unsloth's `get_chat_template("gemma")` while `llama-server` serves with the GGUF's own `--jinja` template; if these differ for Gemma-4, training optimizes a format the server never uses, and the divergence compounds with more steps (exactly the monotonic decay seen). Other suspects: LR too high (2e-4) / catastrophic forgetting on a near-ceiling base. **Final, evidence-backed recommendation: serve stock `unsloth/gemma-4-31B-it-GGUF`** (best by far) and recover clarification via the system prompt. Do NOT route production through any current fine-tune. The eval-gate has now correctly rejected 2 bad retrains — keep it as the publish gate. ## Step 7 — recipe fix + raw-output probe (2026-06-09): training stack implicated, fine-tuning HALTED Fixed the suspected train/serve chat-template mismatch (PR #54): Gemma-4's native `chat_template.jinja` uses a NEW `<|turn>role … ` format (no `` at all), while training forced unsloth's legacy "gemma" template. `train_qlora.py` now formats with the tokenizer's native template (hard `<|turn>` assert), masks loss to the assistant turn, LR 5e-5. Retrained on the 2,122-example set through the gate: **validity 0.0 — gate FAIL** (production stays v1, third bad retrain rejected). Diagnostics that pinpointed the cause: - **GGUF template check (CPU, ~free):** our exported staging GGUF embeds the correct native `<|turn>` template (16,934 chars, no ``) → train and serve formats are now verifiably aligned. Template is exonerated as the remaining cause. - **Raw-output probe (`/outputs/gemma-cal-staging-Q4_K_M.gguf`):** free generation emits pure degenerate looping — `'Huddle — — — — — …'` to the token limit; constrained generation emits 512 tokens of nothing. **The weights are destroyed, not misformatted.** With dataset (69→2,122), template (legacy/native), LR (2e-4/5e-5), and masking (on/off) all varied, degradation always tracks training steps and ends in token-loop collapse. The remaining common factor is **Unsloth's QLoRA path for Gemma-4-31B** (new architecture; training logs warn `get_input_embeddings not auto-handled for Gemma4AudioModel`). **Fine-tuning is halted** until that stack demonstrably works for this arch (or is replaced with plain transformers+PEFT). ## Step 8 — improve served evals via prompt (stock + targeted SYSTEM additions) Base's only eval misses are prompt-fixable: m03 dropped the 2nd event of a multi-event thread; q04 didn't ask clarification on a "TBD" plan. Added two surgical SYSTEM lines (list every distinct event separately; ask via needs_clarification when day/time is TBD). **Result: PERFECT SCORE — 1.0 on every metric (n=28, tp/fp/fn = 22/0/0).** | Metric | base (old prompt) | **base + new prompt** | | --- | --- | --- | | schema validity | 1.00 | **1.00** | | event precision | 1.00 | **1.00** | | start-exact recall | 0.955 | **1.00** | | event F1 | 0.977 | **1.00** | | no-event accuracy | 1.00 | **1.00** | | clarification recall | 0.75 | **1.00** | Both misses fixed, nothing regressed. **This is the production configuration: stock `unsloth/gemma-4-31B-it-GGUF` + the updated SYSTEM prompt.** (Set Space var `MODEL_HF_REPO=unsloth/gemma-4-31B-it-GGUF`; the prompt ships with the app.) The "Well-Tuned" artifact remains `ParetoOptimal/gemma-4-cal-gguf` (v1); any future fine-tune must beat THIS 1.0 baseline through the gate — i.e., match it and win on a harder, expanded eval set. ## Step 9 — the E4B edge-model campaign (2026-06-10) Re-aimed fine-tuning where it has headroom: a **Gemma-4 E4B (~8B)** edge model that runs without a paid A100, gated against **stock E4B**. Six gated runs, each fixing a diagnosed failure (the fixed recipe trained cleanly every time — validity 1.0 throughout, confirming the Step-7 breakage was specific to the 31B path): | run | change | F1 | recall | clarify | eval | | --- | --- | --- | --- | --- | --- | | #1 | fixed recipe, 2,122 ex | 0.884 | 0.864 | 1.0 | n=28 | | #2 | + weekday-in-prompt (+data regen) | 0.955 | 0.955 | 0.75 | n=28 | | #3 | + next-DOW conflict filter (74 rows), 4× hand | **1.0** | **1.0** | 0.75 | n=28 | | #4 | + TBD-clarify seeds, 8× hand | 0.93 | 0.909 | 1.0 | n=28 | | #5 | clarify seeds, 4× hand | 0.93 | 0.909 | 1.0 | n=28 | | — | **eval expanded 28→60** (50 events; jitter-resistant) | | | | | | #6 | + Batch-7 seeds (next-DOW, "opens") | 0.97 | 0.96 | 1.0 | n=60 | | stock E4B (weekday prompt) | | 0.97 | 0.96 | 1.0 | n=60 | Run #6 vs stock is an **exact statistical tie** (identical tp/fp/fn 48/1/2; both miss `e09` "next Tuesday" — which resisted 7 explicit training seeds — and one "opens" case each). Campaign side effects that improved the PRODUCT for every model: weekday-in-prompt, the next-DOW convention cleanup, and the 60-example eval. ## Step 10 — bare-prompt (internalization) test: no decisive gap Dropped the system prompt for both models (identical minimal user content, same JSON-schema constraint; `modal_eval.py --minimal-prompt`), measuring internalized task knowledge: | bare, n=60 | stock E4B | fine-tuned E4B | | --- | --- | --- | | schema validity | 0.967 | **1.0** | | event F1 | **0.682** | 0.644 | | start-exact recall | **0.60** | 0.56 | | no-event accuracy | 0.70 | **0.80** | | clarification recall | 0.50 | **0.625** | Small trade-offs both ways, within noise. **Verdict: at this data scale (139 hand + 2,000 SMCalFlow) with QLoRA/1-epoch, the E4B fine-tune reaches PARITY with stock, not superiority** — non-degraded, perfect validity everywhere, better bare-prompt discipline, slightly weaker bare extraction. The strict-dominance gate therefore never auto-promoted it; the candidate GGUF remains on the Modal volume (`/outputs/gemma-cal-e4b-staging-Q4_K_M.gguf`). Publishing it as the project's edge model at parity is a **product decision** (zero quality cost; production would serve our own fine-tune, fulfilling "Well-Tuned") — deliberately left to the owner, not the gate.