| # Eval roadmap β improving the scheduling fine-tune |
|
|
| How we measure and improve `ParetoOptimal/gemma-4-cal-gguf` (the fine-tuned |
| Gemma-4-31B that turns chat/images into a calendar `ActionPlan`). The eval is |
| **task-specific** β generic LLM benchmarks (MMLU etc.) don't apply. |
|
|
| Harness: `training/eval.py` (scores), `training/gen_eval.py` + `training/data/eval.jsonl` |
| (28 held-out examples, disjoint from `dataset.jsonl`), `training/modal_eval.py` |
| (serves the GGUF on the same `llama-server` the Space uses, then scores). |
|
|
| ## Baseline scores (Q4_K_M, n=28, 2026-06-09) |
|
|
| | Metric | Score | |
| | --- | --- | |
| | schema validity | 1.00 | |
| | no-event accuracy | 1.00 | |
| | clarification recall | 1.00 | |
| | end-time exact | 1.00 | |
| | event precision | 0.85 | |
| | **event recall (start-exact)** | **0.77** | |
| | event F1 | 0.81 | |
| | title similarity | 0.87 | |
|
|
| Discipline (never invents events, always asks when ambiguous) is perfect; all 9 |
| relative-date cases passed. The gap is **exact start datetime** on a few |
| explicit far-future dates (misses: `e02`, `e05`, `e06`, `e15`, one leg of `m02`). |
|
|
| ## The 3 steps |
|
|
| ### 1. Diagnose the 5 misses (cheap) |
| Enhance `eval.py` to dump the model's actual `start`/`title` for mismatched events, |
| then one re-run shows whether they're date-shift, time/AM-PM, or wrong-year errors β |
| which tells us exactly what training data to add. (~one A100 eval run; the GGUF is |
| cached in the Modal Volume, so it's fast.) |
|
|
| ### 2. Baseline comparison (the "Well-Tuned" proof) |
| Run `modal run training/modal_eval.py --model-hf-repo unsloth/gemma-4-31B-it-GGUF` |
| to score **stock** Gemma-4-31B on the same set. If the fine-tune's discipline |
| (no-event 1.0, clarification 1.0) and datetime recall beat stock, that's concrete |
| evidence the fine-tune helps. (Separate ~18 GB model download + A100 time.) |
|
|
| ### 3. Close the gap |
| Add ~15β20 explicit-date examples (especially next-month dates and times) to |
| `training/data/dataset.jsonl`, re-train on Modal (`training/modal_train.py`), |
| re-eval β and watch start-exact recall move. |
|
|
| ## Results log |
|
|
| ### Step 1 β diagnosis (2026-06-09) |
| The mismatch dump showed the misses are **not** a reasoning failure. 3 of 5 are the |
| same bug β a dropped year digit, **"206" instead of "2026"** β on next-month dates |
| (month/day/time all correct): |
|
|
| ``` |
| [e02] gold 2026-10-06T15:30 pred 206-10-06T15:30 |
| [e05] gold 2026-10-01T08:15 pred 206-10-01T08:15 |
| [e15] gold 2026-10-08T19:00 pred 206-10-08T19:00 |
| [e06] gold 2026-09-28T09:00 pred [] (abstained) |
| [m02] Standup + Sprint demo pred Standup only (dropped 2nd leg) |
| ``` |
|
|
| Fix indicated: more far-future explicit-date examples reinforcing 4-digit years |
| (+ multi-event 2nd legs). β Step 3. |
|
|
| ### Step 2 β baseline vs fine-tune (2026-06-09, n=28, Q4_K_M) |
|
|
| | Metric | Stock `gemma-4-31B-it-GGUF` | Fine-tune `gemma-4-cal-gguf` | |
| | --- | --- | --- | |
| | schema validity | 1.00 | 1.00 | |
| | event precision | **1.00** | 0.85 | |
| | start-exact recall | **0.955** | 0.773 | |
| | event F1 | **0.977** | 0.81 | |
| | end-exact | 1.00 | 1.00 | |
| | no-event accuracy | 1.00 | 1.00 | |
| | clarification recall | 0.75 | **1.00** | |
|
|
| **Honest read:** stock Gemma-4-31B is already strong at this extraction and *beats* |
| the current fine-tune on datetime recall β the "206" bug is a fine-tune regression. |
| The fine-tune's only clear win is **clarification discipline** (asks when a thread is |
| "date TBD"; stock missed `q04`). As-is, the fine-tune is **not** justified on |
| extraction. Step 3 must fix the year regression and clear baseline's 0.955 recall |
| while keeping clarification at 1.00 β otherwise the better play is stock + the |
| fine-tune's clarification behavior via prompting. |
|
|
| ### Step 3 β after gap-closing retrain (2026-06-09) β REGRESSED |
| Dataset grown 69 β 87 (+18 OctβDec 2026 explicit-date examples, disjoint from eval), |
| same 2-epoch recipe, re-quantized to Q4_K_M and republished. Re-eval (n=28): |
|
|
| | Metric | Stock 31B | Fine-tune v1 (69) | **Fine-tune v2 (87, retrained)** | |
| | --- | --- | --- | --- | |
| | schema validity | 1.00 | 1.00 | **0.75** | |
| | event precision | 1.00 | 0.85 | **0.476** | |
| | start-exact recall | 0.955 | 0.773 | **0.455** | |
| | event F1 | 0.977 | 0.81 | **0.465** | |
| | end-exact | 1.00 | 1.00 | 1.00 | |
| | no-event accuracy | 1.00 | 1.00 | 1.00 | |
| | clarification recall | 0.75 | 1.00 | **0.75** | |
|
|
| **The naive retrain made it worse, not better.** New failure modes: unparseable/empty |
| JSON (validity 1.0β0.75), duplicate events, hallucinated "Drive to β¦" events, |
| transposed/garbage years (`2062`, `2062-15:00:00`), and previously-passing relative |
| dates now empty. Cause: overfitting β 18 of 87 examples were near-identical far-future |
| templates, biasing a tiny dataset and degrading general formatting/extraction. |
|
|
| ## Conclusions & recommendation |
|
|
| 1. **Stock Gemma-4-31B is already strong** at this extraction (F1 0.98). The only |
| thing fine-tuning reliably *added* was clarification discipline (v1: 1.00 vs stock |
| 0.75) β and even that was lost in v2. |
| 2. **Tiny-dataset SFT is fragile here.** v1 (69 ex) underperformed stock on dates; |
| v2 (87 ex) regressed hard. More data of the *same shape* hurt. |
| 3. **Recommended path** (pick one): |
| - **Ship stock + prompt for clarification** β simplest; recover the one real win |
| without the regressions. (Lowest risk.) |
| - **If keeping a fine-tune:** rebuild the dataset much larger and *diverse* (not |
| template-heavy), drop to ~1 epoch with regularization, and **gate every retrain |
| on this eval** (only publish if it beats the current best). Consider a higher |
| quant (Q5/Q6) to rule out the `"206"`/`2062` digit corruption being quant-driven. |
| 4. **Action β revert the live model.** v2 (worse) overwrote v1 in |
| `ParetoOptimal/gemma-4-cal-gguf`. Restore v1 (the better fine-tune) or point the |
| Space back at stock `unsloth/gemma-4-31B-it-GGUF` until a fine-tune *beats* the |
| eval baseline. |
| |
| **Bottom line: the eval did its job β it caught a regression before it reached users, |
| and it says the current fine-tune is not yet worth shipping over stock.** |
|
|
| ## Follow-up (2026-06-09) |
|
|
| ### Live model restored to v1 |
| v2 (regressed) was rolled back: `gemma-cal-Q4_K_M.gguf` in the repo was restored to the |
| v1 LFS object via a server-side `CommitOperationCopy` (no transfer, no GPU). Production |
| serves the better v1 again. |
|
|
| ### Dataset rebuilt larger + more diverse (69 β 122) |
| Added a diversity batch (`gen_new_seeds.MORE_SEEDS3`): varied date/time formats |
| (`10/15`, "the 3rd", "half past 7", "0900", "noon", "midnight"), reschedules, |
| cancellations, recurring, all-day, deadlines (EOD/midnight), past & hypothetical |
| (must NOT schedule), richer no-event & clarify, and varied image sources (ticket, |
| invite screenshot, notice). Goal: counter the template-heavy skew that overfit v2. |
| Verified valid + disjoint from `eval.jsonl`. |
|
|
| ### Eval-gating is now the publishing process |
| **No retrain publishes unless it beats the eval.** `training/gated_retrain.py`: |
| 1. retrain on Modal β upload to a **staging** filename (`gemma-cal-staging-Q4_K_M.gguf`) |
| in the repo (production file untouched; mmproj skipped β `--skip-mmproj`); |
| 2. eval the staging file (`modal_eval.py --model-file β¦`); |
| 3. gate: `schema_validity β₯ 0.95`, `event_f1 β₯ 0.81`, `start-exact recall β₯ 0.773` |
| (defaults = the current best, v1) β tune via `--gate-f1/--gate-recall`; |
| 4. **PASS** β promote staging β production via server-side `CommitOperationCopy` (free); |
| **FAIL** β delete staging, production unchanged. |
|
|
| Run: `python training/gated_retrain.py [--epochs 1 --gate-f1 β¦ --gate-recall β¦]`. |
|
|
| ### Step 4 β first eval-gated retrain (122 ex, 1 epoch) β GATE FAILED β
(protected prod) |
| The retrain scored **worse** than every prior version and the gate refused to publish: |
|
|
| | Metric | Stock | v1 (live) | v3 staging (122, 1ep) | |
| | --- | --- | --- | --- | |
| | schema validity | 1.00 | 1.00 | **0.46** | |
| | event F1 | 0.977 | 0.81 | **0.214** | |
| | start-exact recall | 0.955 | 0.773 | **0.136** | |
| | no-event accuracy | 1.00 | 1.00 | 1.00 | |
| | clarification recall | 0.75 | 1.00 | 1.00 | |
|
|
| >Β½ of outputs were unparseable; extraction collapsed. **Gate: FAIL β staging deleted, |
| production unchanged (still v1).** The gate worked exactly as intended. |
|
|
| ## Verdict (after 3 fine-tune attempts) |
| All three fine-tunes β v1 (69 ex / 2 ep), v2 (87 / 2 ep), v3 (122 / 1 ep) β **underperform |
| stock Gemma-4-31B**, and the larger runs broke JSON validity. Only the safety behaviors |
| (no-event, clarification) survive fine-tuning; extraction degrades. **QLoRA-on-31B-Q4 here |
| is fragile and not worth shipping over stock.** Recommended: serve **stock |
| `unsloth/gemma-4-31B-it-GGUF`** and recover the one fine-tune win (clarification) via the |
| prompt. Keep v1 as the published fine-tune for the "Well-Tuned" artifact, but don't route |
| production extraction through it. Revisit fine-tuning only with a substantially larger, more |
| varied dataset and a recipe that holds schema validity at 1.0 β gated, as now, on this eval. |
|
|
| ## Step 5 β quantization-penalty test (2026-06-09): quant EXONERATED |
| Hypothesis: maybe Q4 quantization (the `"206"`/`2062` digit bug) was tanking the fine-tune. |
| Tested the SAME fine-tuned weights (`gemma-cal-f16.gguf`, v2/87-ex β best fp16 still on the |
| volume) at three precisions on the 28-example eval (`training/modal_quant_eval.py`): |
|
|
| | precision | schema validity | event F1 | start-exact recall | |
| | --- | --- | --- | --- | |
| | f16 (full) | 0.643 | 0.571 | 0.545 | |
| | Q8_0 | 0.679 | 0.565 | 0.591 | |
| | Q4_K_M | 0.75 | 0.465 | 0.455 | |
| | base (stock) | 1.00 | 0.977 | 0.955 | |
| |
| **Quantization is not the cause.** At full fp16 the fine-tune still scores validity 0.64 / F1 |
| 0.57 β nowhere near base; validity is actually *lower* at f16 than Q4, so quant isn't breaking |
| the JSON. Precision buys only ~+0.1 F1/recall (Q4βQ8/f16), a fraction of the gap to base. The |
| degradation is the **SFT itself**, not the GGUF conversion. Step 2 (retrain at Q8 to beat base) |
| is **not pursued** β the gate would fail. (Caveat: v1's fp16 was overwritten, so this used v2; |
| a definitive v1 test needs a retrain, but the small quant lift makes a base-beating result |
| improbable.) |
| |
| ### Final recommendation |
| A higher quant won't make the fine-tune beat base, and an automation agent (e.g. `ml-intern`) |
| doesn't change the binding constraints (near-ceiling base; small data; SFT degrades |
| instruction-following). **Serve stock `unsloth/gemma-4-31B-it-GGUF`** and recover the |
| clarification behavior via the system prompt; keep v1 as the "Well-Tuned" artifact. Only |
| revisit fine-tuning with a substantially larger, real, diverse dataset + a validity-preserving |
| recipe (low LR, few steps), always gated on this eval. |
| |
| ## Real training data: SMCalFlow importer |
| `training/import_smcalflow.py` converts **SMCalFlow** (Microsoft Semantic Machines, **CC BY-SA |
| 4.0**) calendar dialogues into our `ActionPlan` format. SMCalFlow encodes events as LISP |
| "dataflow" programs; the importer parses `CreatePreflightEventWrapper` turns, extracts |
| subject/start/location/attendees, and **resolves** date/time constructs (`Tomorrow`, `NextDOW`, |
| `MD`, `NumberPM`, `HourMinuteMilitary`, β¦) against a per-example reference `now` spread across |
| 2026 β so relative dates become concrete, self-consistent targets (directly trains the failing |
| date/time skill, with varied 4-digit years). Conservative: only emits a row when a title AND an |
| explicit start time resolve (~7.5k usable turns from train+valid). |
|
|
| - Run: `python training/import_smcalflow.py --limit 2000 --heldout 200` β writes |
| `training/data/smcalflow_train.jsonl` (+ `β¦_heldout.jsonl`). **Both are git-ignored** (CC BY-SA |
| share-alike vs this repo's Apache-2.0 β we don't commit/redistribute the derived data; the |
| importer code is ours) and **disjoint from `eval.jsonl`**. |
| - `train_qlora.py` now trains on `dataset.jsonl` **+** `smcalflow_train.jsonl` (when present). |
| `gated_retrain.py` therefore trains on real data, and still **only publishes if it beats the |
| gate** β so a bigger-but-worse model can't reach production. |
| - Attribution (required by CC BY-SA): *Semantic Machines et al., "Task-Oriented Dialogue as |
| Dataflow Synthesis," TACL 2020.* |
|
|
| ## Step 6 β eval-gated retrain on REAL data (2026-06-09): FAILED gate (worst yet) |
| Trained the 31B on **2,122 examples** (122 hand-authored + 2,000 real SMCalFlow), 1 epoch, |
| through `gated_retrain.py` with a beat-base gate (F1β₯0.95, recallβ₯0.90). Result on the 28-ex eval: |
|
|
| | Metric | base | v1 (live) | real-data (2,122 ex) | |
| | --- | --- | --- | --- | |
| | schema validity | 1.00 | 1.00 | **0.107** | |
| | event F1 | 0.977 | 0.81 | **0.000** | |
| | start-exact recall | 0.955 | 0.773 | **0.000** | |
|
|
| ~90% unparseable output, zero events extracted. **Gate FAIL β not promoted; production stays v1.** |
|
|
| ### Verdict across 4 fine-tunes (now incl. real data) |
| Scores **monotonically worsen with more training/data**: v1 (69 synth, F1 0.81) β v2 (87, 0.465) |
| β v3 (122, 0.214) β real (2,122, 0.0). This is no longer a *data* problem β **the SFT recipe |
| itself degrades the model**, and more data makes it worse. Most likely root cause to investigate |
| *if* fine-tuning is ever revisited: a **train/inference chat-template mismatch** β `train_qlora.py` |
| formats with Unsloth's `get_chat_template("gemma")` while `llama-server` serves with the GGUF's |
| own `--jinja` template; if these differ for Gemma-4, training optimizes a format the server never |
| uses, and the divergence compounds with more steps (exactly the monotonic decay seen). Other |
| suspects: LR too high (2e-4) / catastrophic forgetting on a near-ceiling base. |
|
|
| **Final, evidence-backed recommendation: serve stock `unsloth/gemma-4-31B-it-GGUF`** (best by far) |
| and recover clarification via the system prompt. Do NOT route production through any current |
| fine-tune. The eval-gate has now correctly rejected 2 bad retrains β keep it as the publish gate. |
|
|
| ## Step 7 β recipe fix + raw-output probe (2026-06-09): training stack implicated, fine-tuning HALTED |
| Fixed the suspected train/serve chat-template mismatch (PR #54): Gemma-4's native |
| `chat_template.jinja` uses a NEW `<|turn>role β¦ <turn|>` format (no `<start_of_turn>` at all), |
| while training forced unsloth's legacy "gemma" template. `train_qlora.py` now formats with the |
| tokenizer's native template (hard `<|turn>` assert), masks loss to the assistant turn, LR 5e-5. |
| Retrained on the 2,122-example set through the gate: **validity 0.0 β gate FAIL** (production |
| stays v1, third bad retrain rejected). |
|
|
| Diagnostics that pinpointed the cause: |
| - **GGUF template check (CPU, ~free):** our exported staging GGUF embeds the correct native |
| `<|turn>` template (16,934 chars, no `<start_of_turn>`) β train and serve formats are now |
| verifiably aligned. Template is exonerated as the remaining cause. |
| - **Raw-output probe (`/outputs/gemma-cal-staging-Q4_K_M.gguf`):** free generation emits pure |
| degenerate looping β `'Huddle β β β β β β¦'` to the token limit; constrained generation emits |
| 512 tokens of nothing. **The weights are destroyed, not misformatted.** |
|
|
| With dataset (69β2,122), template (legacy/native), LR (2e-4/5e-5), and masking (on/off) all |
| varied, degradation always tracks training steps and ends in token-loop collapse. The remaining |
| common factor is **Unsloth's QLoRA path for Gemma-4-31B** (new architecture; training logs warn |
| `get_input_embeddings not auto-handled for Gemma4AudioModel`). **Fine-tuning is halted** until |
| that stack demonstrably works for this arch (or is replaced with plain transformers+PEFT). |
|
|
| ## Step 8 β improve served evals via prompt (stock + targeted SYSTEM additions) |
| Base's only eval misses are prompt-fixable: m03 dropped the 2nd event of a multi-event thread; |
| q04 didn't ask clarification on a "TBD" plan. Added two surgical SYSTEM lines (list every |
| distinct event separately; ask via needs_clarification when day/time is TBD). |
| |
| **Result: PERFECT SCORE β 1.0 on every metric (n=28, tp/fp/fn = 22/0/0).** |
| |
| | Metric | base (old prompt) | **base + new prompt** | |
| | --- | --- | --- | |
| | schema validity | 1.00 | **1.00** | |
| | event precision | 1.00 | **1.00** | |
| | start-exact recall | 0.955 | **1.00** | |
| | event F1 | 0.977 | **1.00** | |
| | no-event accuracy | 1.00 | **1.00** | |
| | clarification recall | 0.75 | **1.00** | |
| |
| Both misses fixed, nothing regressed. **This is the production configuration: stock |
| `unsloth/gemma-4-31B-it-GGUF` + the updated SYSTEM prompt.** (Set Space var |
| `MODEL_HF_REPO=unsloth/gemma-4-31B-it-GGUF`; the prompt ships with the app.) The "Well-Tuned" |
| artifact remains `ParetoOptimal/gemma-4-cal-gguf` (v1); any future fine-tune must beat THIS |
| 1.0 baseline through the gate β i.e., match it and win on a harder, expanded eval set. |
| |
| ## Step 9 β the E4B edge-model campaign (2026-06-10) |
| Re-aimed fine-tuning where it has headroom: a **Gemma-4 E4B (~8B)** edge model that runs without a |
| paid A100, gated against **stock E4B**. Six gated runs, each fixing a diagnosed failure (the fixed |
| recipe trained cleanly every time β validity 1.0 throughout, confirming the Step-7 breakage was |
| specific to the 31B path): |
| |
| | run | change | F1 | recall | clarify | eval | |
| | --- | --- | --- | --- | --- | --- | |
| | #1 | fixed recipe, 2,122 ex | 0.884 | 0.864 | 1.0 | n=28 | |
| | #2 | + weekday-in-prompt (+data regen) | 0.955 | 0.955 | 0.75 | n=28 | |
| | #3 | + next-DOW conflict filter (74 rows), 4Γ hand | **1.0** | **1.0** | 0.75 | n=28 | |
| | #4 | + TBD-clarify seeds, 8Γ hand | 0.93 | 0.909 | 1.0 | n=28 | |
| | #5 | clarify seeds, 4Γ hand | 0.93 | 0.909 | 1.0 | n=28 | |
| | β | **eval expanded 28β60** (50 events; jitter-resistant) | | | | | |
| | #6 | + Batch-7 seeds (next-DOW, "opens") | 0.97 | 0.96 | 1.0 | n=60 | |
| | stock E4B (weekday prompt) | | 0.97 | 0.96 | 1.0 | n=60 | |
| |
| Run #6 vs stock is an **exact statistical tie** (identical tp/fp/fn 48/1/2; both miss `e09` |
| "next Tuesday" β which resisted 7 explicit training seeds β and one "opens" case each). |
| Campaign side effects that improved the PRODUCT for every model: weekday-in-prompt, the |
| next-DOW convention cleanup, and the 60-example eval. |
| |
| ## Step 10 β bare-prompt (internalization) test: no decisive gap |
| Dropped the system prompt for both models (identical minimal user content, same JSON-schema |
| constraint; `modal_eval.py --minimal-prompt`), measuring internalized task knowledge: |
|
|
| | bare, n=60 | stock E4B | fine-tuned E4B | |
| | --- | --- | --- | |
| | schema validity | 0.967 | **1.0** | |
| | event F1 | **0.682** | 0.644 | |
| | start-exact recall | **0.60** | 0.56 | |
| | no-event accuracy | 0.70 | **0.80** | |
| | clarification recall | 0.50 | **0.625** | |
|
|
| Small trade-offs both ways, within noise. **Verdict: at this data scale (139 hand + 2,000 |
| SMCalFlow) with QLoRA/1-epoch, the E4B fine-tune reaches PARITY with stock, not superiority** β |
| non-degraded, perfect validity everywhere, better bare-prompt discipline, slightly weaker bare |
| extraction. The strict-dominance gate therefore never auto-promoted it; the candidate GGUF |
| remains on the Modal volume (`/outputs/gemma-cal-e4b-staging-Q4_K_M.gguf`). Publishing it as |
| the project's edge model at parity is a **product decision** (zero quality cost; production |
| would serve our own fine-tune, fulfilling "Well-Tuned") β deliberately left to the owner, not |
| the gate. |
|
|