Eval roadmap β improving the scheduling fine-tune
How we measure and improve ParetoOptimal/gemma-4-cal-gguf (the fine-tuned
Gemma-4-31B that turns chat/images into a calendar ActionPlan). The eval is
task-specific β generic LLM benchmarks (MMLU etc.) don't apply.
Harness: training/eval.py (scores), training/gen_eval.py + training/data/eval.jsonl
(28 held-out examples, disjoint from dataset.jsonl), training/modal_eval.py
(serves the GGUF on the same llama-server the Space uses, then scores).
Baseline scores (Q4_K_M, n=28, 2026-06-09)
| Metric | Score |
|---|---|
| schema validity | 1.00 |
| no-event accuracy | 1.00 |
| clarification recall | 1.00 |
| end-time exact | 1.00 |
| event precision | 0.85 |
| event recall (start-exact) | 0.77 |
| event F1 | 0.81 |
| title similarity | 0.87 |
Discipline (never invents events, always asks when ambiguous) is perfect; all 9
relative-date cases passed. The gap is exact start datetime on a few
explicit far-future dates (misses: e02, e05, e06, e15, one leg of m02).
The 3 steps
1. Diagnose the 5 misses (cheap)
Enhance eval.py to dump the model's actual start/title for mismatched events,
then one re-run shows whether they're date-shift, time/AM-PM, or wrong-year errors β
which tells us exactly what training data to add. (~one A100 eval run; the GGUF is
cached in the Modal Volume, so it's fast.)
2. Baseline comparison (the "Well-Tuned" proof)
Run modal run training/modal_eval.py --model-hf-repo unsloth/gemma-4-31B-it-GGUF
to score stock Gemma-4-31B on the same set. If the fine-tune's discipline
(no-event 1.0, clarification 1.0) and datetime recall beat stock, that's concrete
evidence the fine-tune helps. (Separate ~18 GB model download + A100 time.)
3. Close the gap
Add ~15β20 explicit-date examples (especially next-month dates and times) to
training/data/dataset.jsonl, re-train on Modal (training/modal_train.py),
re-eval β and watch start-exact recall move.
Results log
Step 1 β diagnosis (2026-06-09)
The mismatch dump showed the misses are not a reasoning failure. 3 of 5 are the same bug β a dropped year digit, "206" instead of "2026" β on next-month dates (month/day/time all correct):
[e02] gold 2026-10-06T15:30 pred 206-10-06T15:30
[e05] gold 2026-10-01T08:15 pred 206-10-01T08:15
[e15] gold 2026-10-08T19:00 pred 206-10-08T19:00
[e06] gold 2026-09-28T09:00 pred [] (abstained)
[m02] Standup + Sprint demo pred Standup only (dropped 2nd leg)
Fix indicated: more far-future explicit-date examples reinforcing 4-digit years (+ multi-event 2nd legs). β Step 3.
Step 2 β baseline vs fine-tune (2026-06-09, n=28, Q4_K_M)
| Metric | Stock gemma-4-31B-it-GGUF |
Fine-tune gemma-4-cal-gguf |
|---|---|---|
| schema validity | 1.00 | 1.00 |
| event precision | 1.00 | 0.85 |
| start-exact recall | 0.955 | 0.773 |
| event F1 | 0.977 | 0.81 |
| end-exact | 1.00 | 1.00 |
| no-event accuracy | 1.00 | 1.00 |
| clarification recall | 0.75 | 1.00 |
Honest read: stock Gemma-4-31B is already strong at this extraction and beats
the current fine-tune on datetime recall β the "206" bug is a fine-tune regression.
The fine-tune's only clear win is clarification discipline (asks when a thread is
"date TBD"; stock missed q04). As-is, the fine-tune is not justified on
extraction. Step 3 must fix the year regression and clear baseline's 0.955 recall
while keeping clarification at 1.00 β otherwise the better play is stock + the
fine-tune's clarification behavior via prompting.
Step 3 β after gap-closing retrain (2026-06-09) β REGRESSED
Dataset grown 69 β 87 (+18 OctβDec 2026 explicit-date examples, disjoint from eval), same 2-epoch recipe, re-quantized to Q4_K_M and republished. Re-eval (n=28):
| Metric | Stock 31B | Fine-tune v1 (69) | Fine-tune v2 (87, retrained) |
|---|---|---|---|
| schema validity | 1.00 | 1.00 | 0.75 |
| event precision | 1.00 | 0.85 | 0.476 |
| start-exact recall | 0.955 | 0.773 | 0.455 |
| event F1 | 0.977 | 0.81 | 0.465 |
| end-exact | 1.00 | 1.00 | 1.00 |
| no-event accuracy | 1.00 | 1.00 | 1.00 |
| clarification recall | 0.75 | 1.00 | 0.75 |
The naive retrain made it worse, not better. New failure modes: unparseable/empty
JSON (validity 1.0β0.75), duplicate events, hallucinated "Drive to β¦" events,
transposed/garbage years (2062, 2062-15:00:00), and previously-passing relative
dates now empty. Cause: overfitting β 18 of 87 examples were near-identical far-future
templates, biasing a tiny dataset and degrading general formatting/extraction.
Conclusions & recommendation
- Stock Gemma-4-31B is already strong at this extraction (F1 0.98). The only thing fine-tuning reliably added was clarification discipline (v1: 1.00 vs stock 0.75) β and even that was lost in v2.
- Tiny-dataset SFT is fragile here. v1 (69 ex) underperformed stock on dates; v2 (87 ex) regressed hard. More data of the same shape hurt.
- Recommended path (pick one):
- Ship stock + prompt for clarification β simplest; recover the one real win without the regressions. (Lowest risk.)
- If keeping a fine-tune: rebuild the dataset much larger and diverse (not
template-heavy), drop to ~1 epoch with regularization, and gate every retrain
on this eval (only publish if it beats the current best). Consider a higher
quant (Q5/Q6) to rule out the
"206"/2062digit corruption being quant-driven.
- Action β revert the live model. v2 (worse) overwrote v1 in
ParetoOptimal/gemma-4-cal-gguf. Restore v1 (the better fine-tune) or point the Space back at stockunsloth/gemma-4-31B-it-GGUFuntil a fine-tune beats the eval baseline.
Bottom line: the eval did its job β it caught a regression before it reached users, and it says the current fine-tune is not yet worth shipping over stock.
Follow-up (2026-06-09)
Live model restored to v1
v2 (regressed) was rolled back: gemma-cal-Q4_K_M.gguf in the repo was restored to the
v1 LFS object via a server-side CommitOperationCopy (no transfer, no GPU). Production
serves the better v1 again.
Dataset rebuilt larger + more diverse (69 β 122)
Added a diversity batch (gen_new_seeds.MORE_SEEDS3): varied date/time formats
(10/15, "the 3rd", "half past 7", "0900", "noon", "midnight"), reschedules,
cancellations, recurring, all-day, deadlines (EOD/midnight), past & hypothetical
(must NOT schedule), richer no-event & clarify, and varied image sources (ticket,
invite screenshot, notice). Goal: counter the template-heavy skew that overfit v2.
Verified valid + disjoint from eval.jsonl.
Eval-gating is now the publishing process
No retrain publishes unless it beats the eval. training/gated_retrain.py:
- retrain on Modal β upload to a staging filename (
gemma-cal-staging-Q4_K_M.gguf) in the repo (production file untouched; mmproj skipped β--skip-mmproj); - eval the staging file (
modal_eval.py --model-file β¦); - gate:
schema_validity β₯ 0.95,event_f1 β₯ 0.81,start-exact recall β₯ 0.773(defaults = the current best, v1) β tune via--gate-f1/--gate-recall; - PASS β promote staging β production via server-side
CommitOperationCopy(free); FAIL β delete staging, production unchanged.
Run: python training/gated_retrain.py [--epochs 1 --gate-f1 β¦ --gate-recall β¦].
Step 4 β first eval-gated retrain (122 ex, 1 epoch) β GATE FAILED β (protected prod)
The retrain scored worse than every prior version and the gate refused to publish:
| Metric | Stock | v1 (live) | v3 staging (122, 1ep) |
|---|---|---|---|
| schema validity | 1.00 | 1.00 | 0.46 |
| event F1 | 0.977 | 0.81 | 0.214 |
| start-exact recall | 0.955 | 0.773 | 0.136 |
| no-event accuracy | 1.00 | 1.00 | 1.00 |
| clarification recall | 0.75 | 1.00 | 1.00 |
Β½ of outputs were unparseable; extraction collapsed. Gate: FAIL β staging deleted, production unchanged (still v1). The gate worked exactly as intended.
Verdict (after 3 fine-tune attempts)
All three fine-tunes β v1 (69 ex / 2 ep), v2 (87 / 2 ep), v3 (122 / 1 ep) β underperform
stock Gemma-4-31B, and the larger runs broke JSON validity. Only the safety behaviors
(no-event, clarification) survive fine-tuning; extraction degrades. QLoRA-on-31B-Q4 here
is fragile and not worth shipping over stock. Recommended: serve stock
unsloth/gemma-4-31B-it-GGUF and recover the one fine-tune win (clarification) via the
prompt. Keep v1 as the published fine-tune for the "Well-Tuned" artifact, but don't route
production extraction through it. Revisit fine-tuning only with a substantially larger, more
varied dataset and a recipe that holds schema validity at 1.0 β gated, as now, on this eval.
Step 5 β quantization-penalty test (2026-06-09): quant EXONERATED
Hypothesis: maybe Q4 quantization (the "206"/2062 digit bug) was tanking the fine-tune.
Tested the SAME fine-tuned weights (gemma-cal-f16.gguf, v2/87-ex β best fp16 still on the
volume) at three precisions on the 28-example eval (training/modal_quant_eval.py):
| precision | schema validity | event F1 | start-exact recall |
|---|---|---|---|
| f16 (full) | 0.643 | 0.571 | 0.545 |
| Q8_0 | 0.679 | 0.565 | 0.591 |
| Q4_K_M | 0.75 | 0.465 | 0.455 |
| base (stock) | 1.00 | 0.977 | 0.955 |
Quantization is not the cause. At full fp16 the fine-tune still scores validity 0.64 / F1 0.57 β nowhere near base; validity is actually lower at f16 than Q4, so quant isn't breaking the JSON. Precision buys only ~+0.1 F1/recall (Q4βQ8/f16), a fraction of the gap to base. The degradation is the SFT itself, not the GGUF conversion. Step 2 (retrain at Q8 to beat base) is not pursued β the gate would fail. (Caveat: v1's fp16 was overwritten, so this used v2; a definitive v1 test needs a retrain, but the small quant lift makes a base-beating result improbable.)
Final recommendation
A higher quant won't make the fine-tune beat base, and an automation agent (e.g. ml-intern)
doesn't change the binding constraints (near-ceiling base; small data; SFT degrades
instruction-following). Serve stock unsloth/gemma-4-31B-it-GGUF and recover the
clarification behavior via the system prompt; keep v1 as the "Well-Tuned" artifact. Only
revisit fine-tuning with a substantially larger, real, diverse dataset + a validity-preserving
recipe (low LR, few steps), always gated on this eval.
Real training data: SMCalFlow importer
training/import_smcalflow.py converts SMCalFlow (Microsoft Semantic Machines, CC BY-SA
4.0) calendar dialogues into our ActionPlan format. SMCalFlow encodes events as LISP
"dataflow" programs; the importer parses CreatePreflightEventWrapper turns, extracts
subject/start/location/attendees, and resolves date/time constructs (Tomorrow, NextDOW,
MD, NumberPM, HourMinuteMilitary, β¦) against a per-example reference now spread across
2026 β so relative dates become concrete, self-consistent targets (directly trains the failing
date/time skill, with varied 4-digit years). Conservative: only emits a row when a title AND an
explicit start time resolve (~7.5k usable turns from train+valid).
- Run:
python training/import_smcalflow.py --limit 2000 --heldout 200β writestraining/data/smcalflow_train.jsonl(+β¦_heldout.jsonl). Both are git-ignored (CC BY-SA share-alike vs this repo's Apache-2.0 β we don't commit/redistribute the derived data; the importer code is ours) and disjoint fromeval.jsonl. train_qlora.pynow trains ondataset.jsonl+smcalflow_train.jsonl(when present).gated_retrain.pytherefore trains on real data, and still only publishes if it beats the gate β so a bigger-but-worse model can't reach production.- Attribution (required by CC BY-SA): Semantic Machines et al., "Task-Oriented Dialogue as Dataflow Synthesis," TACL 2020.
Step 6 β eval-gated retrain on REAL data (2026-06-09): FAILED gate (worst yet)
Trained the 31B on 2,122 examples (122 hand-authored + 2,000 real SMCalFlow), 1 epoch,
through gated_retrain.py with a beat-base gate (F1β₯0.95, recallβ₯0.90). Result on the 28-ex eval:
| Metric | base | v1 (live) | real-data (2,122 ex) |
|---|---|---|---|
| schema validity | 1.00 | 1.00 | 0.107 |
| event F1 | 0.977 | 0.81 | 0.000 |
| start-exact recall | 0.955 | 0.773 | 0.000 |
~90% unparseable output, zero events extracted. Gate FAIL β not promoted; production stays v1.
Verdict across 4 fine-tunes (now incl. real data)
Scores monotonically worsen with more training/data: v1 (69 synth, F1 0.81) β v2 (87, 0.465)
β v3 (122, 0.214) β real (2,122, 0.0). This is no longer a data problem β the SFT recipe
itself degrades the model, and more data makes it worse. Most likely root cause to investigate
if fine-tuning is ever revisited: a train/inference chat-template mismatch β train_qlora.py
formats with Unsloth's get_chat_template("gemma") while llama-server serves with the GGUF's
own --jinja template; if these differ for Gemma-4, training optimizes a format the server never
uses, and the divergence compounds with more steps (exactly the monotonic decay seen). Other
suspects: LR too high (2e-4) / catastrophic forgetting on a near-ceiling base.
Final, evidence-backed recommendation: serve stock unsloth/gemma-4-31B-it-GGUF (best by far)
and recover clarification via the system prompt. Do NOT route production through any current
fine-tune. The eval-gate has now correctly rejected 2 bad retrains β keep it as the publish gate.
Step 7 β recipe fix + raw-output probe (2026-06-09): training stack implicated, fine-tuning HALTED
Fixed the suspected train/serve chat-template mismatch (PR #54): Gemma-4's native
chat_template.jinja uses a NEW <|turn>role β¦ <turn|> format (no <start_of_turn> at all),
while training forced unsloth's legacy "gemma" template. train_qlora.py now formats with the
tokenizer's native template (hard <|turn> assert), masks loss to the assistant turn, LR 5e-5.
Retrained on the 2,122-example set through the gate: validity 0.0 β gate FAIL (production
stays v1, third bad retrain rejected).
Diagnostics that pinpointed the cause:
- GGUF template check (CPU, ~free): our exported staging GGUF embeds the correct native
<|turn>template (16,934 chars, no<start_of_turn>) β train and serve formats are now verifiably aligned. Template is exonerated as the remaining cause. - Raw-output probe (
/outputs/gemma-cal-staging-Q4_K_M.gguf): free generation emits pure degenerate looping β'Huddle β β β β β β¦'to the token limit; constrained generation emits 512 tokens of nothing. The weights are destroyed, not misformatted.
With dataset (69β2,122), template (legacy/native), LR (2e-4/5e-5), and masking (on/off) all
varied, degradation always tracks training steps and ends in token-loop collapse. The remaining
common factor is Unsloth's QLoRA path for Gemma-4-31B (new architecture; training logs warn
get_input_embeddings not auto-handled for Gemma4AudioModel). Fine-tuning is halted until
that stack demonstrably works for this arch (or is replaced with plain transformers+PEFT).
Step 8 β improve served evals via prompt (stock + targeted SYSTEM additions)
Base's only eval misses are prompt-fixable: m03 dropped the 2nd event of a multi-event thread; q04 didn't ask clarification on a "TBD" plan. Added two surgical SYSTEM lines (list every distinct event separately; ask via needs_clarification when day/time is TBD).
Result: PERFECT SCORE β 1.0 on every metric (n=28, tp/fp/fn = 22/0/0).
| Metric | base (old prompt) | base + new prompt |
|---|---|---|
| schema validity | 1.00 | 1.00 |
| event precision | 1.00 | 1.00 |
| start-exact recall | 0.955 | 1.00 |
| event F1 | 0.977 | 1.00 |
| no-event accuracy | 1.00 | 1.00 |
| clarification recall | 0.75 | 1.00 |
Both misses fixed, nothing regressed. This is the production configuration: stock
unsloth/gemma-4-31B-it-GGUF + the updated SYSTEM prompt. (Set Space var
MODEL_HF_REPO=unsloth/gemma-4-31B-it-GGUF; the prompt ships with the app.) The "Well-Tuned"
artifact remains ParetoOptimal/gemma-4-cal-gguf (v1); any future fine-tune must beat THIS
1.0 baseline through the gate β i.e., match it and win on a harder, expanded eval set.
Step 9 β the E4B edge-model campaign (2026-06-10)
Re-aimed fine-tuning where it has headroom: a Gemma-4 E4B (~8B) edge model that runs without a paid A100, gated against stock E4B. Six gated runs, each fixing a diagnosed failure (the fixed recipe trained cleanly every time β validity 1.0 throughout, confirming the Step-7 breakage was specific to the 31B path):
| run | change | F1 | recall | clarify | eval |
|---|---|---|---|---|---|
| #1 | fixed recipe, 2,122 ex | 0.884 | 0.864 | 1.0 | n=28 |
| #2 | + weekday-in-prompt (+data regen) | 0.955 | 0.955 | 0.75 | n=28 |
| #3 | + next-DOW conflict filter (74 rows), 4Γ hand | 1.0 | 1.0 | 0.75 | n=28 |
| #4 | + TBD-clarify seeds, 8Γ hand | 0.93 | 0.909 | 1.0 | n=28 |
| #5 | clarify seeds, 4Γ hand | 0.93 | 0.909 | 1.0 | n=28 |
| β | eval expanded 28β60 (50 events; jitter-resistant) | ||||
| #6 | + Batch-7 seeds (next-DOW, "opens") | 0.97 | 0.96 | 1.0 | n=60 |
| stock E4B (weekday prompt) | 0.97 | 0.96 | 1.0 | n=60 |
Run #6 vs stock is an exact statistical tie (identical tp/fp/fn 48/1/2; both miss e09
"next Tuesday" β which resisted 7 explicit training seeds β and one "opens" case each).
Campaign side effects that improved the PRODUCT for every model: weekday-in-prompt, the
next-DOW convention cleanup, and the 60-example eval.
Step 10 β bare-prompt (internalization) test: no decisive gap
Dropped the system prompt for both models (identical minimal user content, same JSON-schema
constraint; modal_eval.py --minimal-prompt), measuring internalized task knowledge:
| bare, n=60 | stock E4B | fine-tuned E4B |
|---|---|---|
| schema validity | 0.967 | 1.0 |
| event F1 | 0.682 | 0.644 |
| start-exact recall | 0.60 | 0.56 |
| no-event accuracy | 0.70 | 0.80 |
| clarification recall | 0.50 | 0.625 |
Small trade-offs both ways, within noise. Verdict: at this data scale (139 hand + 2,000
SMCalFlow) with QLoRA/1-epoch, the E4B fine-tune reaches PARITY with stock, not superiority β
non-degraded, perfect validity everywhere, better bare-prompt discipline, slightly weaker bare
extraction. The strict-dominance gate therefore never auto-promoted it; the candidate GGUF
remains on the Modal volume (/outputs/gemma-cal-e4b-staging-Q4_K_M.gguf). Publishing it as
the project's edge model at parity is a product decision (zero quality cost; production
would serve our own fine-tune, fulfilling "Well-Tuned") β deliberately left to the owner, not
the gate.