OffGridSchedula / docs /eval-roadmap.md
ParetoOptimal's picture
Initial Commit
0366d65
|
Raw
History Blame Contribute Delete
19.2 kB

Eval roadmap β€” improving the scheduling fine-tune

How we measure and improve ParetoOptimal/gemma-4-cal-gguf (the fine-tuned Gemma-4-31B that turns chat/images into a calendar ActionPlan). The eval is task-specific β€” generic LLM benchmarks (MMLU etc.) don't apply.

Harness: training/eval.py (scores), training/gen_eval.py + training/data/eval.jsonl (28 held-out examples, disjoint from dataset.jsonl), training/modal_eval.py (serves the GGUF on the same llama-server the Space uses, then scores).

Baseline scores (Q4_K_M, n=28, 2026-06-09)

Metric Score
schema validity 1.00
no-event accuracy 1.00
clarification recall 1.00
end-time exact 1.00
event precision 0.85
event recall (start-exact) 0.77
event F1 0.81
title similarity 0.87

Discipline (never invents events, always asks when ambiguous) is perfect; all 9 relative-date cases passed. The gap is exact start datetime on a few explicit far-future dates (misses: e02, e05, e06, e15, one leg of m02).

The 3 steps

1. Diagnose the 5 misses (cheap)

Enhance eval.py to dump the model's actual start/title for mismatched events, then one re-run shows whether they're date-shift, time/AM-PM, or wrong-year errors β€” which tells us exactly what training data to add. (~one A100 eval run; the GGUF is cached in the Modal Volume, so it's fast.)

2. Baseline comparison (the "Well-Tuned" proof)

Run modal run training/modal_eval.py --model-hf-repo unsloth/gemma-4-31B-it-GGUF to score stock Gemma-4-31B on the same set. If the fine-tune's discipline (no-event 1.0, clarification 1.0) and datetime recall beat stock, that's concrete evidence the fine-tune helps. (Separate ~18 GB model download + A100 time.)

3. Close the gap

Add ~15–20 explicit-date examples (especially next-month dates and times) to training/data/dataset.jsonl, re-train on Modal (training/modal_train.py), re-eval β€” and watch start-exact recall move.

Results log

Step 1 β€” diagnosis (2026-06-09)

The mismatch dump showed the misses are not a reasoning failure. 3 of 5 are the same bug β€” a dropped year digit, "206" instead of "2026" β€” on next-month dates (month/day/time all correct):

[e02] gold 2026-10-06T15:30  pred 206-10-06T15:30
[e05] gold 2026-10-01T08:15  pred 206-10-01T08:15
[e15] gold 2026-10-08T19:00  pred 206-10-08T19:00
[e06] gold 2026-09-28T09:00  pred []                 (abstained)
[m02] Standup + Sprint demo  pred Standup only       (dropped 2nd leg)

Fix indicated: more far-future explicit-date examples reinforcing 4-digit years (+ multi-event 2nd legs). β†’ Step 3.

Step 2 β€” baseline vs fine-tune (2026-06-09, n=28, Q4_K_M)

Metric Stock gemma-4-31B-it-GGUF Fine-tune gemma-4-cal-gguf
schema validity 1.00 1.00
event precision 1.00 0.85
start-exact recall 0.955 0.773
event F1 0.977 0.81
end-exact 1.00 1.00
no-event accuracy 1.00 1.00
clarification recall 0.75 1.00

Honest read: stock Gemma-4-31B is already strong at this extraction and beats the current fine-tune on datetime recall β€” the "206" bug is a fine-tune regression. The fine-tune's only clear win is clarification discipline (asks when a thread is "date TBD"; stock missed q04). As-is, the fine-tune is not justified on extraction. Step 3 must fix the year regression and clear baseline's 0.955 recall while keeping clarification at 1.00 β€” otherwise the better play is stock + the fine-tune's clarification behavior via prompting.

Step 3 β€” after gap-closing retrain (2026-06-09) β€” REGRESSED

Dataset grown 69 β†’ 87 (+18 Oct–Dec 2026 explicit-date examples, disjoint from eval), same 2-epoch recipe, re-quantized to Q4_K_M and republished. Re-eval (n=28):

Metric Stock 31B Fine-tune v1 (69) Fine-tune v2 (87, retrained)
schema validity 1.00 1.00 0.75
event precision 1.00 0.85 0.476
start-exact recall 0.955 0.773 0.455
event F1 0.977 0.81 0.465
end-exact 1.00 1.00 1.00
no-event accuracy 1.00 1.00 1.00
clarification recall 0.75 1.00 0.75

The naive retrain made it worse, not better. New failure modes: unparseable/empty JSON (validity 1.0β†’0.75), duplicate events, hallucinated "Drive to …" events, transposed/garbage years (2062, 2062-15:00:00), and previously-passing relative dates now empty. Cause: overfitting β€” 18 of 87 examples were near-identical far-future templates, biasing a tiny dataset and degrading general formatting/extraction.

Conclusions & recommendation

  1. Stock Gemma-4-31B is already strong at this extraction (F1 0.98). The only thing fine-tuning reliably added was clarification discipline (v1: 1.00 vs stock 0.75) β€” and even that was lost in v2.
  2. Tiny-dataset SFT is fragile here. v1 (69 ex) underperformed stock on dates; v2 (87 ex) regressed hard. More data of the same shape hurt.
  3. Recommended path (pick one):
    • Ship stock + prompt for clarification β€” simplest; recover the one real win without the regressions. (Lowest risk.)
    • If keeping a fine-tune: rebuild the dataset much larger and diverse (not template-heavy), drop to ~1 epoch with regularization, and gate every retrain on this eval (only publish if it beats the current best). Consider a higher quant (Q5/Q6) to rule out the "206"/2062 digit corruption being quant-driven.
  4. Action β€” revert the live model. v2 (worse) overwrote v1 in ParetoOptimal/gemma-4-cal-gguf. Restore v1 (the better fine-tune) or point the Space back at stock unsloth/gemma-4-31B-it-GGUF until a fine-tune beats the eval baseline.

Bottom line: the eval did its job β€” it caught a regression before it reached users, and it says the current fine-tune is not yet worth shipping over stock.

Follow-up (2026-06-09)

Live model restored to v1

v2 (regressed) was rolled back: gemma-cal-Q4_K_M.gguf in the repo was restored to the v1 LFS object via a server-side CommitOperationCopy (no transfer, no GPU). Production serves the better v1 again.

Dataset rebuilt larger + more diverse (69 β†’ 122)

Added a diversity batch (gen_new_seeds.MORE_SEEDS3): varied date/time formats (10/15, "the 3rd", "half past 7", "0900", "noon", "midnight"), reschedules, cancellations, recurring, all-day, deadlines (EOD/midnight), past & hypothetical (must NOT schedule), richer no-event & clarify, and varied image sources (ticket, invite screenshot, notice). Goal: counter the template-heavy skew that overfit v2. Verified valid + disjoint from eval.jsonl.

Eval-gating is now the publishing process

No retrain publishes unless it beats the eval. training/gated_retrain.py:

  1. retrain on Modal β†’ upload to a staging filename (gemma-cal-staging-Q4_K_M.gguf) in the repo (production file untouched; mmproj skipped β€” --skip-mmproj);
  2. eval the staging file (modal_eval.py --model-file …);
  3. gate: schema_validity β‰₯ 0.95, event_f1 β‰₯ 0.81, start-exact recall β‰₯ 0.773 (defaults = the current best, v1) β€” tune via --gate-f1/--gate-recall;
  4. PASS β†’ promote staging β†’ production via server-side CommitOperationCopy (free); FAIL β†’ delete staging, production unchanged.

Run: python training/gated_retrain.py [--epochs 1 --gate-f1 … --gate-recall …].

Step 4 β€” first eval-gated retrain (122 ex, 1 epoch) β€” GATE FAILED βœ… (protected prod)

The retrain scored worse than every prior version and the gate refused to publish:

Metric Stock v1 (live) v3 staging (122, 1ep)
schema validity 1.00 1.00 0.46
event F1 0.977 0.81 0.214
start-exact recall 0.955 0.773 0.136
no-event accuracy 1.00 1.00 1.00
clarification recall 0.75 1.00 1.00

Β½ of outputs were unparseable; extraction collapsed. Gate: FAIL β†’ staging deleted, production unchanged (still v1). The gate worked exactly as intended.

Verdict (after 3 fine-tune attempts)

All three fine-tunes β€” v1 (69 ex / 2 ep), v2 (87 / 2 ep), v3 (122 / 1 ep) β€” underperform stock Gemma-4-31B, and the larger runs broke JSON validity. Only the safety behaviors (no-event, clarification) survive fine-tuning; extraction degrades. QLoRA-on-31B-Q4 here is fragile and not worth shipping over stock. Recommended: serve stock unsloth/gemma-4-31B-it-GGUF and recover the one fine-tune win (clarification) via the prompt. Keep v1 as the published fine-tune for the "Well-Tuned" artifact, but don't route production extraction through it. Revisit fine-tuning only with a substantially larger, more varied dataset and a recipe that holds schema validity at 1.0 β€” gated, as now, on this eval.

Step 5 β€” quantization-penalty test (2026-06-09): quant EXONERATED

Hypothesis: maybe Q4 quantization (the "206"/2062 digit bug) was tanking the fine-tune. Tested the SAME fine-tuned weights (gemma-cal-f16.gguf, v2/87-ex β€” best fp16 still on the volume) at three precisions on the 28-example eval (training/modal_quant_eval.py):

precision schema validity event F1 start-exact recall
f16 (full) 0.643 0.571 0.545
Q8_0 0.679 0.565 0.591
Q4_K_M 0.75 0.465 0.455
base (stock) 1.00 0.977 0.955

Quantization is not the cause. At full fp16 the fine-tune still scores validity 0.64 / F1 0.57 β€” nowhere near base; validity is actually lower at f16 than Q4, so quant isn't breaking the JSON. Precision buys only ~+0.1 F1/recall (Q4β†’Q8/f16), a fraction of the gap to base. The degradation is the SFT itself, not the GGUF conversion. Step 2 (retrain at Q8 to beat base) is not pursued β€” the gate would fail. (Caveat: v1's fp16 was overwritten, so this used v2; a definitive v1 test needs a retrain, but the small quant lift makes a base-beating result improbable.)

Final recommendation

A higher quant won't make the fine-tune beat base, and an automation agent (e.g. ml-intern) doesn't change the binding constraints (near-ceiling base; small data; SFT degrades instruction-following). Serve stock unsloth/gemma-4-31B-it-GGUF and recover the clarification behavior via the system prompt; keep v1 as the "Well-Tuned" artifact. Only revisit fine-tuning with a substantially larger, real, diverse dataset + a validity-preserving recipe (low LR, few steps), always gated on this eval.

Real training data: SMCalFlow importer

training/import_smcalflow.py converts SMCalFlow (Microsoft Semantic Machines, CC BY-SA 4.0) calendar dialogues into our ActionPlan format. SMCalFlow encodes events as LISP "dataflow" programs; the importer parses CreatePreflightEventWrapper turns, extracts subject/start/location/attendees, and resolves date/time constructs (Tomorrow, NextDOW, MD, NumberPM, HourMinuteMilitary, …) against a per-example reference now spread across 2026 β€” so relative dates become concrete, self-consistent targets (directly trains the failing date/time skill, with varied 4-digit years). Conservative: only emits a row when a title AND an explicit start time resolve (~7.5k usable turns from train+valid).

  • Run: python training/import_smcalflow.py --limit 2000 --heldout 200 β†’ writes training/data/smcalflow_train.jsonl (+ …_heldout.jsonl). Both are git-ignored (CC BY-SA share-alike vs this repo's Apache-2.0 β†’ we don't commit/redistribute the derived data; the importer code is ours) and disjoint from eval.jsonl.
  • train_qlora.py now trains on dataset.jsonl + smcalflow_train.jsonl (when present). gated_retrain.py therefore trains on real data, and still only publishes if it beats the gate β€” so a bigger-but-worse model can't reach production.
  • Attribution (required by CC BY-SA): Semantic Machines et al., "Task-Oriented Dialogue as Dataflow Synthesis," TACL 2020.

Step 6 β€” eval-gated retrain on REAL data (2026-06-09): FAILED gate (worst yet)

Trained the 31B on 2,122 examples (122 hand-authored + 2,000 real SMCalFlow), 1 epoch, through gated_retrain.py with a beat-base gate (F1β‰₯0.95, recallβ‰₯0.90). Result on the 28-ex eval:

Metric base v1 (live) real-data (2,122 ex)
schema validity 1.00 1.00 0.107
event F1 0.977 0.81 0.000
start-exact recall 0.955 0.773 0.000

~90% unparseable output, zero events extracted. Gate FAIL β†’ not promoted; production stays v1.

Verdict across 4 fine-tunes (now incl. real data)

Scores monotonically worsen with more training/data: v1 (69 synth, F1 0.81) β†’ v2 (87, 0.465) β†’ v3 (122, 0.214) β†’ real (2,122, 0.0). This is no longer a data problem β€” the SFT recipe itself degrades the model, and more data makes it worse. Most likely root cause to investigate if fine-tuning is ever revisited: a train/inference chat-template mismatch β€” train_qlora.py formats with Unsloth's get_chat_template("gemma") while llama-server serves with the GGUF's own --jinja template; if these differ for Gemma-4, training optimizes a format the server never uses, and the divergence compounds with more steps (exactly the monotonic decay seen). Other suspects: LR too high (2e-4) / catastrophic forgetting on a near-ceiling base.

Final, evidence-backed recommendation: serve stock unsloth/gemma-4-31B-it-GGUF (best by far) and recover clarification via the system prompt. Do NOT route production through any current fine-tune. The eval-gate has now correctly rejected 2 bad retrains β€” keep it as the publish gate.

Step 7 β€” recipe fix + raw-output probe (2026-06-09): training stack implicated, fine-tuning HALTED

Fixed the suspected train/serve chat-template mismatch (PR #54): Gemma-4's native chat_template.jinja uses a NEW <|turn>role … <turn|> format (no <start_of_turn> at all), while training forced unsloth's legacy "gemma" template. train_qlora.py now formats with the tokenizer's native template (hard <|turn> assert), masks loss to the assistant turn, LR 5e-5. Retrained on the 2,122-example set through the gate: validity 0.0 β€” gate FAIL (production stays v1, third bad retrain rejected).

Diagnostics that pinpointed the cause:

  • GGUF template check (CPU, ~free): our exported staging GGUF embeds the correct native <|turn> template (16,934 chars, no <start_of_turn>) β†’ train and serve formats are now verifiably aligned. Template is exonerated as the remaining cause.
  • Raw-output probe (/outputs/gemma-cal-staging-Q4_K_M.gguf): free generation emits pure degenerate looping β€” 'Huddle β€” β€” β€” β€” β€” …' to the token limit; constrained generation emits 512 tokens of nothing. The weights are destroyed, not misformatted.

With dataset (69β†’2,122), template (legacy/native), LR (2e-4/5e-5), and masking (on/off) all varied, degradation always tracks training steps and ends in token-loop collapse. The remaining common factor is Unsloth's QLoRA path for Gemma-4-31B (new architecture; training logs warn get_input_embeddings not auto-handled for Gemma4AudioModel). Fine-tuning is halted until that stack demonstrably works for this arch (or is replaced with plain transformers+PEFT).

Step 8 β€” improve served evals via prompt (stock + targeted SYSTEM additions)

Base's only eval misses are prompt-fixable: m03 dropped the 2nd event of a multi-event thread; q04 didn't ask clarification on a "TBD" plan. Added two surgical SYSTEM lines (list every distinct event separately; ask via needs_clarification when day/time is TBD).

Result: PERFECT SCORE β€” 1.0 on every metric (n=28, tp/fp/fn = 22/0/0).

Metric base (old prompt) base + new prompt
schema validity 1.00 1.00
event precision 1.00 1.00
start-exact recall 0.955 1.00
event F1 0.977 1.00
no-event accuracy 1.00 1.00
clarification recall 0.75 1.00

Both misses fixed, nothing regressed. This is the production configuration: stock unsloth/gemma-4-31B-it-GGUF + the updated SYSTEM prompt. (Set Space var MODEL_HF_REPO=unsloth/gemma-4-31B-it-GGUF; the prompt ships with the app.) The "Well-Tuned" artifact remains ParetoOptimal/gemma-4-cal-gguf (v1); any future fine-tune must beat THIS 1.0 baseline through the gate β€” i.e., match it and win on a harder, expanded eval set.

Step 9 β€” the E4B edge-model campaign (2026-06-10)

Re-aimed fine-tuning where it has headroom: a Gemma-4 E4B (~8B) edge model that runs without a paid A100, gated against stock E4B. Six gated runs, each fixing a diagnosed failure (the fixed recipe trained cleanly every time β€” validity 1.0 throughout, confirming the Step-7 breakage was specific to the 31B path):

run change F1 recall clarify eval
#1 fixed recipe, 2,122 ex 0.884 0.864 1.0 n=28
#2 + weekday-in-prompt (+data regen) 0.955 0.955 0.75 n=28
#3 + next-DOW conflict filter (74 rows), 4Γ— hand 1.0 1.0 0.75 n=28
#4 + TBD-clarify seeds, 8Γ— hand 0.93 0.909 1.0 n=28
#5 clarify seeds, 4Γ— hand 0.93 0.909 1.0 n=28
β€” eval expanded 28β†’60 (50 events; jitter-resistant)
#6 + Batch-7 seeds (next-DOW, "opens") 0.97 0.96 1.0 n=60
stock E4B (weekday prompt) 0.97 0.96 1.0 n=60

Run #6 vs stock is an exact statistical tie (identical tp/fp/fn 48/1/2; both miss e09 "next Tuesday" β€” which resisted 7 explicit training seeds β€” and one "opens" case each). Campaign side effects that improved the PRODUCT for every model: weekday-in-prompt, the next-DOW convention cleanup, and the 60-example eval.

Step 10 β€” bare-prompt (internalization) test: no decisive gap

Dropped the system prompt for both models (identical minimal user content, same JSON-schema constraint; modal_eval.py --minimal-prompt), measuring internalized task knowledge:

bare, n=60 stock E4B fine-tuned E4B
schema validity 0.967 1.0
event F1 0.682 0.644
start-exact recall 0.60 0.56
no-event accuracy 0.70 0.80
clarification recall 0.50 0.625

Small trade-offs both ways, within noise. Verdict: at this data scale (139 hand + 2,000 SMCalFlow) with QLoRA/1-epoch, the E4B fine-tune reaches PARITY with stock, not superiority β€” non-degraded, perfect validity everywhere, better bare-prompt discipline, slightly weaker bare extraction. The strict-dominance gate therefore never auto-promoted it; the candidate GGUF remains on the Modal volume (/outputs/gemma-cal-e4b-staging-Q4_K_M.gguf). Publishing it as the project's edge model at parity is a product decision (zero quality cost; production would serve our own fine-tune, fulfilling "Well-Tuned") β€” deliberately left to the owner, not the gate.