OffGridSchedula

Running

App Files Files Community

OffGridSchedula / docs /eval-roadmap.md

ParetoOptimal

Initial Commit

0366d65 13 days ago

preview code

Raw

History Blame Contribute Delete

19.2 kB

Eval roadmap — improving the scheduling fine-tune

How we measure and improve ParetoOptimal/gemma-4-cal-gguf (the fine-tuned Gemma-4-31B that turns chat/images into a calendar ActionPlan). The eval is task-specific — generic LLM benchmarks (MMLU etc.) don't apply.

Harness: training/eval.py (scores), training/gen_eval.py + training/data/eval.jsonl (28 held-out examples, disjoint from dataset.jsonl), training/modal_eval.py (serves the GGUF on the same llama-server the Space uses, then scores).

Baseline scores (Q4_K_M, n=28, 2026-06-09)

Metric	Score
schema validity	1.00
no-event accuracy	1.00
clarification recall	1.00
end-time exact	1.00
event precision	0.85
event recall (start-exact)	0.77
event F1	0.81
title similarity	0.87

Discipline (never invents events, always asks when ambiguous) is perfect; all 9 relative-date cases passed. The gap is exact start datetime on a few explicit far-future dates (misses: e02, e05, e06, e15, one leg of m02).

The 3 steps

1. Diagnose the 5 misses (cheap)

Enhance eval.py to dump the model's actual start/title for mismatched events, then one re-run shows whether they're date-shift, time/AM-PM, or wrong-year errors — which tells us exactly what training data to add. (~one A100 eval run; the GGUF is cached in the Modal Volume, so it's fast.)

2. Baseline comparison (the "Well-Tuned" proof)

Run modal run training/modal_eval.py --model-hf-repo unsloth/gemma-4-31B-it-GGUF to score stock Gemma-4-31B on the same set. If the fine-tune's discipline (no-event 1.0, clarification 1.0) and datetime recall beat stock, that's concrete evidence the fine-tune helps. (Separate ~18 GB model download + A100 time.)

3. Close the gap

Add ~15–20 explicit-date examples (especially next-month dates and times) to training/data/dataset.jsonl, re-train on Modal (training/modal_train.py), re-eval — and watch start-exact recall move.

Results log

Step 1 — diagnosis (2026-06-09)

The mismatch dump showed the misses are not a reasoning failure. 3 of 5 are the same bug — a dropped year digit, "206" instead of "2026" — on next-month dates (month/day/time all correct):

[e02] gold 2026-10-06T15:30  pred 206-10-06T15:30
[e05] gold 2026-10-01T08:15  pred 206-10-01T08:15
[e15] gold 2026-10-08T19:00  pred 206-10-08T19:00
[e06] gold 2026-09-28T09:00  pred []                 (abstained)
[m02] Standup + Sprint demo  pred Standup only       (dropped 2nd leg)

Fix indicated: more far-future explicit-date examples reinforcing 4-digit years (+ multi-event 2nd legs). → Step 3.

Step 2 — baseline vs fine-tune (2026-06-09, n=28, Q4_K_M)

Metric	Stock `gemma-4-31B-it-GGUF`	Fine-tune `gemma-4-cal-gguf`
schema validity	1.00	1.00
event precision	1.00	0.85
start-exact recall	0.955	0.773
event F1	0.977	0.81
end-exact	1.00	1.00
no-event accuracy	1.00	1.00
clarification recall	0.75	1.00

Honest read: stock Gemma-4-31B is already strong at this extraction and beats the current fine-tune on datetime recall — the "206" bug is a fine-tune regression. The fine-tune's only clear win is clarification discipline (asks when a thread is "date TBD"; stock missed q04). As-is, the fine-tune is not justified on extraction. Step 3 must fix the year regression and clear baseline's 0.955 recall while keeping clarification at 1.00 — otherwise the better play is stock + the fine-tune's clarification behavior via prompting.

Step 3 — after gap-closing retrain (2026-06-09) — REGRESSED

Dataset grown 69 → 87 (+18 Oct–Dec 2026 explicit-date examples, disjoint from eval), same 2-epoch recipe, re-quantized to Q4_K_M and republished. Re-eval (n=28):

Metric	Stock 31B	Fine-tune v1 (69)	Fine-tune v2 (87, retrained)
schema validity	1.00	1.00	0.75
event precision	1.00	0.85	0.476
start-exact recall	0.955	0.773	0.455
event F1	0.977	0.81	0.465
end-exact	1.00	1.00	1.00
no-event accuracy	1.00	1.00	1.00
clarification recall	0.75	1.00	0.75

The naive retrain made it worse, not better. New failure modes: unparseable/empty JSON (validity 1.0→0.75), duplicate events, hallucinated "Drive to …" events, transposed/garbage years (2062, 2062-15:00:00), and previously-passing relative dates now empty. Cause: overfitting — 18 of 87 examples were near-identical far-future templates, biasing a tiny dataset and degrading general formatting/extraction.

Conclusions & recommendation

Stock Gemma-4-31B is already strong at this extraction (F1 0.98). The only thing fine-tuning reliably added was clarification discipline (v1: 1.00 vs stock 0.75) — and even that was lost in v2.
Tiny-dataset SFT is fragile here. v1 (69 ex) underperformed stock on dates; v2 (87 ex) regressed hard. More data of the same shape hurt.
Recommended path (pick one):
- Ship stock + prompt for clarification — simplest; recover the one real win without the regressions. (Lowest risk.)
- If keeping a fine-tune: rebuild the dataset much larger and diverse (not template-heavy), drop to ~1 epoch with regularization, and gate every retrain on this eval (only publish if it beats the current best). Consider a higher quant (Q5/Q6) to rule out the "206"/2062 digit corruption being quant-driven.
Action — revert the live model. v2 (worse) overwrote v1 in ParetoOptimal/gemma-4-cal-gguf. Restore v1 (the better fine-tune) or point the Space back at stock unsloth/gemma-4-31B-it-GGUF until a fine-tune beats the eval baseline.

Bottom line: the eval did its job — it caught a regression before it reached users, and it says the current fine-tune is not yet worth shipping over stock.

Follow-up (2026-06-09)

Live model restored to v1

v2 (regressed) was rolled back: gemma-cal-Q4_K_M.gguf in the repo was restored to the v1 LFS object via a server-side CommitOperationCopy (no transfer, no GPU). Production serves the better v1 again.

Dataset rebuilt larger + more diverse (69 → 122)

Added a diversity batch (gen_new_seeds.MORE_SEEDS3): varied date/time formats (10/15, "the 3rd", "half past 7", "0900", "noon", "midnight"), reschedules, cancellations, recurring, all-day, deadlines (EOD/midnight), past & hypothetical (must NOT schedule), richer no-event & clarify, and varied image sources (ticket, invite screenshot, notice). Goal: counter the template-heavy skew that overfit v2. Verified valid + disjoint from eval.jsonl.

Eval-gating is now the publishing process

No retrain publishes unless it beats the eval. training/gated_retrain.py:

retrain on Modal → upload to a staging filename (gemma-cal-staging-Q4_K_M.gguf) in the repo (production file untouched; mmproj skipped — --skip-mmproj);
eval the staging file (modal_eval.py --model-file …);
gate: schema_validity ≥ 0.95, event_f1 ≥ 0.81, start-exact recall ≥ 0.773 (defaults = the current best, v1) — tune via --gate-f1/--gate-recall;
PASS → promote staging → production via server-side CommitOperationCopy (free); FAIL → delete staging, production unchanged.

Run: python training/gated_retrain.py [--epochs 1 --gate-f1 … --gate-recall …].

Step 4 — first eval-gated retrain (122 ex, 1 epoch) — GATE FAILED ✅ (protected prod)

The retrain scored worse than every prior version and the gate refused to publish:

Metric	Stock	v1 (live)	v3 staging (122, 1ep)
schema validity	1.00	1.00	0.46
event F1	0.977	0.81	0.214
start-exact recall	0.955	0.773	0.136
no-event accuracy	1.00	1.00	1.00
clarification recall	0.75	1.00	1.00

½ of outputs were unparseable; extraction collapsed. Gate: FAIL → staging deleted, production unchanged (still v1). The gate worked exactly as intended.

Verdict (after 3 fine-tune attempts)

All three fine-tunes — v1 (69 ex / 2 ep), v2 (87 / 2 ep), v3 (122 / 1 ep) — underperform stock Gemma-4-31B, and the larger runs broke JSON validity. Only the safety behaviors (no-event, clarification) survive fine-tuning; extraction degrades. QLoRA-on-31B-Q4 here is fragile and not worth shipping over stock. Recommended: serve stock unsloth/gemma-4-31B-it-GGUF and recover the one fine-tune win (clarification) via the prompt. Keep v1 as the published fine-tune for the "Well-Tuned" artifact, but don't route production extraction through it. Revisit fine-tuning only with a substantially larger, more varied dataset and a recipe that holds schema validity at 1.0 — gated, as now, on this eval.

Step 5 — quantization-penalty test (2026-06-09): quant EXONERATED

Hypothesis: maybe Q4 quantization (the "206"/2062 digit bug) was tanking the fine-tune. Tested the SAME fine-tuned weights (gemma-cal-f16.gguf, v2/87-ex — best fp16 still on the volume) at three precisions on the 28-example eval (training/modal_quant_eval.py):

precision	schema validity	event F1	start-exact recall
f16 (full)	0.643	0.571	0.545
Q8_0	0.679	0.565	0.591
Q4_K_M	0.75	0.465	0.455
base (stock)	1.00	0.977	0.955

Quantization is not the cause. At full fp16 the fine-tune still scores validity 0.64 / F1 0.57 — nowhere near base; validity is actually lower at f16 than Q4, so quant isn't breaking the JSON. Precision buys only ~+0.1 F1/recall (Q4→Q8/f16), a fraction of the gap to base. The degradation is the SFT itself, not the GGUF conversion. Step 2 (retrain at Q8 to beat base) is not pursued — the gate would fail. (Caveat: v1's fp16 was overwritten, so this used v2; a definitive v1 test needs a retrain, but the small quant lift makes a base-beating result improbable.)

Final recommendation

A higher quant won't make the fine-tune beat base, and an automation agent (e.g. ml-intern) doesn't change the binding constraints (near-ceiling base; small data; SFT degrades instruction-following). Serve stock unsloth/gemma-4-31B-it-GGUF and recover the clarification behavior via the system prompt; keep v1 as the "Well-Tuned" artifact. Only revisit fine-tuning with a substantially larger, real, diverse dataset + a validity-preserving recipe (low LR, few steps), always gated on this eval.

Real training data: SMCalFlow importer

training/import_smcalflow.py converts SMCalFlow (Microsoft Semantic Machines, CC BY-SA 4.0) calendar dialogues into our ActionPlan format. SMCalFlow encodes events as LISP "dataflow" programs; the importer parses CreatePreflightEventWrapper turns, extracts subject/start/location/attendees, and resolves date/time constructs (Tomorrow, NextDOW, MD, NumberPM, HourMinuteMilitary, …) against a per-example reference now spread across 2026 — so relative dates become concrete, self-consistent targets (directly trains the failing date/time skill, with varied 4-digit years). Conservative: only emits a row when a title AND an explicit start time resolve (~7.5k usable turns from train+valid).

Run: python training/import_smcalflow.py --limit 2000 --heldout 200 → writes training/data/smcalflow_train.jsonl (+ …_heldout.jsonl). Both are git-ignored (CC BY-SA share-alike vs this repo's Apache-2.0 → we don't commit/redistribute the derived data; the importer code is ours) and disjoint from eval.jsonl.
train_qlora.py now trains on dataset.jsonl + smcalflow_train.jsonl (when present). gated_retrain.py therefore trains on real data, and still only publishes if it beats the gate — so a bigger-but-worse model can't reach production.
Attribution (required by CC BY-SA): Semantic Machines et al., "Task-Oriented Dialogue as Dataflow Synthesis," TACL 2020.

Step 6 — eval-gated retrain on REAL data (2026-06-09): FAILED gate (worst yet)

Trained the 31B on 2,122 examples (122 hand-authored + 2,000 real SMCalFlow), 1 epoch, through gated_retrain.py with a beat-base gate (F1≥0.95, recall≥0.90). Result on the 28-ex eval:

Metric	base	v1 (live)	real-data (2,122 ex)
schema validity	1.00	1.00	0.107
event F1	0.977	0.81	0.000
start-exact recall	0.955	0.773	0.000

~90% unparseable output, zero events extracted. Gate FAIL → not promoted; production stays v1.

Verdict across 4 fine-tunes (now incl. real data)

Scores monotonically worsen with more training/data: v1 (69 synth, F1 0.81) → v2 (87, 0.465) → v3 (122, 0.214) → real (2,122, 0.0). This is no longer a data problem — the SFT recipe itself degrades the model, and more data makes it worse. Most likely root cause to investigate if fine-tuning is ever revisited: a train/inference chat-template mismatch — train_qlora.py formats with Unsloth's get_chat_template("gemma") while llama-server serves with the GGUF's own --jinja template; if these differ for Gemma-4, training optimizes a format the server never uses, and the divergence compounds with more steps (exactly the monotonic decay seen). Other suspects: LR too high (2e-4) / catastrophic forgetting on a near-ceiling base.

Final, evidence-backed recommendation: serve stock unsloth/gemma-4-31B-it-GGUF (best by far) and recover clarification via the system prompt. Do NOT route production through any current fine-tune. The eval-gate has now correctly rejected 2 bad retrains — keep it as the publish gate.

Step 7 — recipe fix + raw-output probe (2026-06-09): training stack implicated, fine-tuning HALTED

Fixed the suspected train/serve chat-template mismatch (PR #54): Gemma-4's native chat_template.jinja uses a NEW <|turn>role … <turn|> format (no <start_of_turn> at all), while training forced unsloth's legacy "gemma" template. train_qlora.py now formats with the tokenizer's native template (hard <|turn> assert), masks loss to the assistant turn, LR 5e-5. Retrained on the 2,122-example set through the gate: validity 0.0 — gate FAIL (production stays v1, third bad retrain rejected).

Diagnostics that pinpointed the cause:

GGUF template check (CPU, ~free): our exported staging GGUF embeds the correct native <|turn> template (16,934 chars, no <start_of_turn>) → train and serve formats are now verifiably aligned. Template is exonerated as the remaining cause.
Raw-output probe (/outputs/gemma-cal-staging-Q4_K_M.gguf): free generation emits pure degenerate looping — 'Huddle — — — — — …' to the token limit; constrained generation emits 512 tokens of nothing. The weights are destroyed, not misformatted.

With dataset (69→2,122), template (legacy/native), LR (2e-4/5e-5), and masking (on/off) all varied, degradation always tracks training steps and ends in token-loop collapse. The remaining common factor is Unsloth's QLoRA path for Gemma-4-31B (new architecture; training logs warn get_input_embeddings not auto-handled for Gemma4AudioModel). Fine-tuning is halted until that stack demonstrably works for this arch (or is replaced with plain transformers+PEFT).

Step 8 — improve served evals via prompt (stock + targeted SYSTEM additions)

Base's only eval misses are prompt-fixable: m03 dropped the 2nd event of a multi-event thread; q04 didn't ask clarification on a "TBD" plan. Added two surgical SYSTEM lines (list every distinct event separately; ask via needs_clarification when day/time is TBD).

Result: PERFECT SCORE — 1.0 on every metric (n=28, tp/fp/fn = 22/0/0).

Metric	base (old prompt)	base + new prompt
schema validity	1.00	1.00
event precision	1.00	1.00
start-exact recall	0.955	1.00
event F1	0.977	1.00
no-event accuracy	1.00	1.00
clarification recall	0.75	1.00

Both misses fixed, nothing regressed. This is the production configuration: stock unsloth/gemma-4-31B-it-GGUF + the updated SYSTEM prompt. (Set Space var MODEL_HF_REPO=unsloth/gemma-4-31B-it-GGUF; the prompt ships with the app.) The "Well-Tuned" artifact remains ParetoOptimal/gemma-4-cal-gguf (v1); any future fine-tune must beat THIS 1.0 baseline through the gate — i.e., match it and win on a harder, expanded eval set.

Step 9 — the E4B edge-model campaign (2026-06-10)

Re-aimed fine-tuning where it has headroom: a Gemma-4 E4B (~8B) edge model that runs without a paid A100, gated against stock E4B. Six gated runs, each fixing a diagnosed failure (the fixed recipe trained cleanly every time — validity 1.0 throughout, confirming the Step-7 breakage was specific to the 31B path):

run	change	F1	recall	clarify	eval
#1	fixed recipe, 2,122 ex	0.884	0.864	1.0	n=28
#2	+ weekday-in-prompt (+data regen)	0.955	0.955	0.75	n=28
#3	+ next-DOW conflict filter (74 rows), 4× hand	1.0	1.0	0.75	n=28
#4	+ TBD-clarify seeds, 8× hand	0.93	0.909	1.0	n=28
#5	clarify seeds, 4× hand	0.93	0.909	1.0	n=28
—	eval expanded 28→60 (50 events; jitter-resistant)
#6	+ Batch-7 seeds (next-DOW, "opens")	0.97	0.96	1.0	n=60
stock E4B (weekday prompt)		0.97	0.96	1.0	n=60

Run #6 vs stock is an exact statistical tie (identical tp/fp/fn 48/1/2; both miss e09 "next Tuesday" — which resisted 7 explicit training seeds — and one "opens" case each). Campaign side effects that improved the PRODUCT for every model: weekday-in-prompt, the next-DOW convention cleanup, and the 60-example eval.

Step 10 — bare-prompt (internalization) test: no decisive gap

Dropped the system prompt for both models (identical minimal user content, same JSON-schema constraint; modal_eval.py --minimal-prompt), measuring internalized task knowledge:

bare, n=60	stock E4B	fine-tuned E4B
schema validity	0.967	1.0
event F1	0.682	0.644
start-exact recall	0.60	0.56
no-event accuracy	0.70	0.80
clarification recall	0.50	0.625

Small trade-offs both ways, within noise. Verdict: at this data scale (139 hand + 2,000 SMCalFlow) with QLoRA/1-epoch, the E4B fine-tune reaches PARITY with stock, not superiority — non-degraded, perfect validity everywhere, better bare-prompt discipline, slightly weaker bare extraction. The strict-dominance gate therefore never auto-promoted it; the candidate GGUF remains on the Modal volume (/outputs/gemma-cal-e4b-staging-Q4_K_M.gguf). Publishing it as the project's edge model at parity is a product decision (zero quality cost; production would serve our own fine-tune, fulfilling "Well-Tuned") — deliberately left to the owner, not the gate.