OffGridSchedula

Running

App Files Files Community

OffGridSchedula / docs /eval-roadmap.md

ParetoOptimal

Initial Commit

0366d65 14 days ago

preview code

Raw

History Blame Contribute Delete

19.2 kB

	# Eval roadmap — improving the scheduling fine-tune

	How we measure and improve `ParetoOptimal/gemma-4-cal-gguf` (the fine-tuned
	Gemma-4-31B that turns chat/images into a calendar `ActionPlan`). The eval is
	task-specific — generic LLM benchmarks (MMLU etc.) don't apply.

	Harness: `training/eval.py` (scores), `training/gen_eval.py` + `training/data/eval.jsonl`
	(28 held-out examples, disjoint from `dataset.jsonl`), `training/modal_eval.py`
	(serves the GGUF on the same `llama-server` the Space uses, then scores).

	## Baseline scores (Q4_K_M, n=28, 2026-06-09)

	\| Metric \| Score \|
	\| --- \| --- \|
	\| schema validity \| 1.00 \|
	\| no-event accuracy \| 1.00 \|
	\| clarification recall \| 1.00 \|
	\| end-time exact \| 1.00 \|
	\| event precision \| 0.85 \|
	\| event recall (start-exact) \| 0.77 \|
	\| event F1 \| 0.81 \|
	\| title similarity \| 0.87 \|

	Discipline (never invents events, always asks when ambiguous) is perfect; all 9
	relative-date cases passed. The gap is exact start datetime on a few
	explicit far-future dates (misses: `e02`, `e05`, `e06`, `e15`, one leg of `m02`).

	## The 3 steps

	### 1. Diagnose the 5 misses (cheap)
	Enhance `eval.py` to dump the model's actual `start`/`title` for mismatched events,
	then one re-run shows whether they're date-shift, time/AM-PM, or wrong-year errors —
	which tells us exactly what training data to add. (~one A100 eval run; the GGUF is
	cached in the Modal Volume, so it's fast.)

	### 2. Baseline comparison (the "Well-Tuned" proof)
	Run `modal run training/modal_eval.py --model-hf-repo unsloth/gemma-4-31B-it-GGUF`
	to score stock Gemma-4-31B on the same set. If the fine-tune's discipline
	(no-event 1.0, clarification 1.0) and datetime recall beat stock, that's concrete
	evidence the fine-tune helps. (Separate ~18 GB model download + A100 time.)

	### 3. Close the gap
	Add ~15–20 explicit-date examples (especially next-month dates and times) to
	`training/data/dataset.jsonl`, re-train on Modal (`training/modal_train.py`),
	re-eval — and watch start-exact recall move.

	## Results log

	### Step 1 — diagnosis (2026-06-09)
	The mismatch dump showed the misses are not a reasoning failure. 3 of 5 are the
	same bug — a dropped year digit, "206" instead of "2026" — on next-month dates
	(month/day/time all correct):

	```
	[e02] gold 2026-10-06T15:30 pred 206-10-06T15:30
	[e05] gold 2026-10-01T08:15 pred 206-10-01T08:15
	[e15] gold 2026-10-08T19:00 pred 206-10-08T19:00
	[e06] gold 2026-09-28T09:00 pred [] (abstained)
	[m02] Standup + Sprint demo pred Standup only (dropped 2nd leg)
	```

	Fix indicated: more far-future explicit-date examples reinforcing 4-digit years
	(+ multi-event 2nd legs). → Step 3.

	### Step 2 — baseline vs fine-tune (2026-06-09, n=28, Q4_K_M)

	\| Metric \| Stock `gemma-4-31B-it-GGUF` \| Fine-tune `gemma-4-cal-gguf` \|
	\| --- \| --- \| --- \|
	\| schema validity \| 1.00 \| 1.00 \|
	\| event precision \| 1.00 \| 0.85 \|
	\| start-exact recall \| 0.955 \| 0.773 \|
	\| event F1 \| 0.977 \| 0.81 \|
	\| end-exact \| 1.00 \| 1.00 \|
	\| no-event accuracy \| 1.00 \| 1.00 \|
	\| clarification recall \| 0.75 \| 1.00 \|

	Honest read: stock Gemma-4-31B is already strong at this extraction and beats
	the current fine-tune on datetime recall — the "206" bug is a fine-tune regression.
	The fine-tune's only clear win is clarification discipline (asks when a thread is
	"date TBD"; stock missed `q04`). As-is, the fine-tune is not justified on
	extraction. Step 3 must fix the year regression and clear baseline's 0.955 recall
	while keeping clarification at 1.00 — otherwise the better play is stock + the
	fine-tune's clarification behavior via prompting.

	### Step 3 — after gap-closing retrain (2026-06-09) — REGRESSED
	Dataset grown 69 → 87 (+18 Oct–Dec 2026 explicit-date examples, disjoint from eval),
	same 2-epoch recipe, re-quantized to Q4_K_M and republished. Re-eval (n=28):

	\| Metric \| Stock 31B \| Fine-tune v1 (69) \| Fine-tune v2 (87, retrained) \|
	\| --- \| --- \| --- \| --- \|
	\| schema validity \| 1.00 \| 1.00 \| 0.75 \|
	\| event precision \| 1.00 \| 0.85 \| 0.476 \|
	\| start-exact recall \| 0.955 \| 0.773 \| 0.455 \|
	\| event F1 \| 0.977 \| 0.81 \| 0.465 \|
	\| end-exact \| 1.00 \| 1.00 \| 1.00 \|
	\| no-event accuracy \| 1.00 \| 1.00 \| 1.00 \|
	\| clarification recall \| 0.75 \| 1.00 \| 0.75 \|

	The naive retrain made it worse, not better. New failure modes: unparseable/empty
	JSON (validity 1.0→0.75), duplicate events, hallucinated "Drive to …" events,
	transposed/garbage years (`2062`, `2062-15:00:00`), and previously-passing relative
	dates now empty. Cause: overfitting — 18 of 87 examples were near-identical far-future
	templates, biasing a tiny dataset and degrading general formatting/extraction.

	## Conclusions & recommendation

	1. Stock Gemma-4-31B is already strong at this extraction (F1 0.98). The only
	thing fine-tuning reliably added was clarification discipline (v1: 1.00 vs stock
	0.75) — and even that was lost in v2.
	2. Tiny-dataset SFT is fragile here. v1 (69 ex) underperformed stock on dates;
	v2 (87 ex) regressed hard. More data of the same shape hurt.
	3. Recommended path (pick one):
	- Ship stock + prompt for clarification — simplest; recover the one real win
	without the regressions. (Lowest risk.)
	- If keeping a fine-tune: rebuild the dataset much larger and diverse (not
	template-heavy), drop to ~1 epoch with regularization, and **gate every retrain
	on this eval** (only publish if it beats the current best). Consider a higher
	quant (Q5/Q6) to rule out the `"206"`/`2062` digit corruption being quant-driven.
	4. Action — revert the live model. v2 (worse) overwrote v1 in
	`ParetoOptimal/gemma-4-cal-gguf`. Restore v1 (the better fine-tune) or point the
	Space back at stock `unsloth/gemma-4-31B-it-GGUF` until a fine-tune beats the
	eval baseline.

	**Bottom line: the eval did its job — it caught a regression before it reached users,
	and it says the current fine-tune is not yet worth shipping over stock.**

	## Follow-up (2026-06-09)

	### Live model restored to v1
	v2 (regressed) was rolled back: `gemma-cal-Q4_K_M.gguf` in the repo was restored to the
	v1 LFS object via a server-side `CommitOperationCopy` (no transfer, no GPU). Production
	serves the better v1 again.

	### Dataset rebuilt larger + more diverse (69 → 122)
	Added a diversity batch (`gen_new_seeds.MORE_SEEDS3`): varied date/time formats
	(`10/15`, "the 3rd", "half past 7", "0900", "noon", "midnight"), reschedules,
	cancellations, recurring, all-day, deadlines (EOD/midnight), past & hypothetical
	(must NOT schedule), richer no-event & clarify, and varied image sources (ticket,
	invite screenshot, notice). Goal: counter the template-heavy skew that overfit v2.
	Verified valid + disjoint from `eval.jsonl`.

	### Eval-gating is now the publishing process
	No retrain publishes unless it beats the eval. `training/gated_retrain.py`:
	1. retrain on Modal → upload to a staging filename (`gemma-cal-staging-Q4_K_M.gguf`)
	in the repo (production file untouched; mmproj skipped — `--skip-mmproj`);
	2. eval the staging file (`modal_eval.py --model-file …`);
	3. gate: `schema_validity ≥ 0.95`, `event_f1 ≥ 0.81`, `start-exact recall ≥ 0.773`
	(defaults = the current best, v1) — tune via `--gate-f1/--gate-recall`;
	4. PASS → promote staging → production via server-side `CommitOperationCopy` (free);
	FAIL → delete staging, production unchanged.

	Run: `python training/gated_retrain.py [--epochs 1 --gate-f1 … --gate-recall …]`.

	### Step 4 — first eval-gated retrain (122 ex, 1 epoch) — GATE FAILED ✅ (protected prod)
	The retrain scored worse than every prior version and the gate refused to publish:

	\| Metric \| Stock \| v1 (live) \| v3 staging (122, 1ep) \|
	\| --- \| --- \| --- \| --- \|
	\| schema validity \| 1.00 \| 1.00 \| 0.46 \|
	\| event F1 \| 0.977 \| 0.81 \| 0.214 \|
	\| start-exact recall \| 0.955 \| 0.773 \| 0.136 \|
	\| no-event accuracy \| 1.00 \| 1.00 \| 1.00 \|
	\| clarification recall \| 0.75 \| 1.00 \| 1.00 \|

	>½ of outputs were unparseable; extraction collapsed. **Gate: FAIL → staging deleted,
	production unchanged (still v1).** The gate worked exactly as intended.

	## Verdict (after 3 fine-tune attempts)
	All three fine-tunes — v1 (69 ex / 2 ep), v2 (87 / 2 ep), v3 (122 / 1 ep) — **underperform
	stock Gemma-4-31B**, and the larger runs broke JSON validity. Only the safety behaviors
	(no-event, clarification) survive fine-tuning; extraction degrades. **QLoRA-on-31B-Q4 here
	is fragile and not worth shipping over stock. Recommended: serve stock
	`unsloth/gemma-4-31B-it-GGUF`** and recover the one fine-tune win (clarification) via the
	prompt. Keep v1 as the published fine-tune for the "Well-Tuned" artifact, but don't route
	production extraction through it. Revisit fine-tuning only with a substantially larger, more
	varied dataset and a recipe that holds schema validity at 1.0 — gated, as now, on this eval.

	## Step 5 — quantization-penalty test (2026-06-09): quant EXONERATED
	Hypothesis: maybe Q4 quantization (the `"206"`/`2062` digit bug) was tanking the fine-tune.
	Tested the SAME fine-tuned weights (`gemma-cal-f16.gguf`, v2/87-ex — best fp16 still on the
	volume) at three precisions on the 28-example eval (`training/modal_quant_eval.py`):

	\| precision \| schema validity \| event F1 \| start-exact recall \|
	\| --- \| --- \| --- \| --- \|
	\| f16 (full) \| 0.643 \| 0.571 \| 0.545 \|
	\| Q8_0 \| 0.679 \| 0.565 \| 0.591 \|
	\| Q4_K_M \| 0.75 \| 0.465 \| 0.455 \|
	\| base (stock) \| 1.00 \| 0.977 \| 0.955 \|

	Quantization is not the cause. At full fp16 the fine-tune still scores validity 0.64 / F1
	0.57 — nowhere near base; validity is actually lower at f16 than Q4, so quant isn't breaking
	the JSON. Precision buys only ~+0.1 F1/recall (Q4→Q8/f16), a fraction of the gap to base. The
	degradation is the SFT itself, not the GGUF conversion. Step 2 (retrain at Q8 to beat base)
	is not pursued — the gate would fail. (Caveat: v1's fp16 was overwritten, so this used v2;
	a definitive v1 test needs a retrain, but the small quant lift makes a base-beating result
	improbable.)

	### Final recommendation
	A higher quant won't make the fine-tune beat base, and an automation agent (e.g. `ml-intern`)
	doesn't change the binding constraints (near-ceiling base; small data; SFT degrades
	instruction-following). Serve stock `unsloth/gemma-4-31B-it-GGUF` and recover the
	clarification behavior via the system prompt; keep v1 as the "Well-Tuned" artifact. Only
	revisit fine-tuning with a substantially larger, real, diverse dataset + a validity-preserving
	recipe (low LR, few steps), always gated on this eval.

	## Real training data: SMCalFlow importer
	`training/import_smcalflow.py` converts SMCalFlow (Microsoft Semantic Machines, **CC BY-SA
	4.0**) calendar dialogues into our `ActionPlan` format. SMCalFlow encodes events as LISP
	"dataflow" programs; the importer parses `CreatePreflightEventWrapper` turns, extracts
	subject/start/location/attendees, and resolves date/time constructs (`Tomorrow`, `NextDOW`,
	`MD`, `NumberPM`, `HourMinuteMilitary`, …) against a per-example reference `now` spread across
	2026 — so relative dates become concrete, self-consistent targets (directly trains the failing
	date/time skill, with varied 4-digit years). Conservative: only emits a row when a title AND an
	explicit start time resolve (~7.5k usable turns from train+valid).

	- Run: `python training/import_smcalflow.py --limit 2000 --heldout 200` → writes
	`training/data/smcalflow_train.jsonl` (+ `…_heldout.jsonl`). Both are git-ignored (CC BY-SA
	share-alike vs this repo's Apache-2.0 → we don't commit/redistribute the derived data; the
	importer code is ours) and disjoint from `eval.jsonl`.
	- `train_qlora.py` now trains on `dataset.jsonl` + `smcalflow_train.jsonl` (when present).
	`gated_retrain.py` therefore trains on real data, and still **only publishes if it beats the
	gate** — so a bigger-but-worse model can't reach production.
	- Attribution (required by CC BY-SA): *Semantic Machines et al., "Task-Oriented Dialogue as
	Dataflow Synthesis," TACL 2020.*

	## Step 6 — eval-gated retrain on REAL data (2026-06-09): FAILED gate (worst yet)
	Trained the 31B on 2,122 examples (122 hand-authored + 2,000 real SMCalFlow), 1 epoch,
	through `gated_retrain.py` with a beat-base gate (F1≥0.95, recall≥0.90). Result on the 28-ex eval:

	\| Metric \| base \| v1 (live) \| real-data (2,122 ex) \|
	\| --- \| --- \| --- \| --- \|
	\| schema validity \| 1.00 \| 1.00 \| 0.107 \|
	\| event F1 \| 0.977 \| 0.81 \| 0.000 \|
	\| start-exact recall \| 0.955 \| 0.773 \| 0.000 \|

	~90% unparseable output, zero events extracted. Gate FAIL → not promoted; production stays v1.

	### Verdict across 4 fine-tunes (now incl. real data)
	Scores monotonically worsen with more training/data: v1 (69 synth, F1 0.81) → v2 (87, 0.465)
	→ v3 (122, 0.214) → real (2,122, 0.0). This is no longer a data problem — **the SFT recipe
	itself degrades the model**, and more data makes it worse. Most likely root cause to investigate
	if fine-tuning is ever revisited: a train/inference chat-template mismatch — `train_qlora.py`
	formats with Unsloth's `get_chat_template("gemma")` while `llama-server` serves with the GGUF's
	own `--jinja` template; if these differ for Gemma-4, training optimizes a format the server never
	uses, and the divergence compounds with more steps (exactly the monotonic decay seen). Other
	suspects: LR too high (2e-4) / catastrophic forgetting on a near-ceiling base.

	Final, evidence-backed recommendation: serve stock `unsloth/gemma-4-31B-it-GGUF` (best by far)
	and recover clarification via the system prompt. Do NOT route production through any current
	fine-tune. The eval-gate has now correctly rejected 2 bad retrains — keep it as the publish gate.

	## Step 7 — recipe fix + raw-output probe (2026-06-09): training stack implicated, fine-tuning HALTED
	Fixed the suspected train/serve chat-template mismatch (PR #54): Gemma-4's native
	`chat_template.jinja` uses a NEW `<\|turn>role … <turn\|>` format (no `<start_of_turn>` at all),
	while training forced unsloth's legacy "gemma" template. `train_qlora.py` now formats with the
	tokenizer's native template (hard `<\|turn>` assert), masks loss to the assistant turn, LR 5e-5.
	Retrained on the 2,122-example set through the gate: validity 0.0 — gate FAIL (production
	stays v1, third bad retrain rejected).

	Diagnostics that pinpointed the cause:
	- GGUF template check (CPU, ~free): our exported staging GGUF embeds the correct native
	`<\|turn>` template (16,934 chars, no `<start_of_turn>`) → train and serve formats are now
	verifiably aligned. Template is exonerated as the remaining cause.
	- Raw-output probe (`/outputs/gemma-cal-staging-Q4_K_M.gguf`): free generation emits pure
	degenerate looping — `'Huddle — — — — — …'` to the token limit; constrained generation emits
	512 tokens of nothing. The weights are destroyed, not misformatted.

	With dataset (69→2,122), template (legacy/native), LR (2e-4/5e-5), and masking (on/off) all
	varied, degradation always tracks training steps and ends in token-loop collapse. The remaining
	common factor is Unsloth's QLoRA path for Gemma-4-31B (new architecture; training logs warn
	`get_input_embeddings not auto-handled for Gemma4AudioModel`). Fine-tuning is halted until
	that stack demonstrably works for this arch (or is replaced with plain transformers+PEFT).

	## Step 8 — improve served evals via prompt (stock + targeted SYSTEM additions)
	Base's only eval misses are prompt-fixable: m03 dropped the 2nd event of a multi-event thread;
	q04 didn't ask clarification on a "TBD" plan. Added two surgical SYSTEM lines (list every
	distinct event separately; ask via needs_clarification when day/time is TBD).

	Result: PERFECT SCORE — 1.0 on every metric (n=28, tp/fp/fn = 22/0/0).

	\| Metric \| base (old prompt) \| base + new prompt \|
	\| --- \| --- \| --- \|
	\| schema validity \| 1.00 \| 1.00 \|
	\| event precision \| 1.00 \| 1.00 \|
	\| start-exact recall \| 0.955 \| 1.00 \|
	\| event F1 \| 0.977 \| 1.00 \|
	\| no-event accuracy \| 1.00 \| 1.00 \|
	\| clarification recall \| 0.75 \| 1.00 \|

	Both misses fixed, nothing regressed. **This is the production configuration: stock
	`unsloth/gemma-4-31B-it-GGUF` + the updated SYSTEM prompt.** (Set Space var
	`MODEL_HF_REPO=unsloth/gemma-4-31B-it-GGUF`; the prompt ships with the app.) The "Well-Tuned"
	artifact remains `ParetoOptimal/gemma-4-cal-gguf` (v1); any future fine-tune must beat THIS
	1.0 baseline through the gate — i.e., match it and win on a harder, expanded eval set.

	## Step 9 — the E4B edge-model campaign (2026-06-10)
	Re-aimed fine-tuning where it has headroom: a Gemma-4 E4B (~8B) edge model that runs without a
	paid A100, gated against stock E4B. Six gated runs, each fixing a diagnosed failure (the fixed
	recipe trained cleanly every time — validity 1.0 throughout, confirming the Step-7 breakage was
	specific to the 31B path):

	\| run \| change \| F1 \| recall \| clarify \| eval \|
	\| --- \| --- \| --- \| --- \| --- \| --- \|
	\| #1 \| fixed recipe, 2,122 ex \| 0.884 \| 0.864 \| 1.0 \| n=28 \|
	\| #2 \| + weekday-in-prompt (+data regen) \| 0.955 \| 0.955 \| 0.75 \| n=28 \|
	\| #3 \| + next-DOW conflict filter (74 rows), 4× hand \| 1.0 \| 1.0 \| 0.75 \| n=28 \|
	\| #4 \| + TBD-clarify seeds, 8× hand \| 0.93 \| 0.909 \| 1.0 \| n=28 \|
	\| #5 \| clarify seeds, 4× hand \| 0.93 \| 0.909 \| 1.0 \| n=28 \|
	\| — \| eval expanded 28→60 (50 events; jitter-resistant) \| \| \| \| \|
	\| #6 \| + Batch-7 seeds (next-DOW, "opens") \| 0.97 \| 0.96 \| 1.0 \| n=60 \|
	\| stock E4B (weekday prompt) \| \| 0.97 \| 0.96 \| 1.0 \| n=60 \|

	Run #6 vs stock is an exact statistical tie (identical tp/fp/fn 48/1/2; both miss `e09`
	"next Tuesday" — which resisted 7 explicit training seeds — and one "opens" case each).
	Campaign side effects that improved the PRODUCT for every model: weekday-in-prompt, the
	next-DOW convention cleanup, and the 60-example eval.

	## Step 10 — bare-prompt (internalization) test: no decisive gap
	Dropped the system prompt for both models (identical minimal user content, same JSON-schema
	constraint; `modal_eval.py --minimal-prompt`), measuring internalized task knowledge:

	\| bare, n=60 \| stock E4B \| fine-tuned E4B \|
	\| --- \| --- \| --- \|
	\| schema validity \| 0.967 \| 1.0 \|
	\| event F1 \| 0.682 \| 0.644 \|
	\| start-exact recall \| 0.60 \| 0.56 \|
	\| no-event accuracy \| 0.70 \| 0.80 \|
	\| clarification recall \| 0.50 \| 0.625 \|

	Small trade-offs both ways, within noise. **Verdict: at this data scale (139 hand + 2,000
	SMCalFlow) with QLoRA/1-epoch, the E4B fine-tune reaches PARITY with stock, not superiority** —
	non-degraded, perfect validity everywhere, better bare-prompt discipline, slightly weaker bare
	extraction. The strict-dominance gate therefore never auto-promoted it; the candidate GGUF
	remains on the Modal volume (`/outputs/gemma-cal-e4b-staging-Q4_K_M.gguf`). Publishing it as
	the project's edge model at parity is a product decision (zero quality cost; production
	would serve our own fine-tune, fulfilling "Well-Tuned") — deliberately left to the owner, not
	the gate.