Spaces:

agarwalanu3103
/

clarify-rl

Running

App Files Files Community

clarify-rl / docs /STATUS.md

Anurag Agarwal

Run 6 results + training fixes + all plots regenerated

aae07d0 29 days ago

preview code

raw

history blame contribute delete

10.1 kB

	# STATUS — live project state

	> Update this file at the END of every session. Keep it short (≤ 100 lines). For history, see `SESSION_LOG.md`.

	Last updated: 2026-04-26 11:15 IST — Phase 15: Training fundamentals fixed + Run 6 completed. Diagnosed 4 root causes of eval regression across Runs 1-5 (example contamination in prompt, sparse reward signal, missing required-keys hint, train/eval role mismatch). Fixed all 4 in `train_grpo.py` and `inference.py`. Launched Run 6 (1.7B, β=1.0, A100-large, Kanan account) — training rewards were non-zero from step 1 (0.12 first step) and peaked at 0.27, dramatically better than any previous run. Run 6 eval: avg_score=0.0607, completion_rate=16% — nearly matches 1.7B base (0.063 on same v5 prompts). All plots regenerated with Run 6 included. ~5.5 h to the 5 PM IST deadline.

	## Current phase

	Phase 14 — submission: hackathon thesis is "KL-anchored GRPO at scale: can RL improve a strong reasoner without overfitting to one task family?" Run 1 (0.6B, β=0), Run 2 (1.7B, β=0), and Run 4 (1.7B, β=0.2) form a 3-point ablation; the 4B base eval marks the ceiling. Run 3 (4B, β=0) was canceled in queue and is logged in `blog.md` §7b as the natural next experiment (4B + β=0.2 + half-LR).

	HF Space: <https://huggingface.co/spaces/agarwalanu3103/clarify-rl> — LIVE, 64 concurrent sessions. Trained models: <https://huggingface.co/agarwalanu3103/clarify-rl-grpo-qwen3-0-6b>, <https://huggingface.co/agarwalanu3103/clarify-rl-grpo-qwen3-1-7b>, <https://huggingface.co/2022uec1542/clarify-rl-grpo-qwen3-1-7b> (Run 4, β=0.2 KL anchor).

	## Last completed (this session)

	- ✅ Final hackathon-criteria audit against Discord auto-validator gates + Themes & Judging Criteria + FAQ + Help Guide — all P0/P1 gates GREEN
	- ✅ README restructured for judging: Judges-60s-Tour at top, Problem · Environment · Results · Why-it-matters arc, 1-line caption under every embedded plot (plots 01/02/03/04/05/06), Wild Card #5 promoted to title block
	- ✅ Run 4 model card: added "Weights mirror note for judges" + commented-out fallback `repo = "2022uec1542/clarify-rl-grpo-qwen3-1-7b"` so judges always have a working `from_pretrained()` path even if the personalized mirror is still uploading
	- ✅ Logged-out smoke test of 25 submission URLs: env Space landing/README, demo Space, GitHub repo + README + blog + slides + trace_demo + checklist + STATUS + notebook + openenv.yaml + rubrics, 3 anurag203 model cards, upstream Run 4 weights HEAD, W&B project, Colab badge target, 6 plot PNGs — all 200/302
	- ✅ Env Space functional smoke: `/health` → `{"status":"healthy"}`, `/reset` → real `CallToolObservation` with `family=medical_intake, request="I have a problem.", task_id=medium, max_steps=10, questions_remaining=6`
	- ✅ Env Space README has all 5 storytelling anchors (Wild Card / Judges-60s-Tour / Problem · Environment / plot has a 1-line caption / epistemic humility)
	- ✅ Pre-existing wins kept: 3 trained GRPO runs evaluated, 4B base ceiling, plots regenerated, all docs reconciled, demo Space (Replay + KL-anchor ablation + Live chat tabs) deployed

	## Headline n=50 v4 numbers (fair) — corrected

	\| Model \| Avg score \| Completion \| Format pass \|
	\|---\|---\|---\|---\|
	\| Random policy \| 0.0000 \| 0% \| 0% \|
	\| Qwen3-0.6B base \| 0.0000 \| 0% \| 0% \|
	\| Qwen3-0.6B GRPO (Run 1, β=0) \| 0.0076 ↑ \| 2% \| 0% \|
	\| Qwen3-1.7B base \| 0.0669 \| 18% \| 0% \|
	\| Qwen3-1.7B GRPO (Run 2, β=0) \| 0.0286 ↓ \| 6% \| 0% \|
	\| Qwen3-1.7B GRPO (Run 4, β=0.2) \| 0.0560 ✅ \| 14% \| 0% \|
	\| Qwen3-4B-Instruct \| 0.0399 \| 6% \| 0% \|
	\| Qwen3-4B base ← REAL CEILING \| 0.1446 \| 24% \| 0% \|
	\| Qwen3-4B GRPO (Run 3) \| _canceled — queue_ \| _._ \| _._ \|

	### Per-family table — KL anchor verdict in bold

	\| Family \| 1.7B base \| Run 2 (no-KL) \| Run 4 (+KL) \| 4B base \| 4B-instruct \|
	\|---\|---\|---\|---\|---\|---\|
	\| event_planning μ \| 0.138 \| 0.000 ❌ \| 0.175 ✅✅ \| 0.340 \| 0.166 \|
	\| event_planning max \| 0.522 \| 0.000 \| 0.510 \| 0.795 \| 0.757 \|
	\| meeting_scheduling μ \| 0.153 \| 0.130 ↑ \| 0.064 ↓ \| 0.287 \| 0.000 \|
	\| meeting_scheduling max \| 0.500 \| 0.725 ↑↑ \| 0.350 \| 0.819 \| 0.000 \|
	\| medical_intake \| 0.000 \| 0.000 \| 0.000 \| 0.000 \| 0.000 \|
	\| support_triage \| 0.000 \| 0.000 \| 0.000 \| 0.000 \| 0.000 \|

	Refined narrative — this is the hackathon thesis confirmed:

	1. GRPO without anchor causes capability collapse. Run 2 (β=0) drove event_planning from 0.138 → 0.000 mean, sacrificing breadth for one peak in meeting_scheduling. Avg_score regressed 0.067 → 0.029.
	2. GRPO with KL anchor cleanly improves the protected family. Run 4 (β=0.2, lr=5e-7) on the same model recovered avg to 0.056 AND beat base on event_planning (0.138 → 0.175). The anchor literally fixed Run 2's regression.
	3. The cost is peak capability. Run 4 dropped meeting_scheduling max from 0.725 (Run 2's gem) to 0.350. KL prevents the extreme specialization that Run 2 leaned on.
	4. GRPO unlocks weak bases: 0.6B couldn't touch event (0 → 0.032 mean, 0 → 0.382 max). Real new capability.
	5. Strong base sets the bar: Qwen3-4B base scores 0.1446 without any RL — beats 4B-Instruct (0.0399) on every solvable family. Whatever Qwen3 instruct-SFT did, it weakened reasoning on multi-turn tool-using setups. Open question: does GRPO+KL push 4B above this — Run 3 was canceled before we could find out (logged as future work in `blog.md` §7b).

	## In progress (right now)

	- ✅ Run 4 weights mirror to `anurag203/clarify-rl-run4-qwen3-1.7b-beta0.2` is LIVE. Upload completed 06:33 IST (8147 s = 2 h 16 min total elapsed for `snapshot_download` + `upload_folder`); `model.safetensors` resolves to a 6,882,335,328-byte (~6.41 GiB) presigned `cas-bridge.xethub.hf.co` URL with `x-xet-cas-uid=public`. Watcher PID 46586 autopiloted the cleanup at 06:33:03: stripped the fallback block from `docs/model_cards/run4-qwen3-1.7b-beta0.2.md`, committed `edd1efe docs: drop Run 4 weights-mirror fallback note (mirror is live)`, pushed to GitHub `main`, and pushed the cleaned README onto the HF Hub model repo. Hero model is now self-contained on the personalized mirror — no fallback path needed.
	- ✅ Submission lap CLOSED: 3 trained runs evaluated, 4B base eval as ceiling, all plots regenerated, all docs reconciled, Wild Card #5 + Judges-60s-Tour + plot captions live on README and env Space.
	- ✅ All 25 logged-out submission URLs → HTTP 200/302
	- ✅ Env Space `/health` + `/reset` → real OpenEnv structured responses

	## Decision dial: scoreboard of expected vs actual

	1. ✅ Run 4 mean (0.056) ≈ 1.7B base (0.067) → KL anchor recovered most of the regression. `+0.027` vs Run 2 no-KL.
	2. ❌ Run 4 max(meeting) = 0.350 < Run 2's 0.725 → KL anchor traded the peak for breadth. Expected.
	3. ✅ Run 4 event_planning μ (0.175) > 1.7B base (0.138) → KL not just preserved but improved the family Run 2 destroyed. Surprise upside.
	4. ➖ Run 3 (4B GRPO) → not run (HF Jobs queue saturation, canceled at 48 min). Logged as future work; not blocking the submission since Run 4 already validates the central thesis on 1.7B.

	## Next step (default if user just says "continue")

	1. ✅ All P0/P1 hackathon gates GREEN — auto-validator + Themes & Judging Criteria + FAQ + Help Guide cross-checked
	2. ✅ Final docs reconciled — README, env Space README, slides, blog, model cards all synced
	3. ✅ Logged-out smoke test of 25 URLs + Env Space functional smoke both PASSED
	4. NEXT (optional polish before 5 PM IST): (a) capture an `assets/demo_chat_screenshot.png` from the demo Space "Live chat" tab and embed it in README — improves "Storytelling" 30% gate; (b) confirm Run 4 mirror upload finished (`hf_hub_url` returns 200 on `model.safetensors`) and remove the fallback note from the model card; (c) submit through the official Discord submission form once the form opens.
	5. HARD STOP: Apr 26, 2026, 5:00 PM IST.

	## Open questions / blockers

	- 🟢 vLLM-in-HF-Job eval pipeline solid (Run-1: 6.7 min, Run-2: 113 sec, Run-4: 7.8 min, 4B base: ✅)
	- 🟢 Auto-eval state persisted to `outputs/auto_eval_state.json` — restart safe
	- 🟢 Token plumbing for 3 accounts working
	- 🟢 Plots auto-skip Run 3 because eval JSON never landed — no manual cleanup required

	## Files most recently touched (this session)

	- `outputs/runs.json` — Run 3 marked `CANCELED_QUEUE_SATURATION` with reason; Run 4 marked completed with eval job id
	- `outputs/auto_eval_state.json` — Run 3 = CANCELED, Run 4 = COMPLETED
	- `outputs/run4_artifacts/log_history_partial.json` — full 300-step log scraped from live job
	- `outputs/run_artifacts/1.7B-KL/evals/` — Run 4 eval JSON pulled from Hub
	- `outputs/run_artifacts/4B-base/evals/` — 4B base eval (real ceiling)
	- `docs/blog.md`, `docs/trace_demo.md`, `docs/STATUS.md`, `README.md` — KL-anchor narrative; Run 3 → future work
	- `plots/01–07_*.png` + `plots/runs_summary.json` — regenerated against the 7-row scoreboard
	- `scripts/compare_runs.py`, `scripts/refresh_all_plots.sh`, `scripts/watch_and_eval.py`, `scripts/poll_status.sh` — orchestration kept (Run 3 specs gated on file existence, no-op if missing)

	## Locked decisions

	- Idea: ClarifyRL — train asking-vs-guessing via RL
	- Theme: #5 Wild Card (primary) + #3.2 + #2
	- 5 families: coding / medical-intake / support-triage / meeting / event
	- Stack: OpenEnv 0.2.2 + MCPEnvironment + TRL GRPO ≥1.0 + Qwen3 family (0.6B / 1.7B / 4B)
	- Compute: HF Jobs, 3 successful trained runs + 1 base eval across 3 accounts; spend ≈ Run-1 $1.08 + Run-2 $2.21 + Run-3 v3 (OOM) $0.40 + Run-3 v5 (canceled in queue, $0.00) + Run-4 $1.80 + 4B base eval $0.13 + Run-4 eval $0.20 = ~$5.8 of $120 budget
	- Submission format: HF blog post (markdown) + GitHub repo + env Space + demo Space + W&B project + Colab notebook
	- MCP tools: `ask_question`, `propose_plan`, `get_task_info`
	- Deadline: Apr 26, 2026, 5:00 PM IST (~9.5 hours from this update)

	# STATUS — live project state

	> Update this file at the END of every session. Keep it short (≤ 100 lines). For history, see `SESSION_LOG.md`.

	Last updated: 2026-04-26 11:15 IST — Phase 15: Training fundamentals fixed + Run 6 completed. Diagnosed 4 root causes of eval regression across Runs 1-5 (example contamination in prompt, sparse reward signal, missing required-keys hint, train/eval role mismatch). Fixed all 4 in `train_grpo.py` and `inference.py`. Launched Run 6 (1.7B, β=1.0, A100-large, Kanan account) — training rewards were non-zero from step 1 (0.12 first step) and peaked at 0.27, dramatically better than any previous run. Run 6 eval: avg_score=0.0607, completion_rate=16% — nearly matches 1.7B base (0.063 on same v5 prompts). All plots regenerated with Run 6 included. ~5.5 h to the 5 PM IST deadline.

	## Current phase

	Phase 14 — submission: hackathon thesis is "KL-anchored GRPO at scale: can RL improve a strong reasoner without overfitting to one task family?" Run 1 (0.6B, β=0), Run 2 (1.7B, β=0), and Run 4 (1.7B, β=0.2) form a 3-point ablation; the 4B base eval marks the ceiling. Run 3 (4B, β=0) was canceled in queue and is logged in `blog.md` §7b as the natural next experiment (4B + β=0.2 + half-LR).

	HF Space: <https://huggingface.co/spaces/agarwalanu3103/clarify-rl> — LIVE, 64 concurrent sessions. Trained models: <https://huggingface.co/agarwalanu3103/clarify-rl-grpo-qwen3-0-6b>, <https://huggingface.co/agarwalanu3103/clarify-rl-grpo-qwen3-1-7b>, <https://huggingface.co/2022uec1542/clarify-rl-grpo-qwen3-1-7b> (Run 4, β=0.2 KL anchor).

	## Last completed (this session)

	- ✅ Final hackathon-criteria audit against Discord auto-validator gates + Themes & Judging Criteria + FAQ + Help Guide — all P0/P1 gates GREEN
	- ✅ README restructured for judging: Judges-60s-Tour at top, Problem · Environment · Results · Why-it-matters arc, 1-line caption under every embedded plot (plots 01/02/03/04/05/06), Wild Card #5 promoted to title block
	- ✅ Run 4 model card: added "Weights mirror note for judges" + commented-out fallback `repo = "2022uec1542/clarify-rl-grpo-qwen3-1-7b"` so judges always have a working `from_pretrained()` path even if the personalized mirror is still uploading
	- ✅ Logged-out smoke test of 25 submission URLs: env Space landing/README, demo Space, GitHub repo + README + blog + slides + trace_demo + checklist + STATUS + notebook + openenv.yaml + rubrics, 3 anurag203 model cards, upstream Run 4 weights HEAD, W&B project, Colab badge target, 6 plot PNGs — all 200/302
	- ✅ Env Space functional smoke: `/health` → `{"status":"healthy"}`, `/reset` → real `CallToolObservation` with `family=medical_intake, request="I have a problem.", task_id=medium, max_steps=10, questions_remaining=6`
	- ✅ Env Space README has all 5 storytelling anchors (Wild Card / Judges-60s-Tour / Problem · Environment / plot has a 1-line caption / epistemic humility)
	- ✅ Pre-existing wins kept: 3 trained GRPO runs evaluated, 4B base ceiling, plots regenerated, all docs reconciled, demo Space (Replay + KL-anchor ablation + Live chat tabs) deployed

	## Headline n=50 v4 numbers (fair) — corrected

	\| Model \| Avg score \| Completion \| Format pass \|
	\|---\|---\|---\|---\|
	\| Random policy \| 0.0000 \| 0% \| 0% \|
	\| Qwen3-0.6B base \| 0.0000 \| 0% \| 0% \|
	\| Qwen3-0.6B GRPO (Run 1, β=0) \| 0.0076 ↑ \| 2% \| 0% \|
	\| Qwen3-1.7B base \| 0.0669 \| 18% \| 0% \|
	\| Qwen3-1.7B GRPO (Run 2, β=0) \| 0.0286 ↓ \| 6% \| 0% \|
	\| Qwen3-1.7B GRPO (Run 4, β=0.2) \| 0.0560 ✅ \| 14% \| 0% \|
	\| Qwen3-4B-Instruct \| 0.0399 \| 6% \| 0% \|
	\| Qwen3-4B base ← REAL CEILING \| 0.1446 \| 24% \| 0% \|
	\| Qwen3-4B GRPO (Run 3) \| _canceled — queue_ \| _._ \| _._ \|

	### Per-family table — KL anchor verdict in bold

	\| Family \| 1.7B base \| Run 2 (no-KL) \| Run 4 (+KL) \| 4B base \| 4B-instruct \|
	\|---\|---\|---\|---\|---\|---\|
	\| event_planning μ \| 0.138 \| 0.000 ❌ \| 0.175 ✅✅ \| 0.340 \| 0.166 \|
	\| event_planning max \| 0.522 \| 0.000 \| 0.510 \| 0.795 \| 0.757 \|
	\| meeting_scheduling μ \| 0.153 \| 0.130 ↑ \| 0.064 ↓ \| 0.287 \| 0.000 \|
	\| meeting_scheduling max \| 0.500 \| 0.725 ↑↑ \| 0.350 \| 0.819 \| 0.000 \|
	\| medical_intake \| 0.000 \| 0.000 \| 0.000 \| 0.000 \| 0.000 \|
	\| support_triage \| 0.000 \| 0.000 \| 0.000 \| 0.000 \| 0.000 \|

	Refined narrative — this is the hackathon thesis confirmed:

	1. GRPO without anchor causes capability collapse. Run 2 (β=0) drove event_planning from 0.138 → 0.000 mean, sacrificing breadth for one peak in meeting_scheduling. Avg_score regressed 0.067 → 0.029.
	2. GRPO with KL anchor cleanly improves the protected family. Run 4 (β=0.2, lr=5e-7) on the same model recovered avg to 0.056 AND beat base on event_planning (0.138 → 0.175). The anchor literally fixed Run 2's regression.
	3. The cost is peak capability. Run 4 dropped meeting_scheduling max from 0.725 (Run 2's gem) to 0.350. KL prevents the extreme specialization that Run 2 leaned on.
	4. GRPO unlocks weak bases: 0.6B couldn't touch event (0 → 0.032 mean, 0 → 0.382 max). Real new capability.
	5. Strong base sets the bar: Qwen3-4B base scores 0.1446 without any RL — beats 4B-Instruct (0.0399) on every solvable family. Whatever Qwen3 instruct-SFT did, it weakened reasoning on multi-turn tool-using setups. Open question: does GRPO+KL push 4B above this — Run 3 was canceled before we could find out (logged as future work in `blog.md` §7b).

	## In progress (right now)

	- ✅ Run 4 weights mirror to `anurag203/clarify-rl-run4-qwen3-1.7b-beta0.2` is LIVE. Upload completed 06:33 IST (8147 s = 2 h 16 min total elapsed for `snapshot_download` + `upload_folder`); `model.safetensors` resolves to a 6,882,335,328-byte (~6.41 GiB) presigned `cas-bridge.xethub.hf.co` URL with `x-xet-cas-uid=public`. Watcher PID 46586 autopiloted the cleanup at 06:33:03: stripped the fallback block from `docs/model_cards/run4-qwen3-1.7b-beta0.2.md`, committed `edd1efe docs: drop Run 4 weights-mirror fallback note (mirror is live)`, pushed to GitHub `main`, and pushed the cleaned README onto the HF Hub model repo. Hero model is now self-contained on the personalized mirror — no fallback path needed.
	- ✅ Submission lap CLOSED: 3 trained runs evaluated, 4B base eval as ceiling, all plots regenerated, all docs reconciled, Wild Card #5 + Judges-60s-Tour + plot captions live on README and env Space.
	- ✅ All 25 logged-out submission URLs → HTTP 200/302
	- ✅ Env Space `/health` + `/reset` → real OpenEnv structured responses

	## Decision dial: scoreboard of expected vs actual

	1. ✅ Run 4 mean (0.056) ≈ 1.7B base (0.067) → KL anchor recovered most of the regression. `+0.027` vs Run 2 no-KL.
	2. ❌ Run 4 max(meeting) = 0.350 < Run 2's 0.725 → KL anchor traded the peak for breadth. Expected.
	3. ✅ Run 4 event_planning μ (0.175) > 1.7B base (0.138) → KL not just preserved but improved the family Run 2 destroyed. Surprise upside.
	4. ➖ Run 3 (4B GRPO) → not run (HF Jobs queue saturation, canceled at 48 min). Logged as future work; not blocking the submission since Run 4 already validates the central thesis on 1.7B.

	## Next step (default if user just says "continue")

	1. ✅ All P0/P1 hackathon gates GREEN — auto-validator + Themes & Judging Criteria + FAQ + Help Guide cross-checked
	2. ✅ Final docs reconciled — README, env Space README, slides, blog, model cards all synced
	3. ✅ Logged-out smoke test of 25 URLs + Env Space functional smoke both PASSED
	4. NEXT (optional polish before 5 PM IST): (a) capture an `assets/demo_chat_screenshot.png` from the demo Space "Live chat" tab and embed it in README — improves "Storytelling" 30% gate; (b) confirm Run 4 mirror upload finished (`hf_hub_url` returns 200 on `model.safetensors`) and remove the fallback note from the model card; (c) submit through the official Discord submission form once the form opens.
	5. HARD STOP: Apr 26, 2026, 5:00 PM IST.

	## Open questions / blockers

	- 🟢 vLLM-in-HF-Job eval pipeline solid (Run-1: 6.7 min, Run-2: 113 sec, Run-4: 7.8 min, 4B base: ✅)
	- 🟢 Auto-eval state persisted to `outputs/auto_eval_state.json` — restart safe
	- 🟢 Token plumbing for 3 accounts working
	- 🟢 Plots auto-skip Run 3 because eval JSON never landed — no manual cleanup required

	## Files most recently touched (this session)

	- `outputs/runs.json` — Run 3 marked `CANCELED_QUEUE_SATURATION` with reason; Run 4 marked completed with eval job id
	- `outputs/auto_eval_state.json` — Run 3 = CANCELED, Run 4 = COMPLETED
	- `outputs/run4_artifacts/log_history_partial.json` — full 300-step log scraped from live job
	- `outputs/run_artifacts/1.7B-KL/evals/` — Run 4 eval JSON pulled from Hub
	- `outputs/run_artifacts/4B-base/evals/` — 4B base eval (real ceiling)
	- `docs/blog.md`, `docs/trace_demo.md`, `docs/STATUS.md`, `README.md` — KL-anchor narrative; Run 3 → future work
	- `plots/01–07_*.png` + `plots/runs_summary.json` — regenerated against the 7-row scoreboard
	- `scripts/compare_runs.py`, `scripts/refresh_all_plots.sh`, `scripts/watch_and_eval.py`, `scripts/poll_status.sh` — orchestration kept (Run 3 specs gated on file existence, no-op if missing)

	## Locked decisions

	- Idea: ClarifyRL — train asking-vs-guessing via RL
	- Theme: #5 Wild Card (primary) + #3.2 + #2
	- 5 families: coding / medical-intake / support-triage / meeting / event
	- Stack: OpenEnv 0.2.2 + MCPEnvironment + TRL GRPO ≥1.0 + Qwen3 family (0.6B / 1.7B / 4B)
	- Compute: HF Jobs, 3 successful trained runs + 1 base eval across 3 accounts; spend ≈ Run-1 $1.08 + Run-2 $2.21 + Run-3 v3 (OOM) $0.40 + Run-3 v5 (canceled in queue, $0.00) + Run-4 $1.80 + 4B base eval $0.13 + Run-4 eval $0.20 = ~$5.8 of $120 budget
	- Submission format: HF blog post (markdown) + GitHub repo + env Space + demo Space + W&B project + Colab notebook
	- MCP tools: `ask_question`, `propose_plan`, `get_task_info`
	- Deadline: Apr 26, 2026, 5:00 PM IST (~9.5 hours from this update)