clarify-rl / docs /SESSION_LOG.md
agarwalanu3103's picture
plots: add training progression + diagnostics, drop W&B links
099bec8 verified
# SESSION LOG β€” append-only history
> Newest entries on TOP. Each agent appends one entry at session end. Keep entries to 5-7 bullets max.
---
## 2026-04-26 07:25 IST β€” Claude (Cursor) β€” Phase 11: submission lap CLOSED + final polish
- **Run 4 weights mirror to `anurag203/clarify-rl-run4-qwen3-1.7b-beta0.2` is LIVE** (started 04:17, completed 06:33, 2h16min total for `snapshot_download` + `upload_folder` of the 6.88 GB model). `model.safetensors` resolves to a 6,882,335,328-byte presigned `cas-bridge.xethub.hf.co` URL with `X-Xet-Cas-Uid=public` β€” judges can download without auth. All 3 mirrors (Run 1, Run 2, Run 4) confirmed `private=False` via HF API.
- **Auto-watcher script** (`/tmp/clarify-rl-mirror/await_mirror_and_finalize.py`) self-piloted cleanup at 06:33:03 β€” stripped the fallback block from `docs/model_cards/run4-qwen3-1.7b-beta0.2.md`, committed `edd1efe`, pushed to GitHub `main` and HF Hub model repo. Hero model is now self-contained on the personalized mirror.
- **README restructured for judges** (commit `9240521` and earlier): added `## ⏱ Judges β€” 60-second tour` at top with 5-step flow + Problem/Environment/Results/Why-it-matters arc + 1-line caption under every plot + Wild Card #5 promoted into the title block + embedded `assets/demo_replay_screenshot.png` (Replay viewer tab showing a Run 4 rollout with per-rubric breakdown, 1440x1100 / 187 KB, captured via headless Chrome).
- **Run 4 model card YAML cleanup** (commit `2e31357`): removed the `datasets: agarwalanu3103/clarify-rl` pill that HF Hub was rendering as a disabled grey tag (it was the env Space, not a dataset). Pushed cleaned card to GitHub + HF Hub. Confirmed `cardData.datasets: ABSENT` and `tags with dataset prefix: []` on the API.
- **Final logged-out smoke test**: 15/15 README-referenced URLs return HTTP 200 (env Space, demo Space, all 3 model mirrors, Colab notebook, GitHub README, raw plots, blog, model cards). Env Space `POST /reset` returns a real `CallToolObservation` (`meeting_scheduling` family, 6-question budget, proper instructions). Submission auto-validator gates all GREEN.
- **Files touched this phase**: `README.md` (60s tour + screenshot embed + Wild Card promotion), `docs/STATUS.md` (mirror progress + final state), `docs/model_cards/run4-qwen3-1.7b-beta0.2.md` (fallback note removed by watcher + datasets pill removed), `docs/slides.md` (path-typo fix `server/rubric/` β†’ `server/rubrics.py`), `assets/demo_replay_screenshot.png` (new).
- **Final spend**: β‰ˆ $5.8 of $120 HF Jobs budget across 3 trained runs + 1 base eval + 1 Run 4 eval. ~9.5 h to the 5 PM IST deadline at log close. All deliverable links live and public.
---
## 2026-04-25 23:55 IST β€” Cascade (Cursor) β€” Phase 9: TWO root-cause bugs found and fixed
- **Run 2 launched** (Qwen3-1.7B / 400 steps / a100-large) β€” currently downloading deps + bootstrapping vLLM. Runs to completion in ~3h.
- **Root cause #1 β€” parser couldn't parse what the trained model emits.** Trained 0.6B emits `function_call(arg="value")` form (e.g. `ask_question(question="What is your budget? (in USD)")`) β€” the OLD parser used a naive regex that broke on **nested parentheses** in the question text. Rewrote `parse_tool_call` with:
- `_find_balanced_func_call(text)`: walks the string with a paren+quote depth counter to extract `(name, body)` even when `body` contains `(...)` and JSON.
- `_parse_positional_args(...)`: handles `key="value"`, `key={json}`, and bare positional forms.
- `_parse_prefixed_call(...)`: NEW. Handles `ASK: {...}`, `PROPOSE: {...}`, `Q: text`, `PLAN: text` shapes that the trained 0.6B uses ~20% of the time.
- **Root cause #2 β€” eval system prompt's example was being copied verbatim.** Old eval `SYSTEM_PROMPT` hard-coded `propose_plan(plan='{"stack": "python+fastapi", "scale": "1k users"}')` as an example. Inspected the Qwen3-1.7B base eval (n=50) β€” **50/50** scenarios produced exactly `{"stack": "python+fastapi", "scale": "1k users"}` for **event_planning** and **medical_intake** tasks. The base model literally just copied the system-prompt example. Aligned eval `SYSTEM_PROMPT` character-for-character with `train_grpo.py:PROMPT` so trained model has zero distribution shift between train and eval.
- **Re-launch plan**: launched 3 fresh `n50_v3` evals (0.6B-base, 1.7B-base, 0.6B-trained) with the prompt-aligned, prefix-tolerant parser. These are the first FAIR measurements of the trained-vs-untrained gap.
- **Pre-fix table is misleading** β€” the 0/50 scores reflect the eval bug, not the model's true behaviour. Trained 0.6B was scoring some non-zero rewards mid-trajectory (0.02–0.05 per question that revealed a field) before submitting a copied-example plan that scored 0. Post-fix numbers will be the headline.
- **Files touched**: `inference.py` (new `_PREFIX_TO_TOOL` table, `_parse_prefixed_call`, `_find_balanced_func_call`, `_parse_positional_args`, aligned `SYSTEM_PROMPT`); `scripts/make_plots.py` (multi-run `--log-history`/`--eval` support); `scripts/poll_training.py` (new robust monitor with retry-on-empty); `scripts/refresh_all_plots.sh` (new orchestrator).
---
## 2026-04-25 23:25 IST β€” Cascade (Cursor) β€” Phase 8: run 1 evaluated end-to-end
- **Run 1 finished**: Qwen3-0.6B / 300 steps / a100-large / 38 min wall / pushed `agarwalanu3103/clarify-rl-grpo-qwen3-0-6b` to Hub. Reward grew **7Γ—** (0.006 β†’ 0.045) over 300 steps β€” pipeline working.
- **HF Inference Router refuses fine-tuned uploads** (`model_not_supported` 400). Built `scripts/eval_with_vllm.py` + `scripts/launch_eval_job.sh` to host vLLM ourselves inside an HF Job (snapshot-downloads the project Space for `inference.py` + scenarios; runs `scripts/run_eval.py` against local vLLM; uploads results to model repo's `evals/` folder).
- **Qwen3 `<think>` token-waste fix**: trained model burned full 300-token budget inside `<think>` blocks during eval, never reaching TOOL/ARGS. Patched `inference.py` to pass `extra_body={"chat_template_kwargs": {"enable_thinking": False}}` (matches train_grpo.py) and bumped `MAX_TOKENS=800`. Eval runtime dropped from never-completes β†’ 33.6s for 50 scenarios.
- **Headline numbers (run 1)**: trained avg=0.00 / format_pass=0% / 0 winners on 50 held-out, but avg-questions dropped to **3.32** (vs 5.06 for Qwen3-8B base, 5.87 for Qwen3-4B-Instruct, 4.0 for random policy). All untrained models also fail FormatCheck (best: 4B-Instruct 2/30 with max=0.81).
- **Failure mode identified**: trained 0.6B submits empty `{}` plans or wrong-family schemas (`{"stack": ..., "scale": ...}` for an event_planning task). The family→required-keys mapping wasn't learned in 300 steps — needs more steps or a bigger base.
- **All 5 plots generated** from real run-1 data (`plots/01_..._05_..png`). `docs/trace_demo_run1.md` written with 3 illustrative scenarios. `outputs/baselines.json` refreshed with comparative table. Total spend so far: **$1.08 of $120 budget**.
- **Next**: instant Anurag pastes HF_TOKEN_2/3 + GitHub URL β†’ launch runs 2/3 (1.7B + 4B) in parallel; the larger bases are expected to clear FormatCheck.
---
## 2026-04-25 22:00 IST β€” Cascade (Cursor) β€” Phase 7: smoke green + run 1 launched
- **SMOKE TEST PASSED** (job `69ece82ad70108f37acde9b8`): Qwen3-0.6B / a10g-small / 5 steps / train_loss=-0.099 / 67s wall / model + Trackio bucket saved cleanly
- Resolved a 7-issue dependency cascade discovered by the smoke test (in chronological order):
1. `vllm_ascend` β€” TRL unconditionally imports it on CUDA β†’ monkey-patch `importlib.util.find_spec` to return None
2. `mergekit` β€” TRL <1.0 eagerly imports it; pinned `trl[vllm]>=1.0` in launcher to drop dependency entirely (mergekit also has incompatible pydantic constraints with vllm 0.10.2+)
3. `llm_blender` β€” TRL eagerly imports; `llm_blender` itself imports `TRANSFORMERS_CACHE` (removed in transformers 5.x). Solution: stub via `sys.modules` + `_BLOCKED_PACKAGES` extension
4. `peft` β€” TRL callbacks need PeftModel; added explicit `--with peft`
5. `transformers 5.x compatibility` β€” keeping `git+https://...transformers@main` for vllm 0.17 + trl 1.0 cohabitation
6. `chat_template_kwargs` only exists in trl >= 1.0 β€” added `trl[vllm]>=1.0` pin AND defensive `_grpo_kwargs` filter via `dataclasses.fields(GRPOConfig)` so we survive future TRL upgrades
7. Old `llm_blender.__spec__ is None` β€” extended monkey-patch to override `transformers.utils.import_utils._is_package_available` for blocked packages
- **Defensive `GRPOConfig` init**: `_grpo_kwargs` dict + filter against installed TRL's dataclass fields β†’ drops unsupported kwargs with a warning instead of crashing
- **macOS truststore**: patched `.venv/bin/hf` entry-point to inject `truststore.inject_into_ssl()` for corporate proxy SSL (was blocking `hf jobs logs`)
- **Production run 1 launched**: Qwen3-0.6B / a100-large / 300 steps (job `69ece976d2c8bd8662bcdf48`) on account 1 (`agarwalanu3103`); ETA ~50 min wall, ~$2.50
- **`docs/ANURAG_TODO.md` rewritten**: clean GUI-vs-terminal split for the user β€” they need to (A) provide HF tokens for accounts 2+3, (B) click Qwen3 license accept, (C) create a GitHub repo, (D) flip model cards to public after training. I handle everything else.
- **Decisions locked this session**: TRL pinned `>=1.0` (drops mergekit need); flavor matrix updated to a100-large for 0.6B+1.7B and h200 for 4B; compute strategy unchanged (3 parallel runs, ~$20 projected)
- **Next**: monitor run 1 logs; await user to paste HF tokens 2+3 + GitHub URL; launch runs 2+3 + bake baseline evals while runs train
---
## 2026-04-25 20:30 IST β€” Cascade (Cursor) β€” Phase 6: budget-unlocked rewrite
- User confirmed **full $120 HF Jobs budget is available** β€” strategy switched from "save money" to "buy reliability + quality"
- **Root-cause fix for `0.000000` loss pathology**: `NUM_GENERATIONS=2` produces zero advantage when both rollouts agree on tokens (common early in training). Bumped default to auto-tune by GPU tier: 4/8/8 for 24GB/40GB/80GB.
- **vllm_gpu_memory_utilization**: raised from flat 0.40 β†’ 0.55 on 40+ GB tiers. Was leaving half of A100/H100 idle.
- **`training/train_grpo.py` upgrades**: SMOKE_TEST mode, OOM trap, RESUME_FROM_CKPT, auto-resume from existing checkpoints, FAILED marker
- **New scripts**: `scripts/launch_all.sh` (parallel multi-account launcher), `scripts/preflight.sh` (concurrent WS probe), `scripts/run_post_train_eval.sh` (post-training eval orchestrator)
- **Server change**: `SUPPORTS_CONCURRENT_SESSIONS=True` + `max_concurrent_envs=8` β€” enables 3-4 parallel HF Jobs against one Space
- **Compute plan locked**: 0.6B/a10g-large/500 + 1.7B/a100-large/400 + 4B/h100-large/250, ~$70 total; optional $25 insurance run on 1.7B seed=84
- **Llama / Qwen2.5-Instruct rejected** β€” chat template fails TRL `add_response_schema`; not retesting under time pressure
- **Tokens received**: agarwalanu3103, Kanan, mnit (3 accounts) β€” held in env vars only, not committed
- **Doc updates**: `docs/11-submission-plan.md` rewritten with phased rollout, `docs/blog.md` skeleton drafted with `<PLACEHOLDER>` for post-eval numbers, `README.md` refreshed with Colab badge + plot embeds + model card links
- **Next**: Anurag pushes server/ to Space (~5 min), runs preflight, runs smoke ($0.50), then launch_all.sh fires production runs
---
## 2026-04-25 14:50 IST β€” Cascade (Windsurf) β€” Phase 4: deploy + baseline eval
- **inference.py rewritten** to submission format: OpenAI client, WebSocket env communication, `[START]/[STEP]/[END]` structured logs, `BASELINE_MODE=policy/hybrid/llm`, policy fallback
- **HF Space deployed**: `agarwalanu3103/clarify-rl` β€” Docker build, `/health` + `/reset` verified live
- Fixed Dockerfile (removed lockfile dependency), added `.dockerignore`, `SUBMISSION_CHECKLIST.md`, `truststore` SSL fix
- **Policy baseline ran**: all 3 tasks (easy/medium/hard) complete end-to-end, scores 0.00 (expected β€” empty plan). Per-question rewards confirmed (0.02-0.05)
- `pyproject.toml` updated: added `openai`, `websockets` deps
- Decisions locked this session: HF Space account = `agarwalanu3103`; inference uses OpenAI client (not huggingface_hub)
- **Next**: run hybrid/LLM baseline with HF credits, then start GRPO training
---
## 2026-04-25 15:30 IST β€” Cascade (Windsurf) β€” gap analysis + regression tests
- Full gap analysis: cross-referenced specs 03/04/05 against all implementation code
- **Fix**: Added `_guard_episode_done()` β€” tool calls after episode ends now blocked (was unguarded)
- **Fix**: `ask_question` truncates to 200 chars per spec 03; `05-scenario-design.md` difficulty table synced with code
- **Tests**: Created 4 test modules with 48 unit tests (all green locally): scenarios (reproducibility, field coverage, required keys), grader (parse_plan, rewards), user_simulator (keyword coverage, all-family reachability), plus integration test shells for env + rubrics
- **Infra**: `__init__.py` import-safe, root `conftest.py`, pytest config in `pyproject.toml`
- Decisions locked this session: none
- **Next**: Phase 4 β€” push to HF Space, write `inference.py`, generate eval set
---
## 2026-04-25 13:50 IST β€” Cascade (Windsurf) β€” code audit + critical bug fixes
- Full audit of all server modules against design specs (scenarios, user_simulator, rubrics, grader, clarify_environment, app)
- **Critical fix**: Added `step_async()` override β€” reward/done/step_count was silently broken on WebSocket client path (training would have gotten no reward signal)
- **Bug fix**: Required keys now always included in profile for medium/hard; hard range adjusted (6,7) from (8,12) to match field pool sizes
- Added `_patch_obs()` helper + max_steps enforcement in env; oracle now scores 0.887 (was 0.0 due to FormatCheck gate failure)
- Created root `README.md` + `.gitignore`; updated `08-timeline.md` (phases 1-3 done), `09-risks.md` (R7 verified, R8 mitigated), `10-positioning.md`
- Both smoke tests pass: `smoke_env.py` (direct) + `smoke_client.py` (WebSocket) with correct step_count=5 and reward propagation
---
## 2026-04-25 12:05 IST β€” Cascade (Windsurf) β€” onboarding system + plan audit
- Audited ClarifyRL plan against all 6 official hackathon docs (FAQs, Help Guide, Themes, Travel Guide, Opening-Ceremony PDF, Resources)
- Patched `06-training-plan.md`: HF Jobs `t4-small` primary path, fork TRL `openenv_wordle_grpo.ipynb`, sample-inspection cadence every 50 steps, LoRA save warning
- Patched `08-timeline.md`: Setup phase adds `hf auth login` + `hf jobs hardware`; Phase 5 now "fork wordle notebook"; Phase 6 adds eyeball-4-rollouts step
- Patched `10-positioning.md`: added "Direct alignment with judges' own example ideas" section citing slide 50 ("Realistic customer engagement / frustrated customers" β†’ our `support_triage` family)
- Created multi-session handoff system: `.windsurf/rules/clarify-rl.md` (auto-loaded), `docs/AGENT_ONBOARDING.md` (manual paste), `docs/STATUS.md` (live state), `docs/SESSION_LOG.md` (this file)
- **Decisions locked this session**: HF Jobs is primary compute (was Colab); starter notebook is `openenv_wordle_grpo.ipynb`
- **Next**: implement `server/scenarios.py` per `docs/05-scenario-design.md`
---
## 2026-04-25 ~11:30 IST β€” Cascade (Windsurf) β€” positioning sharpening
- Reframed pitch from "personal-assistant booking demo" to "AI safety / hallucination" framing
- Rewrote `00-overview.md` to lead with Air Canada / lawyer / Cursor anchors
- Expanded task families from 5 personal to 3 high-stakes + 2 personal: `coding_requirements`, `medical_intake`, `support_triage`, `meeting_scheduling`, `event_planning`
- Promoted hallucination rate (90% β†’ 3%) to headline metric, demoted plan satisfaction to secondary
- Created `10-positioning.md` with safety/alignment framing for judges
- Updated `01-requirements.md` F2.2 / F6.3 to match new family names + added hallucination acceptance criterion
- **Decisions locked this session**: Wild Card #5 as primary theme pitch; new 5-family split; hallucination as headline metric
---
## Pre-2026-04-25 β€” design + scaffolding (multiple sessions, summarized)
- Locked idea: ClarifyRL / AskBeforeYouAct (train LLM to ask before acting)
- Created design doc set 00-09 in `clarify-rl/docs/`
- Set up scaffolding: `pyproject.toml`, `Dockerfile`, `openenv.yaml`, `models.py`, `client.py`, `server/__init__.py`
- Confirmed stack: OpenEnv 0.2.2 + MCPEnvironment + Unsloth + TRL GRPO + Qwen2.5-1.5B-Instruct
- Confirmed compute: free Colab T4 + M3 Pro 18GB (later updated to add HF Jobs as primary)
- Confirmed team: Bhole Chature (Anurag + Kanan)
---
## Entry template (copy-paste for new sessions)
```markdown
## YYYY-MM-DD HH:MM IST β€” <agent / chat tag> β€” <one-line summary>
- Did: <bullet>
- Did: <bullet>
- Did: <bullet>
- Decisions locked this session: <or "none">
- Next: <what the next agent should pick up>
---
```