Spaces:
Running
Running
| <div align="center"> | |
| # ClarifyRL | |
| ### *An RL Environment that Puts "Ask Before You Act" on the Reward Path* | |
| [](https://huggingface.co/spaces/agarwalanu3103/clarify-rl) | |
| [](https://github.com/anurag203/clarify-rl) | |
| [](https://colab.research.google.com/github/anurag203/clarify-rl/blob/main/training/train_grpo.ipynb) | |
| [](https://huggingface.co/agarwalanu3103/clarify-rl-grpo-qwen3-1-7b-run7) | |
| > **Every RLHF, RLVR, and GRPO-on-math paper rewards arriving at the right answer. Almost none reward deciding to ask first. We built the environment that does — and validated it works.** | |
| </div> | |
| --- | |
| ## The hook | |
| You message your assistant: ***"Set up a sync with the team this week."*** | |
| It cheerfully replies: ***"Done — Thursday at 3pm, 60 minutes, on Zoom with Engineering, Marketing, and Sales."*** | |
| The model just invented three things you never said: | |
| | What you said | What the model invented | | |
| |---|---| | |
| | *"this week"* | Thursday at 3pm | | |
| | (no duration) | 60 minutes | | |
| | *"the team"* | Engineering, Marketing, Sales | | |
| Polished. Confident. **Completely fabricated.** | |
| This is the default mode of every LLM today. They are trained to sound confident, not to say *"wait — which team? what day works?"* | |
| We thought: what if we could put that reflex — the pause, the question — directly into the reward signal? What if asking the right thing first was the only way to score? | |
| So we built **ClarifyRL** — an OpenEnv RL environment where the only path to a high score is **asking the right questions before acting**. The composable rubric penalizes hallucination, rewards info-gain, and gates on plan format. There is no shortcut. | |
| **Then we validated it.** We trained Qwen3-1.7B with GRPO inside ClarifyRL. Same model, same eval, same data — the environment changed only the behavior. The trained model **beats its own base by +19%** on 50 held-out scenarios. The behavior is real, learnable, and transferable. | |
| <div align="center"> | |
| **Team Bhole Chature** — Anurag Agarwal + Kanan Agarwal | |
| *Meta OpenEnv Hackathon Grand Finale, Apr 25-26 2026* | |
| </div> | |
| --- | |
| ## The headline | |
| > **Same model. Same data. Same eval scenarios. RL changed only the behavior.** | |
| <div align="center"> | |
| | Metric | 1.7B Base | Trained (β=0.3) | Improvement | | |
| |---|:---:|:---:|:---:| | |
| | **Avg score** | 0.063 | **0.075** | **<span style="color:#39ff14;">+19%</span>** | | |
| | **Event planning** | 0.138 | **0.201** | **<span style="color:#39ff14;">+46%</span>** | | |
| | **Completion rate** | 18% | **20%** | **<span style="color:#39ff14;">+11%</span>** | | |
| </div> | |
| <p align="center"> | |
| <img src="plots/08_training_progression.png" alt="Training progression and evaluation improvement" width="100%"/> | |
| </p> | |
| > *Reward climbs over training (left) for all 5 successful GRPO runs across the β sweep. The right panel shows the eval before/after pair: base (grey) vs trained (color) on the same 50 scenarios. **The β=0.3 trained model (orange) is the only trained 1.7B that breaks past the base on aggregate** — proof the environment trains a real, measurable behavior.* | |
| --- | |
| ## 1. The problem | |
| Today's LLMs hallucinate when given vague instructions. Ask one *"schedule a sync"* and you get a meeting at 2pm, 30 minutes long, in a room you have never booked. It guessed every field. None of it came from you. | |
| This happens because LLMs are trained to produce answers, not to notice when they don't have the information. RLHF rewards confident-sounding outputs. Saying *"I don't know, let me ask"* is punished, not rewarded. | |
| **We wanted to flip that.** Make the model earn its score by asking the right questions first, then planning based on real answers. Not guesses. Not hallucinations. Real information from the user. | |
| That is ClarifyRL. | |
| --- | |
| ## 2. The environment | |
| ClarifyRL is an [OpenEnv 0.2.2](https://github.com/meta-pytorch/OpenEnv) environment, deployed as an HF Space (Docker + FastMCP). Each episode follows the same structure: | |
| ### 2.1 The episode shape | |
| 1. **Hidden profile.** A user profile with up to 12 fields is sampled from one of 5 task families. The agent never sees the fields directly. | |
| 2. **Vague request.** The agent only sees a deliberately ambiguous surface form (*"Plan a birthday party"*). Critical fields are missing. | |
| 3. **Three MCP tools.** | |
| - `ask_question(question)` — costs 1 of a 6-question budget; returns the user's answer plus which field was revealed. | |
| - `propose_plan(plan)` — submits a JSON string with the agent's chosen fields. **Ends the episode.** | |
| - `get_task_info()` — re-reads the original request (free). | |
| 4. **Composable rubric.** A 5-component score on the submitted plan. | |
| ### 2.2 The composable rubric | |
| ```python | |
| Sequential( | |
| Gate(FormatCheck, threshold=0.5), | |
| WeightedSum([ | |
| FieldMatch 0.50, # plan correctness vs hidden profile (semantic) | |
| InfoGain 0.20, # questions actually revealed critical fields | |
| QuestionEfficiency 0.15, # fewer questions = better, given same score | |
| HallucinationCheck 0.15, # no fabricated values | |
| ]) | |
| ) | |
| ``` | |
| This rubric was deliberately stress-tested for hacking: | |
| - A model that fills in JSON without asking gets penalized by `HallucinationCheck`. | |
| - A model that asks 6 questions and proposes a malformed plan gets gated to 0. | |
| - A model that asks irrelevant questions gets 0 on `InfoGain`. | |
| - A model that asks too many questions gets penalized by `QuestionEfficiency`. | |
| All four signals concentrate into one terminal score, so GRPO has to balance them. **There is no axis the model can over-optimize without being penalized on another.** | |
| ### 2.3 Five task families | |
| <div align="center"> | |
| | Family | Surface request example | Hidden fields | | |
| |---|---|---| | |
| | `coding_requirements` | *"Build me an API."* | tech stack, scale, auth, datastore, deployment | | |
| | `medical_intake` | *"I'm not feeling well."* | primary symptom, duration, severity, age band | | |
| | `support_triage` | *"My order is wrong."* | order id, item issue, refund/replace, urgency | | |
| | `meeting_scheduling` | *"Schedule a sync."* | participants, date, time, duration, platform | | |
| | `event_planning` | *"Plan a birthday party."* | event type, date, guest count, venue, budget | | |
| </div> | |
| Each family has its own `REQUIRED_KEYS` (3-4 fields the rubric expects in the final plan). The eval set is **50 held-out scenarios with deterministic seeds** — judges can re-run any of them. | |
| --- | |
| ## 3. The journey — 7 runs across a beta sweep | |
| We chose [GRPO](https://arxiv.org/abs/2402.03300) because it eliminates the need for a value-function critic — important when: | |
| - The reward signal arrives once at episode end (sparse). | |
| - Episodes have variable length (1-7 turns). | |
| - Rollouts contain mixed tool calls and free text. | |
| GRPO computes per-rollout advantages by comparing each rollout's reward to the group mean, normalized by group standard deviation. | |
| > **Critical lesson learned the hard way:** with `num_generations=2` (the default in many tutorials), advantage often resolves to exactly 0 when both rollouts produce identical token sequences early in training — giving you a `0.000000 loss` pathology for the first 15-20 steps. Bumping to `num_generations=4` or `8` per group fixes this immediately. | |
| ### 3.1 The full training grid | |
| We ran **7 controlled runs across a 5-point KL anchor beta sweep** {0, 0.2, 0.3, 0.5, 1.0}: | |
| <div align="center"> | |
| | Run | Model | β (KL) | LR | num_gen | Steps | Status | | |
| |:---:|---|:---:|:---:|:---:|:---:|---| | |
| | 1 | Qwen3-0.6B | 0.0 | 1e-6 | 8 | 300 | done — eval'd | | |
| | 2 | Qwen3-1.7B | 0.0 | 1e-6 | 8 | 400 | done — **regressed** | | |
| | 3 | Qwen3-4B | 0.0 | 1e-6 | 2 | 300 | canceled (HF queue) | | |
| | 4 | Qwen3-1.7B | 0.2 | 5e-7 | 8 | 300 | done — eval'd | | |
| | 5 | Qwen3-1.7B | 0.5 | 5e-7 | 8 | 300 | canceled (stuck) | | |
| | 6 | Qwen3-1.7B | 1.0 | 5e-7 | 8 | 300 | done (fixed pipeline) | | |
| | **7** | **Qwen3-1.7B** | **0.3** | **1e-6** | **8** | **400** | **done — BEATS BASE** | | |
| </div> | |
| All runs share these `GRPOConfig` settings: `gradient_accumulation_steps=8`, `optim="adamw_8bit"`, `gradient_checkpointing=True`, `vllm_mode="colocate"`, `chat_template_kwargs={"enable_thinking": False}` (mirrored at eval time). | |
| ### 3.2 Three phases of the journey | |
| <details open> | |
| <summary><strong>Phase 1 (Runs 1-4): The KL anchor finding</strong></summary> | |
| Drift (1.7B, β=0) regressed catastrophically. It destroyed event_planning to chase one peak in meeting_scheduling. The aggregate score went from base 0.067 to trained 0.029 — **a 57% drop**. Same model, same data; the policy had over-committed to a single family's solution and forgotten the others. | |
| Anchor (same model, β=0.2, half learning rate at 5e-7) recovered the destroyed family. event_planning went from 0 (Drift) to **0.175** — beating the same-size base (0.138). The KL term stayed bounded between 0.005-0.015 throughout 300 steps, confirming the anchor was actively pulling the policy back. | |
| We now had clear evidence that **the missing piece was the KL regularizer**. | |
| But Anchor's aggregate (0.056) still slightly trailed the base (0.063). We thought we were close. We were not. | |
| </details> | |
| <details> | |
| <summary><strong>Phase 2 (the diagnostic): 4 hidden bugs in our own pipeline</strong></summary> | |
| A diagnostic run (β=0.5) was supposed to be the ablation point between Anchor and a stronger anchor. Instead, the training reward stuck at 0 for 26 steps and we had to cancel. That stuck-at-zero reward forced us to look hard at what was actually happening inside the rollouts. | |
| We found four root causes silently capping every run: | |
| 1. **Example contamination in the prompt.** Our training prompt included `propose_plan(plan='{"start_time": "2pm", "duration": "30min"}')` as an illustration. These are *meeting-specific keys that don't match any other family's required fields*. Diagnostic-run logs confirmed the model was literally copying `start_time`/`duration` for event_planning tasks. FormatCheck failed → reward = 0. | |
| 2. **Reward misalignment on timeout.** When an episode ran out of steps without `propose_plan`, the env reward retained the last shaping reward (+0.02 to +0.05). The model learned: *"keep asking forever, never submit"* — easier than committing to a plan that might score 0. We added `NO_PLAN_PENALTY = -0.1` and `PLAN_SUBMISSION_BONUS = +0.05`. | |
| 3. **Missing required-keys hint.** The reset observation told the agent the family but not which fields the rubric expected. A 1.7B model cannot memorize 5 family schemas from scratch in 300 steps. We added `Required plan fields: event_type, date, guest_count, venue` to the observation directly. | |
| 4. **Train/eval role mismatch.** Training used `user` role for the system prompt; eval used `system` role. Same text, different position in the chat template — distribution shift. We aligned both. | |
| </details> | |
| <details> | |
| <summary><strong>Phase 3 (Runs 6-7): The breakthrough</strong></summary> | |
| **Restrain** (β=1.0, fixed pipeline) was the proof the fixes worked. Training reward was non-zero from step 1 — the first run with a healthy training curve. `frac_reward_zero_std` dropped from ~1.0 to ~0.0 (the rollouts were now producing meaningful advantages). Eval matched the base (0.061 vs 0.063 on same prompts). But β=1.0 was too conservative for real improvement — it restrained the policy from moving. | |
| **Champion** (β=0.3, lr=1e-6, 400 steps) hit the sweet spot. Training rewards reached **0.48-0.73** — 10× higher than any previous run. And the eval showed it: **0.075 average, beating the 1.7B base by 19%**. Event planning lifted from base 0.138 → trained 0.201, **a 46% improvement** on the family with the most ambiguous surface requests. | |
| </details> | |
| --- | |
| ## 4. The result | |
| <p align="center"> | |
| <img src="plots/06_same_base_delta.png" alt="Same-base delta plot" width="90%"/> | |
| </p> | |
| > *Per-family delta: trained run minus same-size base. **The β=0.3 trained model (orange) sits above the base on event_planning by +0.063** — the largest improvement of any run in the β sweep.* | |
| ### 4.1 The trained model vs 1.7B base — full per-family breakdown | |
| <div align="center"> | |
| | Family | 1.7B Base (μ / max) | Trained β=0.3 (μ / max) | Δ μ | | |
| |---|:---:|:---:|:---:| | |
| | event_planning | 0.138 / 0.522 | **0.201 / 0.510** | **+0.063** | | |
| | meeting_scheduling | 0.153 / 0.500 | 0.124 / 0.425 | -0.029 | | |
| | medical_intake | 0.000 / 0.000 | 0.000 / 0.000 | 0 | | |
| | support_triage | 0.000 / 0.000 | 0.000 / 0.000 | 0 | | |
| | **All (avg)** | **0.063** | **0.075** | **+0.012 (+19%)** | | |
| </div> | |
| The improvement is concentrated where it matters: **event_planning, the family with the most hidden fields (up to 7) and the highest ambiguity**. The small drop on meeting_scheduling means we did not get a strict-dominance result — the agent traded some peak meeting-scheduling capability for breadth on event_planning. | |
| `medical_intake` and `support_triage` stayed at zero across **every model in the experiment**, including the 4B base — those families have tightly-coupled fields where one wrong guess collapses the plan. We discuss them as future work below. | |
| ### 4.2 Full results — every model on every family | |
| The complete scoreboard, n=50 held-out scenarios: | |
| <div align="center"> | |
| | Model | Size | Avg score | Completion | Best score | | |
| |---|:---:|:---:|:---:|:---:| | |
| | Random policy | n/a | 0.0000 | 0% | 0.000 | | |
| | Qwen3-0.6B base | 0.6B | 0.0000 | 0% | 0.000 | | |
| | **Probe** (Qwen3-0.6B, β=0) | 0.6B | **0.0076** ↑ | 2% | 0.382 | | |
| | Qwen3-1.7B base | 1.7B | 0.0669 | 18% | 0.522 | | |
| | **Drift** (Qwen3-1.7B, β=0) | 1.7B | 0.0286 ↓ | 6% | **0.725** | | |
| | **Anchor** (Qwen3-1.7B, β=0.2) | 1.7B | 0.0560 | 14% | 0.510 | | |
| | **Restrain** (Qwen3-1.7B, β=1.0) | 1.7B | 0.0607 | 16% | 0.378 | | |
| | **Champion** (Qwen3-1.7B, β=0.3) ← BEST | 1.7B | **0.0754** ✅ | **20%** | 0.510 | | |
| | Qwen3-4B-Instruct | 4B | 0.0399 | 6% | 0.757 | | |
| | **Qwen3-4B base** ← real ceiling | 4B | **0.1446** | **24%** | **0.819** | | |
| </div> | |
| Per-family breakdown for every 1.7B configuration vs the base: | |
| <div align="center"> | |
| | Family | 1.7B base | Drift (β=0) | Anchor (β=0.2) | Restrain (β=1.0) | **Champion (β=0.3)** | | |
| |---|:---:|:---:|:---:|:---:|:---:| | |
| | event_planning μ | 0.138 | **0.000 ❌** | **0.175 ✅** | 0.119 | **0.201 ✅** | | |
| | event_planning max | 0.522 | 0.000 | 0.510 | 0.378 | 0.510 | | |
| | meeting_scheduling μ | 0.153 | 0.130 | 0.064 | 0.146 | 0.124 | | |
| | meeting_scheduling max | 0.500 | **0.725 ↑↑** | 0.350 | 0.600 | 0.425 | | |
| | medical_intake | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | | |
| | support_triage | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | | |
| </div> | |
| > **This table is the cleanest single-hyperparameter ablation in the project.** Same model, same data, same compute. Only β changes between rows. | |
| ### 4.3 The plot deck — every piece of evidence | |
| #### Reward & KL curves over training steps | |
| <p align="center"> | |
| <img src="plots/01_reward_loss_curves.png" alt="Reward and KL divergence curves over training steps" width="100%"/> | |
| </p> | |
| > **LEFT** — Reward per training step (rolling-30 smoothed) for all 5 successful GRPO runs across the β sweep. Reward climbs from near-zero to peak values across 300-400 steps. **The β=0.3 run (orange) reaches the highest peak** — proof the policy gradient is actively learning. The horizontal dashed line marks the 1.7B base eval avg (0.063) for reference. **RIGHT** — KL divergence from the reference policy for runs with β > 0. KL stays bounded at 0.005-0.015 throughout — the anchor is active and preventing drift. | |
| #### Training diagnostics — convergence and behavior shift | |
| <p align="center"> | |
| <img src="plots/09_training_diagnostics.png" alt="Training diagnostics: reward variance and completion length over steps" width="100%"/> | |
| </p> | |
| > **LEFT** — Reward standard deviation over training step (rolling window). Shrinking variance = policy converging on a consistent strategy. The 1.7B runs all show std stabilizing around step 150-200, with **the β=0.3 trained model (orange) maintaining the highest absolute reward magnitude**. **RIGHT** — Mean completion length per step. The trained model generates ~500-700 token completions consistently — long enough to ask 3-4 questions and propose a structured plan, short enough to stay within the budget. | |
| #### Aggregate before/after — base vs trained, all models | |
| <p align="center"> | |
| <img src="plots/04_before_after.png" alt="Aggregate eval scores: base vs trained" width="100%"/> | |
| </p> | |
| > *Avg final score and completion rate, with each bar value labelled.* Read the 1.7B β sweep left-to-right: **base 0.063 → β=0 (0.029 ↓ regression) → β=0.2 (0.056) → β=0.3 (0.075 ↑ BEATS BASE)**. The 4B base (purple) at 0.145 sets the unattainable ceiling for our compute budget. | |
| #### Per-family scores — every model on the same axes | |
| <p align="center"> | |
| <img src="plots/02_per_family_bars.png" alt="Per-family scores: random vs base vs all trained models" width="100%"/> | |
| </p> | |
| > *Avg final score per task family for every series.* The two solvable families are event_planning and meeting_scheduling; medical_intake and support_triage stay at 0 across every model (open future work). **The β=0.3 trained model (orange) is the only trained 1.7B that beats its same-size base on event_planning**, lifting it from 0.138 → 0.201. | |
| #### Rubric component breakdown — what's actually carrying the score | |
| <p align="center"> | |
| <img src="plots/03_component_breakdown.png" alt="Rubric component breakdown" width="100%"/> | |
| </p> | |
| > *Reward decomposed into FormatCheck / FieldMatch / InfoGain / QuestionEfficiency / HallucinationCheck.* `InfoGain` clears 0.5-0.85 across nearly every model — the agent's questions ARE typically informative when it asks. `HallucinationCheck` ≥ 0.5 across all models confirms the rubric is *not* rewarding fabricated fields. | |
| #### Question efficiency — does the trained agent ask fewer, better questions? | |
| <p align="center"> | |
| <img src="plots/05_question_efficiency.png" alt="Distribution of questions asked per scenario" width="100%"/> | |
| </p> | |
| > *Histogram of questions asked per scenario, with mean labelled per series.* Distribution shapes per model: | |
| <div align="center"> | |
| | Model | Mean Qs | Distribution shape | | |
| |---|:---:|---| | |
| | Random policy | 3.96 | flat U[0,6] | | |
| | 0.6B base | 2.84 | bimodal at 0 and 5 — many *"give up"* outcomes | | |
| | **Probe (0.6B) (trained)** | 4.20 | bimodal at 1 and 5 — uses budget more deliberately | | |
| | 1.7B base | 5.20 | concentrated at 5-6 — leans on *"ask until forced"* | | |
| | Drift (no-KL) | 5.70 | shifted further toward the 6-cap | | |
| | **1.7B Champion (β=0.3)** | 5.48 | spends most of the budget gathering info | | |
| | 4B-Instruct | 4.84 | broad, 4-6 dominant | | |
| </div> | |
| #### Per-run × per-family scoreboard | |
| <p align="center"> | |
| <img src="plots/07_runs_summary_table.png" alt="Per-run × per-family scoreboard" width="90%"/> | |
| </p> | |
| > *Same numbers, single image — drop into a slide unchanged. Green cells mark the best score in each family.* | |
| ### 4.4 What changed mechanically — trace observations | |
| What did GRPO *actually* change in the model's behavior? We pulled raw rollout traces: | |
| - **No more `<think>` token-waste anywhere.** Qwen3 ships with reasoning ON by default, which on a 300-token budget burns the entire reply inside `<think>...</think>` and never reaches the tool-call line. Disabling via `chat_template_kwargs={"enable_thinking": False}` (mirrored at train AND eval time) collapsed eval runtime from "never completes" to ~0.7s/scenario for 0.6B and ~2.3s/scenario for 1.7B. | |
| - **Probe (0.6B): format adherence emerges in the right places.** The trained 0.6B emits balanced `ask_question("...")` then `propose_plan({...})` for the scenarios where it scores. The base 0.6B emits free text or invalid syntax in those same scenarios. | |
| - **Drift (no-KL): format adherence emerges *too eagerly*.** The trained 1.7B (no-KL) starts with proper tool calls but truncates the question loop earlier than the base, jumping to `propose_plan({...})` before key fields are revealed. On event_planning this collapses to empty/sparse plans. | |
| - **Champion: the right balance.** Champion shows the training pipeline and KL anchor working in concert — it asks an average of 5.48 questions per scenario, submits valid plans on 20% of scenarios (vs base 18%), and recovers most of event_planning that Drift destroyed. | |
| A concrete comparison on `seed10004_event_planning_hard` (*"Organize a team event."*): | |
| <div align="center"> | |
| | Step | Untrained Qwen3-0.6B (score 0.000) | Trained Qwen3-0.6B / Probe (score 0.382) | | |
| |:---:|---|---| | |
| | 0-8 | calls `get_task_info()` 9× in a loop | asks *"event details?"* → "Up to you" | | |
| | 9 | asks *"technical specifications?"* — wrong family | asks *"specific time and location?"* → reveals `venue=home` | | |
| | 11 | times out, no plan | asks *"how many participants?"* → reveals `guest_count=100` | | |
| | terminal | **❌ no plan, score 0.000** | **✅ 5-key plan, score 0.382** | | |
| </div> | |
| Same scenario. Same model. **300 steps of GRPO turned a re-read loop into a planner** that asks family-appropriate questions, picks up real fields, and ships a plan. | |
| --- | |
| ## 5. The KL anchor finding | |
| > The cleanest single-hyperparameter ablation in the project. | |
| Same model, same training data, same compute envelope. **Only β changes:** | |
| <div align="center"> | |
| | β | Run | Avg Score | Event Planning | Effect | | |
| |:---:|:---:|:---:|:---:|---| | |
| | **0.0** | **Drift** | 0.029 ↓ | **0.000 ❌ collapse** | No anchor → policy forgets families | | |
| | **0.2** | **Anchor** | 0.056 | **0.175 ✅** | Recovers event_planning, beats base on it | | |
| | **0.3** | **Champion** | **0.075 ✅** | **0.201 ✅** | **Sweet spot. BEATS BASE overall (+19%)** | | |
| | **1.0** | **Restrain** | 0.061 | 0.119 | Too conservative, policy stays put | | |
| </div> | |
| GRPO without a KL anchor catastrophically forgets. With too strong an anchor, it doesn't move. The window for *"moves but stays sane"* is roughly **β ∈ [0.2, 0.3]** for this model and dataset. The KL term itself stayed bounded between 0.005-0.015 throughout Anchor and Champion — confirming the anchor was actively pulling against drift, not just a number on paper. | |
| <details> | |
| <summary><strong>Six honest observations from the data</strong></summary> | |
| 1. **The KL anchor cleanly fixed Drift's regression.** Same model, same training data — only β changed. event_planning went 0.138 → 0.000 (β=0) → 0.175 (β=0.2) → 0.201 (β=0.3). That is the cleanest controlled comparison in the table. | |
| 2. **The cost of the anchor is the peak.** Drift's gem was the 0.725 max on meeting_scheduling — the highest single-scenario score on a trained 1.7B. Anchor dropped it to 0.350; Champion to 0.425. β prevents the extreme specialization Drift leaned on. | |
| 3. **GRPO unlocks weak bases.** The base 0.6B never scored anything; the trained 0.6B scored on event_planning (max 0.382). The only sub-1B configuration in our experiments that produced a non-zero plan score in this env. | |
| 4. **Medical intake and support triage are unreachable.** All seven trained/base models score 0/27 on these two families. Future work: per-family curricula or hierarchical scaffolding. | |
| 5. **The real ceiling is Qwen3-4B *base*, not 4B-Instruct.** 4B base (no RL) scores avg 0.145 and tops 0.819 — the highest single-scenario score we've seen at any size. Instruct-tuning *hurt* the 4B (4B-Inst avg 0.040). For multi-turn tool-using tasks, instruction-SFT seems to weaken patient field-by-field reasoning. | |
| 6. **Reward magnitude tells the right story.** Champion's training reward peaked at 0.73 (vs Anchor's 0.01 and Drift's 0.029) — a 10× improvement that translated into a real eval delta. Champion is the first run where both training and eval signals are healthy. | |
| </details> | |
| --- | |
| ## 6. The eval-pipeline bug saga | |
| > Five compounding bugs nearly killed the project. The story of finding them is worth telling. | |
| We initially saw 0/50 across **every** model — trained, base, instruct-tuned, all of them. That's not a model problem; that's an eval-pipeline problem. | |
| <details> | |
| <summary><strong>Bug 1: Parser bug — function-call form (with nested parens)</strong></summary> | |
| The trained 0.6B emits `ask_question("What is your budget? (in USD)")` style with **nested parens** in question text. Our original `parse_tool_call` used a naive regex that stopped at the first `)`, mangling 100% of the trained model's outputs. | |
| **Fix:** replaced with a balanced-paren scanner (`_find_balanced_func_call`) plus dedicated `_parse_positional_args` that handles `key="value"`, `key={json}`, and bare positional args. | |
| </details> | |
| <details> | |
| <summary><strong>Bug 2: Parser bug — prefix form</strong></summary> | |
| The same trained model emits `ASK: {"question": "..."}` and `PROPOSE: {"plan": "..."}` for ~30% of its outputs (a habit picked up during GRPO training). The original parser didn't recognize the prefix form at all. | |
| **Fix:** added `_parse_prefixed_call` with a `_PREFIX_TO_TOOL` mapping for `ASK / Q / QUESTION → ask_question`, `PROPOSE / PLAN → propose_plan`. | |
| </details> | |
| <details> | |
| <summary><strong>Bug 3: Parser bug — commas in quoted strings</strong></summary> | |
| `ask_question("What is X (e.g., birthday)?")` was being split on the comma inside the quoted string, truncating the question to `"What is X (e.g."`. | |
| **Fix:** wrote `_split_top_level_commas` that respects quotes, parens, brackets, and braces simultaneously. | |
| </details> | |
| <details> | |
| <summary><strong>Bug 4: Prompt example contamination</strong></summary> | |
| Our eval `SYSTEM_PROMPT` had `propose_plan(plan='{"stack": "python+fastapi", "scale": "1k users"}')` as an illustrative example. **Qwen3-1.7B base copied that plan verbatim for every scenario regardless of family** — we saw 50/50 event-planning tasks emit the software-stack plan. | |
| **Fix:** aligned the eval prompt char-for-character with the training prompt so the model has zero distribution shift. | |
| </details> | |
| <details> | |
| <summary><strong>Bug 5: Conversational drift on Instruct models</strong></summary> | |
| Qwen3-4B-Instruct would emit valid tool calls for the first 2-3 turns then drift to natural language (*"Let me think about what date might work…"*). | |
| **Fix:** modified `scripts/run_eval.py` to inject `RESPONSE FORMAT: Reply with ONE function call only, no other text.` into every observation reply. | |
| </details> | |
| A sixth issue, mostly mechanical: the env Space was rejecting concurrent eval clients with `CAPACITY_REACHED (8/8 sessions active)`. We bumped `max_concurrent_envs` from 8 → 64 in `server/app.py`. | |
| > **The reason this is worth a section:** without these fixes, the conclusion would have been *"GRPO doesn't train this model"*. The actual conclusion was *"we couldn't measure what GRPO was doing"*. | |
| --- | |
| ## 7. What worked and what didn't | |
| <table> | |
| <tr> | |
| <th width="50%">✅ What we'll keep doing</th> | |
| <th width="50%">❌ What we won't do again</th> | |
| </tr> | |
| <tr> | |
| <td valign="top"> | |
| 1. **Composable rubric over a single scalar.** Lets us debug exactly which axis the model is failing on per rollout. | |
| 2. **`num_generations=4-8` instead of 2.** Single biggest quality lever. Fixed the `0.000000` loss pathology immediately. | |
| 3. **One env Space for all rollouts.** `max_concurrent_envs=64` saved us from cloning the Space three times. | |
| 4. **`enable_thinking: False` mirrored at train AND eval.** Saved 2× token budget across all 7 runs. | |
| 5. **vLLM-in-HF-Job eval.** Reproducible by judges with one command. ~$0.13 per 50-scenario eval. | |
| </td> | |
| <td valign="top"> | |
| 1. **Llama-3-Instruct, Qwen2.5-Instruct.** Chat templates don't support TRL's `add_response_schema`. | |
| 2. **HF Inference Router for fine-tuned uploads.** Returns `model_not_supported` 400. | |
| 3. **`num_generations=2`.** Variance too low; advantages collapse to zero. | |
| 4. **Free-form rewards.** Model overfits to format compliance, ignores hallucination. | |
| 5. **TRL pre-1.0 + `chat_template_kwargs`.** Pin `trl[vllm]>=1.0` explicitly. | |
| 6. **`vllm_ascend` shadow plugin.** Monkey-patch `importlib.util.find_spec` to hide it. | |
| 7. **Qwen3 default thinking mode at eval.** Burns the full token budget on `<think>` traces. | |
| </td> | |
| </tr> | |
| </table> | |
| --- | |
| ## 8. Cost & reproducibility | |
| ### Total compute spend | |
| <div align="center"> | |
| | Item | Hardware | Wall time | Cost | | |
| |---|:---:|:---:|:---:| | |
| | Probe (0.6B, 300 steps, β=0) | a100-large | 30 min | $1.25 | | |
| | Drift (1.7B, 400 steps, β=0) | a100-large | 60 min | $2.50 | | |
| | Anchor (1.7B, 300 steps, β=0.2) | a100-large | 78 min | $3.25 | | |
| | Restrain (1.7B, 300 steps, β=1.0) | a100-large | 70 min | $2.92 | | |
| | **Champion (1.7B, 400 steps, β=0.3)** | a100-large | 94 min | $3.92 | | |
| | 9 evals (n=50 each, vLLM) | a10g-large | 2-7 min/eval | ~$1.50 total | | |
| | **Total** | | | **~$15** | | |
| </div> | |
| > Distributed across 3 HF accounts in parallel so they ran during the same window. **~$15 of the $120 hackathon budget.** | |
| ### Reproducing locally | |
| ```bash | |
| git clone https://github.com/anurag203/clarify-rl | |
| cd clarify-rl | |
| pip install -e . | |
| # Run the env locally | |
| uvicorn server.app:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| ### Reproducing the training | |
| ```bash | |
| # Smoke run (5 steps, ~$0.50, no Hub push) | |
| HF_TOKEN=hf_xxx SMOKE=1 ./scripts/launch_hf_job.sh Qwen/Qwen3-0.6B a10g-small | |
| # Champion recipe (~$4, ~1.5 h on a100-large) | |
| HF_TOKEN=hf_xxx BETA=0.3 LEARNING_RATE=1e-6 \ | |
| ./scripts/launch_hf_job.sh Qwen/Qwen3-1.7B a100-large 400 | |
| ``` | |
| ### Reproducing the evaluation | |
| ```bash | |
| HF_TOKEN=hf_xxx ./scripts/launch_eval_job.sh \ | |
| --model agarwalanu3103/clarify-rl-grpo-qwen3-1-7b-run7 \ | |
| --flavor a10g-large --limit 50 | |
| # Result is uploaded to <model>:evals/eval_*.json | |
| ``` | |
| Or open the [training notebook in Colab](https://colab.research.google.com/github/anurag203/clarify-rl/blob/main/training/train_grpo.ipynb) and re-run end-to-end. | |
| --- | |
| ## 9. Limitations & honest gaps | |
| We want to be transparent about what this submission does *not* show: | |
| 1. **medical_intake and support_triage are 0 across every model** — including the 4B base. The fields are tightly coupled (a missing `order_id` invalidates the plan even if other fields are correct). A curriculum or hierarchical scoring would likely fix this. | |
| 2. **No 4B GRPO run.** A 4B run was queued but canceled at 48 minutes in HF Jobs SCHEDULING. Listed as future work. | |
| 3. **Single random seed per run.** All 7 runs use seed=42. The clean β monotonicity suggests the result is robust, but a 3-seed sweep is the proper confirmation. | |
| 4. **No rubric-component weight ablation.** Our 50/20/15/15 weights came from a one-time design discussion, not a sweep. | |
| 5. **Format pass = 0% across every model.** The strict gate was built to be hack-resistant, but our trained models almost never hit 1.0 because the JSON parsing tolerates only exact field-name matches. | |
| 6. **No human evaluation.** All scoring is rubric-based. | |
| These gaps are about extending the validation, not the contribution itself. | |
| --- | |
| ## 10. Future work | |
| The most ambitious open directions, in priority order: | |
| 1. **Curriculum on family difficulty.** Start training only on `event_planning`, mix in harder families incrementally. Likely closes the medical_intake / support_triage gap. | |
| 2. **4B with the fixed pipeline.** Qwen3-4B base already scores 0.145 — strongest number in the project. β=0.2-0.3 + fixed pipeline at 4B is the obvious next experiment. Estimated cost: ~$8 / run. | |
| 3. **Cross-family generalization.** Hold one family out at training time. Strongest evidence the environment teaches a general capability vs a per-family policy. | |
| 4. **Multi-turn ambiguity.** Right now each `ask_question` reveals one field cleanly. A user-simulator that responds ambiguously sometimes would push the env closer to real assistant scenarios. | |
| 5. **Hard scenarios tier.** A `super_hard` tier with adversarial vagueness (*"do the thing"*) would test whether trained models degrade gracefully or collapse. | |
| 6. **Compositional plans.** A nested plan format (sub-tasks, conditions) would let us study compositional clarification. | |
| --- | |
| ## 11. Why this matters | |
| > **ClarifyRL is a safety primitive, not a benchmark.** | |
| Every existing LLM-RL paper we read either rewards getting the right answer (RLVR / RLHF / GRPO-on-math) or rewards completing the trajectory. **Almost none reward *deciding to ask first*.** That gap is exactly where the hardest production failures live: a model that hallucinates dosage, deadline, or destination is much more dangerous than one that admits *"I don't know — please clarify."* | |
| ClarifyRL closes that gap with three things you can drop into any LLM-RL pipeline: | |
| 1. **A composable rubric** that decomposes the reward into FormatCheck × FieldMatch × InfoGain × Efficiency × Hallucination — five signals you can debug, ablate, and reweight independently. | |
| 2. **A hidden-profile mechanism** that forces the agent to *gather information* rather than guess. The fields the rubric scores against are never visible at reset; they only emerge through `ask_question`. | |
| 3. **A clean β-anchored RL recipe** (validated across a 5-point sweep) showing exactly where the policy stays sane and where it collapses. | |
| A research lab could plug ClarifyRL in tomorrow as the **humility-shaping stage** between SFT and a larger downstream RL pipeline. | |
| > *The contribution is the environment. The trained 1.7B model is just the proof that the idea trains a real, measurable behavior — and that the same recipe scales to whatever model size the lab cares about.* | |
| That is the opportunity we wanted to open. **The next move is yours.** | |
| --- | |
| <div align="center"> | |
| ## Acknowledgments | |
| Built on [Meta's OpenEnv](https://github.com/meta-pytorch/OpenEnv) and [Hugging Face TRL](https://github.com/huggingface/trl). | |
| The starter notebook was TRL's `openenv_wordle_grpo.ipynb`. | |
| Thanks to the Meta + HF teams for shipping production-grade RL environments | |
| and the GRPO-with-vLLM-colocate path that made same-day parallel runs feasible. | |
| --- | |
| [](https://huggingface.co/spaces/agarwalanu3103/clarify-rl) | |
| [](https://github.com/anurag203/clarify-rl) | |
| [](https://huggingface.co/spaces/anurag203/clarify-rl-demo) | |
| **Team Bhole Chature** · *Anurag Agarwal + Kanan Agarwal* | |
| </div> | |