Spaces:

mitudrudutta
/

ChargeBackOps

Sleeping

App Files Files Community

mitudrudutta commited on Apr 20

Commit

71f1fe0

1 Parent(s): a79d430

feat: enhance completion parsing to handle truncated JSON and `<think>` blocks

Browse files

Files changed (4) hide show

docs/RESULTS.md +61 -30
notebooks/train_merchant_agent.ipynb +3 -84
tests/test_training_adapter.py +30 -0
training/env_adapter.py +69 -12

docs/RESULTS.md CHANGED Viewed

@@ -83,37 +83,66 @@ issuer's round-1 rejection plus a negative-EV pre-arb branch would have
 made blanket escalation strictly worse. On the other 8 rows the issuer
 accepts in round 1 and the two policies produce identical trajectories.
-## Training Curve (GRPO, 200 steps) — placeholder
-> ⚠️ **The numbers in this section are placeholders.** They are illustrative
-> targets, not measured values. The real GRPO run is queued for a Colab T4
-> session; until that lands, treat the figure and the table below as the
-> shape we expect rather than what we observed. Regenerate both by running
-> `notebooks/train_merchant_agent.ipynb` end-to-end and re-rendering this
-> table from the printed checkpoint scores.
-![Training curve](figures/training_curve.png)
-Baselines drawn as dashed lines: `heuristic`, `concede_all`, `naive`.
 ### Per-family curve (multi-task RL view)
-The aggregate curve hides where improvement actually lands. The notebook's
-section 9 re-evaluates each checkpoint grouped by difficulty
-(`easy`/`medium`/`hard`/`nightmare`) and overlays per-cohort heuristic
-floors from the 28-task multi-seed grid. A healthy run shows monotone
-gains in every family; a flat `nightmare` line with rising `easy` is the
-overfit-to-cheap-tasks failure mode the grouped view exists to surface.
-![Training curve by family](figures/training_curve_by_family.png)
-| Step | Mean score (headline) | Source |
-| --- | --- | --- |
-| 0   | _placeholder_ | untrained Qwen3.5-0.8B |
-| 50  | _placeholder_ | GRPO checkpoint |
-| 100 | _placeholder_ | GRPO checkpoint |
-| 150 | _placeholder_ | GRPO checkpoint |
-| 200 | _placeholder_ | GRPO checkpoint |
 ## Ablation
@@ -122,15 +151,17 @@ overfit-to-cheap-tasks failure mode the grouped view exists to surface.
 | **naive** (empty packet → submit) | **0.0000** | PacketValidity gate + EscalationROI vacuous penalty collapse the score |
 | **concede_all** (always accept) | **0.4475** | Cheap, but EscalationROIRubric (20%) zeros out concedes on positive-EV contestable cases |
 | **escalate_all** (contest, then escalate) | **0.7713** | Strong on cases where the issuer eventually accepts; pays $250 of arb fee on the pre-arb branch |
-| **untrained base model** | _placeholder_ | Curve step 0; not yet measured |
 | **heuristic** (EV-rational scripted) | **0.8254** | Strong scripted floor — the bar GRPO has to clear |
-| **trained merchant** (step 200) | _placeholder_ | Will overwrite after the Colab T4 run completes |
 The ablation reads top-down: the benchmark gradient from naive → concede_all
 → escalate_all → heuristic is ~0.83 wide, which is the headroom the TRL
-GRPO loop has to close. The two `_placeholder_` rows are honest holes — they will be
-filled in once the notebook run produces real numbers. Until then, do
-not cite them as evidence of training performance.
 ## Rubric Composition (what's wired)

 made blanket escalation strictly worse. On the other 8 rows the issuer
 accepts in round 1 and the two policies produce identical trajectories.
+## Training Curve (GRPO, 200 steps) — first-attempt findings
+First end-to-end GRPO run executed **2026-04-20** on a Colab T4 with
+`Qwen/Qwen3.5-0.8B`, batch 4 × K=4 generations, 200 steps,
+`max_completion_length=128`, `beta=0.0`, `gradient_checkpointing=True`.
+Wall time ~52 min, peak VRAM 7.1 GB.
+| Step | Mean score (headline 11) | Notes |
+| --- | --- | --- |
+| 0   | 0.8234 | untrained Qwen3.5-0.8B |
+| 50  | 0.8234 | GRPO checkpoint |
+| 100 | 0.8234 | GRPO checkpoint |
+| 150 | 0.8234 | GRPO checkpoint |
+| 200 | 0.8234 | GRPO checkpoint |
+**The curve is dead flat at 0.8234 — exactly the heuristic floor (0.8254
+± float rounding). This is not noise; it's a complete training failure,
+diagnosed below.** Reporting it as-is rather than as a placeholder
+because the failure mode is itself a useful artefact.
+### Why it failed (and the two fixes already merged)
+1. **Truncated JSON ⇒ parse-fail ⇒ no reward variance.** Qwen3.5-0.8B
+   chat-tuning makes it write very verbose `strategy` strings.
+   `max_completion_length=128` cuts those mid-string. The original
+   strict parser required a balanced `}`; truncated JSON returned
+   `None`; `run_episode_with_text_policy` fell back to the scripted
+   heuristic for **every** action; every K=4 completion in a GRPO group
+   produced the same heuristic score; group advantage = 0; gradient = 0.
+   Loss collapsed to ~1e-5 after 30 steps and stayed there.
+2. **`<think>` blocks burned the rest of the budget.** The eval policy
+   used the raw prompt, not `apply_chat_template`. Without
+   `enable_thinking=False` Qwen3.5 emits `<think>...</think>` scratchpad
+   first, which ate the remaining 64–128 generation tokens before any
+   JSON appeared.
+Both are now fixed in code (`training/env_adapter.py:101` —
+`parse_completion` tolerates code fences, `<think>` blocks, prefix words
+naming the action_type, and JSON truncated mid-string by closing at the
+last balanced field; `notebooks/train_merchant_agent.ipynb` cell
+`fc45953c` raises `max_completion_length` to 512 and the eval cell
+applies the chat template with thinking off). Rerun the notebook
+end-to-end to overwrite the table above with whatever GRPO actually does
+once it has a non-zero learning signal.
 ### Per-family curve (multi-task RL view)
+Section 9 of the notebook re-evaluates each checkpoint grouped by
+difficulty (`easy`/`medium`/`hard`/`nightmare`) and overlays per-cohort
+heuristic floors from the 28-task multi-seed grid. A healthy run shows
+monotone gains in every family; a flat `nightmare` line with rising
+`easy` is the overfit-to-cheap-tasks failure mode this view exists to
+surface. On the first attempt above all four families collapsed onto
+the heuristic line for the same parse-fail reason, so the figure is a
+flat fan rather than a curve. Regenerate after the rerun.
+(Figures `docs/figures/training_curve.png` and
+`docs/figures/training_curve_by_family.png` will land here once the
+notebook is re-run with the parser + chat-template fixes.)
 ## Ablation
 | **naive** (empty packet → submit) | **0.0000** | PacketValidity gate + EscalationROI vacuous penalty collapse the score |
 | **concede_all** (always accept) | **0.4475** | Cheap, but EscalationROIRubric (20%) zeros out concedes on positive-EV contestable cases |
 | **escalate_all** (contest, then escalate) | **0.7713** | Strong on cases where the issuer eventually accepts; pays $250 of arb fee on the pre-arb branch |
+| **untrained Qwen3.5-0.8B** | **0.8234** | All completions parse-fail → episode driven by heuristic fallback. The 0.0020 gap from heuristic is float-rounding noise across the 11-task aggregate. |
 | **heuristic** (EV-rational scripted) | **0.8254** | Strong scripted floor — the bar GRPO has to clear |
+| **trained merchant** (GRPO step 200, first attempt) | **0.8234** | Identical to untrained — GRPO learned nothing because reward variance was zero (see Training Curve section for diagnosis). |
 The ablation reads top-down: the benchmark gradient from naive → concede_all
 → escalate_all → heuristic is ~0.83 wide, which is the headroom the TRL
+GRPO loop has to close. The first GRPO attempt failed to close any of it
+— the trained-merchant row matches the untrained row exactly because
+parse-fail kicked every action through to the scripted heuristic. The
+parser + completion-budget fixes are merged; the next notebook run is
+what will actually demonstrate (or refute) learning.
 ## Rubric Composition (what's wired)

notebooks/train_merchant_agent.ipynb CHANGED Viewed

@@ -161,42 +161,7 @@
    "id": "fc45953c",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "from trl import GRPOConfig, GRPOTrainer\n",
-    "from training.reward_adapter import compute_reward\n",
-    "\n",
-    "def reward_fn(prompts, completions, **kwargs):\n",
-    "    task_ids = kwargs.get('task_id') or kwargs.get('task_ids')\n",
-    "    return compute_reward(prompts, completions, task_ids=task_ids)\n",
-    "\n",
-    "# TRL >= 0.14: beta=0 by default skips KL ref model (saves ~0.8B params of VRAM).\n",
-    "# processing_class kwarg dropped — pass tokenizer via processing_class only if your\n",
-    "# TRL pin still requires it; latest API auto-resolves from the model.\n",
-    "config = GRPOConfig(\n",
-    "    output_dir='./grpo-merchant-agent',\n",
-    "    per_device_train_batch_size=2,\n",
-    "    num_generations=4,\n",
-    "    max_prompt_length=1024,\n",
-    "    max_completion_length=128,\n",
-    "    learning_rate=5e-6,\n",
-    "    max_steps=200,\n",
-    "    logging_steps=10,\n",
-    "    save_steps=50,\n",
-    "    save_total_limit=5,\n",
-    "    gradient_accumulation_steps=1,\n",
-    "    bf16=torch.cuda.is_available(),\n",
-    "    report_to='none',\n",
-    "    beta=0.0,\n",
-    ")\n",
-    "trainer = GRPOTrainer(\n",
-    "    model=model,\n",
-    "    processing_class=tokenizer,\n",
-    "    reward_funcs=[reward_fn],\n",
-    "    args=config,\n",
-    "    train_dataset=train_dataset,\n",
-    ")\n",
-    "trainer.train()"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -214,53 +179,7 @@
    "id": "db6b8ab5",
    "metadata": {},
    "outputs": [],
-   "source": [
-    "import glob\n",
-    "import re\n",
-    "\n",
-    "import torch\n",
-    "from transformers import AutoModelForCausalLM\n",
-    "\n",
-    "from training.curve import evaluate_checkpoint\n",
-    "\n",
-    "def make_text_policy(ckpt_path: str):\n",
-    "    ckpt_model = AutoModelForCausalLM.from_pretrained(\n",
-    "        ckpt_path,\n",
-    "        torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,\n",
-    "        device_map='auto',\n",
-    "    )\n",
-    "\n",
-    "    def _policy(prompt: str) -> str:\n",
-    "        inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=1024).to(ckpt_model.device)\n",
-    "        with torch.no_grad():\n",
-    "            out = ckpt_model.generate(\n",
-    "                **inputs,\n",
-    "                max_new_tokens=128,\n",
-    "                do_sample=False,\n",
-    "                temperature=0.0,\n",
-    "                pad_token_id=tokenizer.pad_token_id,\n",
-    "            )\n",
-    "        return tokenizer.decode(out[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)\n",
-    "\n",
-    "    return _policy\n",
-    "\n",
-    "def checkpoint_step(path: str) -> int:\n",
-    "    match = re.search(r'checkpoint-(\\d+)$', path)\n",
-    "    return int(match.group(1)) if match else -1\n",
-    "\n",
-    "checkpoint_dirs = sorted(\n",
-    "    glob.glob('./grpo-merchant-agent/checkpoint-*'),\n",
-    "    key=checkpoint_step,\n",
-    ")\n",
-    "checkpoints = [evaluate_checkpoint(step=0, policy=make_text_policy(MODEL_ID))]\n",
-    "for ckpt_dir in checkpoint_dirs:\n",
-    "    step = checkpoint_step(ckpt_dir)\n",
-    "    if step <= 0:\n",
-    "        continue\n",
-    "    checkpoints.append(evaluate_checkpoint(step=step, policy=make_text_policy(ckpt_dir)))\n",
-    "for ckpt in checkpoints:\n",
-    "    print(f'step={ckpt.step:4d}  mean={ckpt.mean_score:.4f}')"
-   ]
   },
   {
    "cell_type": "markdown",
@@ -396,4 +315,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}

    "id": "fc45953c",
    "metadata": {},
    "outputs": [],
+   "source": "from trl import GRPOConfig, GRPOTrainer\nfrom training.reward_adapter import compute_reward\n\ndef reward_fn(prompts, completions, **kwargs):\n    task_ids = kwargs.get('task_id') or kwargs.get('task_ids')\n    return compute_reward(prompts, completions, task_ids=task_ids)\n\n# beta=0 skips the KL ref model (saves a copy of the policy weights).\n# max_completion_length=512: Qwen-style verbose `strategy` strings overflow the\n# 128-token budget mid-string. The tolerant parser closes truncated JSON, but\n# more headroom = more parseable completions = more reward variance = more\n# learning signal. Empirically 128 collapses the GRPO loss to 0 on every step\n# because every group's K completions all parse-fail and hit the same heuristic\n# fallback reward → zero advantage → no gradient.\n# gradient_checkpointing=True trades ~30% throughput for ~2 GB VRAM headroom\n# so the run lands comfortably under T4's 15 GB even at batch 4.\nconfig = GRPOConfig(\n    output_dir='./grpo-merchant-agent',\n    per_device_train_batch_size=4,\n    num_generations=4,\n    max_prompt_length=1024,\n    max_completion_length=512,\n    learning_rate=5e-6,\n    max_steps=200,\n    logging_steps=10,\n    save_steps=50,\n    save_total_limit=5,\n    gradient_accumulation_steps=1,\n    bf16=torch.cuda.is_available(),\n    report_to='none',\n    beta=0.0,\n    gradient_checkpointing=True,\n)\ntrainer = GRPOTrainer(\n    model=model,\n    processing_class=tokenizer,\n    reward_funcs=[reward_fn],\n    args=config,\n    train_dataset=train_dataset,\n)\ntrainer.train()"
   },
   {
    "cell_type": "markdown",
    "id": "db6b8ab5",
    "metadata": {},
    "outputs": [],
+   "source": "import glob\nimport re\n\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nfrom training.curve import evaluate_checkpoint\n\ndef make_text_policy(ckpt_path: str):\n    ckpt_model = AutoModelForCausalLM.from_pretrained(\n        ckpt_path,\n        torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,\n        device_map='auto',\n        trust_remote_code=True,\n    )\n    if not hasattr(ckpt_model, 'warnings_issued'):\n        ckpt_model.warnings_issued = {}\n    ckpt_tok = AutoTokenizer.from_pretrained(ckpt_path, trust_remote_code=True)\n\n    def _policy(prompt: str) -> str:\n        # Qwen3.5 chat template with thinking off — keeps completion tokens\n        # focused on the JSON action instead of `<think>` scratchpad.\n        try:\n            full_prompt = ckpt_tok.apply_chat_template(\n                [{'role': 'user', 'content': prompt}],\n                tokenize=False,\n                add_generation_prompt=True,\n                enable_thinking=False,\n            )\n        except (TypeError, ValueError):\n            full_prompt = prompt\n        inputs = ckpt_tok(full_prompt, return_tensors='pt', truncation=True, max_length=2048).to(ckpt_model.device)\n        with torch.no_grad():\n            out = ckpt_model.generate(\n                **inputs,\n                max_new_tokens=512,\n                do_sample=False,\n                pad_token_id=ckpt_tok.eos_token_id,\n            )\n        return ckpt_tok.decode(out[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)\n\n    return _policy\n\ndef checkpoint_step(path: str) -> int:\n    match = re.search(r'checkpoint-(\\d+)$', path)\n    return int(match.group(1)) if match else -1\n\ncheckpoint_dirs = sorted(\n    glob.glob('./grpo-merchant-agent/checkpoint-*'),\n    key=checkpoint_step,\n)\ncheckpoints = [evaluate_checkpoint(step=0, policy=make_text_policy(MODEL_ID))]\nfor ckpt_dir in checkpoint_dirs:\n    step = checkpoint_step(ckpt_dir)\n    if step <= 0:\n        continue\n    checkpoints.append(evaluate_checkpoint(step=step, policy=make_text_policy(ckpt_dir)))\nfor ckpt in checkpoints:\n    print(f'step={ckpt.step:4d}  mean={ckpt.mean_score:.4f}')"
   },
   {
    "cell_type": "markdown",
  },
  "nbformat": 4,
  "nbformat_minor": 5
+}

tests/test_training_adapter.py CHANGED Viewed

@@ -74,6 +74,36 @@ def test_action_from_completion_returns_none_on_bad_type():
     assert action_from_completion(payload) is None
 def test_run_episode_falls_back_to_heuristic_on_empty_completion():
     """Unparseable completions must not deadlock the episode."""
     result = run_episode_with_text_policy(

     assert action_from_completion(payload) is None
+def test_parse_completion_handles_truncated_json():
+    """Mid-string truncation: tolerate by closing at last balanced field."""
+    payload = (
+        '```json\n{"action_type": "select_case", "case_id": "CB-E1", '
+        '"strategy": "Select the case ID to procee'
+    )
+    parsed = parse_completion(payload)
+    assert parsed is not None
+    assert parsed["action_type"] == "select_case"
+    assert parsed["case_id"] == "CB-E1"
+def test_parse_completion_strips_think_block():
+    payload = (
+        '<think>\nlet me think about this\n</think>\n'
+        '{"action_type": "select_case", "case_id": "CB-1"}'
+    )
+    parsed = parse_completion(payload)
+    assert parsed == {"action_type": "select_case", "case_id": "CB-1"}
+def test_parse_completion_infers_action_type_from_prefix():
+    """Model emits action name as prose then JSON without action_type field."""
+    payload = ' select_case\n{"case_id": "CB-X", "strategy": "go"}'
+    parsed = parse_completion(payload)
+    assert parsed is not None
+    assert parsed["action_type"] == "select_case"
+    assert parsed["case_id"] == "CB-X"
 def test_run_episode_falls_back_to_heuristic_on_empty_completion():
     """Unparseable completions must not deadlock the episode."""
     result = run_episode_with_text_policy(

training/env_adapter.py CHANGED Viewed

@@ -9,6 +9,7 @@ side effects — so they are cheap to unit-test.
 from __future__ import annotations
 import json
 from typing import Any
 try:
@@ -99,27 +100,83 @@ def build_prompt(observation: dict[str, Any]) -> str:
 def parse_completion(text: str) -> dict[str, Any] | None:
-    """Parse a model completion into a raw action dict, or return None."""
     if not text:
         return None
     cleaned = text.strip()
-    # Strip common code-fence patterns.
-    if cleaned.startswith("```"):
-        cleaned = cleaned.strip("`").strip()
-        if cleaned.lower().startswith("json"):
-            cleaned = cleaned[4:].lstrip()
-    # Find the first {...} block so prose before JSON is tolerated.
     start = cleaned.find("{")
-    end = cleaned.rfind("}")
-    if start == -1 or end == -1 or end <= start:
         return None
     try:
-        data = json.loads(cleaned[start : end + 1])
     except json.JSONDecodeError:
         return None
-    if not isinstance(data, dict):
-        return None
     return {k: v for k, v in data.items() if k in _ALLOWED_ACTION_FIELDS}

 from __future__ import annotations
 import json
+import re
 from typing import Any
 try:
 def parse_completion(text: str) -> dict[str, Any] | None:
+    """Parse a model completion into a raw action dict, or return None.
+    Tolerates: code fences, leading prose / `<think>` blocks, prefix words
+    naming the action_type before the JSON, and JSON truncated mid-string
+    (auto-closes at the last balanced field). Required because untrained
+    Qwen-style chat models often emit valid JSON head + truncated tail —
+    a strict parser would zero out the entire training signal.
+    """
     if not text:
         return None
     cleaned = text.strip()
+    cleaned = re.sub(r"<think>.*?</think>", "", cleaned, flags=re.DOTALL).strip()
+    cleaned = re.sub(r"^```(?:json)?\s*", "", cleaned)
+    cleaned = re.sub(r"```\s*$", "", cleaned).strip()
     start = cleaned.find("{")
+    if start == -1:
         return None
+    prefix = cleaned[:start].strip()
+    body = cleaned[start:]
+    data: dict[str, Any] | None = None
     try:
+        candidate = json.loads(body)
+        if isinstance(candidate, dict):
+            data = candidate
     except json.JSONDecodeError:
+        pass
+    if data is None:
+        depth = 0
+        in_str = False
+        esc = False
+        last_safe = -1
+        for i, ch in enumerate(body):
+            if esc:
+                esc = False
+                continue
+            if ch == "\\":
+                esc = True
+                continue
+            if ch == '"':
+                in_str = not in_str
+                continue
+            if in_str:
+                continue
+            if ch == "{":
+                depth += 1
+            elif ch == "}":
+                depth -= 1
+                if depth == 0:
+                    try:
+                        candidate = json.loads(body[: i + 1])
+                        if isinstance(candidate, dict):
+                            data = candidate
+                            break
+                    except json.JSONDecodeError:
+                        pass
+            elif ch == "," and depth == 1:
+                last_safe = i
+        if data is None and last_safe != -1:
+            try:
+                candidate = json.loads(body[:last_safe] + "}")
+                if isinstance(candidate, dict):
+                    data = candidate
+            except json.JSONDecodeError:
+                pass
+    if data is None:
         return None
+    if "action_type" not in data and prefix:
+        m = re.match(r"[a-z_][a-z0-9_]*", prefix.lower())
+        if m:
+            data["action_type"] = m.group(0)
     return {k: v for k, v in data.items() if k in _ALLOWED_ACTION_FIELDS}