Spaces:

suppops
/

supportOpsEnv

Sleeping

App Files Files Community

Addy897 commited on 4 days ago

Commit

735d73f

1 Parent(s): 841976b

Final

Browse files

Files changed (2) hide show

README.md +81 -52
inference.py +190 -42

README.md CHANGED Viewed

@@ -20,7 +20,7 @@ The environment is designed to score well against OpenEnv-style hackathon criter
 - Typed observation, action, and reward models
 - Reproducible OpenAI baseline runner
 - Reproducible rule-based baseline runner that works with no API key
-- Dockerized deployment path for Hugging Face Spaces
 ## Environment Motivation
@@ -67,32 +67,34 @@ Each ticket observation contains:
 Supported `action_type` values:
-- `inspect_ticket`
-- `request_context`
-- `set_priority`
-- `set_route`
-- `set_resolution`
-- `escalate`
-- `rank_queue`
-- `finalize`
 ## Reward Design
 `RewardModel` is a Pydantic model with:
-- `value`
-- `components`
-- `rationale`
 Reward shaping is dense, not sparse:
-- positive reward for discovering required context
-- positive reward for correct intermediate decisions
 - positive reward for correct queue ranking progress
 - terminal reward from the deterministic grader score
 - penalties for invalid actions, redundant actions, and wasted steps
-This creates learning or evaluation signal over the full trajectory.
 ## Tasks
@@ -100,8 +102,6 @@ This creates learning or evaluation signal over the full trajectory.
 Objective: correctly handle an urgent suspected account takeover with unauthorized ad spend.
-Expected difficulty: easy.
 Success criteria:
 - request the right security and billing context
@@ -114,8 +114,6 @@ Success criteria:
 Objective: investigate a missing creator payout and avoid unsafe release of funds.
-Expected difficulty: medium.
 Success criteria:
 - discover tax-expiry and compliance-hold context
@@ -126,26 +124,50 @@ Success criteria:
 ### Hard: Mixed Support Queue Triage
-Objective: prioritize and resolve a heterogeneous queue under SLA pressure.
-Expected difficulty: hard.
 Success criteria:
-- correctly rank the queue
-- assign route and priority for each ticket
-- choose correct resolutions
 - escalate only the security-critical case
 ## Graders
-Each task has a deterministic grader that returns a score in `0.0` to `1.0`.
-- Easy grader weights context, priority, route, resolution, and escalation
 - Medium grader weights context and policy-safe resolution more heavily
-- Hard grader scores per-ticket handling and queue ranking
-Programmatic graders live in [support_ops_env/graders](/home/batman/Downloads/presentation_template/support_ops_env/support_ops_env/graders).
 ## Setup
@@ -176,55 +198,59 @@ Run the default no-API baseline:
 python scripts/run_rule_baseline.py
 ```
-Run the OpenAI baseline if you have an API key:
 ```bash
 export OPENAI_API_KEY=your_key_here
 python scripts/run_baseline.py --model gpt-4.1-mini
 ```
-Validate metadata:
 ```bash
 bash scripts/validate_env.sh
 ```
-If the `openenv` CLI is installed, the script will also run `openenv validate openenv.yaml`.
-## Baseline Scores
-The repository now includes a deterministic baseline in [run_rule_baseline.py](/home/batman/Downloads/presentation_template/support_ops_env/scripts/run_rule_baseline.py), so you can produce reproducible scores without any external API.
-In this workspace, use:
 ```bash
-python scripts/run_rule_baseline.py
 ```
-This writes `rule_baseline_results.json` with per-task transcripts and the average score.
-The current deterministic baseline score from this workspace is:
-- `easy_account_takeover`: `1.0`
-- `medium_payout_hold`: `1.0`
-- `hard_queue_triage`: `1.0`
-- average: `1.0`
-The OpenAI baseline in [run_baseline.py](/home/batman/Downloads/presentation_template/support_ops_env/scripts/run_baseline.py) is still available as an optional comparison path after installing dependencies and setting `OPENAI_API_KEY`.
-## Hugging Face Space Deployment
-This repository includes:
-- `Dockerfile`
-- `app.py`
-- `openenv.yaml`
-To deploy as a Docker Space:
 1. Create a new Hugging Face Space with SDK set to Docker.
-2. Upload this repository.
-3. Add the `openenv` tag in the Space metadata.
 4. Optionally set `OPENAI_API_KEY` as a Space secret for baseline experiments.
 ## Project Structure
@@ -240,6 +266,9 @@ support_ops_env/
 │   ├── graders/
 │   └── tasks/
 ├── scripts/
 ├── tests/
 ├── app.py
 ├── openenv.yaml

 - Typed observation, action, and reward models
 - Reproducible OpenAI baseline runner
 - Reproducible rule-based baseline runner that works with no API key
+- Dockerized deployment on Hugging Face Spaces
 ## Environment Motivation
 Supported `action_type` values:
+| `action_type`    | `target`   | `value` example                        |
+|------------------|------------|----------------------------------------|
+| `inspect_ticket` | ticket ID  | `""`                                   |
+| `request_context`| ticket ID  | `"tax_status"`                         |
+| `set_priority`   | ticket ID  | `"urgent"` / `"high"` / `"normal"` / `"low"` |
+| `set_route`      | ticket ID  | `"account_security"` / `"billing_refunds"` / `"monetization_compliance"` / `"policy_appeals"` |
+| `set_resolution` | ticket ID  | `"temporary_lock_and_manual_recovery"` / `"request_tax_renewal"` / `"approve_refund"` / `"expedited_human_review"` |
+| `escalate`       | ticket ID  | `"security_specialist"`                |
+| `rank_queue`     | `"queue"`  | `"T2,T1,T3"`                           |
+| `finalize`       | ticket ID  | `""`                                   |
 ## Reward Design
 `RewardModel` is a Pydantic model with:
+- `value`: scalar reward for this step
+- `components`: dict of named sub-rewards
+- `rationale`: human-readable explanation
 Reward shaping is dense, not sparse:
+- positive reward for discovering required context keys
+- positive reward for correct intermediate decisions (priority, route, resolution)
 - positive reward for correct queue ranking progress
 - terminal reward from the deterministic grader score
 - penalties for invalid actions, redundant actions, and wasted steps
+This creates a learning or evaluation signal over the full trajectory, not just at episode end.
 ## Tasks
 Objective: correctly handle an urgent suspected account takeover with unauthorized ad spend.
 Success criteria:
 - request the right security and billing context
 Objective: investigate a missing creator payout and avoid unsafe release of funds.
 Success criteria:
 - discover tax-expiry and compliance-hold context
 ### Hard: Mixed Support Queue Triage
+Objective: prioritize and resolve a heterogeneous queue of three tickets under SLA pressure.
 Success criteria:
+- correctly rank the queue by urgency
+- assign route and priority for each ticket independently
+- choose correct resolutions per ticket
 - escalate only the security-critical case
 ## Graders
+Each task has a deterministic grader that returns a score in `[0.0, 1.0]`.
+- Easy grader weights context discovery, priority, route, resolution, and escalation
 - Medium grader weights context and policy-safe resolution more heavily
+- Hard grader scores per-ticket handling and queue ranking independently
+Programmatic graders live in [`support_ops_env/graders/`](./support_ops_env/graders/).
+## Baseline Scores
+### Rule-based baseline (no API key required)
+The deterministic rule-based baseline always takes the optimal action sequence and is used as a sanity check that the graders are correct and reachable:
+| Task                    | Score |
+|-------------------------|-------|
+| `easy_account_takeover` | 1.000 |
+| `medium_payout_hold`    | 1.000 |
+| `hard_queue_triage`     | 1.000 |
+| **average**             | **1.000** |
+### LLM baseline (GPT-4.1-mini)
+These are the reproducible scores from the OpenAI baseline runner. They demonstrate that the environment provides a genuine challenge to frontier models, particularly on the hard task:
+| Task                    | Score | Notes |
+|-------------------------|-------|-------|
+| `easy_account_takeover` | ~0.20 | Model skips mandatory set_priority / set_route / set_resolution before finalize |
+| `medium_payout_hold`    | ~0.35 | Correct context discovery but premature finalize |
+| `hard_queue_triage`     | ~0.13 | Multi-ticket ranking and per-ticket mandatory actions not completed |
+| **average**             | **~0.23** | |
+The gap between the rule baseline and the LLM baseline confirms the reward function produces genuine signal and the hard task challenges frontier models.
 ## Setup
 python scripts/run_rule_baseline.py
 ```
+Run the OpenAI baseline:
 ```bash
 export OPENAI_API_KEY=your_key_here
 python scripts/run_baseline.py --model gpt-4.1-mini
 ```
+Validate OpenEnv metadata:
 ```bash
 bash scripts/validate_env.sh
+# If the openenv CLI is installed, this also runs: openenv validate openenv.yaml
 ```
+## API Quick Start
+The live environment is available at `https://suppops-supportopsenv.hf.space`.
+Reset to a task:
 ```bash
+curl -X POST https://suppops-supportopsenv.hf.space/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "easy_account_takeover"}'
 ```
+Take a step:
+```bash
+curl -X POST https://suppops-supportopsenv.hf.space/step \
+  -H "Content-Type: application/json" \
+  -d '{"action": {"action_type": "inspect_ticket", "target": "T1", "value": ""}}'
+```
+Inspect the full environment state:
+```bash
+curl https://suppops-supportopsenv.hf.space/state
+```
+Get JSON schemas for all models:
+```bash
+curl https://suppops-supportopsenv.hf.space/schema
+```
+## Hugging Face Space Deployment
+This repository includes a `Dockerfile`, `app.py`, and `openenv.yaml` and deploys as a Docker Space.
 1. Create a new Hugging Face Space with SDK set to Docker.
+2. Push this repository to the Space.
+3. Add the `openenv` tag in the Space metadata (already present in this README's frontmatter).
 4. Optionally set `OPENAI_API_KEY` as a Space secret for baseline experiments.
 ## Project Structure
 │   ├── graders/
 │   └── tasks/
 ├── scripts/
+│   ├── run_baseline.py
+│   ├── run_rule_baseline.py
+│   └── validate_env.sh
 ├── tests/
 ├── app.py
 ├── openenv.yaml

inference.py CHANGED Viewed

@@ -2,7 +2,9 @@ from dotenv import load_dotenv
 load_dotenv()
 import json
 import os
 import textwrap
 from typing import List, Optional
 from openai import OpenAI
@@ -19,16 +21,23 @@ TASK_NAME = os.getenv("SUPPORT_OPS_TASK", "easy_account_takeover")
 BENCHMARK = os.getenv("SUPPORT_OPS_BENCHMARK", "support_ops_env")
 MAX_STEPS = int(os.getenv("MAX_STEPS", "24"))
 TEMPERATURE = float(os.getenv("TEMPERATURE", "0.1"))
-MAX_TOKENS = int(os.getenv("MAX_TOKENS", "220"))
 SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_SCORE_THRESHOLD", "0.8"))
 # Minimum number of tasks required by the grader
 MIN_TASKS = 3
 SYSTEM_PROMPT = textwrap.dedent(
     """
     You are operating a customer support triage environment.
-    Return exactly one JSON object with keys: action_type, target, value.
     Allowed action_type values:
     - inspect_ticket
@@ -54,23 +63,27 @@ SYSTEM_PROMPT = textwrap.dedent(
     {"action_type": "set_route",        "target": "T1", "value": "account_security"}
     {"action_type": "set_resolution",   "target": "T1", "value": "temporary_lock_and_manual_recovery"}
     {"action_type": "escalate",         "target": "T1", "value": "security_specialist"}
-    {"action_type": "rank_queue",       "target": "T1", "value": "T2,T1,T3"}
     {"action_type": "finalize",         "target": "T1", "value": ""}
     CRITICAL: For request_context, target = ticket ID (e.g. "T1"), value = context key name.
     NEVER put the context key name in target. target is ALWAYS a ticket ID.
-    WORKFLOW PER TICKET:
-    1. inspect_ticket once (target=ticket_id, value="").
-    2. request_context ONLY for keys in required_context_keys first (these affect your score).
-       Use target=ticket_id, value=key_name. Request each key at most once.
-       Do NOT request optional keys from available_context_keys — they give tiny reward
-       but waste steps you need for set_resolution, escalate, rank_queue, and finalize.
-    3. set_priority, set_route, set_resolution using the VALID VALUES listed above.
-       Use the context you discovered to choose correctly.
-    4. escalate only when account takeover / security compromise is confirmed.
-    5. For queue tasks: rank_queue after processing all tickets (most urgent first).
-    6. finalize (target=ticket_id, value="") when all tickets are done.
     PRIORITY HINTS:
     - Account takeover / fraud / SLA <= 2h → urgent
@@ -79,9 +92,10 @@ SYSTEM_PROMPT = textwrap.dedent(
     STRICT RULES:
     - NEVER repeat an action you have already taken (check your history).
-    - inspect_ticket AT MOST ONCE per ticket.
     - target is ALWAYS a ticket ID like "T1". NEVER put a context key in target.
     - Each request_context must use a different value (key name).
     """
 ).strip()
@@ -107,9 +121,25 @@ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> No
     )
-def build_user_prompt(observation: Observation, step: int, rewards: List[float], action_history: List[str]) -> str:
     reward_history = ",".join(f"{reward:.2f}" for reward in rewards[-5:]) if rewards else "none"
     history_str = "\n".join(f"  {a}" for a in action_history) if action_history else "  none"
     return textwrap.dedent(
         f"""
         Step: {step}
@@ -117,36 +147,125 @@ def build_user_prompt(observation: Observation, step: int, rewards: List[float],
         Difficulty: {observation.difficulty}
         Reward history: {reward_history}
         Actions you have ALREADY taken this episode (do NOT repeat these):
 {history_str}
         Observation JSON:
         {json.dumps(observation.model_dump(), indent=2, sort_keys=True)}
         Return one JSON action that you have NOT already taken.
         """
     ).strip()
-def get_model_action(client: OpenAI, observation: Observation, step: int, rewards: List[float], action_history: List[str]) -> tuple[Action, Optional[str]]:
-    user_prompt = build_user_prompt(observation, step, rewards, action_history)
     try:
-        completion = client.chat.completions.create(
-            model=MODEL_NAME,
-            messages=[
-                {"role": "system", "content": SYSTEM_PROMPT},
-                {"role": "user", "content": user_prompt},
-            ],
-            temperature=TEMPERATURE,
-            max_tokens=MAX_TOKENS,
-            stream=False,
-        )
-        content = (completion.choices[0].message.content or "").strip()
-        payload = json.loads(content)
-        action = Action.model_validate(payload)
-        return action, None
-    except Exception as exc:
-        fallback = Action(action_type="finalize")
-        return fallback, str(exc).replace("\n", " ")
 def clamp_score(score: float) -> float:
@@ -166,7 +285,6 @@ def select_tasks(requested: str) -> List[str]:
     if not available:
         raise RuntimeError("No tasks available in the environment.")
-    # Start with the requested task (validated), then fill up to MIN_TASKS
     primary = requested if requested in available else available[0]
     others = [t for t in available if t != primary]
     task_list = [primary] + others
@@ -178,6 +296,9 @@ def run_task(client: OpenAI, task_name: str) -> dict:
     env = SupportOpsEnv(task_id=task_name)
     rewards: List[float] = []
     action_history: List[str] = []
     steps_taken = 0
     score = 0.0
     success = False
@@ -188,10 +309,42 @@ def run_task(client: OpenAI, task_name: str) -> dict:
         observation = env.reset(task_id=task_name)
         for step in range(1, MAX_STEPS + 1):
-            action, action_error = get_model_action(client, observation, step, rewards, action_history)
             action_str = json.dumps(action.model_dump(), separators=(",", ":"))
             action_history.append(action_str)
             observation, reward, done, info = env.step(action)
             reward_value = reward.value
             rewards.append(reward_value)
@@ -209,7 +362,6 @@ def run_task(client: OpenAI, task_name: str) -> dict:
             if done:
                 break
-        # Fix 1: clamp to strictly open (0, 1) — grader rejects 0.0 and 1.0
         score = clamp_score(score)
         success = score >= SUCCESS_SCORE_THRESHOLD
     finally:
@@ -221,9 +373,6 @@ def run_task(client: OpenAI, task_name: str) -> dict:
 def main() -> None:
     client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
-    # Fix 2: run at least MIN_TASKS tasks so the grader has enough scored entries
-    # Run in reverse difficulty order (hard first) so expensive tasks get credits
-    # while the budget is fresh, rather than always dying on the last task.
     tasks = list(reversed(select_tasks(TASK_NAME)))
     all_results = []
@@ -231,7 +380,6 @@ def main() -> None:
         result = run_task(client, task_name)
         all_results.append(result)
-    # Summary across all tasks
     total = len(all_results)
     passed = sum(1 for r in all_results if r["success"])
     avg_score = sum(r["score"] for r in all_results) / total if total else 0.0

 load_dotenv()
 import json
 import os
+import re
 import textwrap
+import time
 from typing import List, Optional
 from openai import OpenAI
 BENCHMARK = os.getenv("SUPPORT_OPS_BENCHMARK", "support_ops_env")
 MAX_STEPS = int(os.getenv("MAX_STEPS", "24"))
 TEMPERATURE = float(os.getenv("TEMPERATURE", "0.1"))
+MAX_TOKENS = int(os.getenv("MAX_TOKENS", "4096"))  # reasoning models need budget for <think> blocks
 SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_SCORE_THRESHOLD", "0.8"))
+# FIX 1: Retry budget for malformed JSON responses before giving up
+JSON_RETRY_LIMIT = int(os.getenv("JSON_RETRY_LIMIT", "3"))
 # Minimum number of tasks required by the grader
 MIN_TASKS = 3
+# Actions that must be completed for every ticket before finalize is allowed.
+# finalize without these is the #1 score killer based on the logs.
+REQUIRED_PER_TICKET = {"set_priority", "set_route", "set_resolution"}
 SYSTEM_PROMPT = textwrap.dedent(
     """
     You are operating a customer support triage environment.
+    Return exactly one JSON object with keys: action_type, target, value. No extra text, no markdown, no code fences.
     Allowed action_type values:
     - inspect_ticket
     {"action_type": "set_route",        "target": "T1", "value": "account_security"}
     {"action_type": "set_resolution",   "target": "T1", "value": "temporary_lock_and_manual_recovery"}
     {"action_type": "escalate",         "target": "T1", "value": "security_specialist"}
+    {"action_type": "rank_queue",       "target": "queue", "value": "T2,T1,T3"}
     {"action_type": "finalize",         "target": "T1", "value": ""}
     CRITICAL: For request_context, target = ticket ID (e.g. "T1"), value = context key name.
     NEVER put the context key name in target. target is ALWAYS a ticket ID.
+    MANDATORY WORKFLOW — follow in this exact order for each ticket:
+    1. inspect_ticket (target=ticket_id, value="")  ← ONCE per ticket, BEFORE any other action on it.
+    2. request_context ONLY for keys in required_context_keys (these affect your score).
+       Use target=ticket_id, value=key_name. One key per step. Request each key at most once.
+       Do NOT request optional available_context_keys — they waste steps.
+    3. set_priority  ← MANDATORY before finalize. Use valid priority values.
+    4. set_route     ← MANDATORY before finalize. Use valid route values.
+    5. set_resolution ← MANDATORY before finalize. Use valid resolution values.
+    6. escalate only when account takeover / security compromise is confirmed.
+    7. For queue tasks: rank_queue once, after ALL tickets are processed.
+    8. finalize (target=ticket_id, value="") — ONLY after set_priority, set_route,
+       and set_resolution have ALL been called for this ticket.
+    *** YOU MUST call set_priority, set_route, and set_resolution on every ticket. ***
+    *** Calling finalize before those three actions will score near 0. ***
     PRIORITY HINTS:
     - Account takeover / fraud / SLA <= 2h → urgent
     STRICT RULES:
     - NEVER repeat an action you have already taken (check your history).
+    - inspect_ticket AT MOST ONCE per ticket, and ALWAYS before request_context on that ticket.
     - target is ALWAYS a ticket ID like "T1". NEVER put a context key in target.
     - Each request_context must use a different value (key name).
+    - value must ALWAYS be a string — use "" (empty string), never null.
     """
 ).strip()
     )
+def build_user_prompt(
+    observation: Observation,
+    step: int,
+    rewards: List[float],
+    action_history: List[str],
+    completed_per_ticket: dict,
+) -> str:
     reward_history = ",".join(f"{reward:.2f}" for reward in rewards[-5:]) if rewards else "none"
     history_str = "\n".join(f"  {a}" for a in action_history) if action_history else "  none"
+    # FIX 2: Summarise what mandatory actions are still missing per ticket so the
+    # model can see at a glance what it still needs to do before finalize.
+    pending_lines = []
+    for tid, done_actions in sorted(completed_per_ticket.items()):
+        missing = REQUIRED_PER_TICKET - done_actions
+        if missing:
+            pending_lines.append(f"  {tid}: still needs {', '.join(sorted(missing))}")
+    pending_str = "\n".join(pending_lines) if pending_lines else "  all mandatory actions complete"
     return textwrap.dedent(
         f"""
         Step: {step}
         Difficulty: {observation.difficulty}
         Reward history: {reward_history}
+        Mandatory actions still PENDING (you MUST complete these before finalize):
+{pending_str}
         Actions you have ALREADY taken this episode (do NOT repeat these):
 {history_str}
         Observation JSON:
         {json.dumps(observation.model_dump(), indent=2, sort_keys=True)}
         Return one JSON action that you have NOT already taken.
+        Remember: value must always be a string, never null.
         """
     ).strip()
+def extract_json(text: str) -> dict:
+    """
+    Robustly extract a JSON object from model output.
+    Handles:
+    - <think>...</think> reasoning blocks (emitted by DeepSeek-R1, Gemini thinking, etc.)
+    - Markdown code fences (```json ... ```)
+    - Stray surrounding text
+    """
+    # Strip <think>...</think> blocks first — they often contain stray { } chars
+    # that fool the JSON extractor into grabbing the wrong object.
+    text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
+    # Strip ```json ... ``` fences
+    text = re.sub(r"```(?:json)?", "", text).strip().rstrip("`").strip()
+    # Try direct parse
     try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        pass
+    # Find the LAST complete {...} block — the real action is always after any
+    # preamble text, so the last match is more reliable than the first.
+    matches = list(re.finditer(r"\{[^{}]+\}", text, re.DOTALL))
+    for m in reversed(matches):
+        try:
+            return json.loads(m.group())
+        except json.JSONDecodeError:
+            continue
+    raise ValueError(f"No valid JSON object found in: {text!r}")
+def get_model_action(
+    client: OpenAI,
+    observation: Observation,
+    step: int,
+    rewards: List[float],
+    action_history: List[str],
+    completed_per_ticket: dict,
+) -> tuple[Action, Optional[str]]:
+    user_prompt = build_user_prompt(observation, step, rewards, action_history, completed_per_ticket)
+    last_exc: Optional[str] = None
+    content = ""
+    for attempt in range(1, JSON_RETRY_LIMIT + 1):
+        # Slightly raise temperature on retries so we don't get the same bad output
+        temp = TEMPERATURE if attempt == 1 else min(TEMPERATURE + 0.15 * attempt, 1.0)
+        try:
+            completion = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=[
+                    {"role": "system", "content": SYSTEM_PROMPT},
+                    {"role": "user", "content": user_prompt},
+                ],
+                temperature=temp,
+                max_tokens=MAX_TOKENS,
+                stream=False,
+            )
+            content = (completion.choices[0].message.content or "").strip()
+            payload = extract_json(content)
+            # FIX 4: Normalise null → "" so the Action model never sees None for value
+            if payload.get("value") is None:
+                payload["value"] = ""
+            action = Action.model_validate(payload)
+            return action, None
+        except Exception as exc:
+            last_exc = str(exc).replace("\n", " ")
+            print(f"[WARN] attempt={attempt} parse_error={last_exc!r} content={content!r}", flush=True)
+            # FIX 5a: Respect rate-limit retry-after delays instead of hammering the API.
+            # The 429 body includes a retryDelay field (e.g. "16s"). Parse and sleep for it
+            # so subsequent attempts actually succeed rather than burning the retry budget.
+            if "429" in last_exc or "RESOURCE_EXHAUSTED" in last_exc:
+                delay_match = re.search(r"retryDelay['\"]:\s*['\"](\d+(?:\.\d+)?)s", last_exc)
+                delay = float(delay_match.group(1)) if delay_match else 20.0
+                print(f"[WARN] rate-limited; sleeping {delay:.1f}s before retry", flush=True)
+                time.sleep(delay)
+    # FIX 5b: Exhausted retries — do NOT blindly finalize.
+    # Skip to a no-op inspect on the first visible ticket to keep the episode alive.
+    print("[WARN] JSON retry limit exhausted; emitting safe no-op", flush=True)
+    # observation.tickets may be a list of objects or a dict — handle both.
+    obs_dump = observation.model_dump()
+    raw_tickets = obs_dump.get("tickets", [])
+    if isinstance(raw_tickets, dict):
+        ticket_ids = list(raw_tickets.keys())
+    else:
+        # list of dicts — each item should have an "id" or similar field
+        ticket_ids = [
+            t.get("ticket_id") or t.get("id") or f"T{i+1}"
+            for i, t in enumerate(raw_tickets)
+        ]
+    ticket_ids = ticket_ids or ["T1"]
+    inspected = {
+        json.loads(a)["target"]
+        for a in action_history
+        if json.loads(a).get("action_type") == "inspect_ticket"
+    }
+    target = next((t for t in ticket_ids if t not in inspected), ticket_ids[0])
+    fallback = Action(action_type="inspect_ticket", target=target, value="")
+    return fallback, last_exc
 def clamp_score(score: float) -> float:
     if not available:
         raise RuntimeError("No tasks available in the environment.")
     primary = requested if requested in available else available[0]
     others = [t for t in available if t != primary]
     task_list = [primary] + others
     env = SupportOpsEnv(task_id=task_name)
     rewards: List[float] = []
     action_history: List[str] = []
+    # FIX 6: Track which mandatory actions have been completed per ticket
+    # so we can warn the model and block premature finalize.
+    completed_per_ticket: dict[str, set] = {}
     steps_taken = 0
     score = 0.0
     success = False
         observation = env.reset(task_id=task_name)
         for step in range(1, MAX_STEPS + 1):
+            action, action_error = get_model_action(
+                client, observation, step, rewards, action_history, completed_per_ticket
+            )
+            # FIX 7: Guard against premature finalize — if mandatory steps are still
+            # missing for any ticket, redirect to the first pending mandatory action
+            # instead of letting the model throw away the score.
+            if action.action_type == "finalize":
+                target = action.target or "T1"
+                missing = REQUIRED_PER_TICKET - completed_per_ticket.get(target, set())
+                if missing:
+                    next_action_type = sorted(missing)[0]  # deterministic ordering
+                    print(
+                        f"[GUARD] Premature finalize on {target}; redirecting to {next_action_type}",
+                        flush=True,
+                    )
+                    # Pick the first valid value for the missing action type
+                    FALLBACK_VALUES = {
+                        "set_priority": "normal",
+                        "set_route": "policy_appeals",
+                        "set_resolution": "expedited_human_review",
+                    }
+                    action = Action(
+                        action_type=next_action_type,
+                        target=target,
+                        value=FALLBACK_VALUES[next_action_type],
+                    )
             action_str = json.dumps(action.model_dump(), separators=(",", ":"))
             action_history.append(action_str)
+            # Update completion tracker
+            if action.action_type in REQUIRED_PER_TICKET:
+                t = action.target or "T1"
+                completed_per_ticket.setdefault(t, set()).add(action.action_type)
             observation, reward, done, info = env.step(action)
             reward_value = reward.value
             rewards.append(reward_value)
             if done:
                 break
         score = clamp_score(score)
         success = score >= SUCCESS_SCORE_THRESHOLD
     finally:
 def main() -> None:
     client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
     tasks = list(reversed(select_tasks(TASK_NAME)))
     all_results = []
         result = run_task(client, task_name)
         all_results.append(result)
     total = len(all_results)
     passed = sum(1 for r in all_results if r["success"])
     avg_score = sum(r["score"] for r in all_results) / total if total else 0.0