Addy897 commited on
Commit
735d73f
·
1 Parent(s): 841976b
Files changed (2) hide show
  1. README.md +81 -52
  2. inference.py +190 -42
README.md CHANGED
@@ -20,7 +20,7 @@ The environment is designed to score well against OpenEnv-style hackathon criter
20
  - Typed observation, action, and reward models
21
  - Reproducible OpenAI baseline runner
22
  - Reproducible rule-based baseline runner that works with no API key
23
- - Dockerized deployment path for Hugging Face Spaces
24
 
25
  ## Environment Motivation
26
 
@@ -67,32 +67,34 @@ Each ticket observation contains:
67
 
68
  Supported `action_type` values:
69
 
70
- - `inspect_ticket`
71
- - `request_context`
72
- - `set_priority`
73
- - `set_route`
74
- - `set_resolution`
75
- - `escalate`
76
- - `rank_queue`
77
- - `finalize`
 
 
78
 
79
  ## Reward Design
80
 
81
  `RewardModel` is a Pydantic model with:
82
 
83
- - `value`
84
- - `components`
85
- - `rationale`
86
 
87
  Reward shaping is dense, not sparse:
88
 
89
- - positive reward for discovering required context
90
- - positive reward for correct intermediate decisions
91
  - positive reward for correct queue ranking progress
92
  - terminal reward from the deterministic grader score
93
  - penalties for invalid actions, redundant actions, and wasted steps
94
 
95
- This creates learning or evaluation signal over the full trajectory.
96
 
97
  ## Tasks
98
 
@@ -100,8 +102,6 @@ This creates learning or evaluation signal over the full trajectory.
100
 
101
  Objective: correctly handle an urgent suspected account takeover with unauthorized ad spend.
102
 
103
- Expected difficulty: easy.
104
-
105
  Success criteria:
106
 
107
  - request the right security and billing context
@@ -114,8 +114,6 @@ Success criteria:
114
 
115
  Objective: investigate a missing creator payout and avoid unsafe release of funds.
116
 
117
- Expected difficulty: medium.
118
-
119
  Success criteria:
120
 
121
  - discover tax-expiry and compliance-hold context
@@ -126,26 +124,50 @@ Success criteria:
126
 
127
  ### Hard: Mixed Support Queue Triage
128
 
129
- Objective: prioritize and resolve a heterogeneous queue under SLA pressure.
130
-
131
- Expected difficulty: hard.
132
 
133
  Success criteria:
134
 
135
- - correctly rank the queue
136
- - assign route and priority for each ticket
137
- - choose correct resolutions
138
  - escalate only the security-critical case
139
 
140
  ## Graders
141
 
142
- Each task has a deterministic grader that returns a score in `0.0` to `1.0`.
143
 
144
- - Easy grader weights context, priority, route, resolution, and escalation
145
  - Medium grader weights context and policy-safe resolution more heavily
146
- - Hard grader scores per-ticket handling and queue ranking
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
147
 
148
- Programmatic graders live in [support_ops_env/graders](/home/batman/Downloads/presentation_template/support_ops_env/support_ops_env/graders).
 
 
 
 
 
 
 
149
 
150
  ## Setup
151
 
@@ -176,55 +198,59 @@ Run the default no-API baseline:
176
  python scripts/run_rule_baseline.py
177
  ```
178
 
179
- Run the OpenAI baseline if you have an API key:
180
 
181
  ```bash
182
  export OPENAI_API_KEY=your_key_here
183
  python scripts/run_baseline.py --model gpt-4.1-mini
184
  ```
185
 
186
- Validate metadata:
187
 
188
  ```bash
189
  bash scripts/validate_env.sh
 
190
  ```
191
 
192
- If the `openenv` CLI is installed, the script will also run `openenv validate openenv.yaml`.
193
-
194
- ## Baseline Scores
195
 
196
- The repository now includes a deterministic baseline in [run_rule_baseline.py](/home/batman/Downloads/presentation_template/support_ops_env/scripts/run_rule_baseline.py), so you can produce reproducible scores without any external API.
197
 
198
- In this workspace, use:
199
 
200
  ```bash
201
- python scripts/run_rule_baseline.py
 
 
202
  ```
203
 
204
- This writes `rule_baseline_results.json` with per-task transcripts and the average score.
205
 
206
- The current deterministic baseline score from this workspace is:
 
 
 
 
207
 
208
- - `easy_account_takeover`: `1.0`
209
- - `medium_payout_hold`: `1.0`
210
- - `hard_queue_triage`: `1.0`
211
- - average: `1.0`
212
 
213
- The OpenAI baseline in [run_baseline.py](/home/batman/Downloads/presentation_template/support_ops_env/scripts/run_baseline.py) is still available as an optional comparison path after installing dependencies and setting `OPENAI_API_KEY`.
 
 
214
 
215
- ## Hugging Face Space Deployment
216
 
217
- This repository includes:
 
 
218
 
219
- - `Dockerfile`
220
- - `app.py`
221
- - `openenv.yaml`
222
 
223
- To deploy as a Docker Space:
224
 
225
  1. Create a new Hugging Face Space with SDK set to Docker.
226
- 2. Upload this repository.
227
- 3. Add the `openenv` tag in the Space metadata.
228
  4. Optionally set `OPENAI_API_KEY` as a Space secret for baseline experiments.
229
 
230
  ## Project Structure
@@ -240,6 +266,9 @@ support_ops_env/
240
  │ ├── graders/
241
  │ └── tasks/
242
  ├── scripts/
 
 
 
243
  ├── tests/
244
  ├── app.py
245
  ├── openenv.yaml
 
20
  - Typed observation, action, and reward models
21
  - Reproducible OpenAI baseline runner
22
  - Reproducible rule-based baseline runner that works with no API key
23
+ - Dockerized deployment on Hugging Face Spaces
24
 
25
  ## Environment Motivation
26
 
 
67
 
68
  Supported `action_type` values:
69
 
70
+ | `action_type` | `target` | `value` example |
71
+ |------------------|------------|----------------------------------------|
72
+ | `inspect_ticket` | ticket ID | `""` |
73
+ | `request_context`| ticket ID | `"tax_status"` |
74
+ | `set_priority` | ticket ID | `"urgent"` / `"high"` / `"normal"` / `"low"` |
75
+ | `set_route` | ticket ID | `"account_security"` / `"billing_refunds"` / `"monetization_compliance"` / `"policy_appeals"` |
76
+ | `set_resolution` | ticket ID | `"temporary_lock_and_manual_recovery"` / `"request_tax_renewal"` / `"approve_refund"` / `"expedited_human_review"` |
77
+ | `escalate` | ticket ID | `"security_specialist"` |
78
+ | `rank_queue` | `"queue"` | `"T2,T1,T3"` |
79
+ | `finalize` | ticket ID | `""` |
80
 
81
  ## Reward Design
82
 
83
  `RewardModel` is a Pydantic model with:
84
 
85
+ - `value`: scalar reward for this step
86
+ - `components`: dict of named sub-rewards
87
+ - `rationale`: human-readable explanation
88
 
89
  Reward shaping is dense, not sparse:
90
 
91
+ - positive reward for discovering required context keys
92
+ - positive reward for correct intermediate decisions (priority, route, resolution)
93
  - positive reward for correct queue ranking progress
94
  - terminal reward from the deterministic grader score
95
  - penalties for invalid actions, redundant actions, and wasted steps
96
 
97
+ This creates a learning or evaluation signal over the full trajectory, not just at episode end.
98
 
99
  ## Tasks
100
 
 
102
 
103
  Objective: correctly handle an urgent suspected account takeover with unauthorized ad spend.
104
 
 
 
105
  Success criteria:
106
 
107
  - request the right security and billing context
 
114
 
115
  Objective: investigate a missing creator payout and avoid unsafe release of funds.
116
 
 
 
117
  Success criteria:
118
 
119
  - discover tax-expiry and compliance-hold context
 
124
 
125
  ### Hard: Mixed Support Queue Triage
126
 
127
+ Objective: prioritize and resolve a heterogeneous queue of three tickets under SLA pressure.
 
 
128
 
129
  Success criteria:
130
 
131
+ - correctly rank the queue by urgency
132
+ - assign route and priority for each ticket independently
133
+ - choose correct resolutions per ticket
134
  - escalate only the security-critical case
135
 
136
  ## Graders
137
 
138
+ Each task has a deterministic grader that returns a score in `[0.0, 1.0]`.
139
 
140
+ - Easy grader weights context discovery, priority, route, resolution, and escalation
141
  - Medium grader weights context and policy-safe resolution more heavily
142
+ - Hard grader scores per-ticket handling and queue ranking independently
143
+
144
+ Programmatic graders live in [`support_ops_env/graders/`](./support_ops_env/graders/).
145
+
146
+ ## Baseline Scores
147
+
148
+ ### Rule-based baseline (no API key required)
149
+
150
+ The deterministic rule-based baseline always takes the optimal action sequence and is used as a sanity check that the graders are correct and reachable:
151
+
152
+ | Task | Score |
153
+ |-------------------------|-------|
154
+ | `easy_account_takeover` | 1.000 |
155
+ | `medium_payout_hold` | 1.000 |
156
+ | `hard_queue_triage` | 1.000 |
157
+ | **average** | **1.000** |
158
+
159
+ ### LLM baseline (GPT-4.1-mini)
160
+
161
+ These are the reproducible scores from the OpenAI baseline runner. They demonstrate that the environment provides a genuine challenge to frontier models, particularly on the hard task:
162
 
163
+ | Task | Score | Notes |
164
+ |-------------------------|-------|-------|
165
+ | `easy_account_takeover` | ~0.20 | Model skips mandatory set_priority / set_route / set_resolution before finalize |
166
+ | `medium_payout_hold` | ~0.35 | Correct context discovery but premature finalize |
167
+ | `hard_queue_triage` | ~0.13 | Multi-ticket ranking and per-ticket mandatory actions not completed |
168
+ | **average** | **~0.23** | |
169
+
170
+ The gap between the rule baseline and the LLM baseline confirms the reward function produces genuine signal and the hard task challenges frontier models.
171
 
172
  ## Setup
173
 
 
198
  python scripts/run_rule_baseline.py
199
  ```
200
 
201
+ Run the OpenAI baseline:
202
 
203
  ```bash
204
  export OPENAI_API_KEY=your_key_here
205
  python scripts/run_baseline.py --model gpt-4.1-mini
206
  ```
207
 
208
+ Validate OpenEnv metadata:
209
 
210
  ```bash
211
  bash scripts/validate_env.sh
212
+ # If the openenv CLI is installed, this also runs: openenv validate openenv.yaml
213
  ```
214
 
215
+ ## API Quick Start
 
 
216
 
217
+ The live environment is available at `https://suppops-supportopsenv.hf.space`.
218
 
219
+ Reset to a task:
220
 
221
  ```bash
222
+ curl -X POST https://suppops-supportopsenv.hf.space/reset \
223
+ -H "Content-Type: application/json" \
224
+ -d '{"task_id": "easy_account_takeover"}'
225
  ```
226
 
227
+ Take a step:
228
 
229
+ ```bash
230
+ curl -X POST https://suppops-supportopsenv.hf.space/step \
231
+ -H "Content-Type: application/json" \
232
+ -d '{"action": {"action_type": "inspect_ticket", "target": "T1", "value": ""}}'
233
+ ```
234
 
235
+ Inspect the full environment state:
 
 
 
236
 
237
+ ```bash
238
+ curl https://suppops-supportopsenv.hf.space/state
239
+ ```
240
 
241
+ Get JSON schemas for all models:
242
 
243
+ ```bash
244
+ curl https://suppops-supportopsenv.hf.space/schema
245
+ ```
246
 
247
+ ## Hugging Face Space Deployment
 
 
248
 
249
+ This repository includes a `Dockerfile`, `app.py`, and `openenv.yaml` and deploys as a Docker Space.
250
 
251
  1. Create a new Hugging Face Space with SDK set to Docker.
252
+ 2. Push this repository to the Space.
253
+ 3. Add the `openenv` tag in the Space metadata (already present in this README's frontmatter).
254
  4. Optionally set `OPENAI_API_KEY` as a Space secret for baseline experiments.
255
 
256
  ## Project Structure
 
266
  │ ├── graders/
267
  │ └── tasks/
268
  ├── scripts/
269
+ │ ├── run_baseline.py
270
+ │ ├── run_rule_baseline.py
271
+ │ └── validate_env.sh
272
  ├── tests/
273
  ├── app.py
274
  ├── openenv.yaml
inference.py CHANGED
@@ -2,7 +2,9 @@ from dotenv import load_dotenv
2
  load_dotenv()
3
  import json
4
  import os
 
5
  import textwrap
 
6
  from typing import List, Optional
7
 
8
  from openai import OpenAI
@@ -19,16 +21,23 @@ TASK_NAME = os.getenv("SUPPORT_OPS_TASK", "easy_account_takeover")
19
  BENCHMARK = os.getenv("SUPPORT_OPS_BENCHMARK", "support_ops_env")
20
  MAX_STEPS = int(os.getenv("MAX_STEPS", "24"))
21
  TEMPERATURE = float(os.getenv("TEMPERATURE", "0.1"))
22
- MAX_TOKENS = int(os.getenv("MAX_TOKENS", "220"))
23
  SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_SCORE_THRESHOLD", "0.8"))
24
 
 
 
 
25
  # Minimum number of tasks required by the grader
26
  MIN_TASKS = 3
27
 
 
 
 
 
28
  SYSTEM_PROMPT = textwrap.dedent(
29
  """
30
  You are operating a customer support triage environment.
31
- Return exactly one JSON object with keys: action_type, target, value.
32
 
33
  Allowed action_type values:
34
  - inspect_ticket
@@ -54,23 +63,27 @@ SYSTEM_PROMPT = textwrap.dedent(
54
  {"action_type": "set_route", "target": "T1", "value": "account_security"}
55
  {"action_type": "set_resolution", "target": "T1", "value": "temporary_lock_and_manual_recovery"}
56
  {"action_type": "escalate", "target": "T1", "value": "security_specialist"}
57
- {"action_type": "rank_queue", "target": "T1", "value": "T2,T1,T3"}
58
  {"action_type": "finalize", "target": "T1", "value": ""}
59
 
60
  CRITICAL: For request_context, target = ticket ID (e.g. "T1"), value = context key name.
61
  NEVER put the context key name in target. target is ALWAYS a ticket ID.
62
 
63
- WORKFLOW PER TICKET:
64
- 1. inspect_ticket once (target=ticket_id, value="").
65
- 2. request_context ONLY for keys in required_context_keys first (these affect your score).
66
- Use target=ticket_id, value=key_name. Request each key at most once.
67
- Do NOT request optional keys from available_context_keys — they give tiny reward
68
- but waste steps you need for set_resolution, escalate, rank_queue, and finalize.
69
- 3. set_priority, set_route, set_resolution using the VALID VALUES listed above.
70
- Use the context you discovered to choose correctly.
71
- 4. escalate only when account takeover / security compromise is confirmed.
72
- 5. For queue tasks: rank_queue after processing all tickets (most urgent first).
73
- 6. finalize (target=ticket_id, value="") when all tickets are done.
 
 
 
 
74
 
75
  PRIORITY HINTS:
76
  - Account takeover / fraud / SLA <= 2h → urgent
@@ -79,9 +92,10 @@ SYSTEM_PROMPT = textwrap.dedent(
79
 
80
  STRICT RULES:
81
  - NEVER repeat an action you have already taken (check your history).
82
- - inspect_ticket AT MOST ONCE per ticket.
83
  - target is ALWAYS a ticket ID like "T1". NEVER put a context key in target.
84
  - Each request_context must use a different value (key name).
 
85
  """
86
  ).strip()
87
 
@@ -107,9 +121,25 @@ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> No
107
  )
108
 
109
 
110
- def build_user_prompt(observation: Observation, step: int, rewards: List[float], action_history: List[str]) -> str:
 
 
 
 
 
 
111
  reward_history = ",".join(f"{reward:.2f}" for reward in rewards[-5:]) if rewards else "none"
112
  history_str = "\n".join(f" {a}" for a in action_history) if action_history else " none"
 
 
 
 
 
 
 
 
 
 
113
  return textwrap.dedent(
114
  f"""
115
  Step: {step}
@@ -117,36 +147,125 @@ def build_user_prompt(observation: Observation, step: int, rewards: List[float],
117
  Difficulty: {observation.difficulty}
118
  Reward history: {reward_history}
119
 
 
 
 
120
  Actions you have ALREADY taken this episode (do NOT repeat these):
121
  {history_str}
122
 
123
  Observation JSON:
124
  {json.dumps(observation.model_dump(), indent=2, sort_keys=True)}
125
  Return one JSON action that you have NOT already taken.
 
126
  """
127
  ).strip()
128
 
129
 
130
- def get_model_action(client: OpenAI, observation: Observation, step: int, rewards: List[float], action_history: List[str]) -> tuple[Action, Optional[str]]:
131
- user_prompt = build_user_prompt(observation, step, rewards, action_history)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
132
  try:
133
- completion = client.chat.completions.create(
134
- model=MODEL_NAME,
135
- messages=[
136
- {"role": "system", "content": SYSTEM_PROMPT},
137
- {"role": "user", "content": user_prompt},
138
- ],
139
- temperature=TEMPERATURE,
140
- max_tokens=MAX_TOKENS,
141
- stream=False,
142
- )
143
- content = (completion.choices[0].message.content or "").strip()
144
- payload = json.loads(content)
145
- action = Action.model_validate(payload)
146
- return action, None
147
- except Exception as exc:
148
- fallback = Action(action_type="finalize")
149
- return fallback, str(exc).replace("\n", " ")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
 
151
 
152
  def clamp_score(score: float) -> float:
@@ -166,7 +285,6 @@ def select_tasks(requested: str) -> List[str]:
166
  if not available:
167
  raise RuntimeError("No tasks available in the environment.")
168
 
169
- # Start with the requested task (validated), then fill up to MIN_TASKS
170
  primary = requested if requested in available else available[0]
171
  others = [t for t in available if t != primary]
172
  task_list = [primary] + others
@@ -178,6 +296,9 @@ def run_task(client: OpenAI, task_name: str) -> dict:
178
  env = SupportOpsEnv(task_id=task_name)
179
  rewards: List[float] = []
180
  action_history: List[str] = []
 
 
 
181
  steps_taken = 0
182
  score = 0.0
183
  success = False
@@ -188,10 +309,42 @@ def run_task(client: OpenAI, task_name: str) -> dict:
188
  observation = env.reset(task_id=task_name)
189
 
190
  for step in range(1, MAX_STEPS + 1):
191
- action, action_error = get_model_action(client, observation, step, rewards, action_history)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
192
  action_str = json.dumps(action.model_dump(), separators=(",", ":"))
193
  action_history.append(action_str)
194
 
 
 
 
 
 
195
  observation, reward, done, info = env.step(action)
196
  reward_value = reward.value
197
  rewards.append(reward_value)
@@ -209,7 +362,6 @@ def run_task(client: OpenAI, task_name: str) -> dict:
209
  if done:
210
  break
211
 
212
- # Fix 1: clamp to strictly open (0, 1) — grader rejects 0.0 and 1.0
213
  score = clamp_score(score)
214
  success = score >= SUCCESS_SCORE_THRESHOLD
215
  finally:
@@ -221,9 +373,6 @@ def run_task(client: OpenAI, task_name: str) -> dict:
221
  def main() -> None:
222
  client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
223
 
224
- # Fix 2: run at least MIN_TASKS tasks so the grader has enough scored entries
225
- # Run in reverse difficulty order (hard first) so expensive tasks get credits
226
- # while the budget is fresh, rather than always dying on the last task.
227
  tasks = list(reversed(select_tasks(TASK_NAME)))
228
 
229
  all_results = []
@@ -231,7 +380,6 @@ def main() -> None:
231
  result = run_task(client, task_name)
232
  all_results.append(result)
233
 
234
- # Summary across all tasks
235
  total = len(all_results)
236
  passed = sum(1 for r in all_results if r["success"])
237
  avg_score = sum(r["score"] for r in all_results) / total if total else 0.0
 
2
  load_dotenv()
3
  import json
4
  import os
5
+ import re
6
  import textwrap
7
+ import time
8
  from typing import List, Optional
9
 
10
  from openai import OpenAI
 
21
  BENCHMARK = os.getenv("SUPPORT_OPS_BENCHMARK", "support_ops_env")
22
  MAX_STEPS = int(os.getenv("MAX_STEPS", "24"))
23
  TEMPERATURE = float(os.getenv("TEMPERATURE", "0.1"))
24
+ MAX_TOKENS = int(os.getenv("MAX_TOKENS", "4096")) # reasoning models need budget for <think> blocks
25
  SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_SCORE_THRESHOLD", "0.8"))
26
 
27
+ # FIX 1: Retry budget for malformed JSON responses before giving up
28
+ JSON_RETRY_LIMIT = int(os.getenv("JSON_RETRY_LIMIT", "3"))
29
+
30
  # Minimum number of tasks required by the grader
31
  MIN_TASKS = 3
32
 
33
+ # Actions that must be completed for every ticket before finalize is allowed.
34
+ # finalize without these is the #1 score killer based on the logs.
35
+ REQUIRED_PER_TICKET = {"set_priority", "set_route", "set_resolution"}
36
+
37
  SYSTEM_PROMPT = textwrap.dedent(
38
  """
39
  You are operating a customer support triage environment.
40
+ Return exactly one JSON object with keys: action_type, target, value. No extra text, no markdown, no code fences.
41
 
42
  Allowed action_type values:
43
  - inspect_ticket
 
63
  {"action_type": "set_route", "target": "T1", "value": "account_security"}
64
  {"action_type": "set_resolution", "target": "T1", "value": "temporary_lock_and_manual_recovery"}
65
  {"action_type": "escalate", "target": "T1", "value": "security_specialist"}
66
+ {"action_type": "rank_queue", "target": "queue", "value": "T2,T1,T3"}
67
  {"action_type": "finalize", "target": "T1", "value": ""}
68
 
69
  CRITICAL: For request_context, target = ticket ID (e.g. "T1"), value = context key name.
70
  NEVER put the context key name in target. target is ALWAYS a ticket ID.
71
 
72
+ MANDATORY WORKFLOW follow in this exact order for each ticket:
73
+ 1. inspect_ticket (target=ticket_id, value="") ← ONCE per ticket, BEFORE any other action on it.
74
+ 2. request_context ONLY for keys in required_context_keys (these affect your score).
75
+ Use target=ticket_id, value=key_name. One key per step. Request each key at most once.
76
+ Do NOT request optional available_context_keys — they waste steps.
77
+ 3. set_priority ← MANDATORY before finalize. Use valid priority values.
78
+ 4. set_route MANDATORY before finalize. Use valid route values.
79
+ 5. set_resolution MANDATORY before finalize. Use valid resolution values.
80
+ 6. escalate only when account takeover / security compromise is confirmed.
81
+ 7. For queue tasks: rank_queue once, after ALL tickets are processed.
82
+ 8. finalize (target=ticket_id, value="") ONLY after set_priority, set_route,
83
+ and set_resolution have ALL been called for this ticket.
84
+
85
+ *** YOU MUST call set_priority, set_route, and set_resolution on every ticket. ***
86
+ *** Calling finalize before those three actions will score near 0. ***
87
 
88
  PRIORITY HINTS:
89
  - Account takeover / fraud / SLA <= 2h → urgent
 
92
 
93
  STRICT RULES:
94
  - NEVER repeat an action you have already taken (check your history).
95
+ - inspect_ticket AT MOST ONCE per ticket, and ALWAYS before request_context on that ticket.
96
  - target is ALWAYS a ticket ID like "T1". NEVER put a context key in target.
97
  - Each request_context must use a different value (key name).
98
+ - value must ALWAYS be a string — use "" (empty string), never null.
99
  """
100
  ).strip()
101
 
 
121
  )
122
 
123
 
124
+ def build_user_prompt(
125
+ observation: Observation,
126
+ step: int,
127
+ rewards: List[float],
128
+ action_history: List[str],
129
+ completed_per_ticket: dict,
130
+ ) -> str:
131
  reward_history = ",".join(f"{reward:.2f}" for reward in rewards[-5:]) if rewards else "none"
132
  history_str = "\n".join(f" {a}" for a in action_history) if action_history else " none"
133
+
134
+ # FIX 2: Summarise what mandatory actions are still missing per ticket so the
135
+ # model can see at a glance what it still needs to do before finalize.
136
+ pending_lines = []
137
+ for tid, done_actions in sorted(completed_per_ticket.items()):
138
+ missing = REQUIRED_PER_TICKET - done_actions
139
+ if missing:
140
+ pending_lines.append(f" {tid}: still needs {', '.join(sorted(missing))}")
141
+ pending_str = "\n".join(pending_lines) if pending_lines else " all mandatory actions complete"
142
+
143
  return textwrap.dedent(
144
  f"""
145
  Step: {step}
 
147
  Difficulty: {observation.difficulty}
148
  Reward history: {reward_history}
149
 
150
+ Mandatory actions still PENDING (you MUST complete these before finalize):
151
+ {pending_str}
152
+
153
  Actions you have ALREADY taken this episode (do NOT repeat these):
154
  {history_str}
155
 
156
  Observation JSON:
157
  {json.dumps(observation.model_dump(), indent=2, sort_keys=True)}
158
  Return one JSON action that you have NOT already taken.
159
+ Remember: value must always be a string, never null.
160
  """
161
  ).strip()
162
 
163
 
164
+ def extract_json(text: str) -> dict:
165
+ """
166
+ Robustly extract a JSON object from model output.
167
+ Handles:
168
+ - <think>...</think> reasoning blocks (emitted by DeepSeek-R1, Gemini thinking, etc.)
169
+ - Markdown code fences (```json ... ```)
170
+ - Stray surrounding text
171
+ """
172
+ # Strip <think>...</think> blocks first — they often contain stray { } chars
173
+ # that fool the JSON extractor into grabbing the wrong object.
174
+ text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
175
+
176
+ # Strip ```json ... ``` fences
177
+ text = re.sub(r"```(?:json)?", "", text).strip().rstrip("`").strip()
178
+
179
+ # Try direct parse
180
  try:
181
+ return json.loads(text)
182
+ except json.JSONDecodeError:
183
+ pass
184
+
185
+ # Find the LAST complete {...} block — the real action is always after any
186
+ # preamble text, so the last match is more reliable than the first.
187
+ matches = list(re.finditer(r"\{[^{}]+\}", text, re.DOTALL))
188
+ for m in reversed(matches):
189
+ try:
190
+ return json.loads(m.group())
191
+ except json.JSONDecodeError:
192
+ continue
193
+
194
+ raise ValueError(f"No valid JSON object found in: {text!r}")
195
+
196
+
197
+ def get_model_action(
198
+ client: OpenAI,
199
+ observation: Observation,
200
+ step: int,
201
+ rewards: List[float],
202
+ action_history: List[str],
203
+ completed_per_ticket: dict,
204
+ ) -> tuple[Action, Optional[str]]:
205
+ user_prompt = build_user_prompt(observation, step, rewards, action_history, completed_per_ticket)
206
+ last_exc: Optional[str] = None
207
+ content = ""
208
+
209
+ for attempt in range(1, JSON_RETRY_LIMIT + 1):
210
+ # Slightly raise temperature on retries so we don't get the same bad output
211
+ temp = TEMPERATURE if attempt == 1 else min(TEMPERATURE + 0.15 * attempt, 1.0)
212
+ try:
213
+ completion = client.chat.completions.create(
214
+ model=MODEL_NAME,
215
+ messages=[
216
+ {"role": "system", "content": SYSTEM_PROMPT},
217
+ {"role": "user", "content": user_prompt},
218
+ ],
219
+ temperature=temp,
220
+ max_tokens=MAX_TOKENS,
221
+ stream=False,
222
+ )
223
+ content = (completion.choices[0].message.content or "").strip()
224
+ payload = extract_json(content)
225
+
226
+ # FIX 4: Normalise null → "" so the Action model never sees None for value
227
+ if payload.get("value") is None:
228
+ payload["value"] = ""
229
+
230
+ action = Action.model_validate(payload)
231
+ return action, None
232
+ except Exception as exc:
233
+ last_exc = str(exc).replace("\n", " ")
234
+ print(f"[WARN] attempt={attempt} parse_error={last_exc!r} content={content!r}", flush=True)
235
+
236
+ # FIX 5a: Respect rate-limit retry-after delays instead of hammering the API.
237
+ # The 429 body includes a retryDelay field (e.g. "16s"). Parse and sleep for it
238
+ # so subsequent attempts actually succeed rather than burning the retry budget.
239
+ if "429" in last_exc or "RESOURCE_EXHAUSTED" in last_exc:
240
+ delay_match = re.search(r"retryDelay['\"]:\s*['\"](\d+(?:\.\d+)?)s", last_exc)
241
+ delay = float(delay_match.group(1)) if delay_match else 20.0
242
+ print(f"[WARN] rate-limited; sleeping {delay:.1f}s before retry", flush=True)
243
+ time.sleep(delay)
244
+
245
+ # FIX 5b: Exhausted retries — do NOT blindly finalize.
246
+ # Skip to a no-op inspect on the first visible ticket to keep the episode alive.
247
+ print("[WARN] JSON retry limit exhausted; emitting safe no-op", flush=True)
248
+ # observation.tickets may be a list of objects or a dict — handle both.
249
+ obs_dump = observation.model_dump()
250
+ raw_tickets = obs_dump.get("tickets", [])
251
+ if isinstance(raw_tickets, dict):
252
+ ticket_ids = list(raw_tickets.keys())
253
+ else:
254
+ # list of dicts — each item should have an "id" or similar field
255
+ ticket_ids = [
256
+ t.get("ticket_id") or t.get("id") or f"T{i+1}"
257
+ for i, t in enumerate(raw_tickets)
258
+ ]
259
+ ticket_ids = ticket_ids or ["T1"]
260
+
261
+ inspected = {
262
+ json.loads(a)["target"]
263
+ for a in action_history
264
+ if json.loads(a).get("action_type") == "inspect_ticket"
265
+ }
266
+ target = next((t for t in ticket_ids if t not in inspected), ticket_ids[0])
267
+ fallback = Action(action_type="inspect_ticket", target=target, value="")
268
+ return fallback, last_exc
269
 
270
 
271
  def clamp_score(score: float) -> float:
 
285
  if not available:
286
  raise RuntimeError("No tasks available in the environment.")
287
 
 
288
  primary = requested if requested in available else available[0]
289
  others = [t for t in available if t != primary]
290
  task_list = [primary] + others
 
296
  env = SupportOpsEnv(task_id=task_name)
297
  rewards: List[float] = []
298
  action_history: List[str] = []
299
+ # FIX 6: Track which mandatory actions have been completed per ticket
300
+ # so we can warn the model and block premature finalize.
301
+ completed_per_ticket: dict[str, set] = {}
302
  steps_taken = 0
303
  score = 0.0
304
  success = False
 
309
  observation = env.reset(task_id=task_name)
310
 
311
  for step in range(1, MAX_STEPS + 1):
312
+ action, action_error = get_model_action(
313
+ client, observation, step, rewards, action_history, completed_per_ticket
314
+ )
315
+
316
+ # FIX 7: Guard against premature finalize — if mandatory steps are still
317
+ # missing for any ticket, redirect to the first pending mandatory action
318
+ # instead of letting the model throw away the score.
319
+ if action.action_type == "finalize":
320
+ target = action.target or "T1"
321
+ missing = REQUIRED_PER_TICKET - completed_per_ticket.get(target, set())
322
+ if missing:
323
+ next_action_type = sorted(missing)[0] # deterministic ordering
324
+ print(
325
+ f"[GUARD] Premature finalize on {target}; redirecting to {next_action_type}",
326
+ flush=True,
327
+ )
328
+ # Pick the first valid value for the missing action type
329
+ FALLBACK_VALUES = {
330
+ "set_priority": "normal",
331
+ "set_route": "policy_appeals",
332
+ "set_resolution": "expedited_human_review",
333
+ }
334
+ action = Action(
335
+ action_type=next_action_type,
336
+ target=target,
337
+ value=FALLBACK_VALUES[next_action_type],
338
+ )
339
+
340
  action_str = json.dumps(action.model_dump(), separators=(",", ":"))
341
  action_history.append(action_str)
342
 
343
+ # Update completion tracker
344
+ if action.action_type in REQUIRED_PER_TICKET:
345
+ t = action.target or "T1"
346
+ completed_per_ticket.setdefault(t, set()).add(action.action_type)
347
+
348
  observation, reward, done, info = env.step(action)
349
  reward_value = reward.value
350
  rewards.append(reward_value)
 
362
  if done:
363
  break
364
 
 
365
  score = clamp_score(score)
366
  success = score >= SUCCESS_SCORE_THRESHOLD
367
  finally:
 
373
  def main() -> None:
374
  client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
375
 
 
 
 
376
  tasks = list(reversed(select_tasks(TASK_NAME)))
377
 
378
  all_results = []
 
380
  result = run_task(client, task_name)
381
  all_results.append(result)
382
 
 
383
  total = len(all_results)
384
  passed = sum(1 for r in all_results if r["success"])
385
  avg_score = sum(r["score"] for r in all_results) / total if total else 0.0