Spaces:
Sleeping
Sleeping
Final
Browse files- README.md +81 -52
- inference.py +190 -42
README.md
CHANGED
|
@@ -20,7 +20,7 @@ The environment is designed to score well against OpenEnv-style hackathon criter
|
|
| 20 |
- Typed observation, action, and reward models
|
| 21 |
- Reproducible OpenAI baseline runner
|
| 22 |
- Reproducible rule-based baseline runner that works with no API key
|
| 23 |
-
- Dockerized deployment
|
| 24 |
|
| 25 |
## Environment Motivation
|
| 26 |
|
|
@@ -67,32 +67,34 @@ Each ticket observation contains:
|
|
| 67 |
|
| 68 |
Supported `action_type` values:
|
| 69 |
|
| 70 |
-
|
| 71 |
-
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
|
|
|
|
|
|
| 78 |
|
| 79 |
## Reward Design
|
| 80 |
|
| 81 |
`RewardModel` is a Pydantic model with:
|
| 82 |
|
| 83 |
-
- `value`
|
| 84 |
-
- `components`
|
| 85 |
-
- `rationale`
|
| 86 |
|
| 87 |
Reward shaping is dense, not sparse:
|
| 88 |
|
| 89 |
-
- positive reward for discovering required context
|
| 90 |
-
- positive reward for correct intermediate decisions
|
| 91 |
- positive reward for correct queue ranking progress
|
| 92 |
- terminal reward from the deterministic grader score
|
| 93 |
- penalties for invalid actions, redundant actions, and wasted steps
|
| 94 |
|
| 95 |
-
This creates learning or evaluation signal over the full trajectory.
|
| 96 |
|
| 97 |
## Tasks
|
| 98 |
|
|
@@ -100,8 +102,6 @@ This creates learning or evaluation signal over the full trajectory.
|
|
| 100 |
|
| 101 |
Objective: correctly handle an urgent suspected account takeover with unauthorized ad spend.
|
| 102 |
|
| 103 |
-
Expected difficulty: easy.
|
| 104 |
-
|
| 105 |
Success criteria:
|
| 106 |
|
| 107 |
- request the right security and billing context
|
|
@@ -114,8 +114,6 @@ Success criteria:
|
|
| 114 |
|
| 115 |
Objective: investigate a missing creator payout and avoid unsafe release of funds.
|
| 116 |
|
| 117 |
-
Expected difficulty: medium.
|
| 118 |
-
|
| 119 |
Success criteria:
|
| 120 |
|
| 121 |
- discover tax-expiry and compliance-hold context
|
|
@@ -126,26 +124,50 @@ Success criteria:
|
|
| 126 |
|
| 127 |
### Hard: Mixed Support Queue Triage
|
| 128 |
|
| 129 |
-
Objective: prioritize and resolve a heterogeneous queue under SLA pressure.
|
| 130 |
-
|
| 131 |
-
Expected difficulty: hard.
|
| 132 |
|
| 133 |
Success criteria:
|
| 134 |
|
| 135 |
-
- correctly rank the queue
|
| 136 |
-
- assign route and priority for each ticket
|
| 137 |
-
- choose correct resolutions
|
| 138 |
- escalate only the security-critical case
|
| 139 |
|
| 140 |
## Graders
|
| 141 |
|
| 142 |
-
Each task has a deterministic grader that returns a score in `0.0
|
| 143 |
|
| 144 |
-
- Easy grader weights context, priority, route, resolution, and escalation
|
| 145 |
- Medium grader weights context and policy-safe resolution more heavily
|
| 146 |
-
- Hard grader scores per-ticket handling and queue ranking
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
|
| 148 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 149 |
|
| 150 |
## Setup
|
| 151 |
|
|
@@ -176,55 +198,59 @@ Run the default no-API baseline:
|
|
| 176 |
python scripts/run_rule_baseline.py
|
| 177 |
```
|
| 178 |
|
| 179 |
-
Run the OpenAI baseline
|
| 180 |
|
| 181 |
```bash
|
| 182 |
export OPENAI_API_KEY=your_key_here
|
| 183 |
python scripts/run_baseline.py --model gpt-4.1-mini
|
| 184 |
```
|
| 185 |
|
| 186 |
-
Validate metadata:
|
| 187 |
|
| 188 |
```bash
|
| 189 |
bash scripts/validate_env.sh
|
|
|
|
| 190 |
```
|
| 191 |
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
## Baseline Scores
|
| 195 |
|
| 196 |
-
The
|
| 197 |
|
| 198 |
-
|
| 199 |
|
| 200 |
```bash
|
| 201 |
-
|
|
|
|
|
|
|
| 202 |
```
|
| 203 |
|
| 204 |
-
|
| 205 |
|
| 206 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 207 |
|
| 208 |
-
|
| 209 |
-
- `medium_payout_hold`: `1.0`
|
| 210 |
-
- `hard_queue_triage`: `1.0`
|
| 211 |
-
- average: `1.0`
|
| 212 |
|
| 213 |
-
|
|
|
|
|
|
|
| 214 |
|
| 215 |
-
|
| 216 |
|
| 217 |
-
|
|
|
|
|
|
|
| 218 |
|
| 219 |
-
|
| 220 |
-
- `app.py`
|
| 221 |
-
- `openenv.yaml`
|
| 222 |
|
| 223 |
-
|
| 224 |
|
| 225 |
1. Create a new Hugging Face Space with SDK set to Docker.
|
| 226 |
-
2.
|
| 227 |
-
3. Add the `openenv` tag in the Space metadata.
|
| 228 |
4. Optionally set `OPENAI_API_KEY` as a Space secret for baseline experiments.
|
| 229 |
|
| 230 |
## Project Structure
|
|
@@ -240,6 +266,9 @@ support_ops_env/
|
|
| 240 |
│ ├── graders/
|
| 241 |
│ └── tasks/
|
| 242 |
├── scripts/
|
|
|
|
|
|
|
|
|
|
| 243 |
├── tests/
|
| 244 |
├── app.py
|
| 245 |
├── openenv.yaml
|
|
|
|
| 20 |
- Typed observation, action, and reward models
|
| 21 |
- Reproducible OpenAI baseline runner
|
| 22 |
- Reproducible rule-based baseline runner that works with no API key
|
| 23 |
+
- Dockerized deployment on Hugging Face Spaces
|
| 24 |
|
| 25 |
## Environment Motivation
|
| 26 |
|
|
|
|
| 67 |
|
| 68 |
Supported `action_type` values:
|
| 69 |
|
| 70 |
+
| `action_type` | `target` | `value` example |
|
| 71 |
+
|------------------|------------|----------------------------------------|
|
| 72 |
+
| `inspect_ticket` | ticket ID | `""` |
|
| 73 |
+
| `request_context`| ticket ID | `"tax_status"` |
|
| 74 |
+
| `set_priority` | ticket ID | `"urgent"` / `"high"` / `"normal"` / `"low"` |
|
| 75 |
+
| `set_route` | ticket ID | `"account_security"` / `"billing_refunds"` / `"monetization_compliance"` / `"policy_appeals"` |
|
| 76 |
+
| `set_resolution` | ticket ID | `"temporary_lock_and_manual_recovery"` / `"request_tax_renewal"` / `"approve_refund"` / `"expedited_human_review"` |
|
| 77 |
+
| `escalate` | ticket ID | `"security_specialist"` |
|
| 78 |
+
| `rank_queue` | `"queue"` | `"T2,T1,T3"` |
|
| 79 |
+
| `finalize` | ticket ID | `""` |
|
| 80 |
|
| 81 |
## Reward Design
|
| 82 |
|
| 83 |
`RewardModel` is a Pydantic model with:
|
| 84 |
|
| 85 |
+
- `value`: scalar reward for this step
|
| 86 |
+
- `components`: dict of named sub-rewards
|
| 87 |
+
- `rationale`: human-readable explanation
|
| 88 |
|
| 89 |
Reward shaping is dense, not sparse:
|
| 90 |
|
| 91 |
+
- positive reward for discovering required context keys
|
| 92 |
+
- positive reward for correct intermediate decisions (priority, route, resolution)
|
| 93 |
- positive reward for correct queue ranking progress
|
| 94 |
- terminal reward from the deterministic grader score
|
| 95 |
- penalties for invalid actions, redundant actions, and wasted steps
|
| 96 |
|
| 97 |
+
This creates a learning or evaluation signal over the full trajectory, not just at episode end.
|
| 98 |
|
| 99 |
## Tasks
|
| 100 |
|
|
|
|
| 102 |
|
| 103 |
Objective: correctly handle an urgent suspected account takeover with unauthorized ad spend.
|
| 104 |
|
|
|
|
|
|
|
| 105 |
Success criteria:
|
| 106 |
|
| 107 |
- request the right security and billing context
|
|
|
|
| 114 |
|
| 115 |
Objective: investigate a missing creator payout and avoid unsafe release of funds.
|
| 116 |
|
|
|
|
|
|
|
| 117 |
Success criteria:
|
| 118 |
|
| 119 |
- discover tax-expiry and compliance-hold context
|
|
|
|
| 124 |
|
| 125 |
### Hard: Mixed Support Queue Triage
|
| 126 |
|
| 127 |
+
Objective: prioritize and resolve a heterogeneous queue of three tickets under SLA pressure.
|
|
|
|
|
|
|
| 128 |
|
| 129 |
Success criteria:
|
| 130 |
|
| 131 |
+
- correctly rank the queue by urgency
|
| 132 |
+
- assign route and priority for each ticket independently
|
| 133 |
+
- choose correct resolutions per ticket
|
| 134 |
- escalate only the security-critical case
|
| 135 |
|
| 136 |
## Graders
|
| 137 |
|
| 138 |
+
Each task has a deterministic grader that returns a score in `[0.0, 1.0]`.
|
| 139 |
|
| 140 |
+
- Easy grader weights context discovery, priority, route, resolution, and escalation
|
| 141 |
- Medium grader weights context and policy-safe resolution more heavily
|
| 142 |
+
- Hard grader scores per-ticket handling and queue ranking independently
|
| 143 |
+
|
| 144 |
+
Programmatic graders live in [`support_ops_env/graders/`](./support_ops_env/graders/).
|
| 145 |
+
|
| 146 |
+
## Baseline Scores
|
| 147 |
+
|
| 148 |
+
### Rule-based baseline (no API key required)
|
| 149 |
+
|
| 150 |
+
The deterministic rule-based baseline always takes the optimal action sequence and is used as a sanity check that the graders are correct and reachable:
|
| 151 |
+
|
| 152 |
+
| Task | Score |
|
| 153 |
+
|-------------------------|-------|
|
| 154 |
+
| `easy_account_takeover` | 1.000 |
|
| 155 |
+
| `medium_payout_hold` | 1.000 |
|
| 156 |
+
| `hard_queue_triage` | 1.000 |
|
| 157 |
+
| **average** | **1.000** |
|
| 158 |
+
|
| 159 |
+
### LLM baseline (GPT-4.1-mini)
|
| 160 |
+
|
| 161 |
+
These are the reproducible scores from the OpenAI baseline runner. They demonstrate that the environment provides a genuine challenge to frontier models, particularly on the hard task:
|
| 162 |
|
| 163 |
+
| Task | Score | Notes |
|
| 164 |
+
|-------------------------|-------|-------|
|
| 165 |
+
| `easy_account_takeover` | ~0.20 | Model skips mandatory set_priority / set_route / set_resolution before finalize |
|
| 166 |
+
| `medium_payout_hold` | ~0.35 | Correct context discovery but premature finalize |
|
| 167 |
+
| `hard_queue_triage` | ~0.13 | Multi-ticket ranking and per-ticket mandatory actions not completed |
|
| 168 |
+
| **average** | **~0.23** | |
|
| 169 |
+
|
| 170 |
+
The gap between the rule baseline and the LLM baseline confirms the reward function produces genuine signal and the hard task challenges frontier models.
|
| 171 |
|
| 172 |
## Setup
|
| 173 |
|
|
|
|
| 198 |
python scripts/run_rule_baseline.py
|
| 199 |
```
|
| 200 |
|
| 201 |
+
Run the OpenAI baseline:
|
| 202 |
|
| 203 |
```bash
|
| 204 |
export OPENAI_API_KEY=your_key_here
|
| 205 |
python scripts/run_baseline.py --model gpt-4.1-mini
|
| 206 |
```
|
| 207 |
|
| 208 |
+
Validate OpenEnv metadata:
|
| 209 |
|
| 210 |
```bash
|
| 211 |
bash scripts/validate_env.sh
|
| 212 |
+
# If the openenv CLI is installed, this also runs: openenv validate openenv.yaml
|
| 213 |
```
|
| 214 |
|
| 215 |
+
## API Quick Start
|
|
|
|
|
|
|
| 216 |
|
| 217 |
+
The live environment is available at `https://suppops-supportopsenv.hf.space`.
|
| 218 |
|
| 219 |
+
Reset to a task:
|
| 220 |
|
| 221 |
```bash
|
| 222 |
+
curl -X POST https://suppops-supportopsenv.hf.space/reset \
|
| 223 |
+
-H "Content-Type: application/json" \
|
| 224 |
+
-d '{"task_id": "easy_account_takeover"}'
|
| 225 |
```
|
| 226 |
|
| 227 |
+
Take a step:
|
| 228 |
|
| 229 |
+
```bash
|
| 230 |
+
curl -X POST https://suppops-supportopsenv.hf.space/step \
|
| 231 |
+
-H "Content-Type: application/json" \
|
| 232 |
+
-d '{"action": {"action_type": "inspect_ticket", "target": "T1", "value": ""}}'
|
| 233 |
+
```
|
| 234 |
|
| 235 |
+
Inspect the full environment state:
|
|
|
|
|
|
|
|
|
|
| 236 |
|
| 237 |
+
```bash
|
| 238 |
+
curl https://suppops-supportopsenv.hf.space/state
|
| 239 |
+
```
|
| 240 |
|
| 241 |
+
Get JSON schemas for all models:
|
| 242 |
|
| 243 |
+
```bash
|
| 244 |
+
curl https://suppops-supportopsenv.hf.space/schema
|
| 245 |
+
```
|
| 246 |
|
| 247 |
+
## Hugging Face Space Deployment
|
|
|
|
|
|
|
| 248 |
|
| 249 |
+
This repository includes a `Dockerfile`, `app.py`, and `openenv.yaml` and deploys as a Docker Space.
|
| 250 |
|
| 251 |
1. Create a new Hugging Face Space with SDK set to Docker.
|
| 252 |
+
2. Push this repository to the Space.
|
| 253 |
+
3. Add the `openenv` tag in the Space metadata (already present in this README's frontmatter).
|
| 254 |
4. Optionally set `OPENAI_API_KEY` as a Space secret for baseline experiments.
|
| 255 |
|
| 256 |
## Project Structure
|
|
|
|
| 266 |
│ ├── graders/
|
| 267 |
│ └── tasks/
|
| 268 |
├── scripts/
|
| 269 |
+
│ ├── run_baseline.py
|
| 270 |
+
│ ├── run_rule_baseline.py
|
| 271 |
+
│ └── validate_env.sh
|
| 272 |
├── tests/
|
| 273 |
├── app.py
|
| 274 |
├── openenv.yaml
|
inference.py
CHANGED
|
@@ -2,7 +2,9 @@ from dotenv import load_dotenv
|
|
| 2 |
load_dotenv()
|
| 3 |
import json
|
| 4 |
import os
|
|
|
|
| 5 |
import textwrap
|
|
|
|
| 6 |
from typing import List, Optional
|
| 7 |
|
| 8 |
from openai import OpenAI
|
|
@@ -19,16 +21,23 @@ TASK_NAME = os.getenv("SUPPORT_OPS_TASK", "easy_account_takeover")
|
|
| 19 |
BENCHMARK = os.getenv("SUPPORT_OPS_BENCHMARK", "support_ops_env")
|
| 20 |
MAX_STEPS = int(os.getenv("MAX_STEPS", "24"))
|
| 21 |
TEMPERATURE = float(os.getenv("TEMPERATURE", "0.1"))
|
| 22 |
-
MAX_TOKENS = int(os.getenv("MAX_TOKENS", "
|
| 23 |
SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_SCORE_THRESHOLD", "0.8"))
|
| 24 |
|
|
|
|
|
|
|
|
|
|
| 25 |
# Minimum number of tasks required by the grader
|
| 26 |
MIN_TASKS = 3
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
SYSTEM_PROMPT = textwrap.dedent(
|
| 29 |
"""
|
| 30 |
You are operating a customer support triage environment.
|
| 31 |
-
Return exactly one JSON object with keys: action_type, target, value.
|
| 32 |
|
| 33 |
Allowed action_type values:
|
| 34 |
- inspect_ticket
|
|
@@ -54,23 +63,27 @@ SYSTEM_PROMPT = textwrap.dedent(
|
|
| 54 |
{"action_type": "set_route", "target": "T1", "value": "account_security"}
|
| 55 |
{"action_type": "set_resolution", "target": "T1", "value": "temporary_lock_and_manual_recovery"}
|
| 56 |
{"action_type": "escalate", "target": "T1", "value": "security_specialist"}
|
| 57 |
-
{"action_type": "rank_queue", "target": "
|
| 58 |
{"action_type": "finalize", "target": "T1", "value": ""}
|
| 59 |
|
| 60 |
CRITICAL: For request_context, target = ticket ID (e.g. "T1"), value = context key name.
|
| 61 |
NEVER put the context key name in target. target is ALWAYS a ticket ID.
|
| 62 |
|
| 63 |
-
WORKFLOW
|
| 64 |
-
1. inspect_ticket
|
| 65 |
-
2. request_context ONLY for keys in required_context_keys
|
| 66 |
-
Use target=ticket_id, value=key_name. Request each key at most once.
|
| 67 |
-
Do NOT request optional
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
PRIORITY HINTS:
|
| 76 |
- Account takeover / fraud / SLA <= 2h → urgent
|
|
@@ -79,9 +92,10 @@ SYSTEM_PROMPT = textwrap.dedent(
|
|
| 79 |
|
| 80 |
STRICT RULES:
|
| 81 |
- NEVER repeat an action you have already taken (check your history).
|
| 82 |
-
- inspect_ticket AT MOST ONCE per ticket.
|
| 83 |
- target is ALWAYS a ticket ID like "T1". NEVER put a context key in target.
|
| 84 |
- Each request_context must use a different value (key name).
|
|
|
|
| 85 |
"""
|
| 86 |
).strip()
|
| 87 |
|
|
@@ -107,9 +121,25 @@ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> No
|
|
| 107 |
)
|
| 108 |
|
| 109 |
|
| 110 |
-
def build_user_prompt(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
reward_history = ",".join(f"{reward:.2f}" for reward in rewards[-5:]) if rewards else "none"
|
| 112 |
history_str = "\n".join(f" {a}" for a in action_history) if action_history else " none"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
return textwrap.dedent(
|
| 114 |
f"""
|
| 115 |
Step: {step}
|
|
@@ -117,36 +147,125 @@ def build_user_prompt(observation: Observation, step: int, rewards: List[float],
|
|
| 117 |
Difficulty: {observation.difficulty}
|
| 118 |
Reward history: {reward_history}
|
| 119 |
|
|
|
|
|
|
|
|
|
|
| 120 |
Actions you have ALREADY taken this episode (do NOT repeat these):
|
| 121 |
{history_str}
|
| 122 |
|
| 123 |
Observation JSON:
|
| 124 |
{json.dumps(observation.model_dump(), indent=2, sort_keys=True)}
|
| 125 |
Return one JSON action that you have NOT already taken.
|
|
|
|
| 126 |
"""
|
| 127 |
).strip()
|
| 128 |
|
| 129 |
|
| 130 |
-
def
|
| 131 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 132 |
try:
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
|
| 151 |
|
| 152 |
def clamp_score(score: float) -> float:
|
|
@@ -166,7 +285,6 @@ def select_tasks(requested: str) -> List[str]:
|
|
| 166 |
if not available:
|
| 167 |
raise RuntimeError("No tasks available in the environment.")
|
| 168 |
|
| 169 |
-
# Start with the requested task (validated), then fill up to MIN_TASKS
|
| 170 |
primary = requested if requested in available else available[0]
|
| 171 |
others = [t for t in available if t != primary]
|
| 172 |
task_list = [primary] + others
|
|
@@ -178,6 +296,9 @@ def run_task(client: OpenAI, task_name: str) -> dict:
|
|
| 178 |
env = SupportOpsEnv(task_id=task_name)
|
| 179 |
rewards: List[float] = []
|
| 180 |
action_history: List[str] = []
|
|
|
|
|
|
|
|
|
|
| 181 |
steps_taken = 0
|
| 182 |
score = 0.0
|
| 183 |
success = False
|
|
@@ -188,10 +309,42 @@ def run_task(client: OpenAI, task_name: str) -> dict:
|
|
| 188 |
observation = env.reset(task_id=task_name)
|
| 189 |
|
| 190 |
for step in range(1, MAX_STEPS + 1):
|
| 191 |
-
action, action_error = get_model_action(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
action_str = json.dumps(action.model_dump(), separators=(",", ":"))
|
| 193 |
action_history.append(action_str)
|
| 194 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 195 |
observation, reward, done, info = env.step(action)
|
| 196 |
reward_value = reward.value
|
| 197 |
rewards.append(reward_value)
|
|
@@ -209,7 +362,6 @@ def run_task(client: OpenAI, task_name: str) -> dict:
|
|
| 209 |
if done:
|
| 210 |
break
|
| 211 |
|
| 212 |
-
# Fix 1: clamp to strictly open (0, 1) — grader rejects 0.0 and 1.0
|
| 213 |
score = clamp_score(score)
|
| 214 |
success = score >= SUCCESS_SCORE_THRESHOLD
|
| 215 |
finally:
|
|
@@ -221,9 +373,6 @@ def run_task(client: OpenAI, task_name: str) -> dict:
|
|
| 221 |
def main() -> None:
|
| 222 |
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
|
| 223 |
|
| 224 |
-
# Fix 2: run at least MIN_TASKS tasks so the grader has enough scored entries
|
| 225 |
-
# Run in reverse difficulty order (hard first) so expensive tasks get credits
|
| 226 |
-
# while the budget is fresh, rather than always dying on the last task.
|
| 227 |
tasks = list(reversed(select_tasks(TASK_NAME)))
|
| 228 |
|
| 229 |
all_results = []
|
|
@@ -231,7 +380,6 @@ def main() -> None:
|
|
| 231 |
result = run_task(client, task_name)
|
| 232 |
all_results.append(result)
|
| 233 |
|
| 234 |
-
# Summary across all tasks
|
| 235 |
total = len(all_results)
|
| 236 |
passed = sum(1 for r in all_results if r["success"])
|
| 237 |
avg_score = sum(r["score"] for r in all_results) / total if total else 0.0
|
|
|
|
| 2 |
load_dotenv()
|
| 3 |
import json
|
| 4 |
import os
|
| 5 |
+
import re
|
| 6 |
import textwrap
|
| 7 |
+
import time
|
| 8 |
from typing import List, Optional
|
| 9 |
|
| 10 |
from openai import OpenAI
|
|
|
|
| 21 |
BENCHMARK = os.getenv("SUPPORT_OPS_BENCHMARK", "support_ops_env")
|
| 22 |
MAX_STEPS = int(os.getenv("MAX_STEPS", "24"))
|
| 23 |
TEMPERATURE = float(os.getenv("TEMPERATURE", "0.1"))
|
| 24 |
+
MAX_TOKENS = int(os.getenv("MAX_TOKENS", "4096")) # reasoning models need budget for <think> blocks
|
| 25 |
SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_SCORE_THRESHOLD", "0.8"))
|
| 26 |
|
| 27 |
+
# FIX 1: Retry budget for malformed JSON responses before giving up
|
| 28 |
+
JSON_RETRY_LIMIT = int(os.getenv("JSON_RETRY_LIMIT", "3"))
|
| 29 |
+
|
| 30 |
# Minimum number of tasks required by the grader
|
| 31 |
MIN_TASKS = 3
|
| 32 |
|
| 33 |
+
# Actions that must be completed for every ticket before finalize is allowed.
|
| 34 |
+
# finalize without these is the #1 score killer based on the logs.
|
| 35 |
+
REQUIRED_PER_TICKET = {"set_priority", "set_route", "set_resolution"}
|
| 36 |
+
|
| 37 |
SYSTEM_PROMPT = textwrap.dedent(
|
| 38 |
"""
|
| 39 |
You are operating a customer support triage environment.
|
| 40 |
+
Return exactly one JSON object with keys: action_type, target, value. No extra text, no markdown, no code fences.
|
| 41 |
|
| 42 |
Allowed action_type values:
|
| 43 |
- inspect_ticket
|
|
|
|
| 63 |
{"action_type": "set_route", "target": "T1", "value": "account_security"}
|
| 64 |
{"action_type": "set_resolution", "target": "T1", "value": "temporary_lock_and_manual_recovery"}
|
| 65 |
{"action_type": "escalate", "target": "T1", "value": "security_specialist"}
|
| 66 |
+
{"action_type": "rank_queue", "target": "queue", "value": "T2,T1,T3"}
|
| 67 |
{"action_type": "finalize", "target": "T1", "value": ""}
|
| 68 |
|
| 69 |
CRITICAL: For request_context, target = ticket ID (e.g. "T1"), value = context key name.
|
| 70 |
NEVER put the context key name in target. target is ALWAYS a ticket ID.
|
| 71 |
|
| 72 |
+
MANDATORY WORKFLOW — follow in this exact order for each ticket:
|
| 73 |
+
1. inspect_ticket (target=ticket_id, value="") ← ONCE per ticket, BEFORE any other action on it.
|
| 74 |
+
2. request_context ONLY for keys in required_context_keys (these affect your score).
|
| 75 |
+
Use target=ticket_id, value=key_name. One key per step. Request each key at most once.
|
| 76 |
+
Do NOT request optional available_context_keys — they waste steps.
|
| 77 |
+
3. set_priority ← MANDATORY before finalize. Use valid priority values.
|
| 78 |
+
4. set_route ← MANDATORY before finalize. Use valid route values.
|
| 79 |
+
5. set_resolution ← MANDATORY before finalize. Use valid resolution values.
|
| 80 |
+
6. escalate only when account takeover / security compromise is confirmed.
|
| 81 |
+
7. For queue tasks: rank_queue once, after ALL tickets are processed.
|
| 82 |
+
8. finalize (target=ticket_id, value="") — ONLY after set_priority, set_route,
|
| 83 |
+
and set_resolution have ALL been called for this ticket.
|
| 84 |
+
|
| 85 |
+
*** YOU MUST call set_priority, set_route, and set_resolution on every ticket. ***
|
| 86 |
+
*** Calling finalize before those three actions will score near 0. ***
|
| 87 |
|
| 88 |
PRIORITY HINTS:
|
| 89 |
- Account takeover / fraud / SLA <= 2h → urgent
|
|
|
|
| 92 |
|
| 93 |
STRICT RULES:
|
| 94 |
- NEVER repeat an action you have already taken (check your history).
|
| 95 |
+
- inspect_ticket AT MOST ONCE per ticket, and ALWAYS before request_context on that ticket.
|
| 96 |
- target is ALWAYS a ticket ID like "T1". NEVER put a context key in target.
|
| 97 |
- Each request_context must use a different value (key name).
|
| 98 |
+
- value must ALWAYS be a string — use "" (empty string), never null.
|
| 99 |
"""
|
| 100 |
).strip()
|
| 101 |
|
|
|
|
| 121 |
)
|
| 122 |
|
| 123 |
|
| 124 |
+
def build_user_prompt(
|
| 125 |
+
observation: Observation,
|
| 126 |
+
step: int,
|
| 127 |
+
rewards: List[float],
|
| 128 |
+
action_history: List[str],
|
| 129 |
+
completed_per_ticket: dict,
|
| 130 |
+
) -> str:
|
| 131 |
reward_history = ",".join(f"{reward:.2f}" for reward in rewards[-5:]) if rewards else "none"
|
| 132 |
history_str = "\n".join(f" {a}" for a in action_history) if action_history else " none"
|
| 133 |
+
|
| 134 |
+
# FIX 2: Summarise what mandatory actions are still missing per ticket so the
|
| 135 |
+
# model can see at a glance what it still needs to do before finalize.
|
| 136 |
+
pending_lines = []
|
| 137 |
+
for tid, done_actions in sorted(completed_per_ticket.items()):
|
| 138 |
+
missing = REQUIRED_PER_TICKET - done_actions
|
| 139 |
+
if missing:
|
| 140 |
+
pending_lines.append(f" {tid}: still needs {', '.join(sorted(missing))}")
|
| 141 |
+
pending_str = "\n".join(pending_lines) if pending_lines else " all mandatory actions complete"
|
| 142 |
+
|
| 143 |
return textwrap.dedent(
|
| 144 |
f"""
|
| 145 |
Step: {step}
|
|
|
|
| 147 |
Difficulty: {observation.difficulty}
|
| 148 |
Reward history: {reward_history}
|
| 149 |
|
| 150 |
+
Mandatory actions still PENDING (you MUST complete these before finalize):
|
| 151 |
+
{pending_str}
|
| 152 |
+
|
| 153 |
Actions you have ALREADY taken this episode (do NOT repeat these):
|
| 154 |
{history_str}
|
| 155 |
|
| 156 |
Observation JSON:
|
| 157 |
{json.dumps(observation.model_dump(), indent=2, sort_keys=True)}
|
| 158 |
Return one JSON action that you have NOT already taken.
|
| 159 |
+
Remember: value must always be a string, never null.
|
| 160 |
"""
|
| 161 |
).strip()
|
| 162 |
|
| 163 |
|
| 164 |
+
def extract_json(text: str) -> dict:
|
| 165 |
+
"""
|
| 166 |
+
Robustly extract a JSON object from model output.
|
| 167 |
+
Handles:
|
| 168 |
+
- <think>...</think> reasoning blocks (emitted by DeepSeek-R1, Gemini thinking, etc.)
|
| 169 |
+
- Markdown code fences (```json ... ```)
|
| 170 |
+
- Stray surrounding text
|
| 171 |
+
"""
|
| 172 |
+
# Strip <think>...</think> blocks first — they often contain stray { } chars
|
| 173 |
+
# that fool the JSON extractor into grabbing the wrong object.
|
| 174 |
+
text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
|
| 175 |
+
|
| 176 |
+
# Strip ```json ... ``` fences
|
| 177 |
+
text = re.sub(r"```(?:json)?", "", text).strip().rstrip("`").strip()
|
| 178 |
+
|
| 179 |
+
# Try direct parse
|
| 180 |
try:
|
| 181 |
+
return json.loads(text)
|
| 182 |
+
except json.JSONDecodeError:
|
| 183 |
+
pass
|
| 184 |
+
|
| 185 |
+
# Find the LAST complete {...} block — the real action is always after any
|
| 186 |
+
# preamble text, so the last match is more reliable than the first.
|
| 187 |
+
matches = list(re.finditer(r"\{[^{}]+\}", text, re.DOTALL))
|
| 188 |
+
for m in reversed(matches):
|
| 189 |
+
try:
|
| 190 |
+
return json.loads(m.group())
|
| 191 |
+
except json.JSONDecodeError:
|
| 192 |
+
continue
|
| 193 |
+
|
| 194 |
+
raise ValueError(f"No valid JSON object found in: {text!r}")
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
def get_model_action(
|
| 198 |
+
client: OpenAI,
|
| 199 |
+
observation: Observation,
|
| 200 |
+
step: int,
|
| 201 |
+
rewards: List[float],
|
| 202 |
+
action_history: List[str],
|
| 203 |
+
completed_per_ticket: dict,
|
| 204 |
+
) -> tuple[Action, Optional[str]]:
|
| 205 |
+
user_prompt = build_user_prompt(observation, step, rewards, action_history, completed_per_ticket)
|
| 206 |
+
last_exc: Optional[str] = None
|
| 207 |
+
content = ""
|
| 208 |
+
|
| 209 |
+
for attempt in range(1, JSON_RETRY_LIMIT + 1):
|
| 210 |
+
# Slightly raise temperature on retries so we don't get the same bad output
|
| 211 |
+
temp = TEMPERATURE if attempt == 1 else min(TEMPERATURE + 0.15 * attempt, 1.0)
|
| 212 |
+
try:
|
| 213 |
+
completion = client.chat.completions.create(
|
| 214 |
+
model=MODEL_NAME,
|
| 215 |
+
messages=[
|
| 216 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 217 |
+
{"role": "user", "content": user_prompt},
|
| 218 |
+
],
|
| 219 |
+
temperature=temp,
|
| 220 |
+
max_tokens=MAX_TOKENS,
|
| 221 |
+
stream=False,
|
| 222 |
+
)
|
| 223 |
+
content = (completion.choices[0].message.content or "").strip()
|
| 224 |
+
payload = extract_json(content)
|
| 225 |
+
|
| 226 |
+
# FIX 4: Normalise null → "" so the Action model never sees None for value
|
| 227 |
+
if payload.get("value") is None:
|
| 228 |
+
payload["value"] = ""
|
| 229 |
+
|
| 230 |
+
action = Action.model_validate(payload)
|
| 231 |
+
return action, None
|
| 232 |
+
except Exception as exc:
|
| 233 |
+
last_exc = str(exc).replace("\n", " ")
|
| 234 |
+
print(f"[WARN] attempt={attempt} parse_error={last_exc!r} content={content!r}", flush=True)
|
| 235 |
+
|
| 236 |
+
# FIX 5a: Respect rate-limit retry-after delays instead of hammering the API.
|
| 237 |
+
# The 429 body includes a retryDelay field (e.g. "16s"). Parse and sleep for it
|
| 238 |
+
# so subsequent attempts actually succeed rather than burning the retry budget.
|
| 239 |
+
if "429" in last_exc or "RESOURCE_EXHAUSTED" in last_exc:
|
| 240 |
+
delay_match = re.search(r"retryDelay['\"]:\s*['\"](\d+(?:\.\d+)?)s", last_exc)
|
| 241 |
+
delay = float(delay_match.group(1)) if delay_match else 20.0
|
| 242 |
+
print(f"[WARN] rate-limited; sleeping {delay:.1f}s before retry", flush=True)
|
| 243 |
+
time.sleep(delay)
|
| 244 |
+
|
| 245 |
+
# FIX 5b: Exhausted retries — do NOT blindly finalize.
|
| 246 |
+
# Skip to a no-op inspect on the first visible ticket to keep the episode alive.
|
| 247 |
+
print("[WARN] JSON retry limit exhausted; emitting safe no-op", flush=True)
|
| 248 |
+
# observation.tickets may be a list of objects or a dict — handle both.
|
| 249 |
+
obs_dump = observation.model_dump()
|
| 250 |
+
raw_tickets = obs_dump.get("tickets", [])
|
| 251 |
+
if isinstance(raw_tickets, dict):
|
| 252 |
+
ticket_ids = list(raw_tickets.keys())
|
| 253 |
+
else:
|
| 254 |
+
# list of dicts — each item should have an "id" or similar field
|
| 255 |
+
ticket_ids = [
|
| 256 |
+
t.get("ticket_id") or t.get("id") or f"T{i+1}"
|
| 257 |
+
for i, t in enumerate(raw_tickets)
|
| 258 |
+
]
|
| 259 |
+
ticket_ids = ticket_ids or ["T1"]
|
| 260 |
+
|
| 261 |
+
inspected = {
|
| 262 |
+
json.loads(a)["target"]
|
| 263 |
+
for a in action_history
|
| 264 |
+
if json.loads(a).get("action_type") == "inspect_ticket"
|
| 265 |
+
}
|
| 266 |
+
target = next((t for t in ticket_ids if t not in inspected), ticket_ids[0])
|
| 267 |
+
fallback = Action(action_type="inspect_ticket", target=target, value="")
|
| 268 |
+
return fallback, last_exc
|
| 269 |
|
| 270 |
|
| 271 |
def clamp_score(score: float) -> float:
|
|
|
|
| 285 |
if not available:
|
| 286 |
raise RuntimeError("No tasks available in the environment.")
|
| 287 |
|
|
|
|
| 288 |
primary = requested if requested in available else available[0]
|
| 289 |
others = [t for t in available if t != primary]
|
| 290 |
task_list = [primary] + others
|
|
|
|
| 296 |
env = SupportOpsEnv(task_id=task_name)
|
| 297 |
rewards: List[float] = []
|
| 298 |
action_history: List[str] = []
|
| 299 |
+
# FIX 6: Track which mandatory actions have been completed per ticket
|
| 300 |
+
# so we can warn the model and block premature finalize.
|
| 301 |
+
completed_per_ticket: dict[str, set] = {}
|
| 302 |
steps_taken = 0
|
| 303 |
score = 0.0
|
| 304 |
success = False
|
|
|
|
| 309 |
observation = env.reset(task_id=task_name)
|
| 310 |
|
| 311 |
for step in range(1, MAX_STEPS + 1):
|
| 312 |
+
action, action_error = get_model_action(
|
| 313 |
+
client, observation, step, rewards, action_history, completed_per_ticket
|
| 314 |
+
)
|
| 315 |
+
|
| 316 |
+
# FIX 7: Guard against premature finalize — if mandatory steps are still
|
| 317 |
+
# missing for any ticket, redirect to the first pending mandatory action
|
| 318 |
+
# instead of letting the model throw away the score.
|
| 319 |
+
if action.action_type == "finalize":
|
| 320 |
+
target = action.target or "T1"
|
| 321 |
+
missing = REQUIRED_PER_TICKET - completed_per_ticket.get(target, set())
|
| 322 |
+
if missing:
|
| 323 |
+
next_action_type = sorted(missing)[0] # deterministic ordering
|
| 324 |
+
print(
|
| 325 |
+
f"[GUARD] Premature finalize on {target}; redirecting to {next_action_type}",
|
| 326 |
+
flush=True,
|
| 327 |
+
)
|
| 328 |
+
# Pick the first valid value for the missing action type
|
| 329 |
+
FALLBACK_VALUES = {
|
| 330 |
+
"set_priority": "normal",
|
| 331 |
+
"set_route": "policy_appeals",
|
| 332 |
+
"set_resolution": "expedited_human_review",
|
| 333 |
+
}
|
| 334 |
+
action = Action(
|
| 335 |
+
action_type=next_action_type,
|
| 336 |
+
target=target,
|
| 337 |
+
value=FALLBACK_VALUES[next_action_type],
|
| 338 |
+
)
|
| 339 |
+
|
| 340 |
action_str = json.dumps(action.model_dump(), separators=(",", ":"))
|
| 341 |
action_history.append(action_str)
|
| 342 |
|
| 343 |
+
# Update completion tracker
|
| 344 |
+
if action.action_type in REQUIRED_PER_TICKET:
|
| 345 |
+
t = action.target or "T1"
|
| 346 |
+
completed_per_ticket.setdefault(t, set()).add(action.action_type)
|
| 347 |
+
|
| 348 |
observation, reward, done, info = env.step(action)
|
| 349 |
reward_value = reward.value
|
| 350 |
rewards.append(reward_value)
|
|
|
|
| 362 |
if done:
|
| 363 |
break
|
| 364 |
|
|
|
|
| 365 |
score = clamp_score(score)
|
| 366 |
success = score >= SUCCESS_SCORE_THRESHOLD
|
| 367 |
finally:
|
|
|
|
| 373 |
def main() -> None:
|
| 374 |
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
|
| 375 |
|
|
|
|
|
|
|
|
|
|
| 376 |
tasks = list(reversed(select_tasks(TASK_NAME)))
|
| 377 |
|
| 378 |
all_results = []
|
|
|
|
| 380 |
result = run_task(client, task_name)
|
| 381 |
all_results.append(result)
|
| 382 |
|
|
|
|
| 383 |
total = len(all_results)
|
| 384 |
passed = sum(1 for r in all_results if r["success"])
|
| 385 |
avg_score = sum(r["score"] for r in all_results) / total if total else 0.0
|