dbatcode28 commited on
Commit
bd67155
·
0 Parent(s):
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .hfignore-extra +4 -0
  2. Dockerfile +15 -0
  3. README.md +249 -0
  4. __init__.py +1 -0
  5. __pycache__/inference.cpython-313.pyc +0 -0
  6. app.py +71 -0
  7. client.py +31 -0
  8. inference.py +152 -0
  9. models.py +27 -0
  10. openenv.yaml +22 -0
  11. pyproject.toml +25 -0
  12. requirements.txt +3 -0
  13. rule_baseline_results.json +524 -0
  14. scripts/__pycache__/run_rule_baseline.cpython-313.pyc +0 -0
  15. scripts/run_baseline.py +103 -0
  16. scripts/run_rule_baseline.py +238 -0
  17. scripts/validate_env.sh +10 -0
  18. server/__init__.py +1 -0
  19. server/__pycache__/__init__.cpython-313.pyc +0 -0
  20. server/__pycache__/app.cpython-313.pyc +0 -0
  21. server/app.py +36 -0
  22. support_ops_env/__init__.py +13 -0
  23. support_ops_env/__pycache__/__init__.cpython-313.pyc +0 -0
  24. support_ops_env/__pycache__/env.cpython-313.pyc +0 -0
  25. support_ops_env/__pycache__/models.cpython-313.pyc +0 -0
  26. support_ops_env/__pycache__/reward.cpython-313.pyc +0 -0
  27. support_ops_env/__pycache__/state.cpython-313.pyc +0 -0
  28. support_ops_env/data/easy_cases.json +37 -0
  29. support_ops_env/data/hard_cases.json +84 -0
  30. support_ops_env/data/medium_cases.json +37 -0
  31. support_ops_env/env.py +237 -0
  32. support_ops_env/graders/__init__.py +19 -0
  33. support_ops_env/graders/__pycache__/__init__.cpython-313.pyc +0 -0
  34. support_ops_env/graders/__pycache__/common.cpython-313.pyc +0 -0
  35. support_ops_env/graders/__pycache__/easy.cpython-313.pyc +0 -0
  36. support_ops_env/graders/__pycache__/hard.cpython-313.pyc +0 -0
  37. support_ops_env/graders/__pycache__/medium.cpython-313.pyc +0 -0
  38. support_ops_env/graders/common.py +106 -0
  39. support_ops_env/graders/easy.py +17 -0
  40. support_ops_env/graders/hard.py +18 -0
  41. support_ops_env/graders/medium.py +17 -0
  42. support_ops_env/models.py +119 -0
  43. support_ops_env/reward.py +13 -0
  44. support_ops_env/state.py +34 -0
  45. support_ops_env/tasks/__init__.py +3 -0
  46. support_ops_env/tasks/__pycache__/__init__.cpython-313.pyc +0 -0
  47. support_ops_env/tasks/__pycache__/loader.cpython-313.pyc +0 -0
  48. support_ops_env/tasks/loader.py +35 -0
  49. tests/__pycache__/test_env.cpython-313.pyc +0 -0
  50. tests/__pycache__/test_graders.cpython-313.pyc +0 -0
.hfignore-extra ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ __pycache__/
2
+ *.pyc
3
+ support_ops_env/venv/
4
+ .venv/
Dockerfile ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ WORKDIR /app
4
+
5
+ ENV PYTHONDONTWRITEBYTECODE=1
6
+ ENV PYTHONUNBUFFERED=1
7
+
8
+ COPY requirements.txt .
9
+ RUN pip install --no-cache-dir -r requirements.txt
10
+
11
+ COPY . .
12
+
13
+ EXPOSE 7860
14
+
15
+ CMD ["python", "app.py"]
README.md ADDED
@@ -0,0 +1,249 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: SupportOpsEnv
3
+ sdk: docker
4
+ app_port: 7860
5
+ tags:
6
+ - openenv
7
+ - customer-support
8
+ - evaluation
9
+ ---
10
+
11
+ # SupportOpsEnv
12
+
13
+ SupportOpsEnv is a multi-step environment for evaluating agents on realistic customer support operations. The agent behaves like a support analyst: it reviews ticket summaries, requests missing context, assigns priority, chooses the correct internal route, selects a resolution, escalates when needed, and finalizes the case. This models a genuine workflow used by support operations, trust and safety, monetization, and account-recovery teams.
14
+
15
+ The environment is designed to score well against OpenEnv-style hackathon criteria:
16
+
17
+ - Real-world task simulation instead of a toy game
18
+ - Three deterministic tasks with easy, medium, and hard difficulty
19
+ - Dense reward shaping across the trajectory
20
+ - Typed observation, action, and reward models
21
+ - Reproducible OpenAI baseline runner
22
+ - Reproducible rule-based baseline runner that works with no API key
23
+ - Dockerized deployment path for Hugging Face Spaces
24
+
25
+ ## Environment Motivation
26
+
27
+ Support queue triage is one of the clearest real-world benchmarks for agent quality:
28
+
29
+ - Humans perform it every day
30
+ - It requires multi-step reasoning, not one-shot classification
31
+ - Progress can be measured deterministically
32
+ - It exposes practical agent failure modes such as premature resolution, wrong escalation, and poor prioritization
33
+
34
+ ## Observation Space
35
+
36
+ `Observation` is a Pydantic model with:
37
+
38
+ - `task_id`: active task identifier
39
+ - `difficulty`: `easy`, `medium`, or `hard`
40
+ - `title`: task title
41
+ - `instruction`: natural-language objective
42
+ - `queue_mode`: whether the task contains multiple tickets
43
+ - `tickets`: list of ticket observations
44
+ - `remaining_steps`: steps left in the episode
45
+ - `available_actions`: valid action names
46
+ - `current_queue_order`: current queue ranking, if any
47
+ - `score_hint`: latest intermediate grader snapshot
48
+
49
+ Each ticket observation contains:
50
+
51
+ - `ticket_id`
52
+ - `summary`
53
+ - `visible_context`
54
+ - `discovered_context`
55
+ - `selected_priority`
56
+ - `selected_route`
57
+ - `selected_resolution`
58
+ - `escalation_team`
59
+
60
+ ## Action Space
61
+
62
+ `Action` is a Pydantic model with:
63
+
64
+ - `action_type`
65
+ - `target`
66
+ - `value`
67
+
68
+ Supported `action_type` values:
69
+
70
+ - `inspect_ticket`
71
+ - `request_context`
72
+ - `set_priority`
73
+ - `set_route`
74
+ - `set_resolution`
75
+ - `escalate`
76
+ - `rank_queue`
77
+ - `finalize`
78
+
79
+ ## Reward Design
80
+
81
+ `RewardModel` is a Pydantic model with:
82
+
83
+ - `value`
84
+ - `components`
85
+ - `rationale`
86
+
87
+ Reward shaping is dense, not sparse:
88
+
89
+ - positive reward for discovering required context
90
+ - positive reward for correct intermediate decisions
91
+ - positive reward for correct queue ranking progress
92
+ - terminal reward from the deterministic grader score
93
+ - penalties for invalid actions, redundant actions, and wasted steps
94
+
95
+ This creates learning or evaluation signal over the full trajectory.
96
+
97
+ ## Tasks
98
+
99
+ ### Easy: Account Takeover Triage
100
+
101
+ Objective: correctly handle an urgent suspected account takeover with unauthorized ad spend.
102
+
103
+ Expected difficulty: easy.
104
+
105
+ Success criteria:
106
+
107
+ - request the right security and billing context
108
+ - assign `urgent`
109
+ - route to `account_security`
110
+ - choose `temporary_lock_and_manual_recovery`
111
+ - escalate to `security_specialist`
112
+
113
+ ### Medium: Monetization Payout Hold
114
+
115
+ Objective: investigate a missing creator payout and avoid unsafe release of funds.
116
+
117
+ Expected difficulty: medium.
118
+
119
+ Success criteria:
120
+
121
+ - discover tax-expiry and compliance-hold context
122
+ - assign `high`
123
+ - route to `monetization_compliance`
124
+ - choose `request_tax_renewal`
125
+ - avoid unnecessary escalation
126
+
127
+ ### Hard: Mixed Support Queue Triage
128
+
129
+ Objective: prioritize and resolve a heterogeneous queue under SLA pressure.
130
+
131
+ Expected difficulty: hard.
132
+
133
+ Success criteria:
134
+
135
+ - correctly rank the queue
136
+ - assign route and priority for each ticket
137
+ - choose correct resolutions
138
+ - escalate only the security-critical case
139
+
140
+ ## Graders
141
+
142
+ Each task has a deterministic grader that returns a score in `0.0` to `1.0`.
143
+
144
+ - Easy grader weights context, priority, route, resolution, and escalation
145
+ - Medium grader weights context and policy-safe resolution more heavily
146
+ - Hard grader scores per-ticket handling and queue ranking
147
+
148
+ Programmatic graders live in [support_ops_env/graders](/home/batman/Downloads/presentation_template/support_ops_env/support_ops_env/graders).
149
+
150
+ ## Setup
151
+
152
+ ```bash
153
+ cd support_ops_env
154
+ python -m venv .venv
155
+ source .venv/bin/activate
156
+ pip install -r requirements.txt
157
+ ```
158
+
159
+ ## Usage
160
+
161
+ Run the local tests:
162
+
163
+ ```bash
164
+ python -m unittest discover -s tests -p 'test_*.py'
165
+ ```
166
+
167
+ Run the app locally:
168
+
169
+ ```bash
170
+ python app.py
171
+ ```
172
+
173
+ Run the default no-API baseline:
174
+
175
+ ```bash
176
+ python scripts/run_rule_baseline.py
177
+ ```
178
+
179
+ Run the OpenAI baseline if you have an API key:
180
+
181
+ ```bash
182
+ export OPENAI_API_KEY=your_key_here
183
+ python scripts/run_baseline.py --model gpt-4.1-mini
184
+ ```
185
+
186
+ Validate metadata:
187
+
188
+ ```bash
189
+ bash scripts/validate_env.sh
190
+ ```
191
+
192
+ If the `openenv` CLI is installed, the script will also run `openenv validate openenv.yaml`.
193
+
194
+ ## Baseline Scores
195
+
196
+ The repository now includes a deterministic baseline in [run_rule_baseline.py](/home/batman/Downloads/presentation_template/support_ops_env/scripts/run_rule_baseline.py), so you can produce reproducible scores without any external API.
197
+
198
+ In this workspace, use:
199
+
200
+ ```bash
201
+ python scripts/run_rule_baseline.py
202
+ ```
203
+
204
+ This writes `rule_baseline_results.json` with per-task transcripts and the average score.
205
+
206
+ The current deterministic baseline score from this workspace is:
207
+
208
+ - `easy_account_takeover`: `1.0`
209
+ - `medium_payout_hold`: `1.0`
210
+ - `hard_queue_triage`: `1.0`
211
+ - average: `1.0`
212
+
213
+ The OpenAI baseline in [run_baseline.py](/home/batman/Downloads/presentation_template/support_ops_env/scripts/run_baseline.py) is still available as an optional comparison path after installing dependencies and setting `OPENAI_API_KEY`.
214
+
215
+ ## Hugging Face Space Deployment
216
+
217
+ This repository includes:
218
+
219
+ - `Dockerfile`
220
+ - `app.py`
221
+ - `openenv.yaml`
222
+
223
+ To deploy as a Docker Space:
224
+
225
+ 1. Create a new Hugging Face Space with SDK set to Docker.
226
+ 2. Upload this repository.
227
+ 3. Add the `openenv` tag in the Space metadata.
228
+ 4. Optionally set `OPENAI_API_KEY` as a Space secret for baseline experiments.
229
+
230
+ ## Project Structure
231
+
232
+ ```text
233
+ support_ops_env/
234
+ ├── support_ops_env/
235
+ │ ├── env.py
236
+ │ ├── models.py
237
+ │ ├── reward.py
238
+ │ ├── state.py
239
+ │ ├── data/
240
+ │ ├── graders/
241
+ │ └── tasks/
242
+ ├── scripts/
243
+ ├── tests/
244
+ ├── app.py
245
+ ├── openenv.yaml
246
+ ├── Dockerfile
247
+ ├── requirements.txt
248
+ └── README.md
249
+ ```
__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """SupportOpsEnv project root package marker for OpenEnv tooling."""
__pycache__/inference.cpython-313.pyc ADDED
Binary file (7.91 kB). View file
 
app.py ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ import os
5
+
6
+ import gradio as gr
7
+
8
+ from support_ops_env.env import SupportOpsEnv
9
+ from support_ops_env.models import Action
10
+ from support_ops_env.tasks import list_task_ids
11
+
12
+
13
+ ENV = SupportOpsEnv()
14
+
15
+
16
+ def reset_env(task_id: str) -> str:
17
+ observation = ENV.reset(task_id=task_id)
18
+ return json.dumps(observation.model_dump(), indent=2)
19
+
20
+
21
+ def step_env(task_id: str, action_type: str, target: str, value: str) -> tuple[str, str]:
22
+ if ENV.state().task_id != task_id or ENV.state().step_count == 0 and not ENV.state().done:
23
+ ENV.reset(task_id=task_id)
24
+ action = Action(action_type=action_type, target=target or "T1", value=value or None)
25
+ observation, reward, done, info = ENV.step(action)
26
+ payload = {
27
+ "reward": reward.model_dump(),
28
+ "done": done,
29
+ "info": info,
30
+ }
31
+ return json.dumps(observation.model_dump(), indent=2), json.dumps(payload, indent=2)
32
+
33
+
34
+ with gr.Blocks(title="SupportOpsEnv") as demo:
35
+ gr.Markdown("# SupportOpsEnv")
36
+ gr.Markdown("Multi-step support triage benchmark with deterministic graders.")
37
+
38
+ task_id = gr.Dropdown(choices=list_task_ids(), value=list_task_ids()[0], label="Task")
39
+ action_type = gr.Dropdown(
40
+ choices=[
41
+ "inspect_ticket",
42
+ "request_context",
43
+ "set_priority",
44
+ "set_route",
45
+ "set_resolution",
46
+ "escalate",
47
+ "rank_queue",
48
+ "finalize",
49
+ ],
50
+ value="inspect_ticket",
51
+ label="Action Type",
52
+ )
53
+ target = gr.Textbox(value="T1", label="Target Ticket")
54
+ value = gr.Textbox(label="Value")
55
+ observation_output = gr.Code(label="Observation", language="json")
56
+ result_output = gr.Code(label="Step Result", language="json")
57
+
58
+ reset_button = gr.Button("Reset")
59
+ step_button = gr.Button("Step")
60
+
61
+ reset_button.click(reset_env, inputs=[task_id], outputs=[observation_output])
62
+ step_button.click(
63
+ step_env,
64
+ inputs=[task_id, action_type, target, value],
65
+ outputs=[observation_output, result_output],
66
+ )
67
+
68
+
69
+ if __name__ == "__main__":
70
+ port = int(os.getenv("PORT", os.getenv("GRADIO_SERVER_PORT", "7860")))
71
+ demo.launch(server_name="0.0.0.0", server_port=port)
client.py ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Root-level client wrapper for OpenEnv packaging."""
2
+
3
+ from typing import Dict
4
+
5
+ from openenv.core import EnvClient
6
+ from openenv.core.client_types import StepResult
7
+ from openenv.core.env_server.types import State
8
+
9
+ from models import Action, Observation
10
+
11
+
12
+ class SupportOpsEnvClient(EnvClient[Action, Observation, State]):
13
+ def _step_payload(self, action: Action) -> Dict:
14
+ return action.model_dump()
15
+
16
+ def _parse_result(self, payload: Dict) -> StepResult[Observation]:
17
+ observation = Observation.model_validate(payload.get("observation", {}))
18
+ return StepResult(
19
+ observation=observation,
20
+ reward=payload.get("reward"),
21
+ done=payload.get("done", False),
22
+ )
23
+
24
+ def _parse_state(self, payload: Dict) -> State:
25
+ return State(
26
+ episode_id=payload.get("episode_id"),
27
+ step_count=payload.get("step_count", 0),
28
+ )
29
+
30
+
31
+ __all__ = ["SupportOpsEnvClient"]
inference.py ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import os
3
+ import textwrap
4
+ from typing import List, Optional
5
+
6
+ from openai import OpenAI
7
+
8
+ from support_ops_env.env import SupportOpsEnv
9
+ from support_ops_env.models import Action, Observation
10
+ from support_ops_env.tasks import list_task_ids
11
+
12
+
13
+ LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
14
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or os.getenv("API_KEY")
15
+ API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
16
+ MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
17
+ TASK_NAME = os.getenv("SUPPORT_OPS_TASK", "easy_account_takeover")
18
+ BENCHMARK = os.getenv("SUPPORT_OPS_BENCHMARK", "support_ops_env")
19
+ MAX_STEPS = int(os.getenv("MAX_STEPS", "16"))
20
+ TEMPERATURE = float(os.getenv("TEMPERATURE", "0.1"))
21
+ MAX_TOKENS = int(os.getenv("MAX_TOKENS", "220"))
22
+ SUCCESS_SCORE_THRESHOLD = float(os.getenv("SUCCESS_SCORE_THRESHOLD", "0.8"))
23
+
24
+
25
+ SYSTEM_PROMPT = textwrap.dedent(
26
+ """
27
+ You are operating a customer support triage environment.
28
+ Return exactly one JSON object with keys: action_type, target, value.
29
+ Allowed action_type values:
30
+ - inspect_ticket
31
+ - request_context
32
+ - set_priority
33
+ - set_route
34
+ - set_resolution
35
+ - escalate
36
+ - rank_queue
37
+ - finalize
38
+ Choose only valid ticket ids from the observation.
39
+ Use concise string values.
40
+ Finalize only after enough evidence is gathered.
41
+ """
42
+ ).strip()
43
+
44
+
45
+ def log_start(task: str, env: str, model: str) -> None:
46
+ print(f"[START] task={task} env={env} model={model}", flush=True)
47
+
48
+
49
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
50
+ error_val = error if error else "null"
51
+ done_val = str(done).lower()
52
+ print(
53
+ f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
54
+ flush=True,
55
+ )
56
+
57
+
58
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
59
+ rewards_str = ",".join(f"{reward:.2f}" for reward in rewards)
60
+ print(
61
+ f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
62
+ flush=True,
63
+ )
64
+
65
+
66
+ def build_user_prompt(observation: Observation, step: int, rewards: List[float]) -> str:
67
+ reward_history = ",".join(f"{reward:.2f}" for reward in rewards[-5:]) if rewards else "none"
68
+ return textwrap.dedent(
69
+ f"""
70
+ Step: {step}
71
+ Task: {observation.task_id}
72
+ Difficulty: {observation.difficulty}
73
+ Reward history: {reward_history}
74
+ Observation JSON:
75
+ {json.dumps(observation.model_dump(), indent=2, sort_keys=True)}
76
+ Return one JSON action.
77
+ """
78
+ ).strip()
79
+
80
+
81
+ def get_model_action(client: OpenAI, observation: Observation, step: int, rewards: List[float]) -> tuple[Action, Optional[str]]:
82
+ user_prompt = build_user_prompt(observation, step, rewards)
83
+ try:
84
+ completion = client.chat.completions.create(
85
+ model=MODEL_NAME,
86
+ messages=[
87
+ {"role": "system", "content": SYSTEM_PROMPT},
88
+ {"role": "user", "content": user_prompt},
89
+ ],
90
+ temperature=TEMPERATURE,
91
+ max_tokens=MAX_TOKENS,
92
+ stream=False,
93
+ )
94
+ content = (completion.choices[0].message.content or "").strip()
95
+ payload = json.loads(content)
96
+ action = Action.model_validate(payload)
97
+ return action, None
98
+ except Exception as exc:
99
+ fallback = Action(action_type="finalize")
100
+ return fallback, str(exc).replace("\n", " ")
101
+
102
+
103
+ def ensure_known_task(task_name: str) -> str:
104
+ if task_name in list_task_ids():
105
+ return task_name
106
+ return list_task_ids()[0]
107
+
108
+
109
+ def main() -> None:
110
+ task_name = ensure_known_task(TASK_NAME)
111
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
112
+ env = SupportOpsEnv(task_id=task_name)
113
+
114
+ rewards: List[float] = []
115
+ steps_taken = 0
116
+ score = 0.0
117
+ success = False
118
+
119
+ log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
120
+
121
+ try:
122
+ observation = env.reset(task_id=task_name)
123
+
124
+ for step in range(1, MAX_STEPS + 1):
125
+ action, action_error = get_model_action(client, observation, step, rewards)
126
+ action_str = json.dumps(action.model_dump(), separators=(",", ":"))
127
+
128
+ observation, reward, done, info = env.step(action)
129
+ reward_value = reward.value
130
+ rewards.append(reward_value)
131
+ steps_taken = step
132
+
133
+ log_step(
134
+ step=step,
135
+ action=action_str,
136
+ reward=reward_value,
137
+ done=done,
138
+ error=action_error,
139
+ )
140
+
141
+ score = float(info.get("task_score", 0.0))
142
+ if done:
143
+ break
144
+
145
+ score = min(max(score, 0.0), 1.0)
146
+ success = score >= SUCCESS_SCORE_THRESHOLD
147
+ finally:
148
+ log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
149
+
150
+
151
+ if __name__ == "__main__":
152
+ main()
models.py ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Root-level model exports for OpenEnv packaging."""
2
+
3
+ from support_ops_env.models import (
4
+ Action,
5
+ BaselineResult,
6
+ Observation,
7
+ RewardModel,
8
+ StateModel,
9
+ StepInfo,
10
+ TaskGrade,
11
+ TaskSpec,
12
+ TicketObservation,
13
+ TicketSpec,
14
+ )
15
+
16
+ __all__ = [
17
+ "Action",
18
+ "BaselineResult",
19
+ "Observation",
20
+ "RewardModel",
21
+ "StateModel",
22
+ "StepInfo",
23
+ "TaskGrade",
24
+ "TaskSpec",
25
+ "TicketObservation",
26
+ "TicketSpec",
27
+ ]
openenv.yaml ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: support-ops-env
2
+ description: Multi-step customer support triage and escalation benchmark for OpenEnv-style agents.
3
+ version: 0.1.0
4
+ tags:
5
+ - openenv
6
+ - customer-support
7
+ - triage
8
+ - evaluation
9
+ entrypoint: support_ops_env.env:SupportOpsEnv
10
+ observation_model: support_ops_env.models:Observation
11
+ action_model: support_ops_env.models:Action
12
+ reward_model: support_ops_env.models:RewardModel
13
+ tasks:
14
+ - id: easy_account_takeover
15
+ difficulty: easy
16
+ - id: medium_payout_hold
17
+ difficulty: medium
18
+ - id: hard_queue_triage
19
+ difficulty: hard
20
+ hf_space:
21
+ sdk: docker
22
+ suggested_app_file: app.py
pyproject.toml ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools>=68", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "support-ops-env"
7
+ version = "0.1.0"
8
+ description = "Multi-step customer support triage and escalation benchmark for OpenEnv-style evaluation."
9
+ readme = "README.md"
10
+ requires-python = ">=3.11"
11
+ dependencies = [
12
+ "openenv-core[core]>=0.2.2",
13
+ "pydantic>=2.7,<3",
14
+ "openai>=1.30.0",
15
+ "gradio>=4.44.0",
16
+ ]
17
+
18
+ [project.scripts]
19
+ server = "server.app:main"
20
+
21
+ [tool.setuptools]
22
+ include-package-data = true
23
+
24
+ [tool.setuptools.packages.find]
25
+ include = ["support_ops_env*"]
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ pydantic>=2.7,<3
2
+ openai>=1.30.0
3
+ gradio>=4.44.0
rule_baseline_results.json ADDED
@@ -0,0 +1,524 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "baseline": "rule_based",
3
+ "average_score": 1.0,
4
+ "results": [
5
+ {
6
+ "task_id": "easy_account_takeover",
7
+ "difficulty": "easy",
8
+ "score": 1.0,
9
+ "steps": 7,
10
+ "transcript": [
11
+ {
12
+ "action": {
13
+ "action_type": "request_context",
14
+ "target": "T1",
15
+ "value": "account_security"
16
+ },
17
+ "reward": {
18
+ "value": 0.11,
19
+ "components": {
20
+ "step_penalty": -0.01,
21
+ "required_context_found": 0.12
22
+ },
23
+ "rationale": "Processed request_context."
24
+ },
25
+ "task_score": 0.1,
26
+ "done": false
27
+ },
28
+ {
29
+ "action": {
30
+ "action_type": "request_context",
31
+ "target": "T1",
32
+ "value": "billing_activity"
33
+ },
34
+ "reward": {
35
+ "value": 0.11,
36
+ "components": {
37
+ "step_penalty": -0.01,
38
+ "required_context_found": 0.12
39
+ },
40
+ "rationale": "Processed request_context."
41
+ },
42
+ "task_score": 0.2,
43
+ "done": false
44
+ },
45
+ {
46
+ "action": {
47
+ "action_type": "set_priority",
48
+ "target": "T1",
49
+ "value": "urgent"
50
+ },
51
+ "reward": {
52
+ "value": 0.07,
53
+ "components": {
54
+ "step_penalty": -0.01,
55
+ "priority_match": 0.08
56
+ },
57
+ "rationale": "Processed set_priority."
58
+ },
59
+ "task_score": 0.4,
60
+ "done": false
61
+ },
62
+ {
63
+ "action": {
64
+ "action_type": "set_route",
65
+ "target": "T1",
66
+ "value": "account_security"
67
+ },
68
+ "reward": {
69
+ "value": 0.09,
70
+ "components": {
71
+ "step_penalty": -0.01,
72
+ "route_match": 0.1
73
+ },
74
+ "rationale": "Processed set_route."
75
+ },
76
+ "task_score": 0.65,
77
+ "done": false
78
+ },
79
+ {
80
+ "action": {
81
+ "action_type": "set_resolution",
82
+ "target": "T1",
83
+ "value": "temporary_lock_and_manual_recovery"
84
+ },
85
+ "reward": {
86
+ "value": 0.11,
87
+ "components": {
88
+ "step_penalty": -0.01,
89
+ "resolution_match": 0.12
90
+ },
91
+ "rationale": "Processed set_resolution."
92
+ },
93
+ "task_score": 0.85,
94
+ "done": false
95
+ },
96
+ {
97
+ "action": {
98
+ "action_type": "escalate",
99
+ "target": "T1",
100
+ "value": "security_specialist"
101
+ },
102
+ "reward": {
103
+ "value": 0.09,
104
+ "components": {
105
+ "step_penalty": -0.01,
106
+ "correct_escalation": 0.1
107
+ },
108
+ "rationale": "Processed escalate."
109
+ },
110
+ "task_score": 1.0,
111
+ "done": false
112
+ },
113
+ {
114
+ "action": {
115
+ "action_type": "finalize",
116
+ "target": "T1",
117
+ "value": null
118
+ },
119
+ "reward": {
120
+ "value": 0.99,
121
+ "components": {
122
+ "step_penalty": -0.01,
123
+ "terminal_grade": 1.0
124
+ },
125
+ "rationale": "Final task grade applied."
126
+ },
127
+ "task_score": 1.0,
128
+ "done": true
129
+ }
130
+ ]
131
+ },
132
+ {
133
+ "task_id": "medium_payout_hold",
134
+ "difficulty": "medium",
135
+ "score": 1.0,
136
+ "steps": 6,
137
+ "transcript": [
138
+ {
139
+ "action": {
140
+ "action_type": "request_context",
141
+ "target": "T1",
142
+ "value": "tax_status"
143
+ },
144
+ "reward": {
145
+ "value": 0.11,
146
+ "components": {
147
+ "step_penalty": -0.01,
148
+ "required_context_found": 0.12
149
+ },
150
+ "rationale": "Processed request_context."
151
+ },
152
+ "task_score": 0.225,
153
+ "done": false
154
+ },
155
+ {
156
+ "action": {
157
+ "action_type": "request_context",
158
+ "target": "T1",
159
+ "value": "payout_hold"
160
+ },
161
+ "reward": {
162
+ "value": 0.11,
163
+ "components": {
164
+ "step_penalty": -0.01,
165
+ "required_context_found": 0.12
166
+ },
167
+ "rationale": "Processed request_context."
168
+ },
169
+ "task_score": 0.35,
170
+ "done": false
171
+ },
172
+ {
173
+ "action": {
174
+ "action_type": "set_priority",
175
+ "target": "T1",
176
+ "value": "high"
177
+ },
178
+ "reward": {
179
+ "value": 0.07,
180
+ "components": {
181
+ "step_penalty": -0.01,
182
+ "priority_match": 0.08
183
+ },
184
+ "rationale": "Processed set_priority."
185
+ },
186
+ "task_score": 0.5,
187
+ "done": false
188
+ },
189
+ {
190
+ "action": {
191
+ "action_type": "set_route",
192
+ "target": "T1",
193
+ "value": "monetization_compliance"
194
+ },
195
+ "reward": {
196
+ "value": 0.09,
197
+ "components": {
198
+ "step_penalty": -0.01,
199
+ "route_match": 0.1
200
+ },
201
+ "rationale": "Processed set_route."
202
+ },
203
+ "task_score": 0.75,
204
+ "done": false
205
+ },
206
+ {
207
+ "action": {
208
+ "action_type": "set_resolution",
209
+ "target": "T1",
210
+ "value": "request_tax_renewal"
211
+ },
212
+ "reward": {
213
+ "value": 0.11,
214
+ "components": {
215
+ "step_penalty": -0.01,
216
+ "resolution_match": 0.12
217
+ },
218
+ "rationale": "Processed set_resolution."
219
+ },
220
+ "task_score": 1.0,
221
+ "done": false
222
+ },
223
+ {
224
+ "action": {
225
+ "action_type": "finalize",
226
+ "target": "T1",
227
+ "value": null
228
+ },
229
+ "reward": {
230
+ "value": 0.99,
231
+ "components": {
232
+ "step_penalty": -0.01,
233
+ "terminal_grade": 1.0
234
+ },
235
+ "rationale": "Final task grade applied."
236
+ },
237
+ "task_score": 1.0,
238
+ "done": true
239
+ }
240
+ ]
241
+ },
242
+ {
243
+ "task_id": "hard_queue_triage",
244
+ "difficulty": "hard",
245
+ "score": 1.0,
246
+ "steps": 16,
247
+ "transcript": [
248
+ {
249
+ "action": {
250
+ "action_type": "rank_queue",
251
+ "target": "T1",
252
+ "value": "T2,T3,T1"
253
+ },
254
+ "reward": {
255
+ "value": 0.11,
256
+ "components": {
257
+ "step_penalty": -0.01,
258
+ "queue_progress": 0.12
259
+ },
260
+ "rationale": "Processed rank_queue."
261
+ },
262
+ "task_score": 0.2167,
263
+ "done": false
264
+ },
265
+ {
266
+ "action": {
267
+ "action_type": "request_context",
268
+ "target": "T1",
269
+ "value": "payment_status"
270
+ },
271
+ "reward": {
272
+ "value": 0.11,
273
+ "components": {
274
+ "step_penalty": -0.01,
275
+ "required_context_found": 0.12
276
+ },
277
+ "rationale": "Processed request_context."
278
+ },
279
+ "task_score": 0.25,
280
+ "done": false
281
+ },
282
+ {
283
+ "action": {
284
+ "action_type": "request_context",
285
+ "target": "T2",
286
+ "value": "account_security"
287
+ },
288
+ "reward": {
289
+ "value": 0.11,
290
+ "components": {
291
+ "step_penalty": -0.01,
292
+ "required_context_found": 0.12
293
+ },
294
+ "rationale": "Processed request_context."
295
+ },
296
+ "task_score": 0.2667,
297
+ "done": false
298
+ },
299
+ {
300
+ "action": {
301
+ "action_type": "request_context",
302
+ "target": "T2",
303
+ "value": "billing_activity"
304
+ },
305
+ "reward": {
306
+ "value": 0.11,
307
+ "components": {
308
+ "step_penalty": -0.01,
309
+ "required_context_found": 0.12
310
+ },
311
+ "rationale": "Processed request_context."
312
+ },
313
+ "task_score": 0.2834,
314
+ "done": false
315
+ },
316
+ {
317
+ "action": {
318
+ "action_type": "request_context",
319
+ "target": "T3",
320
+ "value": "appeal_state"
321
+ },
322
+ "reward": {
323
+ "value": 0.11,
324
+ "components": {
325
+ "step_penalty": -0.01,
326
+ "required_context_found": 0.12
327
+ },
328
+ "rationale": "Processed request_context."
329
+ },
330
+ "task_score": 0.3,
331
+ "done": false
332
+ },
333
+ {
334
+ "action": {
335
+ "action_type": "request_context",
336
+ "target": "T3",
337
+ "value": "campaign_deadline"
338
+ },
339
+ "reward": {
340
+ "value": 0.11,
341
+ "components": {
342
+ "step_penalty": -0.01,
343
+ "required_context_found": 0.12
344
+ },
345
+ "rationale": "Processed request_context."
346
+ },
347
+ "task_score": 0.3167,
348
+ "done": false
349
+ },
350
+ {
351
+ "action": {
352
+ "action_type": "set_priority",
353
+ "target": "T1",
354
+ "value": "normal"
355
+ },
356
+ "reward": {
357
+ "value": 0.07,
358
+ "components": {
359
+ "step_penalty": -0.01,
360
+ "priority_match": 0.08
361
+ },
362
+ "rationale": "Processed set_priority."
363
+ },
364
+ "task_score": 0.3834,
365
+ "done": false
366
+ },
367
+ {
368
+ "action": {
369
+ "action_type": "set_priority",
370
+ "target": "T2",
371
+ "value": "urgent"
372
+ },
373
+ "reward": {
374
+ "value": 0.07,
375
+ "components": {
376
+ "step_penalty": -0.01,
377
+ "priority_match": 0.08
378
+ },
379
+ "rationale": "Processed set_priority."
380
+ },
381
+ "task_score": 0.45,
382
+ "done": false
383
+ },
384
+ {
385
+ "action": {
386
+ "action_type": "set_priority",
387
+ "target": "T3",
388
+ "value": "high"
389
+ },
390
+ "reward": {
391
+ "value": 0.07,
392
+ "components": {
393
+ "step_penalty": -0.01,
394
+ "priority_match": 0.08
395
+ },
396
+ "rationale": "Processed set_priority."
397
+ },
398
+ "task_score": 0.5167,
399
+ "done": false
400
+ },
401
+ {
402
+ "action": {
403
+ "action_type": "set_route",
404
+ "target": "T1",
405
+ "value": "billing_refunds"
406
+ },
407
+ "reward": {
408
+ "value": 0.09,
409
+ "components": {
410
+ "step_penalty": -0.01,
411
+ "route_match": 0.1
412
+ },
413
+ "rationale": "Processed set_route."
414
+ },
415
+ "task_score": 0.6,
416
+ "done": false
417
+ },
418
+ {
419
+ "action": {
420
+ "action_type": "set_route",
421
+ "target": "T2",
422
+ "value": "account_security"
423
+ },
424
+ "reward": {
425
+ "value": 0.09,
426
+ "components": {
427
+ "step_penalty": -0.01,
428
+ "route_match": 0.1
429
+ },
430
+ "rationale": "Processed set_route."
431
+ },
432
+ "task_score": 0.6834,
433
+ "done": false
434
+ },
435
+ {
436
+ "action": {
437
+ "action_type": "set_route",
438
+ "target": "T3",
439
+ "value": "policy_appeals"
440
+ },
441
+ "reward": {
442
+ "value": 0.09,
443
+ "components": {
444
+ "step_penalty": -0.01,
445
+ "route_match": 0.1
446
+ },
447
+ "rationale": "Processed set_route."
448
+ },
449
+ "task_score": 0.7667,
450
+ "done": false
451
+ },
452
+ {
453
+ "action": {
454
+ "action_type": "set_resolution",
455
+ "target": "T1",
456
+ "value": "approve_refund"
457
+ },
458
+ "reward": {
459
+ "value": 0.11,
460
+ "components": {
461
+ "step_penalty": -0.01,
462
+ "resolution_match": 0.12
463
+ },
464
+ "rationale": "Processed set_resolution."
465
+ },
466
+ "task_score": 0.8334,
467
+ "done": false
468
+ },
469
+ {
470
+ "action": {
471
+ "action_type": "set_resolution",
472
+ "target": "T2",
473
+ "value": "temporary_lock_and_manual_recovery"
474
+ },
475
+ "reward": {
476
+ "value": 0.11,
477
+ "components": {
478
+ "step_penalty": -0.01,
479
+ "resolution_match": 0.12
480
+ },
481
+ "rationale": "Processed set_resolution."
482
+ },
483
+ "task_score": 0.9,
484
+ "done": false
485
+ },
486
+ {
487
+ "action": {
488
+ "action_type": "set_resolution",
489
+ "target": "T3",
490
+ "value": "expedited_human_review"
491
+ },
492
+ "reward": {
493
+ "value": 0.11,
494
+ "components": {
495
+ "step_penalty": -0.01,
496
+ "resolution_match": 0.12
497
+ },
498
+ "rationale": "Processed set_resolution."
499
+ },
500
+ "task_score": 0.9667,
501
+ "done": false
502
+ },
503
+ {
504
+ "action": {
505
+ "action_type": "escalate",
506
+ "target": "T2",
507
+ "value": "security_specialist"
508
+ },
509
+ "reward": {
510
+ "value": 1.09,
511
+ "components": {
512
+ "step_penalty": -0.01,
513
+ "correct_escalation": 0.1,
514
+ "timeout_grade": 1.0
515
+ },
516
+ "rationale": "Processed escalate."
517
+ },
518
+ "task_score": 1.0,
519
+ "done": true
520
+ }
521
+ ]
522
+ }
523
+ ]
524
+ }
scripts/__pycache__/run_rule_baseline.cpython-313.pyc ADDED
Binary file (10.1 kB). View file
 
scripts/run_baseline.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import argparse
4
+ import json
5
+ import os
6
+ import sys
7
+ from pathlib import Path
8
+ from typing import Dict, List
9
+
10
+ ROOT = Path(__file__).resolve().parent.parent
11
+ if str(ROOT) not in sys.path:
12
+ sys.path.insert(0, str(ROOT))
13
+
14
+ from openai import OpenAI
15
+
16
+ from support_ops_env.env import SupportOpsEnv
17
+ from support_ops_env.models import Action, BaselineResult
18
+ from support_ops_env.tasks import list_task_ids
19
+
20
+
21
+ SYSTEM_PROMPT = """You are evaluating a support operations environment.
22
+ Return exactly one JSON object with keys: action_type, target, value.
23
+ Choose from action_type:
24
+ - inspect_ticket
25
+ - request_context
26
+ - set_priority
27
+ - set_route
28
+ - set_resolution
29
+ - escalate
30
+ - rank_queue
31
+ - finalize
32
+ Be concise and deterministic. Only use ticket ids that appear in the observation.
33
+ When enough evidence is gathered, finalize."""
34
+
35
+
36
+ def main() -> None:
37
+ parser = argparse.ArgumentParser(description="Run a reproducible baseline over all SupportOpsEnv tasks.")
38
+ parser.add_argument("--model", default="gpt-4.1-mini", help="OpenAI model name")
39
+ parser.add_argument("--output", default="baseline_results.json", help="Path to write JSON results")
40
+ args = parser.parse_args()
41
+
42
+ api_key = os.getenv("OPENAI_API_KEY")
43
+ if not api_key:
44
+ raise SystemExit("OPENAI_API_KEY is required.")
45
+
46
+ client = OpenAI(api_key=api_key)
47
+ results: List[BaselineResult] = []
48
+
49
+ for task_id in list_task_ids():
50
+ env = SupportOpsEnv(task_id=task_id)
51
+ observation = env.reset()
52
+ done = False
53
+ transcript: List[Dict[str, object]] = []
54
+ last_info: Dict[str, object] = {}
55
+
56
+ while not done:
57
+ response = client.responses.create(
58
+ model=args.model,
59
+ temperature=0,
60
+ input=[
61
+ {"role": "system", "content": SYSTEM_PROMPT},
62
+ {
63
+ "role": "user",
64
+ "content": json.dumps(observation.model_dump(), indent=2, sort_keys=True),
65
+ },
66
+ ],
67
+ )
68
+ raw = response.output_text.strip()
69
+ payload = json.loads(raw)
70
+ action = Action.model_validate(payload)
71
+ observation, reward, done, info = env.step(action)
72
+ transcript.append(
73
+ {
74
+ "action": action.model_dump(),
75
+ "reward": reward.model_dump(),
76
+ "task_score": info["task_score"],
77
+ "done": done,
78
+ }
79
+ )
80
+ last_info = info
81
+
82
+ results.append(
83
+ BaselineResult(
84
+ task_id=task_id,
85
+ difficulty=observation.difficulty,
86
+ score=float(last_info.get("task_score", 0.0)),
87
+ steps=int(last_info.get("step_count", 0)),
88
+ transcript=transcript,
89
+ )
90
+ )
91
+
92
+ output_path = Path(args.output)
93
+ payload = {
94
+ "model": args.model,
95
+ "average_score": round(sum(item.score for item in results) / len(results), 4),
96
+ "results": [item.model_dump() for item in results],
97
+ }
98
+ output_path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
99
+ print(json.dumps(payload, indent=2))
100
+
101
+
102
+ if __name__ == "__main__":
103
+ main()
scripts/run_rule_baseline.py ADDED
@@ -0,0 +1,238 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import argparse
4
+ import json
5
+ import sys
6
+ from pathlib import Path
7
+ from typing import Dict, List
8
+
9
+ ROOT = Path(__file__).resolve().parent.parent
10
+ if str(ROOT) not in sys.path:
11
+ sys.path.insert(0, str(ROOT))
12
+
13
+ from support_ops_env.env import SupportOpsEnv
14
+ from support_ops_env.models import Action, BaselineResult, Observation, TicketObservation
15
+ from support_ops_env.tasks import list_task_ids
16
+
17
+
18
+ CONTEXT_PRIORITY = [
19
+ "account_security",
20
+ "billing_activity",
21
+ "tax_status",
22
+ "payout_hold",
23
+ "appeal_state",
24
+ "campaign_deadline",
25
+ "payment_status",
26
+ ]
27
+
28
+
29
+ def choose_next_action(observation: Observation) -> Action:
30
+ if observation.queue_mode and not observation.current_queue_order:
31
+ ranking = rank_tickets(observation.tickets)
32
+ return Action(action_type="rank_queue", value=",".join(ranking))
33
+
34
+ for ticket in observation.tickets:
35
+ next_context = missing_high_value_context(ticket)
36
+ if next_context:
37
+ return Action(action_type="request_context", target=ticket.ticket_id, value=next_context)
38
+
39
+ for ticket in observation.tickets:
40
+ priority = infer_priority(ticket)
41
+ if ticket.selected_priority != priority:
42
+ return Action(action_type="set_priority", target=ticket.ticket_id, value=priority)
43
+
44
+ for ticket in observation.tickets:
45
+ route = infer_route(ticket)
46
+ if ticket.selected_route != route:
47
+ return Action(action_type="set_route", target=ticket.ticket_id, value=route)
48
+
49
+ for ticket in observation.tickets:
50
+ resolution = infer_resolution(ticket)
51
+ if ticket.selected_resolution != resolution:
52
+ return Action(action_type="set_resolution", target=ticket.ticket_id, value=resolution)
53
+
54
+ for ticket in observation.tickets:
55
+ escalation = infer_escalation(ticket)
56
+ if ticket.escalation_team != escalation:
57
+ return Action(action_type="escalate", target=ticket.ticket_id, value=escalation)
58
+
59
+ return Action(action_type="finalize")
60
+
61
+
62
+ def missing_high_value_context(ticket: TicketObservation) -> str | None:
63
+ discovered = set(ticket.discovered_context)
64
+ haystack = flattened_text(ticket)
65
+
66
+ candidates: List[str] = infer_required_context(ticket)
67
+
68
+ for key in CONTEXT_PRIORITY:
69
+ if key in candidates and key not in discovered:
70
+ return key
71
+ return None
72
+
73
+
74
+ def infer_required_context(ticket: TicketObservation) -> List[str]:
75
+ text = flattened_text(ticket)
76
+ if "payout" in text or "w-9" in text or "bank details" in text or "funds released" in text:
77
+ return ["tax_status", "payout_hold"]
78
+ if "appeal" in text or "auto-removed" in text or "monetization is paused" in text:
79
+ return ["appeal_state", "campaign_deadline"]
80
+ if "duplicate charge" in text or "refund" in text:
81
+ return ["payment_status"]
82
+ if (
83
+ "login" in text
84
+ or "ad spend" in text
85
+ or "unfamiliar campaigns" in text
86
+ or "taken over" in text
87
+ or "recovery email was changed" in text
88
+ ):
89
+ return ["account_security", "billing_activity"]
90
+ return []
91
+
92
+
93
+ def infer_priority(ticket: TicketObservation) -> str:
94
+ text = flattened_text(ticket)
95
+ if (
96
+ "critical" in text
97
+ or "$1,900" in text
98
+ or "unauthorized ad spend" in text
99
+ or "impossible travel" in text
100
+ or "recovery email was changed" in text
101
+ ):
102
+ return "urgent"
103
+ if "campaign begins in 18 hours" in text or "monetization is paused" in text:
104
+ return "high"
105
+ if "w-9 expired" in text or "monthly payout" in text:
106
+ return "high"
107
+ return "normal"
108
+
109
+
110
+ def infer_route(ticket: TicketObservation) -> str:
111
+ text = flattened_text(ticket)
112
+ if (
113
+ "account takeover" in text
114
+ or "new devices" in text
115
+ or "recovery email was changed" in text
116
+ or "unfamiliar campaigns" in text
117
+ or "unauthorized ad spend" in text
118
+ or "losing access" in text
119
+ ):
120
+ return "account_security"
121
+ if "w-9 expired" in text or "compliance hold" in text:
122
+ return "monetization_compliance"
123
+ if "auto-removed" in text or "human yet" in text:
124
+ return "policy_appeals"
125
+ if "duplicate charge" in text or "automatically refundable" in text:
126
+ return "billing_refunds"
127
+ return "general_support"
128
+
129
+
130
+ def infer_resolution(ticket: TicketObservation) -> str:
131
+ text = flattened_text(ticket)
132
+ if (
133
+ "account takeover" in text
134
+ or "new devices" in text
135
+ or "impossible travel" in text
136
+ or "unfamiliar campaigns" in text
137
+ or "losing access" in text
138
+ ):
139
+ return "temporary_lock_and_manual_recovery"
140
+ if "w-9 expired" in text or "compliance hold" in text:
141
+ return "request_tax_renewal"
142
+ if "auto-removed" in text or "sponsored campaign begins" in text:
143
+ return "expedited_human_review"
144
+ if "duplicate charge" in text or "automatically refundable" in text:
145
+ return "approve_refund"
146
+ return "request_more_info"
147
+
148
+
149
+ def infer_escalation(ticket: TicketObservation) -> str | None:
150
+ text = flattened_text(ticket)
151
+ if (
152
+ "account takeover" in text
153
+ or "critical" in text
154
+ or "impossible travel" in text
155
+ or "unfamiliar campaigns" in text
156
+ or "losing access" in text
157
+ ):
158
+ return "security_specialist"
159
+ return None
160
+
161
+
162
+ def rank_tickets(tickets: List[TicketObservation]) -> List[str]:
163
+ scored = []
164
+ for ticket in tickets:
165
+ text = flattened_text(ticket)
166
+ score = 0
167
+ if "critical" in text or "account takeover" in text or "$1,900" in text or "unfamiliar campaigns" in text:
168
+ score += 100
169
+ if "campaign begins in 18 hours" in text or "sponsored campaign" in text:
170
+ score += 60
171
+ if "duplicate charge" in text:
172
+ score += 20
173
+ if ticket.visible_context.get("sla_hours_remaining") == "1":
174
+ score += 30
175
+ if ticket.visible_context.get("sla_hours_remaining") == "4":
176
+ score += 10
177
+ scored.append((score, ticket.ticket_id))
178
+ scored.sort(key=lambda item: (-item[0], item[1]))
179
+ return [ticket_id for _, ticket_id in scored]
180
+
181
+
182
+ def flattened_text(ticket: TicketObservation) -> str:
183
+ parts = [
184
+ ticket.summary,
185
+ json.dumps(ticket.visible_context, sort_keys=True),
186
+ json.dumps(ticket.discovered_context, sort_keys=True),
187
+ ]
188
+ return " ".join(parts).lower()
189
+
190
+
191
+ def main() -> None:
192
+ parser = argparse.ArgumentParser(description="Run a deterministic rule-based baseline over all tasks.")
193
+ parser.add_argument("--output", default="rule_baseline_results.json", help="Path to write JSON results")
194
+ args = parser.parse_args()
195
+
196
+ results: List[BaselineResult] = []
197
+ for task_id in list_task_ids():
198
+ env = SupportOpsEnv(task_id=task_id)
199
+ observation = env.reset()
200
+ done = False
201
+ transcript: List[Dict[str, object]] = []
202
+ last_info: Dict[str, object] = {}
203
+
204
+ while not done:
205
+ action = choose_next_action(observation)
206
+ observation, reward, done, info = env.step(action)
207
+ transcript.append(
208
+ {
209
+ "action": action.model_dump(),
210
+ "reward": reward.model_dump(),
211
+ "task_score": info["task_score"],
212
+ "done": done,
213
+ }
214
+ )
215
+ last_info = info
216
+
217
+ results.append(
218
+ BaselineResult(
219
+ task_id=task_id,
220
+ difficulty=observation.difficulty,
221
+ score=float(last_info.get("task_score", 0.0)),
222
+ steps=int(last_info.get("step_count", 0)),
223
+ transcript=transcript,
224
+ )
225
+ )
226
+
227
+ payload = {
228
+ "baseline": "rule_based",
229
+ "average_score": round(sum(item.score for item in results) / len(results), 4),
230
+ "results": [item.model_dump() for item in results],
231
+ }
232
+ output_path = Path(args.output)
233
+ output_path.write_text(json.dumps(payload, indent=2), encoding="utf-8")
234
+ print(json.dumps(payload, indent=2))
235
+
236
+
237
+ if __name__ == "__main__":
238
+ main()
scripts/validate_env.sh ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ set -euo pipefail
3
+
4
+ python -m unittest discover -s tests -p 'test_*.py'
5
+
6
+ if command -v openenv >/dev/null 2>&1; then
7
+ openenv validate openenv.yaml
8
+ else
9
+ echo "openenv CLI not installed; skipped 'openenv validate openenv.yaml'."
10
+ fi
server/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ """OpenEnv server package."""
server/__pycache__/__init__.cpython-313.pyc ADDED
Binary file (214 Bytes). View file
 
server/__pycache__/app.cpython-313.pyc ADDED
Binary file (1.42 kB). View file
 
server/app.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from fastapi import FastAPI
4
+
5
+
6
+ app = FastAPI(
7
+ title="SupportOpsEnv Server",
8
+ description="Minimal server entry point for OpenEnv validation and deployment hooks.",
9
+ version="0.1.0",
10
+ )
11
+
12
+
13
+ @app.get("/")
14
+ def root() -> dict[str, str]:
15
+ return {
16
+ "name": "support-ops-env",
17
+ "status": "ok",
18
+ "message": "SupportOpsEnv server entry point is available.",
19
+ }
20
+
21
+
22
+ @app.get("/health")
23
+ def health() -> dict[str, str]:
24
+ return {"status": "healthy"}
25
+
26
+
27
+ def main(host: str = "0.0.0.0", port: int = 8000) -> None:
28
+ import uvicorn
29
+
30
+ uvicorn.run(app, host=host, port=port)
31
+
32
+ def uv_main():
33
+ return app
34
+
35
+ if __name__ == "__main__":
36
+ main()
support_ops_env/__init__.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """SupportOpsEnv package."""
2
+
3
+ from .env import SupportOpsEnv
4
+ from .models import Action, Observation, RewardModel, StateModel, TaskGrade
5
+
6
+ __all__ = [
7
+ "Action",
8
+ "Observation",
9
+ "RewardModel",
10
+ "StateModel",
11
+ "SupportOpsEnv",
12
+ "TaskGrade",
13
+ ]
support_ops_env/__pycache__/__init__.cpython-313.pyc ADDED
Binary file (451 Bytes). View file
 
support_ops_env/__pycache__/env.cpython-313.pyc ADDED
Binary file (16.2 kB). View file
 
support_ops_env/__pycache__/models.cpython-313.pyc ADDED
Binary file (5.42 kB). View file
 
support_ops_env/__pycache__/reward.cpython-313.pyc ADDED
Binary file (708 Bytes). View file
 
support_ops_env/__pycache__/state.cpython-313.pyc ADDED
Binary file (1.94 kB). View file
 
support_ops_env/data/easy_cases.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "task_id": "easy_account_takeover",
4
+ "difficulty": "easy",
5
+ "title": "Account Takeover Triage",
6
+ "description": "Route an urgent support ticket involving suspected account takeover and unauthorized ad spend.",
7
+ "instruction": "Review the ticket, request any missing high-value context, set priority, choose the correct route, select a resolution, and escalate only if warranted.",
8
+ "max_steps": 8,
9
+ "queue_mode": false,
10
+ "grader_name": "easy_support_routing",
11
+ "tickets": [
12
+ {
13
+ "ticket_id": "T1",
14
+ "summary": "Creator reports losing access to their account after repeated login attempts. They also noticed unfamiliar ad spend starting this morning.",
15
+ "visible_context": {
16
+ "customer_tier": "managed_creator",
17
+ "surface": "ads_manager",
18
+ "sla_hours_remaining": "2"
19
+ },
20
+ "hidden_context": {
21
+ "account_security": "Impossible travel and two failed password-reset attempts were flagged overnight.",
22
+ "billing_activity": "A new ad campaign spent $420 in the last 2 hours from a new IP block.",
23
+ "recovery_channel": "A verified backup email is on file and can receive a recovery code.",
24
+ "prior_cases": "No previous enforcement or abuse flags are on record."
25
+ },
26
+ "required_context": [
27
+ "account_security",
28
+ "billing_activity"
29
+ ],
30
+ "gold_priority": "urgent",
31
+ "gold_route": "account_security",
32
+ "gold_resolution": "temporary_lock_and_manual_recovery",
33
+ "gold_escalation_team": "security_specialist"
34
+ }
35
+ ]
36
+ }
37
+ ]
support_ops_env/data/hard_cases.json ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "task_id": "hard_queue_triage",
4
+ "difficulty": "hard",
5
+ "title": "Mixed Support Queue Triage",
6
+ "description": "Prioritize a small queue of heterogeneous support tickets under SLA pressure and route each one correctly.",
7
+ "instruction": "Inspect the queue, gather missing context where useful, assign the right priority and route for each ticket, set a valid resolution, rank the queue from most urgent to least urgent, and finalize.",
8
+ "max_steps": 16,
9
+ "queue_mode": true,
10
+ "gold_queue_order": [
11
+ "T2",
12
+ "T3",
13
+ "T1"
14
+ ],
15
+ "grader_name": "hard_support_queue",
16
+ "tickets": [
17
+ {
18
+ "ticket_id": "T1",
19
+ "summary": "Customer reports a duplicate charge on a subscription renewal and asks when the refund will land.",
20
+ "visible_context": {
21
+ "customer_tier": "consumer",
22
+ "surface": "subscriptions",
23
+ "sla_hours_remaining": "24"
24
+ },
25
+ "hidden_context": {
26
+ "payment_status": "The duplicate charge was confirmed and is automatically refundable.",
27
+ "refund_status": "No refund has been issued yet.",
28
+ "risk_flags": "No fraud indicators or account compromise signals are present."
29
+ },
30
+ "required_context": [
31
+ "payment_status"
32
+ ],
33
+ "gold_priority": "normal",
34
+ "gold_route": "billing_refunds",
35
+ "gold_resolution": "approve_refund",
36
+ "gold_escalation_team": null
37
+ },
38
+ {
39
+ "ticket_id": "T2",
40
+ "summary": "Advertiser cannot log in, says unfamiliar campaigns are spending rapidly, and fears the account was taken over.",
41
+ "visible_context": {
42
+ "customer_tier": "managed_advertiser",
43
+ "surface": "ads_manager",
44
+ "sla_hours_remaining": "1"
45
+ },
46
+ "hidden_context": {
47
+ "account_security": "Two new devices were added and recovery email was changed 30 minutes ago.",
48
+ "billing_activity": "Spending accelerated to $1,900 in the last hour.",
49
+ "risk_flags": "Account takeover risk score is critical."
50
+ },
51
+ "required_context": [
52
+ "account_security",
53
+ "billing_activity"
54
+ ],
55
+ "gold_priority": "urgent",
56
+ "gold_route": "account_security",
57
+ "gold_resolution": "temporary_lock_and_manual_recovery",
58
+ "gold_escalation_team": "security_specialist"
59
+ },
60
+ {
61
+ "ticket_id": "T3",
62
+ "summary": "Verified creator appeals a content moderation decision because a sponsored campaign launches tomorrow and monetization is paused.",
63
+ "visible_context": {
64
+ "customer_tier": "verified_creator",
65
+ "surface": "content_appeals",
66
+ "sla_hours_remaining": "4"
67
+ },
68
+ "hidden_context": {
69
+ "appeal_state": "The content was auto-removed for policy ambiguity and has not been reviewed by a human yet.",
70
+ "campaign_deadline": "The sponsored campaign begins in 18 hours.",
71
+ "account_history": "No previous policy strikes or abuse reports."
72
+ },
73
+ "required_context": [
74
+ "appeal_state",
75
+ "campaign_deadline"
76
+ ],
77
+ "gold_priority": "high",
78
+ "gold_route": "policy_appeals",
79
+ "gold_resolution": "expedited_human_review",
80
+ "gold_escalation_team": null
81
+ }
82
+ ]
83
+ }
84
+ ]
support_ops_env/data/medium_cases.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "task_id": "medium_payout_hold",
4
+ "difficulty": "medium",
5
+ "title": "Monetization Payout Hold",
6
+ "description": "Handle a creator support issue where a monthly payout is missing and policy-compliance context is incomplete.",
7
+ "instruction": "Investigate the payout problem, request the most relevant missing context, decide the correct route, and choose a safe resolution without unnecessary escalation.",
8
+ "max_steps": 9,
9
+ "queue_mode": false,
10
+ "grader_name": "medium_support_resolution",
11
+ "tickets": [
12
+ {
13
+ "ticket_id": "T1",
14
+ "summary": "A small business creator says their monthly payout did not arrive. They mention changing bank details recently and want the funds released immediately.",
15
+ "visible_context": {
16
+ "customer_tier": "business_creator",
17
+ "surface": "creator_monetization",
18
+ "sla_hours_remaining": "12"
19
+ },
20
+ "hidden_context": {
21
+ "tax_status": "The W-9 expired last month and must be renewed before payout release.",
22
+ "payout_hold": "An automated compliance hold is active until tax renewal is confirmed.",
23
+ "bank_change": "The new bank account passed verification 3 days ago.",
24
+ "contract_status": "The creator remains in good standing with no strikes."
25
+ },
26
+ "required_context": [
27
+ "tax_status",
28
+ "payout_hold"
29
+ ],
30
+ "gold_priority": "high",
31
+ "gold_route": "monetization_compliance",
32
+ "gold_resolution": "request_tax_renewal",
33
+ "gold_escalation_team": null
34
+ }
35
+ ]
36
+ }
37
+ ]
support_ops_env/env.py ADDED
@@ -0,0 +1,237 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from typing import Dict, List, Optional, Tuple
4
+
5
+ from .graders import grade_task
6
+ from .models import Action, Observation, RewardModel, StateModel, StepInfo, TaskSpec, TicketObservation
7
+ from .reward import STEP_PENALTY, build_reward
8
+ from .state import discovered_for_ticket, initial_tracking, update_mapping
9
+ from .tasks import get_all_tasks, get_task
10
+
11
+
12
+ class SupportOpsEnv:
13
+ """OpenEnv-shaped benchmark for support operations workflows."""
14
+
15
+ def __init__(self, task_id: Optional[str] = None):
16
+ self._tasks = {task.task_id: task for task in get_all_tasks()}
17
+ self._task_order = sorted(self._tasks)
18
+ self._task_id = task_id or self._task_order[0]
19
+ self._task: TaskSpec = self._tasks[self._task_id]
20
+ self._state: StateModel = initial_tracking(self._task)
21
+
22
+ def reset(self, task_id: Optional[str] = None) -> Observation:
23
+ if task_id is not None:
24
+ self._task = get_task(task_id)
25
+ self._task_id = task_id
26
+ self._state = initial_tracking(self._task)
27
+ return self._build_observation()
28
+
29
+ def state(self) -> StateModel:
30
+ return self._state.model_copy(deep=True)
31
+
32
+ def step(self, action: Action) -> Tuple[Observation, RewardModel, bool, Dict[str, object]]:
33
+ if self._state.done:
34
+ reward = build_reward({"invalid_after_done": -0.1}, "Episode already finished.")
35
+ info = StepInfo(
36
+ task_id=self._task.task_id,
37
+ step_count=self._state.step_count,
38
+ task_score=self._state.latest_score.get("task_score", 0.0),
39
+ done_reason="already_done",
40
+ event="invalid_after_done",
41
+ event_score=reward.components,
42
+ )
43
+ return self._build_observation(), reward, True, info.model_dump()
44
+
45
+ self._state.step_count += 1
46
+ event_scores: Dict[str, float] = {"step_penalty": STEP_PENALTY}
47
+ event_name = action.action_type
48
+ done_reason = None
49
+
50
+ if action.action_type == "inspect_ticket":
51
+ event_scores.update(self._handle_inspect(action))
52
+ elif action.action_type == "request_context":
53
+ event_scores.update(self._handle_request_context(action))
54
+ elif action.action_type == "set_priority":
55
+ event_scores.update(self._handle_priority(action))
56
+ elif action.action_type == "set_route":
57
+ event_scores.update(self._handle_route(action))
58
+ elif action.action_type == "set_resolution":
59
+ event_scores.update(self._handle_resolution(action))
60
+ elif action.action_type == "escalate":
61
+ event_scores.update(self._handle_escalation(action))
62
+ elif action.action_type == "rank_queue":
63
+ event_scores.update(self._handle_rank_queue(action))
64
+ elif action.action_type == "finalize":
65
+ self._state.done = True
66
+ done_reason = "agent_finalized"
67
+ grade = grade_task(self._task, self._state)
68
+ self._state.latest_score = {"task_score": grade.score, **grade.component_scores}
69
+ event_scores["terminal_grade"] = grade.score
70
+ reward = build_reward(event_scores, "Final task grade applied.")
71
+ self._state.cumulative_reward = round(self._state.cumulative_reward + reward.value, 4)
72
+ info = StepInfo(
73
+ task_id=self._task.task_id,
74
+ step_count=self._state.step_count,
75
+ task_score=grade.score,
76
+ done_reason=done_reason,
77
+ grade=grade,
78
+ event=event_name,
79
+ event_score=reward.components,
80
+ )
81
+ return self._build_observation(), reward, True, info.model_dump()
82
+ else:
83
+ event_scores["invalid_action"] = -0.1
84
+ event_name = "invalid_action"
85
+
86
+ grade = grade_task(self._task, self._state)
87
+ self._state.latest_score = {"task_score": grade.score, **grade.component_scores}
88
+
89
+ if self._state.step_count >= self._task.max_steps and not self._state.done:
90
+ self._state.done = True
91
+ done_reason = "max_steps"
92
+ event_scores["timeout_grade"] = grade.score
93
+
94
+ reward = build_reward(event_scores, f"Processed {event_name}.")
95
+ self._state.cumulative_reward = round(self._state.cumulative_reward + reward.value, 4)
96
+ info = StepInfo(
97
+ task_id=self._task.task_id,
98
+ step_count=self._state.step_count,
99
+ task_score=grade.score,
100
+ done_reason=done_reason,
101
+ grade=grade if self._state.done else None,
102
+ event=event_name,
103
+ event_score=reward.components,
104
+ )
105
+ return self._build_observation(), reward, self._state.done, info.model_dump()
106
+
107
+ def _build_observation(self) -> Observation:
108
+ tickets: List[TicketObservation] = []
109
+ for ticket in self._task.tickets:
110
+ keys = self._state.discovered_keys.get(ticket.ticket_id, [])
111
+ discovered_context = {key: ticket.hidden_context[key] for key in keys}
112
+ tickets.append(
113
+ TicketObservation(
114
+ ticket_id=ticket.ticket_id,
115
+ summary=ticket.summary,
116
+ visible_context=ticket.visible_context,
117
+ discovered_context=discovered_context,
118
+ selected_priority=self._state.priorities.get(ticket.ticket_id),
119
+ selected_route=self._state.routes.get(ticket.ticket_id),
120
+ selected_resolution=self._state.resolutions.get(ticket.ticket_id),
121
+ escalation_team=self._state.escalations.get(ticket.ticket_id),
122
+ )
123
+ )
124
+
125
+ return Observation(
126
+ task_id=self._task.task_id,
127
+ difficulty=self._task.difficulty,
128
+ title=self._task.title,
129
+ instruction=self._task.instruction,
130
+ queue_mode=self._task.queue_mode,
131
+ tickets=tickets,
132
+ remaining_steps=max(self._task.max_steps - self._state.step_count, 0),
133
+ available_actions=[
134
+ "inspect_ticket",
135
+ "request_context",
136
+ "set_priority",
137
+ "set_route",
138
+ "set_resolution",
139
+ "escalate",
140
+ "rank_queue",
141
+ "finalize",
142
+ ],
143
+ current_queue_order=self._state.queue_order,
144
+ score_hint=self._state.latest_score,
145
+ )
146
+
147
+ def _find_ticket(self, ticket_id: str):
148
+ for ticket in self._task.tickets:
149
+ if ticket.ticket_id == ticket_id:
150
+ return ticket
151
+ return None
152
+
153
+ def _handle_inspect(self, action: Action) -> Dict[str, float]:
154
+ ticket = self._find_ticket(action.target)
155
+ if ticket is None:
156
+ return {"invalid_ticket": -0.1}
157
+ key = f"inspected::{ticket.ticket_id}"
158
+ notes = self._state.latest_score.setdefault("inspections", 0.0)
159
+ if notes and key in self._state.latest_score:
160
+ return {"redundant_inspect": -0.03}
161
+ self._state.latest_score[key] = 1.0
162
+ return {"inspect": 0.03}
163
+
164
+ def _handle_request_context(self, action: Action) -> Dict[str, float]:
165
+ ticket = self._find_ticket(action.target)
166
+ if ticket is None or not action.value:
167
+ return {"invalid_context_request": -0.1}
168
+ if action.value not in ticket.hidden_context:
169
+ return {"unknown_context_key": -0.08}
170
+
171
+ discovered = discovered_for_ticket(self._state.discovered_keys, ticket.ticket_id)
172
+ if action.value in discovered:
173
+ return {"redundant_context_request": -0.05}
174
+
175
+ discovered.append(action.value)
176
+ if action.value in ticket.required_context:
177
+ return {"required_context_found": 0.12}
178
+ return {"optional_context_found": 0.04}
179
+
180
+ def _handle_priority(self, action: Action) -> Dict[str, float]:
181
+ ticket = self._find_ticket(action.target)
182
+ if ticket is None or not action.value:
183
+ return {"invalid_priority": -0.1}
184
+ current = self._state.priorities.get(ticket.ticket_id)
185
+ update_mapping(self._state.priorities, ticket.ticket_id, action.value)
186
+ if action.value == current:
187
+ return {"redundant_priority": -0.03}
188
+ return {"priority_match": 0.08 if action.value == ticket.gold_priority else -0.04}
189
+
190
+ def _handle_route(self, action: Action) -> Dict[str, float]:
191
+ ticket = self._find_ticket(action.target)
192
+ if ticket is None or not action.value:
193
+ return {"invalid_route": -0.1}
194
+ current = self._state.routes.get(ticket.ticket_id)
195
+ update_mapping(self._state.routes, ticket.ticket_id, action.value)
196
+ if action.value == current:
197
+ return {"redundant_route": -0.03}
198
+ return {"route_match": 0.1 if action.value == ticket.gold_route else -0.06}
199
+
200
+ def _handle_resolution(self, action: Action) -> Dict[str, float]:
201
+ ticket = self._find_ticket(action.target)
202
+ if ticket is None or not action.value:
203
+ return {"invalid_resolution": -0.1}
204
+ current = self._state.resolutions.get(ticket.ticket_id)
205
+ update_mapping(self._state.resolutions, ticket.ticket_id, action.value)
206
+ if action.value == current:
207
+ return {"redundant_resolution": -0.03}
208
+ return {"resolution_match": 0.12 if action.value == ticket.gold_resolution else -0.08}
209
+
210
+ def _handle_escalation(self, action: Action) -> Dict[str, float]:
211
+ ticket = self._find_ticket(action.target)
212
+ if ticket is None:
213
+ return {"invalid_escalation": -0.1}
214
+ team = action.value
215
+ current = self._state.escalations.get(ticket.ticket_id)
216
+ update_mapping(self._state.escalations, ticket.ticket_id, team)
217
+ if team == current:
218
+ return {"redundant_escalation": -0.03}
219
+
220
+ if team == ticket.gold_escalation_team:
221
+ return {"correct_escalation": 0.1}
222
+ if ticket.gold_escalation_team is None and team is None:
223
+ return {"correct_no_escalation": 0.03}
224
+ return {"incorrect_escalation": -0.1}
225
+
226
+ def _handle_rank_queue(self, action: Action) -> Dict[str, float]:
227
+ if not self._task.queue_mode or not action.value:
228
+ return {"invalid_queue_ranking": -0.1}
229
+ ranked = [item.strip() for item in action.value.split(",") if item.strip()]
230
+ valid_ticket_ids = {ticket.ticket_id for ticket in self._task.tickets}
231
+ if set(ranked) != valid_ticket_ids:
232
+ return {"malformed_queue_ranking": -0.08}
233
+ self._state.queue_order = ranked
234
+ correct_positions = sum(
235
+ 1 for observed, expected in zip(ranked, self._task.gold_queue_order) if observed == expected
236
+ )
237
+ return {"queue_progress": round((correct_positions / len(ranked)) * 0.12, 4)}
support_ops_env/graders/__init__.py ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from typing import Callable, Dict
4
+
5
+ from ..models import StateModel, TaskGrade, TaskSpec
6
+ from .easy import grade as easy_grade
7
+ from .hard import grade as hard_grade
8
+ from .medium import grade as medium_grade
9
+
10
+
11
+ GRADERS: Dict[str, Callable[[TaskSpec, StateModel], TaskGrade]] = {
12
+ "easy_support_routing": easy_grade,
13
+ "medium_support_resolution": medium_grade,
14
+ "hard_support_queue": hard_grade,
15
+ }
16
+
17
+
18
+ def grade_task(task: TaskSpec, state: StateModel) -> TaskGrade:
19
+ return GRADERS[task.grader_name](task, state)
support_ops_env/graders/__pycache__/__init__.cpython-313.pyc ADDED
Binary file (955 Bytes). View file
 
support_ops_env/graders/__pycache__/common.cpython-313.pyc ADDED
Binary file (6.43 kB). View file
 
support_ops_env/graders/__pycache__/easy.cpython-313.pyc ADDED
Binary file (709 Bytes). View file
 
support_ops_env/graders/__pycache__/hard.cpython-313.pyc ADDED
Binary file (729 Bytes). View file
 
support_ops_env/graders/__pycache__/medium.cpython-313.pyc ADDED
Binary file (711 Bytes). View file
 
support_ops_env/graders/common.py ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from typing import Dict, List
4
+
5
+ from ..models import StateModel, TaskGrade, TaskSpec, TicketSpec
6
+
7
+
8
+ def _ticket_component(
9
+ ticket: TicketSpec,
10
+ state: StateModel,
11
+ weights: Dict[str, float],
12
+ ) -> Dict[str, float]:
13
+ discovered = set(state.discovered_keys.get(ticket.ticket_id, []))
14
+ required = set(ticket.required_context)
15
+ context_score = 1.0 if not required else len(discovered & required) / len(required)
16
+ escalation_value = state.escalations.get(ticket.ticket_id)
17
+ gold_escalation = ticket.gold_escalation_team
18
+ escalation_score = 1.0 if escalation_value == gold_escalation else 0.0
19
+ if gold_escalation is None and escalation_value is None:
20
+ escalation_score = 1.0
21
+
22
+ raw = {
23
+ "context": context_score,
24
+ "priority": 1.0 if state.priorities.get(ticket.ticket_id) == ticket.gold_priority else 0.0,
25
+ "route": 1.0 if state.routes.get(ticket.ticket_id) == ticket.gold_route else 0.0,
26
+ "resolution": 1.0 if state.resolutions.get(ticket.ticket_id) == ticket.gold_resolution else 0.0,
27
+ "escalation": escalation_score,
28
+ }
29
+ return {name: raw[name] * weights.get(name, 0.0) for name in raw}
30
+
31
+
32
+ def grade_single_ticket(
33
+ task: TaskSpec,
34
+ state: StateModel,
35
+ weights: Dict[str, float],
36
+ ) -> TaskGrade:
37
+ ticket = task.tickets[0]
38
+ weighted = _ticket_component(ticket, state, weights)
39
+ score = round(sum(weighted.values()), 4)
40
+ notes = _notes_for_ticket(ticket, state)
41
+ return TaskGrade(
42
+ task_id=task.task_id,
43
+ score=score,
44
+ passed=score >= 0.8,
45
+ component_scores=weighted,
46
+ notes=notes,
47
+ )
48
+
49
+
50
+ def grade_queue_task(
51
+ task: TaskSpec,
52
+ state: StateModel,
53
+ weights: Dict[str, float],
54
+ ) -> TaskGrade:
55
+ ticket_scores: List[float] = []
56
+ component_sums = {
57
+ "context": 0.0,
58
+ "priority": 0.0,
59
+ "route": 0.0,
60
+ "resolution": 0.0,
61
+ "escalation": 0.0,
62
+ }
63
+ notes: List[str] = []
64
+ for ticket in task.tickets:
65
+ weighted = _ticket_component(ticket, state, weights)
66
+ for name, value in weighted.items():
67
+ component_sums[name] += value
68
+ ticket_scores.append(sum(weighted.values()))
69
+ notes.extend(_notes_for_ticket(ticket, state))
70
+
71
+ divisor = max(len(task.tickets), 1)
72
+ averaged = {name: round(value / divisor, 4) for name, value in component_sums.items()}
73
+
74
+ ranking_score = 0.0
75
+ if task.gold_queue_order:
76
+ matches = sum(
77
+ 1 for observed, expected in zip(state.queue_order, task.gold_queue_order) if observed == expected
78
+ )
79
+ ranking_score = round((matches / len(task.gold_queue_order)) * weights.get("ranking", 0.0), 4)
80
+
81
+ averaged["ranking"] = ranking_score
82
+ score = round(sum(averaged.values()), 4)
83
+ return TaskGrade(
84
+ task_id=task.task_id,
85
+ score=score,
86
+ passed=score >= 0.8,
87
+ component_scores=averaged,
88
+ notes=notes,
89
+ )
90
+
91
+
92
+ def _notes_for_ticket(ticket: TicketSpec, state: StateModel) -> List[str]:
93
+ notes: List[str] = []
94
+ if state.priorities.get(ticket.ticket_id) != ticket.gold_priority:
95
+ notes.append(f"{ticket.ticket_id}: incorrect priority")
96
+ if state.routes.get(ticket.ticket_id) != ticket.gold_route:
97
+ notes.append(f"{ticket.ticket_id}: incorrect route")
98
+ if state.resolutions.get(ticket.ticket_id) != ticket.gold_resolution:
99
+ notes.append(f"{ticket.ticket_id}: incorrect resolution")
100
+ if state.escalations.get(ticket.ticket_id) != ticket.gold_escalation_team:
101
+ if not (ticket.gold_escalation_team is None and state.escalations.get(ticket.ticket_id) is None):
102
+ notes.append(f"{ticket.ticket_id}: incorrect escalation")
103
+ missing = set(ticket.required_context) - set(state.discovered_keys.get(ticket.ticket_id, []))
104
+ if missing:
105
+ notes.append(f"{ticket.ticket_id}: missing required context {sorted(missing)}")
106
+ return notes
support_ops_env/graders/easy.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from ..models import StateModel, TaskGrade, TaskSpec
4
+ from .common import grade_single_ticket
5
+
6
+
7
+ WEIGHTS = {
8
+ "context": 0.2,
9
+ "priority": 0.2,
10
+ "route": 0.25,
11
+ "resolution": 0.2,
12
+ "escalation": 0.15,
13
+ }
14
+
15
+
16
+ def grade(task: TaskSpec, state: StateModel) -> TaskGrade:
17
+ return grade_single_ticket(task, state, WEIGHTS)
support_ops_env/graders/hard.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from ..models import StateModel, TaskGrade, TaskSpec
4
+ from .common import grade_queue_task
5
+
6
+
7
+ WEIGHTS = {
8
+ "context": 0.1,
9
+ "priority": 0.2,
10
+ "route": 0.25,
11
+ "resolution": 0.2,
12
+ "escalation": 0.1,
13
+ "ranking": 0.15,
14
+ }
15
+
16
+
17
+ def grade(task: TaskSpec, state: StateModel) -> TaskGrade:
18
+ return grade_queue_task(task, state, WEIGHTS)
support_ops_env/graders/medium.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from ..models import StateModel, TaskGrade, TaskSpec
4
+ from .common import grade_single_ticket
5
+
6
+
7
+ WEIGHTS = {
8
+ "context": 0.25,
9
+ "priority": 0.15,
10
+ "route": 0.25,
11
+ "resolution": 0.25,
12
+ "escalation": 0.1,
13
+ }
14
+
15
+
16
+ def grade(task: TaskSpec, state: StateModel) -> TaskGrade:
17
+ return grade_single_ticket(task, state, WEIGHTS)
support_ops_env/models.py ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from typing import Any, Dict, List, Literal, Optional
4
+
5
+ from pydantic import BaseModel, Field
6
+
7
+
8
+ ActionType = Literal[
9
+ "inspect_ticket",
10
+ "request_context",
11
+ "set_priority",
12
+ "set_route",
13
+ "set_resolution",
14
+ "escalate",
15
+ "rank_queue",
16
+ "finalize",
17
+ ]
18
+
19
+
20
+ class RewardModel(BaseModel):
21
+ value: float
22
+ components: Dict[str, float] = Field(default_factory=dict)
23
+ rationale: str = ""
24
+
25
+
26
+ class Action(BaseModel):
27
+ action_type: ActionType
28
+ target: str = "T1"
29
+ value: Optional[str] = None
30
+
31
+
32
+ class TicketObservation(BaseModel):
33
+ ticket_id: str
34
+ summary: str
35
+ visible_context: Dict[str, str]
36
+ discovered_context: Dict[str, str] = Field(default_factory=dict)
37
+ selected_priority: Optional[str] = None
38
+ selected_route: Optional[str] = None
39
+ selected_resolution: Optional[str] = None
40
+ escalation_team: Optional[str] = None
41
+
42
+
43
+ class Observation(BaseModel):
44
+ task_id: str
45
+ difficulty: Literal["easy", "medium", "hard"]
46
+ title: str
47
+ instruction: str
48
+ queue_mode: bool
49
+ tickets: List[TicketObservation]
50
+ remaining_steps: int
51
+ available_actions: List[str]
52
+ current_queue_order: List[str] = Field(default_factory=list)
53
+ score_hint: Dict[str, float] = Field(default_factory=dict)
54
+
55
+
56
+ class StateModel(BaseModel):
57
+ task_id: str
58
+ step_count: int
59
+ done: bool
60
+ discovered_keys: Dict[str, List[str]]
61
+ priorities: Dict[str, Optional[str]]
62
+ routes: Dict[str, Optional[str]]
63
+ resolutions: Dict[str, Optional[str]]
64
+ escalations: Dict[str, Optional[str]]
65
+ queue_order: List[str]
66
+ cumulative_reward: float
67
+ latest_score: Dict[str, float] = Field(default_factory=dict)
68
+
69
+
70
+ class TicketSpec(BaseModel):
71
+ ticket_id: str
72
+ summary: str
73
+ visible_context: Dict[str, str]
74
+ hidden_context: Dict[str, str]
75
+ required_context: List[str]
76
+ gold_priority: str
77
+ gold_route: str
78
+ gold_resolution: str
79
+ gold_escalation_team: Optional[str] = None
80
+
81
+
82
+ class TaskSpec(BaseModel):
83
+ task_id: str
84
+ difficulty: Literal["easy", "medium", "hard"]
85
+ title: str
86
+ description: str
87
+ instruction: str
88
+ max_steps: int
89
+ queue_mode: bool = False
90
+ tickets: List[TicketSpec]
91
+ gold_queue_order: List[str] = Field(default_factory=list)
92
+ grader_name: str
93
+ reward_weights: Dict[str, float] = Field(default_factory=dict)
94
+
95
+
96
+ class TaskGrade(BaseModel):
97
+ task_id: str
98
+ score: float
99
+ passed: bool
100
+ component_scores: Dict[str, float]
101
+ notes: List[str] = Field(default_factory=list)
102
+
103
+
104
+ class StepInfo(BaseModel):
105
+ task_id: str
106
+ step_count: int
107
+ task_score: float
108
+ done_reason: Optional[str] = None
109
+ grade: Optional[TaskGrade] = None
110
+ event: str = ""
111
+ event_score: Dict[str, float] = Field(default_factory=dict)
112
+
113
+
114
+ class BaselineResult(BaseModel):
115
+ task_id: str
116
+ difficulty: str
117
+ score: float
118
+ steps: int
119
+ transcript: List[Dict[str, Any]]
support_ops_env/reward.py ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from typing import Dict
4
+
5
+ from .models import RewardModel
6
+
7
+
8
+ STEP_PENALTY = -0.01
9
+
10
+
11
+ def build_reward(components: Dict[str, float], rationale: str) -> RewardModel:
12
+ value = round(sum(components.values()), 4)
13
+ return RewardModel(value=value, components=components, rationale=rationale)
support_ops_env/state.py ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from typing import Dict, List, Optional
4
+
5
+ from .models import StateModel, TaskSpec
6
+
7
+
8
+ def initial_tracking(task: TaskSpec) -> StateModel:
9
+ return StateModel(
10
+ task_id=task.task_id,
11
+ step_count=0,
12
+ done=False,
13
+ discovered_keys={ticket.ticket_id: [] for ticket in task.tickets},
14
+ priorities={ticket.ticket_id: None for ticket in task.tickets},
15
+ routes={ticket.ticket_id: None for ticket in task.tickets},
16
+ resolutions={ticket.ticket_id: None for ticket in task.tickets},
17
+ escalations={ticket.ticket_id: None for ticket in task.tickets},
18
+ queue_order=[],
19
+ cumulative_reward=0.0,
20
+ latest_score={},
21
+ )
22
+
23
+
24
+ def update_mapping(
25
+ current: Dict[str, Optional[str]],
26
+ ticket_id: str,
27
+ value: Optional[str],
28
+ ) -> Dict[str, Optional[str]]:
29
+ current[ticket_id] = value
30
+ return current
31
+
32
+
33
+ def discovered_for_ticket(discovered_keys: Dict[str, List[str]], ticket_id: str) -> List[str]:
34
+ return discovered_keys.setdefault(ticket_id, [])
support_ops_env/tasks/__init__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ from .loader import get_all_tasks, get_task, list_task_ids
2
+
3
+ __all__ = ["get_all_tasks", "get_task", "list_task_ids"]
support_ops_env/tasks/__pycache__/__init__.cpython-313.pyc ADDED
Binary file (306 Bytes). View file
 
support_ops_env/tasks/__pycache__/loader.cpython-313.pyc ADDED
Binary file (2.02 kB). View file
 
support_ops_env/tasks/loader.py ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ from pathlib import Path
5
+ from typing import Dict, List
6
+
7
+ from ..models import TaskSpec
8
+
9
+
10
+ DATA_DIR = Path(__file__).resolve().parent.parent / "data"
11
+
12
+
13
+ def _load_file(name: str) -> List[TaskSpec]:
14
+ path = DATA_DIR / name
15
+ with path.open("r", encoding="utf-8") as handle:
16
+ raw = json.load(handle)
17
+ return [TaskSpec.model_validate(item) for item in raw]
18
+
19
+
20
+ def get_all_tasks() -> List[TaskSpec]:
21
+ tasks: List[TaskSpec] = []
22
+ for filename in ("easy_cases.json", "medium_cases.json", "hard_cases.json"):
23
+ tasks.extend(_load_file(filename))
24
+ return tasks
25
+
26
+
27
+ def get_task(task_id: str) -> TaskSpec:
28
+ for task in get_all_tasks():
29
+ if task.task_id == task_id:
30
+ return task
31
+ raise KeyError(f"Unknown task_id: {task_id}")
32
+
33
+
34
+ def list_task_ids() -> List[str]:
35
+ return [task.task_id for task in get_all_tasks()]
tests/__pycache__/test_env.cpython-313.pyc ADDED
Binary file (2.61 kB). View file
 
tests/__pycache__/test_graders.cpython-313.pyc ADDED
Binary file (1.85 kB). View file