Dar3devil commited on
Commit
2b73c16
·
verified ·
1 Parent(s): b95386f

Initial customer support OpenEnv upload

Browse files
.dockerignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ .pytest_cache/
3
+ .git/
4
+ .env
5
+ outputs/
6
+ pytest-cache-files-*
.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ .pytest_cache/
3
+ *.pyc
4
+ .env
5
+ outputs/
6
+ pytest-cache-files-*
README.md CHANGED
@@ -1,11 +1,176 @@
1
- ---
2
- title: Customer Support Openenv
3
- emoji: 👁
4
- colorFrom: blue
5
- colorTo: blue
6
- sdk: docker
7
- pinned: false
8
- short_description: 'Meta Env Hackathon Submission Space '
9
- ---
10
-
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AcmeCloud Customer Support Ticket Handler
2
+
3
+ A deterministic OpenEnv-style environment for training and evaluating agents on realistic B2B SaaS support workflows.
4
+
5
+ ## What It Simulates
6
+
7
+ Each episode is one inbound customer-support ticket at a fictional company, `AcmeCloud`.
8
+ The agent acts like a support representative and must choose the right sequence of typed tool actions to handle the ticket correctly.
9
+
10
+ The benchmark ships with three fixed tasks:
11
+
12
+ 1. `password_reset_guidance`
13
+ 2. `duplicate_charge_refund`
14
+ 3. `enterprise_data_loss_escalation`
15
+
16
+ ## Why This Is Useful
17
+
18
+ This environment models a real operational task rather than a toy game:
19
+
20
+ - reading support tickets
21
+ - searching internal knowledge base articles
22
+ - looking up customer account details
23
+ - deciding whether to resolve, refund, or escalate
24
+ - sending customer-facing replies under policy constraints
25
+
26
+ The environment is fully deterministic and graded without any LLM judge, which makes it suitable for reproducible RL rollouts and benchmark evaluation.
27
+
28
+ ## Action Space
29
+
30
+ The agent can take exactly six typed actions:
31
+
32
+ - `search_kb(query: str)`
33
+ - `lookup_account(customer_id: str)`
34
+ - `send_reply(message: str)`
35
+ - `issue_refund(amount_cents: int, reason_code: "duplicate_charge")`
36
+ - `resolve_ticket(resolution_code: "password_reset_guidance" | "billing_refund_processed")`
37
+ - `escalate_ticket(queue: "support_lead" | "legal_data_incident", priority: "P2" | "P0", summary: str)`
38
+
39
+ ## Observation Space
40
+
41
+ Each observation includes:
42
+
43
+ - task and ticket identifiers
44
+ - current ticket status
45
+ - customer metadata
46
+ - customer message and full conversation history
47
+ - the last tool result
48
+ - steps taken / remaining
49
+ - available action types
50
+ - last action error
51
+ - accumulated known facts learned from prior tool calls
52
+
53
+ ## Reward Design
54
+
55
+ The environment uses rubric-based reward shaping.
56
+
57
+ - Each task has a deterministic scorecard in `[0.0, 1.0]`
58
+ - Step reward is `score_delta - 0.01 - invalid_penalty - redundancy_penalty`
59
+ - Repeated search/lookup actions incur `-0.02`
60
+ - Invalid actions incur `-0.10`
61
+ - `resolve_ticket` and `escalate_ticket` terminate the episode
62
+ - `issue_refund` changes state but does not terminate the episode
63
+
64
+ Global success threshold: `0.75`
65
+
66
+ ## Task Details
67
+
68
+ ### 1. Password Reset Guidance
69
+
70
+ Customer issue: reset email did not arrive.
71
+
72
+ Expected flow:
73
+
74
+ - search password reset KB article
75
+ - send reply with reset URL and spam/junk guidance
76
+ - resolve with `password_reset_guidance`
77
+
78
+ ### 2. Duplicate Charge Refund
79
+
80
+ Customer issue: billed twice for the current subscription period.
81
+
82
+ Expected flow:
83
+
84
+ - lookup the account
85
+ - search the refund policy
86
+ - issue the verified duplicate-charge refund
87
+ - reply with apology and timeline
88
+ - resolve with `billing_refund_processed`
89
+
90
+ ### 3. Enterprise Data Loss Escalation
91
+
92
+ Customer issue: enterprise data-loss complaint with legal threat.
93
+
94
+ Expected flow:
95
+
96
+ - lookup the account
97
+ - send a careful acknowledgment reply
98
+ - escalate to `legal_data_incident` with `P0`
99
+ - do not refund
100
+ - do not resolve
101
+
102
+ ## Project Layout
103
+
104
+ - `support_ticket_env/`: models, fixtures, scoring, environment core, policy helpers, local/HTTP client
105
+ - `server/`: FastAPI app and Dockerfile
106
+ - `tests/`: unit and scenario tests
107
+ - `inference.py`: baseline runner using the OpenAI client interface
108
+ - `openenv.yaml`: environment metadata
109
+
110
+ ## Local Setup
111
+
112
+ ```bash
113
+ python -m pip install -e .[dev]
114
+ pytest
115
+ uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
116
+ ```
117
+
118
+ Open the docs at [http://localhost:8000/docs](http://localhost:8000/docs) or the simple UI at [http://localhost:8000/web](http://localhost:8000/web).
119
+
120
+ ## Docker
121
+
122
+ ```bash
123
+ docker build -t customer-support-openenv -f server/Dockerfile .
124
+ docker run -p 8000:8000 customer-support-openenv
125
+ ```
126
+
127
+ ## Baseline Inference
128
+
129
+ The baseline script uses the OpenAI client interface and supports any OpenAI-compatible endpoint.
130
+
131
+ Environment variables:
132
+
133
+ - `HF_TOKEN` or `OPENAI_API_KEY`
134
+ - `API_BASE_URL`
135
+ - `MODEL_NAME`
136
+ - optional `ENV_BASE_URL` if you want the script to hit a running server instead of the in-process environment
137
+
138
+ Run:
139
+
140
+ ```bash
141
+ python inference.py
142
+ ```
143
+
144
+ The script emits strict stdout lines in the required format:
145
+
146
+ - `[START]`
147
+ - `[STEP]`
148
+ - `[END]`
149
+
150
+ If the model call fails or credentials are missing, the script falls back to a deterministic scripted policy so the benchmark still runs reproducibly.
151
+
152
+ ## Example Gold Scores
153
+
154
+ Using the included scripted policy:
155
+
156
+ - `password_reset_guidance`: `1.0`
157
+ - `duplicate_charge_refund`: `1.0`
158
+ - `enterprise_data_loss_escalation`: `1.0`
159
+
160
+ ## Deployment Notes
161
+
162
+ - The app exposes `/health`, `/reset`, `/step`, `/state`, `/docs`, `/web`, and `/ws`
163
+ - Sessions are managed in-memory
164
+ - No external services are required to run the environment server itself
165
+ - The benchmark is designed to fit comfortably in the hackathon resource limits
166
+
167
+ ## Validation
168
+
169
+ If `openenv` is installed locally, run:
170
+
171
+ ```bash
172
+ openenv validate
173
+ ```
174
+
175
+ This repository does not depend on an LLM judge for grading.
176
+ All graders are deterministic and implemented directly in the environment scorer.
customer_support_openenv.egg-info/PKG-INFO ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Metadata-Version: 2.4
2
+ Name: customer-support-openenv
3
+ Version: 0.1.0
4
+ Summary: Deterministic OpenEnv-style customer support ticket benchmark for B2B SaaS workflows.
5
+ Requires-Python: >=3.11
6
+ Description-Content-Type: text/markdown
7
+ Requires-Dist: fastapi>=0.115
8
+ Requires-Dist: openenv-core>=0.2.0
9
+ Requires-Dist: openai>=1.30
10
+ Requires-Dist: pydantic>=2.7
11
+ Requires-Dist: uvicorn>=0.30
12
+ Provides-Extra: dev
13
+ Requires-Dist: pytest>=8.0; extra == "dev"
14
+
15
+ # AcmeCloud Customer Support Ticket Handler
16
+
17
+ A deterministic OpenEnv-style environment for training and evaluating agents on realistic B2B SaaS support workflows.
18
+
19
+ ## What It Simulates
20
+
21
+ Each episode is one inbound customer-support ticket at a fictional company, `AcmeCloud`.
22
+ The agent acts like a support representative and must choose the right sequence of typed tool actions to handle the ticket correctly.
23
+
24
+ The benchmark ships with three fixed tasks:
25
+
26
+ 1. `password_reset_guidance`
27
+ 2. `duplicate_charge_refund`
28
+ 3. `enterprise_data_loss_escalation`
29
+
30
+ ## Why This Is Useful
31
+
32
+ This environment models a real operational task rather than a toy game:
33
+
34
+ - reading support tickets
35
+ - searching internal knowledge base articles
36
+ - looking up customer account details
37
+ - deciding whether to resolve, refund, or escalate
38
+ - sending customer-facing replies under policy constraints
39
+
40
+ The environment is fully deterministic and graded without any LLM judge, which makes it suitable for reproducible RL rollouts and benchmark evaluation.
41
+
42
+ ## Action Space
43
+
44
+ The agent can take exactly six typed actions:
45
+
46
+ - `search_kb(query: str)`
47
+ - `lookup_account(customer_id: str)`
48
+ - `send_reply(message: str)`
49
+ - `issue_refund(amount_cents: int, reason_code: "duplicate_charge")`
50
+ - `resolve_ticket(resolution_code: "password_reset_guidance" | "billing_refund_processed")`
51
+ - `escalate_ticket(queue: "support_lead" | "legal_data_incident", priority: "P2" | "P0", summary: str)`
52
+
53
+ ## Observation Space
54
+
55
+ Each observation includes:
56
+
57
+ - task and ticket identifiers
58
+ - current ticket status
59
+ - customer metadata
60
+ - customer message and full conversation history
61
+ - the last tool result
62
+ - steps taken / remaining
63
+ - available action types
64
+ - last action error
65
+ - accumulated known facts learned from prior tool calls
66
+
67
+ ## Reward Design
68
+
69
+ The environment uses rubric-based reward shaping.
70
+
71
+ - Each task has a deterministic scorecard in `[0.0, 1.0]`
72
+ - Step reward is `score_delta - 0.01 - invalid_penalty - redundancy_penalty`
73
+ - Repeated search/lookup actions incur `-0.02`
74
+ - Invalid actions incur `-0.10`
75
+ - `resolve_ticket` and `escalate_ticket` terminate the episode
76
+ - `issue_refund` changes state but does not terminate the episode
77
+
78
+ Global success threshold: `0.75`
79
+
80
+ ## Task Details
81
+
82
+ ### 1. Password Reset Guidance
83
+
84
+ Customer issue: reset email did not arrive.
85
+
86
+ Expected flow:
87
+
88
+ - search password reset KB article
89
+ - send reply with reset URL and spam/junk guidance
90
+ - resolve with `password_reset_guidance`
91
+
92
+ ### 2. Duplicate Charge Refund
93
+
94
+ Customer issue: billed twice for the current subscription period.
95
+
96
+ Expected flow:
97
+
98
+ - lookup the account
99
+ - search the refund policy
100
+ - issue the verified duplicate-charge refund
101
+ - reply with apology and timeline
102
+ - resolve with `billing_refund_processed`
103
+
104
+ ### 3. Enterprise Data Loss Escalation
105
+
106
+ Customer issue: enterprise data-loss complaint with legal threat.
107
+
108
+ Expected flow:
109
+
110
+ - lookup the account
111
+ - send a careful acknowledgment reply
112
+ - escalate to `legal_data_incident` with `P0`
113
+ - do not refund
114
+ - do not resolve
115
+
116
+ ## Project Layout
117
+
118
+ - `support_ticket_env/`: models, fixtures, scoring, environment core, policy helpers, local/HTTP client
119
+ - `server/`: FastAPI app and Dockerfile
120
+ - `tests/`: unit and scenario tests
121
+ - `inference.py`: baseline runner using the OpenAI client interface
122
+ - `openenv.yaml`: environment metadata
123
+
124
+ ## Local Setup
125
+
126
+ ```bash
127
+ python -m pip install -e .[dev]
128
+ pytest
129
+ uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
130
+ ```
131
+
132
+ Open the docs at [http://localhost:8000/docs](http://localhost:8000/docs) or the simple UI at [http://localhost:8000/web](http://localhost:8000/web).
133
+
134
+ ## Docker
135
+
136
+ ```bash
137
+ docker build -t customer-support-openenv -f server/Dockerfile .
138
+ docker run -p 8000:8000 customer-support-openenv
139
+ ```
140
+
141
+ ## Baseline Inference
142
+
143
+ The baseline script uses the OpenAI client interface and supports any OpenAI-compatible endpoint.
144
+
145
+ Environment variables:
146
+
147
+ - `HF_TOKEN` or `OPENAI_API_KEY`
148
+ - `API_BASE_URL`
149
+ - `MODEL_NAME`
150
+ - optional `ENV_BASE_URL` if you want the script to hit a running server instead of the in-process environment
151
+
152
+ Run:
153
+
154
+ ```bash
155
+ python inference.py
156
+ ```
157
+
158
+ The script emits strict stdout lines in the required format:
159
+
160
+ - `[START]`
161
+ - `[STEP]`
162
+ - `[END]`
163
+
164
+ If the model call fails or credentials are missing, the script falls back to a deterministic scripted policy so the benchmark still runs reproducibly.
165
+
166
+ ## Example Gold Scores
167
+
168
+ Using the included scripted policy:
169
+
170
+ - `password_reset_guidance`: `1.0`
171
+ - `duplicate_charge_refund`: `1.0`
172
+ - `enterprise_data_loss_escalation`: `1.0`
173
+
174
+ ## Deployment Notes
175
+
176
+ - The app exposes `/health`, `/reset`, `/step`, `/state`, `/docs`, `/web`, and `/ws`
177
+ - Sessions are managed in-memory
178
+ - No external services are required to run the environment server itself
179
+ - The benchmark is designed to fit comfortably in the hackathon resource limits
180
+
181
+ ## Validation
182
+
183
+ If `openenv` is installed locally, run:
184
+
185
+ ```bash
186
+ openenv validate
187
+ ```
188
+
189
+ This repository does not depend on an LLM judge for grading.
190
+ All graders are deterministic and implemented directly in the environment scorer.
customer_support_openenv.egg-info/SOURCES.txt ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ README.md
2
+ pyproject.toml
3
+ customer_support_openenv.egg-info/PKG-INFO
4
+ customer_support_openenv.egg-info/SOURCES.txt
5
+ customer_support_openenv.egg-info/dependency_links.txt
6
+ customer_support_openenv.egg-info/entry_points.txt
7
+ customer_support_openenv.egg-info/requires.txt
8
+ customer_support_openenv.egg-info/top_level.txt
9
+ server/__init__.py
10
+ server/app.py
11
+ support_ticket_env/__init__.py
12
+ support_ticket_env/client.py
13
+ support_ticket_env/env.py
14
+ support_ticket_env/fixtures.py
15
+ support_ticket_env/models.py
16
+ support_ticket_env/policies.py
17
+ support_ticket_env/scoring.py
18
+ tests/test_env.py
19
+ tests/test_models.py
20
+ tests/test_scenarios.py
customer_support_openenv.egg-info/dependency_links.txt ADDED
@@ -0,0 +1 @@
 
 
1
+
customer_support_openenv.egg-info/entry_points.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ [console_scripts]
2
+ server = server.app:main
customer_support_openenv.egg-info/requires.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ fastapi>=0.115
2
+ openenv-core>=0.2.0
3
+ openai>=1.30
4
+ pydantic>=2.7
5
+ uvicorn>=0.30
6
+
7
+ [dev]
8
+ pytest>=8.0
customer_support_openenv.egg-info/top_level.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ server
2
+ support_ticket_env
inference.py ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ import os
5
+ import sys
6
+ from typing import Any
7
+
8
+ from openai import OpenAI
9
+
10
+ from support_ticket_env import BENCHMARK_NAME, DEFAULT_SUCCESS_THRESHOLD, SupportTicketEnv, fallback_action, list_task_ids, parse_action
11
+
12
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or os.getenv("API_KEY")
13
+ API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
14
+ MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
15
+ ENV_BASE_URL = os.getenv("ENV_BASE_URL")
16
+ TEMPERATURE = 0.0
17
+ MAX_TOKENS = 220
18
+ SUCCESS_THRESHOLD = DEFAULT_SUCCESS_THRESHOLD
19
+
20
+ SYSTEM_PROMPT = """You are operating a deterministic customer-support environment.
21
+ Choose exactly one tool action at each step and respond with exactly one JSON object.
22
+ Valid actions:
23
+ - {\"action_type\": \"search_kb\", \"query\": \"...\"}
24
+ - {\"action_type\": \"lookup_account\", \"customer_id\": \"...\"}
25
+ - {\"action_type\": \"send_reply\", \"message\": \"...\"}
26
+ - {\"action_type\": \"issue_refund\", \"amount_cents\": 4900, \"reason_code\": \"duplicate_charge\"}
27
+ - {\"action_type\": \"resolve_ticket\", \"resolution_code\": \"password_reset_guidance\"}
28
+ - {\"action_type\": \"resolve_ticket\", \"resolution_code\": \"billing_refund_processed\"}
29
+ - {\"action_type\": \"escalate_ticket\", \"queue\": \"support_lead\", \"priority\": \"P2\", \"summary\": \"...\"}
30
+ - {\"action_type\": \"escalate_ticket\", \"queue\": \"legal_data_incident\", \"priority\": \"P0\", \"summary\": \"...\"}
31
+ Do not include markdown, code fences, or explanations."""
32
+
33
+
34
+ def log_start(task: str, env: str, model: str) -> None:
35
+ print(f"[START] task={task} env={env} model={model}", flush=True)
36
+
37
+
38
+ def log_step(step: int, action: str, reward: float, done: bool, error: str | None) -> None:
39
+ error_value = "null" if not error else error.replace("\n", " ")
40
+ print(
41
+ f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={error_value}",
42
+ flush=True,
43
+ )
44
+
45
+
46
+ def log_end(success: bool, steps: int, score: float, rewards: list[float]) -> None:
47
+ rewards_str = ",".join(f"{reward:.2f}" for reward in rewards)
48
+ print(
49
+ f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
50
+ flush=True,
51
+ )
52
+
53
+
54
+ def _strip_code_fences(text: str) -> str:
55
+ cleaned = text.strip()
56
+ if cleaned.startswith("```"):
57
+ lines = cleaned.splitlines()
58
+ if lines and lines[0].startswith("```"):
59
+ lines = lines[1:]
60
+ if lines and lines[-1].startswith("```"):
61
+ lines = lines[:-1]
62
+ cleaned = "\n".join(lines).strip()
63
+ return cleaned
64
+
65
+
66
+ def _extract_json_object(text: str) -> dict[str, Any]:
67
+ cleaned = _strip_code_fences(text)
68
+ start = cleaned.find("{")
69
+ end = cleaned.rfind("}")
70
+ if start == -1 or end == -1 or end <= start:
71
+ raise ValueError("No JSON object found in model response")
72
+ return json.loads(cleaned[start : end + 1])
73
+
74
+
75
+ def build_user_prompt(observation: dict[str, Any]) -> str:
76
+ return (
77
+ "Choose the next best action for this support ticket. "
78
+ "Keep it valid and deterministic. Observation JSON:\n"
79
+ f"{json.dumps(observation, indent=2)}"
80
+ )
81
+
82
+
83
+ def choose_action(client: OpenAI | None, observation) -> Any:
84
+ fallback = fallback_action(observation)
85
+ if client is None:
86
+ return fallback
87
+
88
+ try:
89
+ completion = client.chat.completions.create(
90
+ model=MODEL_NAME,
91
+ messages=[
92
+ {"role": "system", "content": SYSTEM_PROMPT},
93
+ {"role": "user", "content": build_user_prompt(observation.model_dump(mode="json"))},
94
+ ],
95
+ temperature=TEMPERATURE,
96
+ max_tokens=MAX_TOKENS,
97
+ )
98
+ content = (completion.choices[0].message.content or "").strip()
99
+ return parse_action(_extract_json_object(content))
100
+ except Exception as exc: # pragma: no cover - depends on external endpoint
101
+ print(f"[DEBUG] Falling back to scripted policy: {exc}", file=sys.stderr, flush=True)
102
+ return fallback
103
+
104
+
105
+ def run_episode(task_id: str, client: OpenAI | None) -> None:
106
+ env = SupportTicketEnv(base_url=ENV_BASE_URL, task_id=task_id)
107
+ rewards: list[float] = []
108
+ steps_taken = 0
109
+ final_score = 0.0
110
+ success = False
111
+
112
+ log_start(task=task_id, env=BENCHMARK_NAME, model=MODEL_NAME)
113
+
114
+ try:
115
+ result = env.reset(task_id)
116
+ while not result.done:
117
+ action = choose_action(client, result.observation)
118
+ result = env.step(action)
119
+ steps_taken += 1
120
+ rewards.append(result.reward)
121
+ action_str = json.dumps(action.model_dump(mode="json"), separators=(",", ":"))
122
+ log_step(
123
+ step=steps_taken,
124
+ action=action_str,
125
+ reward=result.reward,
126
+ done=result.done,
127
+ error=result.observation.last_action_error,
128
+ )
129
+ final_score = float(result.info.get("score", 0.0))
130
+ success = final_score >= SUCCESS_THRESHOLD
131
+ finally:
132
+ env.close()
133
+ log_end(success=success, steps=steps_taken, score=final_score, rewards=rewards)
134
+
135
+
136
+ if __name__ == "__main__":
137
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY) if API_KEY else None
138
+ for task_id in list_task_ids():
139
+ run_episode(task_id, client)
openenv.yaml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ name: customer_support_ticket_handler
2
+ version: 0.1.0
3
+ description: Deterministic B2B SaaS support benchmark with typed tool actions and rubric-based rewards.
4
+ entrypoint: server.app:app
5
+ runtime: fastapi
6
+ port: 8000
7
+ tags:
8
+ - openenv
9
+ - customer-support
10
+ - reinforcement-learning
11
+ - benchmark
pyproject.toml ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools>=68", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "customer-support-openenv"
7
+ version = "0.1.0"
8
+ description = "Deterministic OpenEnv-style customer support ticket benchmark for B2B SaaS workflows."
9
+ readme = "README.md"
10
+ requires-python = ">=3.11"
11
+ dependencies = [
12
+ "fastapi>=0.115",
13
+ "openenv-core>=0.2.0",
14
+ "openai>=1.30",
15
+ "pydantic>=2.7",
16
+ "uvicorn>=0.30",
17
+ ]
18
+
19
+ [project.scripts]
20
+ server = "server.app:main"
21
+
22
+ [project.optional-dependencies]
23
+ dev = ["pytest>=8.0"]
24
+
25
+ [tool.setuptools.packages.find]
26
+ where = ["."]
27
+ include = ["support_ticket_env", "support_ticket_env.*", "server", "server.*"]
28
+
29
+ [tool.pytest.ini_options]
30
+ addopts = "-p no:cacheprovider"
31
+ pythonpath = ["."]
32
+ testpaths = ["tests"]
server/Dockerfile ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.12-slim
2
+
3
+ ENV PYTHONDONTWRITEBYTECODE=1 \
4
+ PYTHONUNBUFFERED=1 \
5
+ PORT=8000
6
+
7
+ WORKDIR /app
8
+
9
+ COPY pyproject.toml README.md openenv.yaml ./
10
+ COPY support_ticket_env ./support_ticket_env
11
+ COPY server ./server
12
+ COPY inference.py ./inference.py
13
+
14
+ RUN pip install --no-cache-dir --upgrade pip && \
15
+ pip install --no-cache-dir .
16
+
17
+ EXPOSE 8000
18
+
19
+ CMD ["python", "-m", "server.app"]
server/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+
server/app.py ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from threading import Lock
4
+ from uuid import uuid4
5
+ from typing import Any
6
+
7
+ from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
8
+ from fastapi.responses import HTMLResponse
9
+ from pydantic import BaseModel, ConfigDict
10
+
11
+ from support_ticket_env import SupportTicketEnvironment, list_task_ids
12
+
13
+
14
+ class ResetRequest(BaseModel):
15
+ model_config = ConfigDict(extra="forbid")
16
+
17
+ task_id: str | None = None
18
+ session_id: str | None = None
19
+
20
+
21
+ class StepRequest(BaseModel):
22
+ model_config = ConfigDict(extra="forbid")
23
+
24
+ session_id: str
25
+ action: dict[str, Any]
26
+
27
+
28
+ class SessionManager:
29
+ def __init__(self) -> None:
30
+ self._sessions: dict[str, SupportTicketEnvironment] = {}
31
+ self._lock = Lock()
32
+
33
+ def create_or_reuse(self, session_id: str | None = None, task_id: str | None = None) -> tuple[str, SupportTicketEnvironment]:
34
+ with self._lock:
35
+ if session_id and session_id in self._sessions:
36
+ return session_id, self._sessions[session_id]
37
+ new_session_id = session_id or str(uuid4())
38
+ env = SupportTicketEnvironment(task_id=task_id)
39
+ self._sessions[new_session_id] = env
40
+ return new_session_id, env
41
+
42
+ def get(self, session_id: str) -> SupportTicketEnvironment:
43
+ with self._lock:
44
+ if session_id not in self._sessions:
45
+ raise KeyError(session_id)
46
+ return self._sessions[session_id]
47
+
48
+ def delete(self, session_id: str) -> None:
49
+ with self._lock:
50
+ self._sessions.pop(session_id, None)
51
+
52
+
53
+ manager = SessionManager()
54
+ app = FastAPI(
55
+ title="AcmeCloud Customer Support Ticket Handler",
56
+ version="0.1.0",
57
+ description="Deterministic OpenEnv-style customer support benchmark for B2B SaaS ticket handling.",
58
+ )
59
+
60
+
61
+ def _step_payload(result, session_id: str) -> dict[str, Any]:
62
+ payload = result.model_dump(mode="json")
63
+ payload.setdefault("info", {})["session_id"] = session_id
64
+ return payload
65
+
66
+
67
+ @app.get("/health")
68
+ def health() -> dict[str, Any]:
69
+ return {"status": "healthy", "tasks": list_task_ids()}
70
+
71
+
72
+ @app.post("/reset")
73
+ def reset(request: ResetRequest) -> dict[str, Any]:
74
+ session_id, env = manager.create_or_reuse(request.session_id, request.task_id)
75
+ result = env.reset(request.task_id)
76
+ return _step_payload(result, session_id)
77
+
78
+
79
+ @app.post("/step")
80
+ def step(request: StepRequest) -> dict[str, Any]:
81
+ try:
82
+ env = manager.get(request.session_id)
83
+ except KeyError as exc:
84
+ raise HTTPException(status_code=404, detail=f"Unknown session_id: {request.session_id}") from exc
85
+ result = env.step(request.action)
86
+ return _step_payload(result, request.session_id)
87
+
88
+
89
+ @app.get("/state")
90
+ def state(session_id: str) -> dict[str, Any]:
91
+ try:
92
+ env = manager.get(session_id)
93
+ except KeyError as exc:
94
+ raise HTTPException(status_code=404, detail=f"Unknown session_id: {session_id}") from exc
95
+ return {"session_id": session_id, **env.state()}
96
+
97
+
98
+ @app.delete("/session/{session_id}")
99
+ def close_session(session_id: str) -> dict[str, str]:
100
+ manager.delete(session_id)
101
+ return {"status": "deleted", "session_id": session_id}
102
+
103
+
104
+ @app.get("/web")
105
+ def web_ui() -> HTMLResponse:
106
+ task_items = "".join(f"<li><code>{task_id}</code></li>" for task_id in list_task_ids())
107
+ html = f"""
108
+ <html>
109
+ <head>
110
+ <title>AcmeCloud Customer Support Ticket Handler</title>
111
+ <style>
112
+ body {{ font-family: Segoe UI, sans-serif; margin: 2rem auto; max-width: 900px; line-height: 1.5; }}
113
+ code {{ background: #f4f4f4; padding: 0.15rem 0.35rem; border-radius: 0.25rem; }}
114
+ pre {{ background: #111827; color: #f9fafb; padding: 1rem; border-radius: 0.5rem; overflow-x: auto; }}
115
+ </style>
116
+ </head>
117
+ <body>
118
+ <h1>AcmeCloud Customer Support Ticket Handler</h1>
119
+ <p>One episode equals one support ticket. Available fixed tasks:</p>
120
+ <ul>{task_items}</ul>
121
+ <p>Example local reset:</p>
122
+ <pre>curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{{"task_id":"password_reset_guidance"}}'</pre>
123
+ </body>
124
+ </html>
125
+ """
126
+ return HTMLResponse(html)
127
+
128
+
129
+ @app.websocket("/ws")
130
+ async def websocket_endpoint(websocket: WebSocket) -> None:
131
+ await websocket.accept()
132
+ session_id = str(uuid4())
133
+ env = SupportTicketEnvironment()
134
+ try:
135
+ while True:
136
+ payload = await websocket.receive_json()
137
+ message_type = payload.get("type")
138
+ if message_type == "reset":
139
+ result = env.reset(payload.get("task_id"))
140
+ await websocket.send_json(_step_payload(result, session_id))
141
+ elif message_type == "step":
142
+ result = env.step(payload.get("action", {}))
143
+ await websocket.send_json(_step_payload(result, session_id))
144
+ elif message_type == "state":
145
+ await websocket.send_json({"session_id": session_id, **env.state()})
146
+ elif message_type == "close":
147
+ await websocket.send_json({"status": "closed", "session_id": session_id})
148
+ break
149
+ else:
150
+ await websocket.send_json(
151
+ {
152
+ "error": "unsupported_message_type",
153
+ "message": "Use reset, step, state, or close.",
154
+ "session_id": session_id,
155
+ }
156
+ )
157
+ except WebSocketDisconnect:
158
+ return
159
+
160
+
161
+ def main() -> None:
162
+ import uvicorn
163
+
164
+ uvicorn.run("server.app:app", host="0.0.0.0", port=8000, reload=False)
165
+
166
+
167
+ if __name__ == "__main__":
168
+ main()
server/requirements.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ fastapi>=0.115
2
+ uvicorn>=0.30
3
+ pydantic>=2.7
4
+ openai>=1.30
support_ticket_env/__init__.py ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from .client import SupportTicketEnv
2
+ from .env import SupportTicketEnvironment
3
+ from .fixtures import BENCHMARK_NAME, DEFAULT_SUCCESS_THRESHOLD, KB_ARTICLES, TASK_FIXTURES, list_task_ids
4
+ from .models import (
5
+ ACTION_TYPE_NAMES,
6
+ EscalateTicketAction,
7
+ IssueRefundAction,
8
+ LookupAccountAction,
9
+ ResolveTicketAction,
10
+ SearchKBAction,
11
+ SendReplyAction,
12
+ SupportTicketAction,
13
+ SupportTicketObservation,
14
+ SupportTicketStepResult,
15
+ TaskScorecard,
16
+ parse_action,
17
+ )
18
+ from .policies import fallback_action, scripted_policy
19
+
20
+ __all__ = [
21
+ "ACTION_TYPE_NAMES",
22
+ "BENCHMARK_NAME",
23
+ "DEFAULT_SUCCESS_THRESHOLD",
24
+ "EscalateTicketAction",
25
+ "IssueRefundAction",
26
+ "KB_ARTICLES",
27
+ "LookupAccountAction",
28
+ "ResolveTicketAction",
29
+ "SearchKBAction",
30
+ "SendReplyAction",
31
+ "SupportTicketAction",
32
+ "SupportTicketEnv",
33
+ "SupportTicketEnvironment",
34
+ "SupportTicketObservation",
35
+ "SupportTicketStepResult",
36
+ "TASK_FIXTURES",
37
+ "TaskScorecard",
38
+ "fallback_action",
39
+ "list_task_ids",
40
+ "parse_action",
41
+ "scripted_policy",
42
+ ]
support_ticket_env/client.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ from typing import Any
5
+ from urllib import parse, request
6
+
7
+ from .env import SupportTicketEnvironment
8
+ from .fixtures import list_task_ids
9
+ from .models import SupportTicketAction, SupportTicketStepResult
10
+
11
+
12
+ class SupportTicketEnv:
13
+ def __init__(self, base_url: str | None = None, task_id: str | None = None) -> None:
14
+ self.base_url = base_url.rstrip("/") if base_url else None
15
+ self.task_id = task_id
16
+ self.session_id: str | None = None
17
+ self._local_env = SupportTicketEnvironment(task_id=task_id) if not self.base_url else None
18
+
19
+ @classmethod
20
+ def from_docker_image(
21
+ cls,
22
+ image_name: str,
23
+ base_url: str = "http://localhost:8000",
24
+ task_id: str | None = None,
25
+ ) -> "SupportTicketEnv":
26
+ del image_name
27
+ return cls(base_url=base_url, task_id=task_id)
28
+
29
+ @classmethod
30
+ def from_env(
31
+ cls,
32
+ repo_id: str,
33
+ base_url: str,
34
+ task_id: str | None = None,
35
+ ) -> "SupportTicketEnv":
36
+ del repo_id
37
+ return cls(base_url=base_url, task_id=task_id)
38
+
39
+ def reset(self, task_id: str | None = None) -> SupportTicketStepResult:
40
+ effective_task_id = task_id or self.task_id
41
+ if self._local_env is not None:
42
+ return self._local_env.reset(effective_task_id)
43
+
44
+ payload = {}
45
+ if effective_task_id:
46
+ payload["task_id"] = effective_task_id
47
+ if self.session_id:
48
+ payload["session_id"] = self.session_id
49
+ result = self._post_json("/reset", payload)
50
+ self.session_id = result.info.get("session_id")
51
+ return result
52
+
53
+ def step(self, action: SupportTicketAction | dict[str, Any]) -> SupportTicketStepResult:
54
+ if self._local_env is not None:
55
+ return self._local_env.step(action)
56
+
57
+ payload = {
58
+ "session_id": self.session_id,
59
+ "action": action.model_dump(mode="json") if hasattr(action, "model_dump") else action,
60
+ }
61
+ result = self._post_json("/step", payload)
62
+ self.session_id = result.info.get("session_id", self.session_id)
63
+ return result
64
+
65
+ def state(self) -> dict[str, Any]:
66
+ if self._local_env is not None:
67
+ return self._local_env.state()
68
+
69
+ if not self.session_id:
70
+ raise RuntimeError("reset() must be called before state() when using HTTP mode.")
71
+ query = parse.urlencode({"session_id": self.session_id})
72
+ with request.urlopen(f"{self.base_url}/state?{query}") as response:
73
+ return json.loads(response.read().decode("utf-8"))
74
+
75
+ def close(self) -> None:
76
+ self.session_id = None
77
+
78
+ def __enter__(self) -> "SupportTicketEnv":
79
+ return self
80
+
81
+ def __exit__(self, exc_type, exc, tb) -> None:
82
+ self.close()
83
+ return None
84
+
85
+ @staticmethod
86
+ def list_tasks() -> list[str]:
87
+ return list_task_ids()
88
+
89
+ def _post_json(self, path: str, payload: dict[str, Any]) -> SupportTicketStepResult:
90
+ body = json.dumps(payload).encode("utf-8")
91
+ req = request.Request(
92
+ f"{self.base_url}{path}",
93
+ data=body,
94
+ headers={"Content-Type": "application/json"},
95
+ method="POST",
96
+ )
97
+ with request.urlopen(req) as response:
98
+ data = json.loads(response.read().decode("utf-8"))
99
+ return SupportTicketStepResult.model_validate(data)
support_ticket_env/env.py ADDED
@@ -0,0 +1,423 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from dataclasses import dataclass, field
4
+ from typing import Any
5
+
6
+ from .fixtures import (
7
+ BENCHMARK_NAME,
8
+ DEFAULT_SUCCESS_THRESHOLD,
9
+ KB_ARTICLES,
10
+ KnowledgeBaseArticle,
11
+ TaskFixture,
12
+ get_task_fixture,
13
+ list_task_ids,
14
+ )
15
+ from .models import (
16
+ ACTION_TYPE_NAMES,
17
+ AccountLookupResult,
18
+ ConversationTurn,
19
+ KBSearchResult,
20
+ ErrorToolResult,
21
+ EscalateTicketAction,
22
+ EscalationResult,
23
+ IssueRefundAction,
24
+ LookupAccountAction,
25
+ RefundResult,
26
+ ReplyResult,
27
+ ResolveResult,
28
+ SearchKBAction,
29
+ SupportTicketAction,
30
+ SupportTicketObservation,
31
+ SupportTicketStepResult,
32
+ ToolResult,
33
+ parse_action,
34
+ )
35
+ from .scoring import build_scorecard, normalize_text
36
+
37
+
38
+ @dataclass
39
+ class SessionState:
40
+ fixture: TaskFixture
41
+ ticket_status: str = "open"
42
+ steps_taken: int = 0
43
+ conversation_history: list[ConversationTurn] = field(default_factory=list)
44
+ action_history: list[dict[str, Any]] = field(default_factory=list)
45
+ reply_history: list[dict[str, Any]] = field(default_factory=list)
46
+ known_facts: dict[str, Any] = field(default_factory=dict)
47
+ kb_articles_seen: set[str] = field(default_factory=set)
48
+ search_signatures: set[str] = field(default_factory=set)
49
+ lookup_performed: bool = False
50
+ lookup_customer_id: str | None = None
51
+ refund_record: dict[str, Any] | None = None
52
+ refund_attempted: bool = False
53
+ resolution_code: str | None = None
54
+ escalation: dict[str, Any] | None = None
55
+ done: bool = False
56
+ terminal_reason: str | None = None
57
+ previous_score: float = 0.0
58
+ last_tool_result: ToolResult | None = None
59
+ last_action_error: str | None = None
60
+
61
+
62
+ class SupportTicketEnvironment:
63
+ benchmark_name = BENCHMARK_NAME
64
+ max_steps = 8
65
+ step_cost = 0.01
66
+ invalid_action_penalty = 0.10
67
+ repeated_action_penalty = 0.02
68
+ success_threshold = DEFAULT_SUCCESS_THRESHOLD
69
+
70
+ def __init__(self, task_id: str | None = None) -> None:
71
+ self._default_task_id = task_id or list_task_ids()[0]
72
+ self._session: SessionState | None = None
73
+
74
+ def reset(self, task_id: str | None = None) -> SupportTicketStepResult:
75
+ fixture = get_task_fixture(task_id or self._default_task_id)
76
+ self._session = SessionState(
77
+ fixture=fixture,
78
+ conversation_history=[
79
+ ConversationTurn(
80
+ role="customer",
81
+ message=fixture.ticket.message,
82
+ step_index=0,
83
+ )
84
+ ],
85
+ )
86
+ return self._build_result(reward=0.0)
87
+
88
+ def step(self, action: SupportTicketAction | dict[str, Any]) -> SupportTicketStepResult:
89
+ session = self._require_session()
90
+ if session.done:
91
+ session.last_action_error = "episode_already_done"
92
+ session.last_tool_result = ErrorToolResult(
93
+ tool_name="error",
94
+ success=False,
95
+ error_code="episode_already_done",
96
+ message="This ticket is already terminal. Reset the environment before stepping again.",
97
+ )
98
+ return self._build_result(reward=-self.invalid_action_penalty)
99
+
100
+ invalid_penalty = 0.0
101
+ redundancy_penalty = 0.0
102
+ session.last_action_error = None
103
+
104
+ try:
105
+ parsed_action = parse_action(action)
106
+ except Exception as exc:
107
+ session.steps_taken += 1
108
+ session.last_action_error = f"invalid_action: {exc}"
109
+ session.last_tool_result = ErrorToolResult(
110
+ tool_name="error",
111
+ success=False,
112
+ error_code="invalid_action",
113
+ message=str(exc),
114
+ )
115
+ invalid_penalty = self.invalid_action_penalty
116
+ self._record_action({"action_type": "invalid"}, False)
117
+ if session.steps_taken >= self.max_steps:
118
+ session.done = True
119
+ session.terminal_reason = "max_steps_exceeded"
120
+ return self._finalize_step(invalid_penalty=invalid_penalty, redundancy_penalty=0.0)
121
+
122
+ session.steps_taken += 1
123
+ session.last_tool_result, invalid_penalty, redundancy_penalty = self._apply_action(parsed_action)
124
+ action_succeeded = bool(getattr(session.last_tool_result, "success", False))
125
+ self._record_action(parsed_action.model_dump(mode="json"), action_succeeded)
126
+
127
+ if not session.done and session.steps_taken >= self.max_steps:
128
+ session.done = True
129
+ session.terminal_reason = "max_steps_exceeded"
130
+
131
+ return self._finalize_step(
132
+ invalid_penalty=invalid_penalty,
133
+ redundancy_penalty=redundancy_penalty,
134
+ )
135
+
136
+ def state(self) -> dict[str, Any]:
137
+ session = self._require_session()
138
+ scorecard = build_scorecard(session.fixture, session)
139
+ return {
140
+ "benchmark_name": self.benchmark_name,
141
+ "task_id": session.fixture.task_id,
142
+ "ticket_status": session.ticket_status,
143
+ "steps_taken": session.steps_taken,
144
+ "steps_remaining": max(self.max_steps - session.steps_taken, 0),
145
+ "conversation_history": [turn.model_dump(mode="json") for turn in session.conversation_history],
146
+ "audit_log": list(session.action_history),
147
+ "known_facts": dict(session.known_facts),
148
+ "current_rubric_score": scorecard.score,
149
+ "score_breakdown": scorecard.model_dump(mode="json"),
150
+ "terminal_reason": session.terminal_reason,
151
+ "done": session.done,
152
+ }
153
+
154
+ def _apply_action(self, action: SupportTicketAction) -> tuple[ToolResult, float, float]:
155
+ session = self._require_session()
156
+ invalid_penalty = 0.0
157
+ redundancy_penalty = 0.0
158
+
159
+ if isinstance(action, SearchKBAction):
160
+ query_signature = normalize_text(action.query)
161
+ if query_signature in session.search_signatures:
162
+ redundancy_penalty = self.repeated_action_penalty
163
+ session.search_signatures.add(query_signature)
164
+ articles = self._search_knowledge_base(action.query)
165
+ article_ids = [article.article_id for article in articles]
166
+ session.kb_articles_seen.update(article_ids)
167
+ session.known_facts["kb_articles_seen"] = sorted(session.kb_articles_seen)
168
+ session.known_facts["kb_titles_seen"] = [KB_ARTICLES[article_id].title for article_id in sorted(session.kb_articles_seen)]
169
+ result = KBSearchResult(
170
+ tool_name="search_kb",
171
+ success=bool(articles),
172
+ query=action.query,
173
+ article_ids=article_ids,
174
+ snippets=[article.snippet for article in articles],
175
+ message="Knowledge base search completed." if articles else "No KB articles matched the query.",
176
+ )
177
+ return result, invalid_penalty, redundancy_penalty
178
+
179
+ if isinstance(action, LookupAccountAction):
180
+ if action.customer_id != session.fixture.account.customer_id:
181
+ session.last_action_error = "unknown_customer_id"
182
+ result = ErrorToolResult(
183
+ tool_name="error",
184
+ success=False,
185
+ error_code="unknown_customer_id",
186
+ message=f"No account found for customer_id={action.customer_id}.",
187
+ )
188
+ return result, self.invalid_action_penalty, redundancy_penalty
189
+
190
+ if session.lookup_performed and session.lookup_customer_id == action.customer_id:
191
+ redundancy_penalty = self.repeated_action_penalty
192
+
193
+ account = session.fixture.account
194
+ session.lookup_performed = True
195
+ session.lookup_customer_id = action.customer_id
196
+ account_summary = {
197
+ "customer_id": account.customer_id,
198
+ "organization_name": account.organization_name,
199
+ "plan": account.plan,
200
+ "tenure_years": account.tenure_years,
201
+ "arr_usd": account.arr_usd,
202
+ "duplicate_charge_amount_cents": account.duplicate_charge_amount_cents,
203
+ "duplicate_charge_count": account.duplicate_charge_count,
204
+ "duplicate_charge_refund_eligible": account.duplicate_charge_refund_eligible,
205
+ "legal_threat": account.legal_threat,
206
+ "incident_severity": account.incident_severity,
207
+ }
208
+ session.known_facts["account"] = account_summary
209
+ result = AccountLookupResult(
210
+ tool_name="lookup_account",
211
+ success=True,
212
+ customer_id=action.customer_id,
213
+ account_summary=account_summary,
214
+ message="Account lookup completed.",
215
+ )
216
+ return result, invalid_penalty, redundancy_penalty
217
+
218
+ if action.action_type == "send_reply":
219
+ reply = action.message.strip()
220
+ session.reply_history.append({"message": reply, "step_index": session.steps_taken})
221
+ session.conversation_history.append(
222
+ ConversationTurn(role="agent", message=reply, step_index=session.steps_taken)
223
+ )
224
+ result = ReplyResult(
225
+ tool_name="send_reply",
226
+ success=True,
227
+ message_preview=reply[:120],
228
+ message="Reply sent to the customer.",
229
+ )
230
+ return result, invalid_penalty, redundancy_penalty
231
+
232
+ if isinstance(action, IssueRefundAction):
233
+ session.refund_attempted = True
234
+ account = session.fixture.account
235
+ if not session.lookup_performed:
236
+ session.last_action_error = "lookup_required_before_refund"
237
+ result = ErrorToolResult(
238
+ tool_name="error",
239
+ success=False,
240
+ error_code="lookup_required_before_refund",
241
+ message="lookup_account must succeed before issue_refund can be used.",
242
+ )
243
+ return result, self.invalid_action_penalty, redundancy_penalty
244
+
245
+ if not account.duplicate_charge_refund_eligible or not account.duplicate_charge_amount_cents:
246
+ session.last_action_error = "refund_not_applicable"
247
+ result = RefundResult(
248
+ tool_name="issue_refund",
249
+ success=False,
250
+ refunded=False,
251
+ amount_cents=action.amount_cents,
252
+ reason_code=action.reason_code,
253
+ message="No duplicate charge is eligible for refund on this account.",
254
+ )
255
+ return result, self.invalid_action_penalty, redundancy_penalty
256
+
257
+ if action.amount_cents != account.duplicate_charge_amount_cents or action.reason_code != "duplicate_charge":
258
+ session.last_action_error = "incorrect_refund_payload"
259
+ result = RefundResult(
260
+ tool_name="issue_refund",
261
+ success=False,
262
+ refunded=False,
263
+ amount_cents=action.amount_cents,
264
+ reason_code=action.reason_code,
265
+ message="Refund payload does not match the verified duplicate charge.",
266
+ )
267
+ return result, self.invalid_action_penalty, redundancy_penalty
268
+
269
+ session.refund_record = {
270
+ "amount_cents": action.amount_cents,
271
+ "reason_code": action.reason_code,
272
+ "step_index": session.steps_taken,
273
+ }
274
+ result = RefundResult(
275
+ tool_name="issue_refund",
276
+ success=True,
277
+ refunded=True,
278
+ amount_cents=action.amount_cents,
279
+ reason_code=action.reason_code,
280
+ message="Refund recorded successfully.",
281
+ )
282
+ return result, invalid_penalty, redundancy_penalty
283
+
284
+ if action.action_type == "resolve_ticket":
285
+ session.resolution_code = action.resolution_code
286
+ session.ticket_status = "resolved"
287
+ session.done = True
288
+ session.terminal_reason = "resolved"
289
+ result = ResolveResult(
290
+ tool_name="resolve_ticket",
291
+ success=True,
292
+ resolution_code=action.resolution_code,
293
+ ticket_status="resolved",
294
+ message="Ticket marked as resolved.",
295
+ )
296
+ return result, invalid_penalty, redundancy_penalty
297
+
298
+ if isinstance(action, EscalateTicketAction):
299
+ session.escalation = {
300
+ "queue": action.queue,
301
+ "priority": action.priority,
302
+ "summary": action.summary,
303
+ "step_index": session.steps_taken,
304
+ }
305
+ session.ticket_status = "escalated"
306
+ session.done = True
307
+ session.terminal_reason = "escalated"
308
+ result = EscalationResult(
309
+ tool_name="escalate_ticket",
310
+ success=True,
311
+ queue=action.queue,
312
+ priority=action.priority,
313
+ summary=action.summary,
314
+ ticket_status="escalated",
315
+ message="Ticket escalated.",
316
+ )
317
+ return result, invalid_penalty, redundancy_penalty
318
+
319
+ session.last_action_error = "unsupported_action"
320
+ return (
321
+ ErrorToolResult(
322
+ tool_name="error",
323
+ success=False,
324
+ error_code="unsupported_action",
325
+ message=f"Unsupported action type: {type(action).__name__}",
326
+ ),
327
+ self.invalid_action_penalty,
328
+ redundancy_penalty,
329
+ )
330
+
331
+ def _search_knowledge_base(self, query: str) -> list[KnowledgeBaseArticle]:
332
+ query_terms = set(normalize_text(query).split())
333
+ ranked: list[tuple[int, str, KnowledgeBaseArticle]] = []
334
+ for article in KB_ARTICLES.values():
335
+ searchable = normalize_text(" ".join((article.title, article.content, " ".join(article.tags))))
336
+ article_terms = set(searchable.split())
337
+ score = len(query_terms & article_terms)
338
+ if score > 0:
339
+ ranked.append((score, article.article_id, article))
340
+ ranked.sort(key=lambda item: (-item[0], item[1]))
341
+ return [article for _, _, article in ranked[:3]]
342
+
343
+ def _record_action(self, action_payload: dict[str, Any], action_succeeded: bool) -> None:
344
+ session = self._require_session()
345
+ session.action_history.append(
346
+ {
347
+ "step_index": session.steps_taken,
348
+ "action": action_payload,
349
+ "success": action_succeeded,
350
+ "ticket_status": session.ticket_status,
351
+ }
352
+ )
353
+
354
+ def _finalize_step(self, invalid_penalty: float, redundancy_penalty: float) -> SupportTicketStepResult:
355
+ session = self._require_session()
356
+ scorecard = build_scorecard(session.fixture, session)
357
+ reward = round(
358
+ (scorecard.score - session.previous_score) - self.step_cost - invalid_penalty - redundancy_penalty,
359
+ 6,
360
+ )
361
+ session.previous_score = scorecard.score
362
+ return SupportTicketStepResult(
363
+ observation=self._build_observation(),
364
+ reward=reward,
365
+ done=session.done,
366
+ info={
367
+ "task_id": session.fixture.task_id,
368
+ "benchmark_name": self.benchmark_name,
369
+ "score": scorecard.score,
370
+ "score_breakdown": scorecard.model_dump(mode="json"),
371
+ "success": scorecard.score >= self.success_threshold,
372
+ "success_threshold": self.success_threshold,
373
+ "terminal_reason": session.terminal_reason,
374
+ "invalid_penalty": invalid_penalty,
375
+ "redundancy_penalty": redundancy_penalty,
376
+ },
377
+ )
378
+
379
+ def _build_observation(self) -> SupportTicketObservation:
380
+ session = self._require_session()
381
+ ticket = session.fixture.ticket
382
+ return SupportTicketObservation(
383
+ task_id=session.fixture.task_id,
384
+ ticket_id=ticket.ticket_id,
385
+ ticket_status=session.ticket_status,
386
+ customer_id=ticket.customer_id,
387
+ organization_name=ticket.organization_name,
388
+ subject=ticket.subject,
389
+ customer_message=ticket.message,
390
+ conversation_history=list(session.conversation_history),
391
+ last_tool_result=session.last_tool_result,
392
+ steps_taken=session.steps_taken,
393
+ steps_remaining=max(self.max_steps - session.steps_taken, 0),
394
+ available_action_types=list(ACTION_TYPE_NAMES),
395
+ last_action_error=session.last_action_error,
396
+ known_facts=dict(session.known_facts),
397
+ )
398
+
399
+ def _build_result(self, reward: float) -> SupportTicketStepResult:
400
+ session = self._require_session()
401
+ scorecard = build_scorecard(session.fixture, session)
402
+ session.previous_score = scorecard.score
403
+ return SupportTicketStepResult(
404
+ observation=self._build_observation(),
405
+ reward=reward,
406
+ done=session.done,
407
+ info={
408
+ "task_id": session.fixture.task_id,
409
+ "benchmark_name": self.benchmark_name,
410
+ "score": scorecard.score,
411
+ "score_breakdown": scorecard.model_dump(mode="json"),
412
+ "success": scorecard.score >= self.success_threshold,
413
+ "success_threshold": self.success_threshold,
414
+ "terminal_reason": session.terminal_reason,
415
+ "invalid_penalty": 0.0,
416
+ "redundancy_penalty": 0.0,
417
+ },
418
+ )
419
+
420
+ def _require_session(self) -> SessionState:
421
+ if self._session is None:
422
+ raise RuntimeError("Environment has not been reset yet.")
423
+ return self._session
support_ticket_env/fixtures.py ADDED
@@ -0,0 +1,270 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from collections import OrderedDict
4
+ from dataclasses import dataclass, field
5
+ from typing import Literal
6
+
7
+
8
+ BENCHMARK_NAME = "customer_support_ticket_handler"
9
+ DEFAULT_SUCCESS_THRESHOLD = 0.75
10
+ RESET_URL = "https://app.acmecloud.com/reset"
11
+
12
+
13
+ @dataclass(frozen=True)
14
+ class KnowledgeBaseArticle:
15
+ article_id: str
16
+ title: str
17
+ tags: tuple[str, ...]
18
+ snippet: str
19
+ content: str
20
+
21
+
22
+ @dataclass(frozen=True)
23
+ class TicketFixture:
24
+ ticket_id: str
25
+ customer_id: str
26
+ organization_name: str
27
+ subject: str
28
+ message: str
29
+
30
+
31
+ @dataclass(frozen=True)
32
+ class AccountFixture:
33
+ customer_id: str
34
+ organization_name: str
35
+ plan: str
36
+ tenure_years: float | None = None
37
+ arr_usd: int | None = None
38
+ duplicate_charge_amount_cents: int | None = None
39
+ duplicate_charge_count: int = 0
40
+ duplicate_charge_refund_eligible: bool = False
41
+ legal_threat: bool = False
42
+ incident_severity: str | None = None
43
+ mandatory_escalation_queue: str | None = None
44
+ mandatory_escalation_priority: str | None = None
45
+
46
+
47
+ @dataclass(frozen=True)
48
+ class TaskFixture:
49
+ task_id: str
50
+ title: str
51
+ difficulty: Literal["easy", "medium", "hard"]
52
+ ticket: TicketFixture
53
+ account: AccountFixture
54
+ relevant_kb_article_id: str | None = None
55
+ expected_terminal_mode: Literal["resolve", "escalate"] = "resolve"
56
+ expected_resolution_code: str | None = None
57
+ expected_refund_amount_cents: int | None = None
58
+ refund_reason_code: str | None = None
59
+ expected_escalation_queue: str | None = None
60
+ expected_escalation_priority: str | None = None
61
+ reply_keyword_groups: dict[str, tuple[str, ...]] = field(default_factory=dict)
62
+ forbidden_reply_phrases: tuple[str, ...] = ()
63
+ rubric_weights: dict[str, float] = field(default_factory=dict)
64
+ efficiency_bonus_max_steps: int | None = None
65
+
66
+
67
+ KB_ARTICLES = OrderedDict(
68
+ (
69
+ article.article_id,
70
+ article,
71
+ )
72
+ for article in (
73
+ KnowledgeBaseArticle(
74
+ article_id="KB-PW-RESET",
75
+ title="Password reset email troubleshooting",
76
+ tags=("password", "reset", "email", "spam", "login"),
77
+ snippet="Ask the customer to use the AcmeCloud reset page, check spam or junk, and wait 5 minutes before retrying.",
78
+ content=(
79
+ f"If a password reset email does not arrive, direct the user to {RESET_URL}. "
80
+ "Ask them to check their spam or junk folder and wait 5 minutes before requesting another email."
81
+ ),
82
+ ),
83
+ KnowledgeBaseArticle(
84
+ article_id="KB-BILL-DUPLICATE",
85
+ title="Duplicate subscription charge refund policy",
86
+ tags=("billing", "refund", "duplicate", "charge", "subscription"),
87
+ snippet="After verifying a duplicate charge, support can refund the extra charge. Refunds settle in 3-5 business days.",
88
+ content=(
89
+ "If account history confirms an accidental duplicate subscription charge, refund the duplicate amount in full. "
90
+ "Communicate that the refund will appear in 3-5 business days."
91
+ ),
92
+ ),
93
+ KnowledgeBaseArticle(
94
+ article_id="KB-INCIDENT-LEGAL",
95
+ title="Critical data incident legal escalation",
96
+ tags=("incident", "legal", "data", "escalation", "enterprise"),
97
+ snippet="Legal threats and alleged customer data loss must be escalated immediately to the legal_data_incident queue at P0.",
98
+ content=(
99
+ "If an enterprise customer reports data loss and mentions legal action, do not promise a resolution, do not admit fault, "
100
+ "and escalate immediately to the legal_data_incident queue with priority P0."
101
+ ),
102
+ ),
103
+ KnowledgeBaseArticle(
104
+ article_id="KB-SSO-SETUP",
105
+ title="Single sign-on setup guide",
106
+ tags=("sso", "setup", "identity", "onboarding"),
107
+ snippet="Configure SAML or OIDC before enforcing SSO in production.",
108
+ content="SSO setup steps for administrators integrating AcmeCloud with their identity provider.",
109
+ ),
110
+ KnowledgeBaseArticle(
111
+ article_id="KB-INVOICE-DOWNLOAD",
112
+ title="Invoice download instructions",
113
+ tags=("invoice", "billing", "download", "finance"),
114
+ snippet="Billing administrators can download invoices from the Finance tab in workspace settings.",
115
+ content="Steps for locating billing history and downloading invoices from the AcmeCloud admin console.",
116
+ ),
117
+ KnowledgeBaseArticle(
118
+ article_id="KB-MFA-RESET",
119
+ title="Multi-factor authentication reset",
120
+ tags=("mfa", "reset", "login", "security"),
121
+ snippet="MFA resets require identity verification or admin override.",
122
+ content="How to reset multi-factor authentication for locked-out users.",
123
+ ),
124
+ )
125
+ )
126
+
127
+
128
+ TASK_FIXTURES = OrderedDict(
129
+ (
130
+ task.task_id,
131
+ task,
132
+ )
133
+ for task in (
134
+ TaskFixture(
135
+ task_id="password_reset_guidance",
136
+ title="Password reset guidance",
137
+ difficulty="easy",
138
+ ticket=TicketFixture(
139
+ ticket_id="ticket_pw_001",
140
+ customer_id="cust_pw_001",
141
+ organization_name="Northstar Analytics",
142
+ subject="Reset email never arrived",
143
+ message="Hi, I forgot my password and the reset email isn't arriving. Please help.",
144
+ ),
145
+ account=AccountFixture(
146
+ customer_id="cust_pw_001",
147
+ organization_name="Northstar Analytics",
148
+ plan="Pro",
149
+ ),
150
+ relevant_kb_article_id="KB-PW-RESET",
151
+ expected_terminal_mode="resolve",
152
+ expected_resolution_code="password_reset_guidance",
153
+ reply_keyword_groups={
154
+ "reset_url": (RESET_URL,),
155
+ "spam_folder": ("spam", "junk"),
156
+ "wait_guidance": ("5 minutes", "five minutes"),
157
+ },
158
+ rubric_weights={
159
+ "searched_kb": 0.20,
160
+ "reply_has_reset_url": 0.30,
161
+ "reply_mentions_spam_folder": 0.20,
162
+ "resolved_correctly": 0.20,
163
+ "efficient_completion": 0.10,
164
+ },
165
+ efficiency_bonus_max_steps=4,
166
+ ),
167
+ TaskFixture(
168
+ task_id="duplicate_charge_refund",
169
+ title="Duplicate subscription charge refund",
170
+ difficulty="medium",
171
+ ticket=TicketFixture(
172
+ ticket_id="ticket_bill_002",
173
+ customer_id="cust_bill_002",
174
+ organization_name="BlueOrbit Labs",
175
+ subject="Charged twice this month",
176
+ message=(
177
+ "I was charged twice for my subscription this month. I want both charges refunded immediately. "
178
+ "I've been a customer for 3 years."
179
+ ),
180
+ ),
181
+ account=AccountFixture(
182
+ customer_id="cust_bill_002",
183
+ organization_name="BlueOrbit Labs",
184
+ plan="Business",
185
+ tenure_years=3.2,
186
+ duplicate_charge_amount_cents=4900,
187
+ duplicate_charge_count=2,
188
+ duplicate_charge_refund_eligible=True,
189
+ ),
190
+ relevant_kb_article_id="KB-BILL-DUPLICATE",
191
+ expected_terminal_mode="resolve",
192
+ expected_resolution_code="billing_refund_processed",
193
+ expected_refund_amount_cents=4900,
194
+ refund_reason_code="duplicate_charge",
195
+ reply_keyword_groups={
196
+ "timeline": ("3-5 business days", "3 to 5 business days"),
197
+ "duplicate_ack": ("duplicate charge", "double charge", "charged twice"),
198
+ "regret": ("sorry", "apologize", "regret"),
199
+ "refund_confirmed": ("refund", "refunded", "processed"),
200
+ },
201
+ rubric_weights={
202
+ "lookup_account": 0.15,
203
+ "searched_kb": 0.15,
204
+ "correct_refund": 0.25,
205
+ "reply_mentions_timeline": 0.20,
206
+ "reply_acknowledges_and_apologizes": 0.15,
207
+ "resolved_without_escalation": 0.10,
208
+ },
209
+ ),
210
+ TaskFixture(
211
+ task_id="enterprise_data_loss_escalation",
212
+ title="Enterprise data loss legal escalation",
213
+ difficulty="hard",
214
+ ticket=TicketFixture(
215
+ ticket_id="ticket_ent_003",
216
+ customer_id="cust_ent_003",
217
+ organization_name="Granite Peak Holdings",
218
+ subject="Critical enterprise incident",
219
+ message=(
220
+ "Your platform deleted 2 years of our customer data during last night's maintenance. "
221
+ "We are a Fortune 500 client. Our legal team will be in contact unless this is resolved in 2 hours. "
222
+ "I need to speak to your CTO immediately."
223
+ ),
224
+ ),
225
+ account=AccountFixture(
226
+ customer_id="cust_ent_003",
227
+ organization_name="Granite Peak Holdings",
228
+ plan="Enterprise",
229
+ arr_usd=500000,
230
+ legal_threat=True,
231
+ incident_severity="data_loss",
232
+ mandatory_escalation_queue="legal_data_incident",
233
+ mandatory_escalation_priority="P0",
234
+ ),
235
+ relevant_kb_article_id="KB-INCIDENT-LEGAL",
236
+ expected_terminal_mode="escalate",
237
+ expected_escalation_queue="legal_data_incident",
238
+ expected_escalation_priority="P0",
239
+ reply_keyword_groups={
240
+ "urgency": ("urgent", "immediately", "right away", "priority"),
241
+ "escalation": ("escalating", "escalated", "investigated", "investigation"),
242
+ },
243
+ forbidden_reply_phrases=(
244
+ "we deleted your data",
245
+ "this is our fault",
246
+ "we are liable",
247
+ "we caused this",
248
+ "we guarantee recovery",
249
+ ),
250
+ rubric_weights={
251
+ "lookup_account": 0.10,
252
+ "no_refund_or_policy_action": 0.20,
253
+ "reply_sent_before_escalation": 0.15,
254
+ "careful_reply": 0.20,
255
+ "correct_escalation": 0.25,
256
+ "not_resolved": 0.10,
257
+ },
258
+ ),
259
+ )
260
+ )
261
+
262
+
263
+ def list_task_ids() -> list[str]:
264
+ return list(TASK_FIXTURES.keys())
265
+
266
+
267
+ def get_task_fixture(task_id: str) -> TaskFixture:
268
+ if task_id not in TASK_FIXTURES:
269
+ raise KeyError(f"Unknown task_id: {task_id}")
270
+ return TASK_FIXTURES[task_id]
support_ticket_env/models.py ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from typing import Annotated, Any, Literal, TypeAlias
4
+
5
+ from pydantic import BaseModel, ConfigDict, Field, TypeAdapter
6
+
7
+
8
+ class SearchKBAction(BaseModel):
9
+ model_config = ConfigDict(extra="forbid")
10
+
11
+ action_type: Literal["search_kb"]
12
+ query: str = Field(min_length=1)
13
+
14
+
15
+ class LookupAccountAction(BaseModel):
16
+ model_config = ConfigDict(extra="forbid")
17
+
18
+ action_type: Literal["lookup_account"]
19
+ customer_id: str = Field(min_length=1)
20
+
21
+
22
+ class SendReplyAction(BaseModel):
23
+ model_config = ConfigDict(extra="forbid")
24
+
25
+ action_type: Literal["send_reply"]
26
+ message: str = Field(min_length=1)
27
+
28
+
29
+ class IssueRefundAction(BaseModel):
30
+ model_config = ConfigDict(extra="forbid")
31
+
32
+ action_type: Literal["issue_refund"]
33
+ amount_cents: int = Field(gt=0)
34
+ reason_code: Literal["duplicate_charge"]
35
+
36
+
37
+ class ResolveTicketAction(BaseModel):
38
+ model_config = ConfigDict(extra="forbid")
39
+
40
+ action_type: Literal["resolve_ticket"]
41
+ resolution_code: Literal["password_reset_guidance", "billing_refund_processed"]
42
+
43
+
44
+ class EscalateTicketAction(BaseModel):
45
+ model_config = ConfigDict(extra="forbid")
46
+
47
+ action_type: Literal["escalate_ticket"]
48
+ queue: Literal["support_lead", "legal_data_incident"]
49
+ priority: Literal["P2", "P0"]
50
+ summary: str = Field(min_length=1)
51
+
52
+
53
+ SupportTicketAction: TypeAlias = Annotated[
54
+ SearchKBAction
55
+ | LookupAccountAction
56
+ | SendReplyAction
57
+ | IssueRefundAction
58
+ | ResolveTicketAction
59
+ | EscalateTicketAction,
60
+ Field(discriminator="action_type"),
61
+ ]
62
+
63
+ ACTION_ADAPTER = TypeAdapter(SupportTicketAction)
64
+
65
+ ACTION_TYPE_NAMES = [
66
+ "search_kb",
67
+ "lookup_account",
68
+ "send_reply",
69
+ "issue_refund",
70
+ "resolve_ticket",
71
+ "escalate_ticket",
72
+ ]
73
+
74
+
75
+ def parse_action(value: SupportTicketAction | dict[str, Any]) -> SupportTicketAction:
76
+ return ACTION_ADAPTER.validate_python(value)
77
+
78
+
79
+ class ConversationTurn(BaseModel):
80
+ model_config = ConfigDict(extra="forbid")
81
+
82
+ role: Literal["customer", "agent"]
83
+ message: str
84
+ step_index: int = Field(ge=0)
85
+
86
+
87
+ class KBSearchResult(BaseModel):
88
+ model_config = ConfigDict(extra="forbid")
89
+
90
+ tool_name: Literal["search_kb"]
91
+ success: bool
92
+ query: str
93
+ article_ids: list[str] = Field(default_factory=list)
94
+ snippets: list[str] = Field(default_factory=list)
95
+ message: str | None = None
96
+
97
+
98
+ class AccountLookupResult(BaseModel):
99
+ model_config = ConfigDict(extra="forbid")
100
+
101
+ tool_name: Literal["lookup_account"]
102
+ success: bool
103
+ customer_id: str
104
+ account_summary: dict[str, Any] = Field(default_factory=dict)
105
+ message: str | None = None
106
+
107
+
108
+ class ReplyResult(BaseModel):
109
+ model_config = ConfigDict(extra="forbid")
110
+
111
+ tool_name: Literal["send_reply"]
112
+ success: bool
113
+ message_preview: str
114
+ message: str | None = None
115
+
116
+
117
+ class RefundResult(BaseModel):
118
+ model_config = ConfigDict(extra="forbid")
119
+
120
+ tool_name: Literal["issue_refund"]
121
+ success: bool
122
+ refunded: bool
123
+ amount_cents: int
124
+ reason_code: str
125
+ message: str | None = None
126
+
127
+
128
+ class ResolveResult(BaseModel):
129
+ model_config = ConfigDict(extra="forbid")
130
+
131
+ tool_name: Literal["resolve_ticket"]
132
+ success: bool
133
+ resolution_code: str
134
+ ticket_status: Literal["resolved"]
135
+ message: str | None = None
136
+
137
+
138
+ class EscalationResult(BaseModel):
139
+ model_config = ConfigDict(extra="forbid")
140
+
141
+ tool_name: Literal["escalate_ticket"]
142
+ success: bool
143
+ queue: str
144
+ priority: str
145
+ summary: str
146
+ ticket_status: Literal["escalated"]
147
+ message: str | None = None
148
+
149
+
150
+ class ErrorToolResult(BaseModel):
151
+ model_config = ConfigDict(extra="forbid")
152
+
153
+ tool_name: Literal["error"]
154
+ success: Literal[False]
155
+ error_code: str
156
+ message: str
157
+
158
+
159
+ ToolResult: TypeAlias = Annotated[
160
+ KBSearchResult
161
+ | AccountLookupResult
162
+ | ReplyResult
163
+ | RefundResult
164
+ | ResolveResult
165
+ | EscalationResult
166
+ | ErrorToolResult,
167
+ Field(discriminator="tool_name"),
168
+ ]
169
+
170
+
171
+ class ScoreCriterion(BaseModel):
172
+ model_config = ConfigDict(extra="forbid")
173
+
174
+ criterion_id: str
175
+ label: str
176
+ weight: float = Field(ge=0.0, le=1.0)
177
+ earned: bool
178
+ contribution: float = Field(ge=0.0, le=1.0)
179
+
180
+
181
+ class TaskScorecard(BaseModel):
182
+ model_config = ConfigDict(extra="forbid")
183
+
184
+ task_id: str
185
+ score: float = Field(ge=0.0, le=1.0)
186
+ criteria: list[ScoreCriterion]
187
+
188
+
189
+ class SupportTicketObservation(BaseModel):
190
+ model_config = ConfigDict(extra="forbid")
191
+
192
+ task_id: str
193
+ ticket_id: str
194
+ ticket_status: Literal["open", "resolved", "escalated"]
195
+ customer_id: str
196
+ organization_name: str
197
+ subject: str
198
+ customer_message: str
199
+ conversation_history: list[ConversationTurn]
200
+ last_tool_result: ToolResult | None = None
201
+ steps_taken: int = Field(ge=0)
202
+ steps_remaining: int = Field(ge=0)
203
+ available_action_types: list[str]
204
+ last_action_error: str | None = None
205
+ known_facts: dict[str, Any] = Field(default_factory=dict)
206
+
207
+
208
+ class SupportTicketStepResult(BaseModel):
209
+ model_config = ConfigDict(extra="forbid")
210
+
211
+ observation: SupportTicketObservation
212
+ reward: float
213
+ done: bool
214
+ info: dict[str, Any] = Field(default_factory=dict)
support_ticket_env/policies.py ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ from .fixtures import RESET_URL
4
+ from .models import (
5
+ EscalateTicketAction,
6
+ IssueRefundAction,
7
+ LookupAccountAction,
8
+ ResolveTicketAction,
9
+ SearchKBAction,
10
+ SupportTicketAction,
11
+ SupportTicketObservation,
12
+ SendReplyAction,
13
+ )
14
+
15
+
16
+ def scripted_policy(task_id: str, step_index: int, customer_id: str) -> SupportTicketAction:
17
+ if task_id == "password_reset_guidance":
18
+ if step_index == 1:
19
+ return SearchKBAction(action_type="search_kb", query="password reset email not arriving")
20
+ if step_index == 2:
21
+ return SendReplyAction(
22
+ action_type="send_reply",
23
+ message=(
24
+ f"Please use {RESET_URL}. Check your spam or junk folder, then wait 5 minutes before trying again."
25
+ ),
26
+ )
27
+ return ResolveTicketAction(action_type="resolve_ticket", resolution_code="password_reset_guidance")
28
+
29
+ if task_id == "duplicate_charge_refund":
30
+ if step_index == 1:
31
+ return LookupAccountAction(action_type="lookup_account", customer_id=customer_id)
32
+ if step_index == 2:
33
+ return SearchKBAction(action_type="search_kb", query="duplicate charge refund policy")
34
+ if step_index == 3:
35
+ return IssueRefundAction(action_type="issue_refund", amount_cents=4900, reason_code="duplicate_charge")
36
+ if step_index == 4:
37
+ return SendReplyAction(
38
+ action_type="send_reply",
39
+ message=(
40
+ "I'm sorry about the duplicate charge. I've processed the refund for the extra subscription charge, "
41
+ "and it should appear in 3-5 business days."
42
+ ),
43
+ )
44
+ return ResolveTicketAction(action_type="resolve_ticket", resolution_code="billing_refund_processed")
45
+
46
+ if task_id == "enterprise_data_loss_escalation":
47
+ if step_index == 1:
48
+ return LookupAccountAction(action_type="lookup_account", customer_id=customer_id)
49
+ if step_index == 2:
50
+ return SendReplyAction(
51
+ action_type="send_reply",
52
+ message=(
53
+ "I understand this is urgent. I am escalating this to our legal and incident response team right now, "
54
+ "and the case is being actively investigated."
55
+ ),
56
+ )
57
+ return EscalateTicketAction(
58
+ action_type="escalate_ticket",
59
+ queue="legal_data_incident",
60
+ priority="P0",
61
+ summary="Enterprise customer reporting possible data loss and a legal threat after maintenance.",
62
+ )
63
+
64
+ raise ValueError(f"Unsupported task_id: {task_id}")
65
+
66
+
67
+ def fallback_action(observation: SupportTicketObservation) -> SupportTicketAction:
68
+ next_step = observation.steps_taken + 1
69
+ return scripted_policy(observation.task_id, next_step, observation.customer_id)
support_ticket_env/scoring.py ADDED
@@ -0,0 +1,218 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import re
4
+ from typing import Any
5
+
6
+ from .fixtures import TaskFixture
7
+ from .models import ScoreCriterion, TaskScorecard
8
+
9
+
10
+ _SPACE_RE = re.compile(r"\s+")
11
+
12
+
13
+ def normalize_text(text: str) -> str:
14
+ lowered = text.lower().replace("-", " ")
15
+ normalized = _SPACE_RE.sub(" ", lowered)
16
+ return normalized.strip()
17
+
18
+
19
+ def contains_any(text: str, phrases: tuple[str, ...] | list[str]) -> bool:
20
+ normalized = normalize_text(text)
21
+ return any(normalize_text(phrase) in normalized for phrase in phrases)
22
+
23
+
24
+ def contains_all_groups(text: str, groups: list[tuple[str, ...]]) -> bool:
25
+ return all(contains_any(text, group) for group in groups)
26
+
27
+
28
+ def _criterion(criterion_id: str, label: str, weight: float, earned: bool) -> ScoreCriterion:
29
+ contribution = round(weight if earned else 0.0, 6)
30
+ return ScoreCriterion(
31
+ criterion_id=criterion_id,
32
+ label=label,
33
+ weight=weight,
34
+ earned=earned,
35
+ contribution=contribution,
36
+ )
37
+
38
+
39
+ def _reply_messages(state: Any) -> list[str]:
40
+ return [entry["message"] for entry in state.reply_history]
41
+
42
+
43
+ def _has_reply_matching(state: Any, matcher) -> bool:
44
+ return any(matcher(message) for message in _reply_messages(state))
45
+
46
+
47
+ def build_scorecard(fixture: TaskFixture, state: Any) -> TaskScorecard:
48
+ if fixture.task_id == "password_reset_guidance":
49
+ criteria = _score_password_reset(fixture, state)
50
+ elif fixture.task_id == "duplicate_charge_refund":
51
+ criteria = _score_duplicate_charge(fixture, state)
52
+ elif fixture.task_id == "enterprise_data_loss_escalation":
53
+ criteria = _score_enterprise_escalation(fixture, state)
54
+ else:
55
+ raise ValueError(f"Unsupported task_id: {fixture.task_id}")
56
+
57
+ total_score = round(sum(item.contribution for item in criteria), 6)
58
+ return TaskScorecard(task_id=fixture.task_id, score=min(max(total_score, 0.0), 1.0), criteria=criteria)
59
+
60
+
61
+ def _score_password_reset(fixture: TaskFixture, state: Any) -> list[ScoreCriterion]:
62
+ weights = fixture.rubric_weights
63
+ replies = _reply_messages(state)
64
+ return [
65
+ _criterion(
66
+ "searched_kb",
67
+ "Relevant KB article retrieved",
68
+ weights["searched_kb"],
69
+ fixture.relevant_kb_article_id in state.kb_articles_seen,
70
+ ),
71
+ _criterion(
72
+ "reply_has_reset_url",
73
+ "Reply includes the password reset URL",
74
+ weights["reply_has_reset_url"],
75
+ any(fixture.reply_keyword_groups["reset_url"][0] in reply for reply in replies),
76
+ ),
77
+ _criterion(
78
+ "reply_mentions_spam_folder",
79
+ "Reply mentions checking spam or junk",
80
+ weights["reply_mentions_spam_folder"],
81
+ _has_reply_matching(state, lambda text: contains_any(text, fixture.reply_keyword_groups["spam_folder"])),
82
+ ),
83
+ _criterion(
84
+ "resolved_correctly",
85
+ "Ticket resolved with the correct resolution code",
86
+ weights["resolved_correctly"],
87
+ state.ticket_status == "resolved" and state.resolution_code == fixture.expected_resolution_code,
88
+ ),
89
+ _criterion(
90
+ "efficient_completion",
91
+ "Episode completed efficiently",
92
+ weights["efficient_completion"],
93
+ state.ticket_status == "resolved"
94
+ and state.resolution_code == fixture.expected_resolution_code
95
+ and state.steps_taken <= (fixture.efficiency_bonus_max_steps or 0),
96
+ ),
97
+ ]
98
+
99
+
100
+ def _score_duplicate_charge(fixture: TaskFixture, state: Any) -> list[ScoreCriterion]:
101
+ weights = fixture.rubric_weights
102
+
103
+ def acknowledges_and_apologizes(reply: str) -> bool:
104
+ return contains_all_groups(
105
+ reply,
106
+ [
107
+ fixture.reply_keyword_groups["duplicate_ack"],
108
+ fixture.reply_keyword_groups["regret"],
109
+ fixture.reply_keyword_groups["refund_confirmed"],
110
+ ],
111
+ )
112
+
113
+ return [
114
+ _criterion(
115
+ "lookup_account",
116
+ "Account lookup completed",
117
+ weights["lookup_account"],
118
+ state.lookup_performed,
119
+ ),
120
+ _criterion(
121
+ "searched_kb",
122
+ "Duplicate charge policy article retrieved",
123
+ weights["searched_kb"],
124
+ fixture.relevant_kb_article_id in state.kb_articles_seen,
125
+ ),
126
+ _criterion(
127
+ "correct_refund",
128
+ "Correct full duplicate-charge refund issued",
129
+ weights["correct_refund"],
130
+ state.refund_record is not None
131
+ and state.refund_record["amount_cents"] == fixture.expected_refund_amount_cents
132
+ and state.refund_record["reason_code"] == fixture.refund_reason_code,
133
+ ),
134
+ _criterion(
135
+ "reply_mentions_timeline",
136
+ "Reply mentions the refund timeline",
137
+ weights["reply_mentions_timeline"],
138
+ _has_reply_matching(state, lambda text: contains_any(text, fixture.reply_keyword_groups["timeline"])),
139
+ ),
140
+ _criterion(
141
+ "reply_acknowledges_and_apologizes",
142
+ "Reply acknowledges the duplicate charge, apologizes, and confirms the refund",
143
+ weights["reply_acknowledges_and_apologizes"],
144
+ _has_reply_matching(state, acknowledges_and_apologizes),
145
+ ),
146
+ _criterion(
147
+ "resolved_without_escalation",
148
+ "Ticket resolved instead of escalated",
149
+ weights["resolved_without_escalation"],
150
+ state.ticket_status == "resolved" and state.resolution_code == fixture.expected_resolution_code,
151
+ ),
152
+ ]
153
+
154
+
155
+ def _score_enterprise_escalation(fixture: TaskFixture, state: Any) -> list[ScoreCriterion]:
156
+ weights = fixture.rubric_weights
157
+ escalation_step = state.escalation["step_index"] if state.escalation else None
158
+
159
+ def careful_reply(reply: str) -> bool:
160
+ return (
161
+ contains_all_groups(
162
+ reply,
163
+ [
164
+ fixture.reply_keyword_groups["urgency"],
165
+ fixture.reply_keyword_groups["escalation"],
166
+ ],
167
+ )
168
+ and not contains_any(reply, fixture.forbidden_reply_phrases)
169
+ )
170
+
171
+ reply_before_escalation = any(
172
+ escalation_step is None or reply["step_index"] < escalation_step for reply in state.reply_history
173
+ )
174
+
175
+ return [
176
+ _criterion(
177
+ "lookup_account",
178
+ "Account lookup completed",
179
+ weights["lookup_account"],
180
+ state.lookup_performed,
181
+ ),
182
+ _criterion(
183
+ "no_refund_or_policy_action",
184
+ "No refund or resolution policy action was applied",
185
+ weights["no_refund_or_policy_action"],
186
+ state.done and not state.refund_attempted and state.resolution_code is None,
187
+ ),
188
+ _criterion(
189
+ "reply_sent_before_escalation",
190
+ "A reply was sent before escalation",
191
+ weights["reply_sent_before_escalation"],
192
+ reply_before_escalation and bool(state.reply_history),
193
+ ),
194
+ _criterion(
195
+ "careful_reply",
196
+ "Reply acknowledges urgency, mentions escalation, and avoids liability",
197
+ weights["careful_reply"],
198
+ any(
199
+ (escalation_step is None or reply["step_index"] < escalation_step)
200
+ and careful_reply(reply["message"])
201
+ for reply in state.reply_history
202
+ ),
203
+ ),
204
+ _criterion(
205
+ "correct_escalation",
206
+ "Escalation uses the correct queue and priority",
207
+ weights["correct_escalation"],
208
+ state.escalation is not None
209
+ and state.escalation["queue"] == fixture.expected_escalation_queue
210
+ and state.escalation["priority"] == fixture.expected_escalation_priority,
211
+ ),
212
+ _criterion(
213
+ "not_resolved",
214
+ "Ticket was not resolved",
215
+ weights["not_resolved"],
216
+ state.done and state.resolution_code is None,
217
+ ),
218
+ ]
tests/test_env.py ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pytest
2
+
3
+ from support_ticket_env import SupportTicketEnvironment
4
+ from support_ticket_env.models import LookupAccountAction, SearchKBAction, SendReplyAction
5
+
6
+
7
+ def test_search_returns_expected_article_and_progress_reward() -> None:
8
+ env = SupportTicketEnvironment()
9
+ env.reset("password_reset_guidance")
10
+ result = env.step(SearchKBAction(action_type="search_kb", query="password reset email not arriving"))
11
+ assert result.observation.last_tool_result.tool_name == "search_kb"
12
+ assert result.observation.last_tool_result.article_ids[0] == "KB-PW-RESET"
13
+ assert result.reward == pytest.approx(0.19)
14
+
15
+
16
+ def test_lookup_populates_account_facts() -> None:
17
+ env = SupportTicketEnvironment()
18
+ env.reset("duplicate_charge_refund")
19
+ result = env.step(LookupAccountAction(action_type="lookup_account", customer_id="cust_bill_002"))
20
+ account = result.observation.known_facts["account"]
21
+ assert account["plan"] == "Business"
22
+ assert account["duplicate_charge_amount_cents"] == 4900
23
+
24
+
25
+ def test_redundant_search_is_penalized() -> None:
26
+ env = SupportTicketEnvironment()
27
+ env.reset("password_reset_guidance")
28
+ env.step(SearchKBAction(action_type="search_kb", query="password reset email not arriving"))
29
+ result = env.step(SearchKBAction(action_type="search_kb", query="password reset email not arriving"))
30
+ assert result.reward == pytest.approx(-0.03)
31
+ assert result.info["redundancy_penalty"] == pytest.approx(0.02)
32
+
33
+
34
+ def test_refund_before_lookup_is_invalid() -> None:
35
+ env = SupportTicketEnvironment()
36
+ env.reset("duplicate_charge_refund")
37
+ result = env.step({"action_type": "issue_refund", "amount_cents": 4900, "reason_code": "duplicate_charge"})
38
+ assert result.observation.last_action_error == "lookup_required_before_refund"
39
+ assert result.reward == pytest.approx(-0.11)
40
+
41
+
42
+ def test_reset_clears_previous_state() -> None:
43
+ env = SupportTicketEnvironment()
44
+ env.reset("password_reset_guidance")
45
+ env.step(SearchKBAction(action_type="search_kb", query="password reset email not arriving"))
46
+ result = env.reset("password_reset_guidance")
47
+ assert result.observation.steps_taken == 0
48
+ assert result.observation.known_facts == {}
49
+ assert len(result.observation.conversation_history) == 1
50
+
51
+
52
+ def test_max_steps_timeout_is_deterministic() -> None:
53
+ env = SupportTicketEnvironment()
54
+ result = env.reset("password_reset_guidance")
55
+ for _ in range(8):
56
+ result = env.step(SendReplyAction(action_type="send_reply", message="Still investigating."))
57
+ assert result.done is True
58
+ assert result.info["terminal_reason"] == "max_steps_exceeded"
tests/test_models.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pytest
2
+
3
+ from support_ticket_env import parse_action
4
+
5
+
6
+ def test_parse_action_accepts_valid_discriminated_union() -> None:
7
+ action = parse_action({"action_type": "search_kb", "query": "password reset"})
8
+ assert action.action_type == "search_kb"
9
+ assert action.query == "password reset"
10
+
11
+
12
+ def test_parse_action_rejects_invalid_refund_reason_code() -> None:
13
+ with pytest.raises(Exception):
14
+ parse_action(
15
+ {
16
+ "action_type": "issue_refund",
17
+ "amount_cents": 4900,
18
+ "reason_code": "manual_override",
19
+ }
20
+ )
tests/test_scenarios.py ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from support_ticket_env import SupportTicketEnvironment, list_task_ids, scripted_policy
2
+ from support_ticket_env.models import EscalateTicketAction, IssueRefundAction, LookupAccountAction, ResolveTicketAction, SendReplyAction
3
+
4
+
5
+ def run_actions(task_id: str, actions: list[object]):
6
+ env = SupportTicketEnvironment()
7
+ result = env.reset(task_id)
8
+ for action in actions:
9
+ result = env.step(action)
10
+ if result.done:
11
+ break
12
+ return result
13
+
14
+
15
+ def test_gold_policies_score_perfectly() -> None:
16
+ env = SupportTicketEnvironment()
17
+ for task_id in list_task_ids():
18
+ result = env.reset(task_id)
19
+ while not result.done:
20
+ action = scripted_policy(task_id, result.observation.steps_taken + 1, result.observation.customer_id)
21
+ result = env.step(action)
22
+ assert result.info["score"] == 1.0
23
+
24
+
25
+ def test_premature_resolution_scores_poorly() -> None:
26
+ result = run_actions(
27
+ "duplicate_charge_refund",
28
+ [ResolveTicketAction(action_type="resolve_ticket", resolution_code="billing_refund_processed")],
29
+ )
30
+ assert result.done is True
31
+ assert result.info["score"] < 0.5
32
+
33
+
34
+ def test_task3_refund_attempt_hurts_final_score() -> None:
35
+ result = run_actions(
36
+ "enterprise_data_loss_escalation",
37
+ [
38
+ LookupAccountAction(action_type="lookup_account", customer_id="cust_ent_003"),
39
+ IssueRefundAction(action_type="issue_refund", amount_cents=4900, reason_code="duplicate_charge"),
40
+ SendReplyAction(
41
+ action_type="send_reply",
42
+ message="This is urgent. I am escalating this to our legal team right now and the case is being actively investigated.",
43
+ ),
44
+ EscalateTicketAction(
45
+ action_type="escalate_ticket",
46
+ queue="legal_data_incident",
47
+ priority="P0",
48
+ summary="Enterprise customer reports possible data loss and legal threat.",
49
+ ),
50
+ ],
51
+ )
52
+ assert result.info["score"] == 0.8
53
+
54
+
55
+ def test_task3_liability_admission_fails_careful_reply() -> None:
56
+ result = run_actions(
57
+ "enterprise_data_loss_escalation",
58
+ [
59
+ LookupAccountAction(action_type="lookup_account", customer_id="cust_ent_003"),
60
+ SendReplyAction(
61
+ action_type="send_reply",
62
+ message="This is urgent and we are escalating it, but this is our fault and we caused this.",
63
+ ),
64
+ EscalateTicketAction(
65
+ action_type="escalate_ticket",
66
+ queue="legal_data_incident",
67
+ priority="P0",
68
+ summary="Enterprise customer reports possible data loss and legal threat.",
69
+ ),
70
+ ],
71
+ )
72
+ criteria = {item["criterion_id"]: item for item in result.info["score_breakdown"]["criteria"]}
73
+ assert criteria["careful_reply"]["earned"] is False
74
+ assert result.info["score"] < 1.0
uv.lock ADDED
The diff for this file is too large to render. See raw diff