Initial customer support OpenEnv upload
Browse files- .dockerignore +6 -0
- .gitignore +6 -0
- README.md +176 -11
- customer_support_openenv.egg-info/PKG-INFO +190 -0
- customer_support_openenv.egg-info/SOURCES.txt +20 -0
- customer_support_openenv.egg-info/dependency_links.txt +1 -0
- customer_support_openenv.egg-info/entry_points.txt +2 -0
- customer_support_openenv.egg-info/requires.txt +8 -0
- customer_support_openenv.egg-info/top_level.txt +2 -0
- inference.py +139 -0
- openenv.yaml +11 -0
- pyproject.toml +32 -0
- server/Dockerfile +19 -0
- server/__init__.py +1 -0
- server/app.py +168 -0
- server/requirements.txt +4 -0
- support_ticket_env/__init__.py +42 -0
- support_ticket_env/client.py +99 -0
- support_ticket_env/env.py +423 -0
- support_ticket_env/fixtures.py +270 -0
- support_ticket_env/models.py +214 -0
- support_ticket_env/policies.py +69 -0
- support_ticket_env/scoring.py +218 -0
- tests/test_env.py +58 -0
- tests/test_models.py +20 -0
- tests/test_scenarios.py +74 -0
- uv.lock +0 -0
.dockerignore
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
__pycache__/
|
| 2 |
+
.pytest_cache/
|
| 3 |
+
.git/
|
| 4 |
+
.env
|
| 5 |
+
outputs/
|
| 6 |
+
pytest-cache-files-*
|
.gitignore
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
__pycache__/
|
| 2 |
+
.pytest_cache/
|
| 3 |
+
*.pyc
|
| 4 |
+
.env
|
| 5 |
+
outputs/
|
| 6 |
+
pytest-cache-files-*
|
README.md
CHANGED
|
@@ -1,11 +1,176 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# AcmeCloud Customer Support Ticket Handler
|
| 2 |
+
|
| 3 |
+
A deterministic OpenEnv-style environment for training and evaluating agents on realistic B2B SaaS support workflows.
|
| 4 |
+
|
| 5 |
+
## What It Simulates
|
| 6 |
+
|
| 7 |
+
Each episode is one inbound customer-support ticket at a fictional company, `AcmeCloud`.
|
| 8 |
+
The agent acts like a support representative and must choose the right sequence of typed tool actions to handle the ticket correctly.
|
| 9 |
+
|
| 10 |
+
The benchmark ships with three fixed tasks:
|
| 11 |
+
|
| 12 |
+
1. `password_reset_guidance`
|
| 13 |
+
2. `duplicate_charge_refund`
|
| 14 |
+
3. `enterprise_data_loss_escalation`
|
| 15 |
+
|
| 16 |
+
## Why This Is Useful
|
| 17 |
+
|
| 18 |
+
This environment models a real operational task rather than a toy game:
|
| 19 |
+
|
| 20 |
+
- reading support tickets
|
| 21 |
+
- searching internal knowledge base articles
|
| 22 |
+
- looking up customer account details
|
| 23 |
+
- deciding whether to resolve, refund, or escalate
|
| 24 |
+
- sending customer-facing replies under policy constraints
|
| 25 |
+
|
| 26 |
+
The environment is fully deterministic and graded without any LLM judge, which makes it suitable for reproducible RL rollouts and benchmark evaluation.
|
| 27 |
+
|
| 28 |
+
## Action Space
|
| 29 |
+
|
| 30 |
+
The agent can take exactly six typed actions:
|
| 31 |
+
|
| 32 |
+
- `search_kb(query: str)`
|
| 33 |
+
- `lookup_account(customer_id: str)`
|
| 34 |
+
- `send_reply(message: str)`
|
| 35 |
+
- `issue_refund(amount_cents: int, reason_code: "duplicate_charge")`
|
| 36 |
+
- `resolve_ticket(resolution_code: "password_reset_guidance" | "billing_refund_processed")`
|
| 37 |
+
- `escalate_ticket(queue: "support_lead" | "legal_data_incident", priority: "P2" | "P0", summary: str)`
|
| 38 |
+
|
| 39 |
+
## Observation Space
|
| 40 |
+
|
| 41 |
+
Each observation includes:
|
| 42 |
+
|
| 43 |
+
- task and ticket identifiers
|
| 44 |
+
- current ticket status
|
| 45 |
+
- customer metadata
|
| 46 |
+
- customer message and full conversation history
|
| 47 |
+
- the last tool result
|
| 48 |
+
- steps taken / remaining
|
| 49 |
+
- available action types
|
| 50 |
+
- last action error
|
| 51 |
+
- accumulated known facts learned from prior tool calls
|
| 52 |
+
|
| 53 |
+
## Reward Design
|
| 54 |
+
|
| 55 |
+
The environment uses rubric-based reward shaping.
|
| 56 |
+
|
| 57 |
+
- Each task has a deterministic scorecard in `[0.0, 1.0]`
|
| 58 |
+
- Step reward is `score_delta - 0.01 - invalid_penalty - redundancy_penalty`
|
| 59 |
+
- Repeated search/lookup actions incur `-0.02`
|
| 60 |
+
- Invalid actions incur `-0.10`
|
| 61 |
+
- `resolve_ticket` and `escalate_ticket` terminate the episode
|
| 62 |
+
- `issue_refund` changes state but does not terminate the episode
|
| 63 |
+
|
| 64 |
+
Global success threshold: `0.75`
|
| 65 |
+
|
| 66 |
+
## Task Details
|
| 67 |
+
|
| 68 |
+
### 1. Password Reset Guidance
|
| 69 |
+
|
| 70 |
+
Customer issue: reset email did not arrive.
|
| 71 |
+
|
| 72 |
+
Expected flow:
|
| 73 |
+
|
| 74 |
+
- search password reset KB article
|
| 75 |
+
- send reply with reset URL and spam/junk guidance
|
| 76 |
+
- resolve with `password_reset_guidance`
|
| 77 |
+
|
| 78 |
+
### 2. Duplicate Charge Refund
|
| 79 |
+
|
| 80 |
+
Customer issue: billed twice for the current subscription period.
|
| 81 |
+
|
| 82 |
+
Expected flow:
|
| 83 |
+
|
| 84 |
+
- lookup the account
|
| 85 |
+
- search the refund policy
|
| 86 |
+
- issue the verified duplicate-charge refund
|
| 87 |
+
- reply with apology and timeline
|
| 88 |
+
- resolve with `billing_refund_processed`
|
| 89 |
+
|
| 90 |
+
### 3. Enterprise Data Loss Escalation
|
| 91 |
+
|
| 92 |
+
Customer issue: enterprise data-loss complaint with legal threat.
|
| 93 |
+
|
| 94 |
+
Expected flow:
|
| 95 |
+
|
| 96 |
+
- lookup the account
|
| 97 |
+
- send a careful acknowledgment reply
|
| 98 |
+
- escalate to `legal_data_incident` with `P0`
|
| 99 |
+
- do not refund
|
| 100 |
+
- do not resolve
|
| 101 |
+
|
| 102 |
+
## Project Layout
|
| 103 |
+
|
| 104 |
+
- `support_ticket_env/`: models, fixtures, scoring, environment core, policy helpers, local/HTTP client
|
| 105 |
+
- `server/`: FastAPI app and Dockerfile
|
| 106 |
+
- `tests/`: unit and scenario tests
|
| 107 |
+
- `inference.py`: baseline runner using the OpenAI client interface
|
| 108 |
+
- `openenv.yaml`: environment metadata
|
| 109 |
+
|
| 110 |
+
## Local Setup
|
| 111 |
+
|
| 112 |
+
```bash
|
| 113 |
+
python -m pip install -e .[dev]
|
| 114 |
+
pytest
|
| 115 |
+
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
Open the docs at [http://localhost:8000/docs](http://localhost:8000/docs) or the simple UI at [http://localhost:8000/web](http://localhost:8000/web).
|
| 119 |
+
|
| 120 |
+
## Docker
|
| 121 |
+
|
| 122 |
+
```bash
|
| 123 |
+
docker build -t customer-support-openenv -f server/Dockerfile .
|
| 124 |
+
docker run -p 8000:8000 customer-support-openenv
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
## Baseline Inference
|
| 128 |
+
|
| 129 |
+
The baseline script uses the OpenAI client interface and supports any OpenAI-compatible endpoint.
|
| 130 |
+
|
| 131 |
+
Environment variables:
|
| 132 |
+
|
| 133 |
+
- `HF_TOKEN` or `OPENAI_API_KEY`
|
| 134 |
+
- `API_BASE_URL`
|
| 135 |
+
- `MODEL_NAME`
|
| 136 |
+
- optional `ENV_BASE_URL` if you want the script to hit a running server instead of the in-process environment
|
| 137 |
+
|
| 138 |
+
Run:
|
| 139 |
+
|
| 140 |
+
```bash
|
| 141 |
+
python inference.py
|
| 142 |
+
```
|
| 143 |
+
|
| 144 |
+
The script emits strict stdout lines in the required format:
|
| 145 |
+
|
| 146 |
+
- `[START]`
|
| 147 |
+
- `[STEP]`
|
| 148 |
+
- `[END]`
|
| 149 |
+
|
| 150 |
+
If the model call fails or credentials are missing, the script falls back to a deterministic scripted policy so the benchmark still runs reproducibly.
|
| 151 |
+
|
| 152 |
+
## Example Gold Scores
|
| 153 |
+
|
| 154 |
+
Using the included scripted policy:
|
| 155 |
+
|
| 156 |
+
- `password_reset_guidance`: `1.0`
|
| 157 |
+
- `duplicate_charge_refund`: `1.0`
|
| 158 |
+
- `enterprise_data_loss_escalation`: `1.0`
|
| 159 |
+
|
| 160 |
+
## Deployment Notes
|
| 161 |
+
|
| 162 |
+
- The app exposes `/health`, `/reset`, `/step`, `/state`, `/docs`, `/web`, and `/ws`
|
| 163 |
+
- Sessions are managed in-memory
|
| 164 |
+
- No external services are required to run the environment server itself
|
| 165 |
+
- The benchmark is designed to fit comfortably in the hackathon resource limits
|
| 166 |
+
|
| 167 |
+
## Validation
|
| 168 |
+
|
| 169 |
+
If `openenv` is installed locally, run:
|
| 170 |
+
|
| 171 |
+
```bash
|
| 172 |
+
openenv validate
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
This repository does not depend on an LLM judge for grading.
|
| 176 |
+
All graders are deterministic and implemented directly in the environment scorer.
|
customer_support_openenv.egg-info/PKG-INFO
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Metadata-Version: 2.4
|
| 2 |
+
Name: customer-support-openenv
|
| 3 |
+
Version: 0.1.0
|
| 4 |
+
Summary: Deterministic OpenEnv-style customer support ticket benchmark for B2B SaaS workflows.
|
| 5 |
+
Requires-Python: >=3.11
|
| 6 |
+
Description-Content-Type: text/markdown
|
| 7 |
+
Requires-Dist: fastapi>=0.115
|
| 8 |
+
Requires-Dist: openenv-core>=0.2.0
|
| 9 |
+
Requires-Dist: openai>=1.30
|
| 10 |
+
Requires-Dist: pydantic>=2.7
|
| 11 |
+
Requires-Dist: uvicorn>=0.30
|
| 12 |
+
Provides-Extra: dev
|
| 13 |
+
Requires-Dist: pytest>=8.0; extra == "dev"
|
| 14 |
+
|
| 15 |
+
# AcmeCloud Customer Support Ticket Handler
|
| 16 |
+
|
| 17 |
+
A deterministic OpenEnv-style environment for training and evaluating agents on realistic B2B SaaS support workflows.
|
| 18 |
+
|
| 19 |
+
## What It Simulates
|
| 20 |
+
|
| 21 |
+
Each episode is one inbound customer-support ticket at a fictional company, `AcmeCloud`.
|
| 22 |
+
The agent acts like a support representative and must choose the right sequence of typed tool actions to handle the ticket correctly.
|
| 23 |
+
|
| 24 |
+
The benchmark ships with three fixed tasks:
|
| 25 |
+
|
| 26 |
+
1. `password_reset_guidance`
|
| 27 |
+
2. `duplicate_charge_refund`
|
| 28 |
+
3. `enterprise_data_loss_escalation`
|
| 29 |
+
|
| 30 |
+
## Why This Is Useful
|
| 31 |
+
|
| 32 |
+
This environment models a real operational task rather than a toy game:
|
| 33 |
+
|
| 34 |
+
- reading support tickets
|
| 35 |
+
- searching internal knowledge base articles
|
| 36 |
+
- looking up customer account details
|
| 37 |
+
- deciding whether to resolve, refund, or escalate
|
| 38 |
+
- sending customer-facing replies under policy constraints
|
| 39 |
+
|
| 40 |
+
The environment is fully deterministic and graded without any LLM judge, which makes it suitable for reproducible RL rollouts and benchmark evaluation.
|
| 41 |
+
|
| 42 |
+
## Action Space
|
| 43 |
+
|
| 44 |
+
The agent can take exactly six typed actions:
|
| 45 |
+
|
| 46 |
+
- `search_kb(query: str)`
|
| 47 |
+
- `lookup_account(customer_id: str)`
|
| 48 |
+
- `send_reply(message: str)`
|
| 49 |
+
- `issue_refund(amount_cents: int, reason_code: "duplicate_charge")`
|
| 50 |
+
- `resolve_ticket(resolution_code: "password_reset_guidance" | "billing_refund_processed")`
|
| 51 |
+
- `escalate_ticket(queue: "support_lead" | "legal_data_incident", priority: "P2" | "P0", summary: str)`
|
| 52 |
+
|
| 53 |
+
## Observation Space
|
| 54 |
+
|
| 55 |
+
Each observation includes:
|
| 56 |
+
|
| 57 |
+
- task and ticket identifiers
|
| 58 |
+
- current ticket status
|
| 59 |
+
- customer metadata
|
| 60 |
+
- customer message and full conversation history
|
| 61 |
+
- the last tool result
|
| 62 |
+
- steps taken / remaining
|
| 63 |
+
- available action types
|
| 64 |
+
- last action error
|
| 65 |
+
- accumulated known facts learned from prior tool calls
|
| 66 |
+
|
| 67 |
+
## Reward Design
|
| 68 |
+
|
| 69 |
+
The environment uses rubric-based reward shaping.
|
| 70 |
+
|
| 71 |
+
- Each task has a deterministic scorecard in `[0.0, 1.0]`
|
| 72 |
+
- Step reward is `score_delta - 0.01 - invalid_penalty - redundancy_penalty`
|
| 73 |
+
- Repeated search/lookup actions incur `-0.02`
|
| 74 |
+
- Invalid actions incur `-0.10`
|
| 75 |
+
- `resolve_ticket` and `escalate_ticket` terminate the episode
|
| 76 |
+
- `issue_refund` changes state but does not terminate the episode
|
| 77 |
+
|
| 78 |
+
Global success threshold: `0.75`
|
| 79 |
+
|
| 80 |
+
## Task Details
|
| 81 |
+
|
| 82 |
+
### 1. Password Reset Guidance
|
| 83 |
+
|
| 84 |
+
Customer issue: reset email did not arrive.
|
| 85 |
+
|
| 86 |
+
Expected flow:
|
| 87 |
+
|
| 88 |
+
- search password reset KB article
|
| 89 |
+
- send reply with reset URL and spam/junk guidance
|
| 90 |
+
- resolve with `password_reset_guidance`
|
| 91 |
+
|
| 92 |
+
### 2. Duplicate Charge Refund
|
| 93 |
+
|
| 94 |
+
Customer issue: billed twice for the current subscription period.
|
| 95 |
+
|
| 96 |
+
Expected flow:
|
| 97 |
+
|
| 98 |
+
- lookup the account
|
| 99 |
+
- search the refund policy
|
| 100 |
+
- issue the verified duplicate-charge refund
|
| 101 |
+
- reply with apology and timeline
|
| 102 |
+
- resolve with `billing_refund_processed`
|
| 103 |
+
|
| 104 |
+
### 3. Enterprise Data Loss Escalation
|
| 105 |
+
|
| 106 |
+
Customer issue: enterprise data-loss complaint with legal threat.
|
| 107 |
+
|
| 108 |
+
Expected flow:
|
| 109 |
+
|
| 110 |
+
- lookup the account
|
| 111 |
+
- send a careful acknowledgment reply
|
| 112 |
+
- escalate to `legal_data_incident` with `P0`
|
| 113 |
+
- do not refund
|
| 114 |
+
- do not resolve
|
| 115 |
+
|
| 116 |
+
## Project Layout
|
| 117 |
+
|
| 118 |
+
- `support_ticket_env/`: models, fixtures, scoring, environment core, policy helpers, local/HTTP client
|
| 119 |
+
- `server/`: FastAPI app and Dockerfile
|
| 120 |
+
- `tests/`: unit and scenario tests
|
| 121 |
+
- `inference.py`: baseline runner using the OpenAI client interface
|
| 122 |
+
- `openenv.yaml`: environment metadata
|
| 123 |
+
|
| 124 |
+
## Local Setup
|
| 125 |
+
|
| 126 |
+
```bash
|
| 127 |
+
python -m pip install -e .[dev]
|
| 128 |
+
pytest
|
| 129 |
+
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
|
| 130 |
+
```
|
| 131 |
+
|
| 132 |
+
Open the docs at [http://localhost:8000/docs](http://localhost:8000/docs) or the simple UI at [http://localhost:8000/web](http://localhost:8000/web).
|
| 133 |
+
|
| 134 |
+
## Docker
|
| 135 |
+
|
| 136 |
+
```bash
|
| 137 |
+
docker build -t customer-support-openenv -f server/Dockerfile .
|
| 138 |
+
docker run -p 8000:8000 customer-support-openenv
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
## Baseline Inference
|
| 142 |
+
|
| 143 |
+
The baseline script uses the OpenAI client interface and supports any OpenAI-compatible endpoint.
|
| 144 |
+
|
| 145 |
+
Environment variables:
|
| 146 |
+
|
| 147 |
+
- `HF_TOKEN` or `OPENAI_API_KEY`
|
| 148 |
+
- `API_BASE_URL`
|
| 149 |
+
- `MODEL_NAME`
|
| 150 |
+
- optional `ENV_BASE_URL` if you want the script to hit a running server instead of the in-process environment
|
| 151 |
+
|
| 152 |
+
Run:
|
| 153 |
+
|
| 154 |
+
```bash
|
| 155 |
+
python inference.py
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
The script emits strict stdout lines in the required format:
|
| 159 |
+
|
| 160 |
+
- `[START]`
|
| 161 |
+
- `[STEP]`
|
| 162 |
+
- `[END]`
|
| 163 |
+
|
| 164 |
+
If the model call fails or credentials are missing, the script falls back to a deterministic scripted policy so the benchmark still runs reproducibly.
|
| 165 |
+
|
| 166 |
+
## Example Gold Scores
|
| 167 |
+
|
| 168 |
+
Using the included scripted policy:
|
| 169 |
+
|
| 170 |
+
- `password_reset_guidance`: `1.0`
|
| 171 |
+
- `duplicate_charge_refund`: `1.0`
|
| 172 |
+
- `enterprise_data_loss_escalation`: `1.0`
|
| 173 |
+
|
| 174 |
+
## Deployment Notes
|
| 175 |
+
|
| 176 |
+
- The app exposes `/health`, `/reset`, `/step`, `/state`, `/docs`, `/web`, and `/ws`
|
| 177 |
+
- Sessions are managed in-memory
|
| 178 |
+
- No external services are required to run the environment server itself
|
| 179 |
+
- The benchmark is designed to fit comfortably in the hackathon resource limits
|
| 180 |
+
|
| 181 |
+
## Validation
|
| 182 |
+
|
| 183 |
+
If `openenv` is installed locally, run:
|
| 184 |
+
|
| 185 |
+
```bash
|
| 186 |
+
openenv validate
|
| 187 |
+
```
|
| 188 |
+
|
| 189 |
+
This repository does not depend on an LLM judge for grading.
|
| 190 |
+
All graders are deterministic and implemented directly in the environment scorer.
|
customer_support_openenv.egg-info/SOURCES.txt
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
README.md
|
| 2 |
+
pyproject.toml
|
| 3 |
+
customer_support_openenv.egg-info/PKG-INFO
|
| 4 |
+
customer_support_openenv.egg-info/SOURCES.txt
|
| 5 |
+
customer_support_openenv.egg-info/dependency_links.txt
|
| 6 |
+
customer_support_openenv.egg-info/entry_points.txt
|
| 7 |
+
customer_support_openenv.egg-info/requires.txt
|
| 8 |
+
customer_support_openenv.egg-info/top_level.txt
|
| 9 |
+
server/__init__.py
|
| 10 |
+
server/app.py
|
| 11 |
+
support_ticket_env/__init__.py
|
| 12 |
+
support_ticket_env/client.py
|
| 13 |
+
support_ticket_env/env.py
|
| 14 |
+
support_ticket_env/fixtures.py
|
| 15 |
+
support_ticket_env/models.py
|
| 16 |
+
support_ticket_env/policies.py
|
| 17 |
+
support_ticket_env/scoring.py
|
| 18 |
+
tests/test_env.py
|
| 19 |
+
tests/test_models.py
|
| 20 |
+
tests/test_scenarios.py
|
customer_support_openenv.egg-info/dependency_links.txt
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
customer_support_openenv.egg-info/entry_points.txt
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[console_scripts]
|
| 2 |
+
server = server.app:main
|
customer_support_openenv.egg-info/requires.txt
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
fastapi>=0.115
|
| 2 |
+
openenv-core>=0.2.0
|
| 3 |
+
openai>=1.30
|
| 4 |
+
pydantic>=2.7
|
| 5 |
+
uvicorn>=0.30
|
| 6 |
+
|
| 7 |
+
[dev]
|
| 8 |
+
pytest>=8.0
|
customer_support_openenv.egg-info/top_level.txt
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
server
|
| 2 |
+
support_ticket_env
|
inference.py
ADDED
|
@@ -0,0 +1,139 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import json
|
| 4 |
+
import os
|
| 5 |
+
import sys
|
| 6 |
+
from typing import Any
|
| 7 |
+
|
| 8 |
+
from openai import OpenAI
|
| 9 |
+
|
| 10 |
+
from support_ticket_env import BENCHMARK_NAME, DEFAULT_SUCCESS_THRESHOLD, SupportTicketEnv, fallback_action, list_task_ids, parse_action
|
| 11 |
+
|
| 12 |
+
API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or os.getenv("API_KEY")
|
| 13 |
+
API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
|
| 14 |
+
MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
|
| 15 |
+
ENV_BASE_URL = os.getenv("ENV_BASE_URL")
|
| 16 |
+
TEMPERATURE = 0.0
|
| 17 |
+
MAX_TOKENS = 220
|
| 18 |
+
SUCCESS_THRESHOLD = DEFAULT_SUCCESS_THRESHOLD
|
| 19 |
+
|
| 20 |
+
SYSTEM_PROMPT = """You are operating a deterministic customer-support environment.
|
| 21 |
+
Choose exactly one tool action at each step and respond with exactly one JSON object.
|
| 22 |
+
Valid actions:
|
| 23 |
+
- {\"action_type\": \"search_kb\", \"query\": \"...\"}
|
| 24 |
+
- {\"action_type\": \"lookup_account\", \"customer_id\": \"...\"}
|
| 25 |
+
- {\"action_type\": \"send_reply\", \"message\": \"...\"}
|
| 26 |
+
- {\"action_type\": \"issue_refund\", \"amount_cents\": 4900, \"reason_code\": \"duplicate_charge\"}
|
| 27 |
+
- {\"action_type\": \"resolve_ticket\", \"resolution_code\": \"password_reset_guidance\"}
|
| 28 |
+
- {\"action_type\": \"resolve_ticket\", \"resolution_code\": \"billing_refund_processed\"}
|
| 29 |
+
- {\"action_type\": \"escalate_ticket\", \"queue\": \"support_lead\", \"priority\": \"P2\", \"summary\": \"...\"}
|
| 30 |
+
- {\"action_type\": \"escalate_ticket\", \"queue\": \"legal_data_incident\", \"priority\": \"P0\", \"summary\": \"...\"}
|
| 31 |
+
Do not include markdown, code fences, or explanations."""
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def log_start(task: str, env: str, model: str) -> None:
|
| 35 |
+
print(f"[START] task={task} env={env} model={model}", flush=True)
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def log_step(step: int, action: str, reward: float, done: bool, error: str | None) -> None:
|
| 39 |
+
error_value = "null" if not error else error.replace("\n", " ")
|
| 40 |
+
print(
|
| 41 |
+
f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={error_value}",
|
| 42 |
+
flush=True,
|
| 43 |
+
)
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def log_end(success: bool, steps: int, score: float, rewards: list[float]) -> None:
|
| 47 |
+
rewards_str = ",".join(f"{reward:.2f}" for reward in rewards)
|
| 48 |
+
print(
|
| 49 |
+
f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
|
| 50 |
+
flush=True,
|
| 51 |
+
)
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def _strip_code_fences(text: str) -> str:
|
| 55 |
+
cleaned = text.strip()
|
| 56 |
+
if cleaned.startswith("```"):
|
| 57 |
+
lines = cleaned.splitlines()
|
| 58 |
+
if lines and lines[0].startswith("```"):
|
| 59 |
+
lines = lines[1:]
|
| 60 |
+
if lines and lines[-1].startswith("```"):
|
| 61 |
+
lines = lines[:-1]
|
| 62 |
+
cleaned = "\n".join(lines).strip()
|
| 63 |
+
return cleaned
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def _extract_json_object(text: str) -> dict[str, Any]:
|
| 67 |
+
cleaned = _strip_code_fences(text)
|
| 68 |
+
start = cleaned.find("{")
|
| 69 |
+
end = cleaned.rfind("}")
|
| 70 |
+
if start == -1 or end == -1 or end <= start:
|
| 71 |
+
raise ValueError("No JSON object found in model response")
|
| 72 |
+
return json.loads(cleaned[start : end + 1])
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def build_user_prompt(observation: dict[str, Any]) -> str:
|
| 76 |
+
return (
|
| 77 |
+
"Choose the next best action for this support ticket. "
|
| 78 |
+
"Keep it valid and deterministic. Observation JSON:\n"
|
| 79 |
+
f"{json.dumps(observation, indent=2)}"
|
| 80 |
+
)
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
def choose_action(client: OpenAI | None, observation) -> Any:
|
| 84 |
+
fallback = fallback_action(observation)
|
| 85 |
+
if client is None:
|
| 86 |
+
return fallback
|
| 87 |
+
|
| 88 |
+
try:
|
| 89 |
+
completion = client.chat.completions.create(
|
| 90 |
+
model=MODEL_NAME,
|
| 91 |
+
messages=[
|
| 92 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 93 |
+
{"role": "user", "content": build_user_prompt(observation.model_dump(mode="json"))},
|
| 94 |
+
],
|
| 95 |
+
temperature=TEMPERATURE,
|
| 96 |
+
max_tokens=MAX_TOKENS,
|
| 97 |
+
)
|
| 98 |
+
content = (completion.choices[0].message.content or "").strip()
|
| 99 |
+
return parse_action(_extract_json_object(content))
|
| 100 |
+
except Exception as exc: # pragma: no cover - depends on external endpoint
|
| 101 |
+
print(f"[DEBUG] Falling back to scripted policy: {exc}", file=sys.stderr, flush=True)
|
| 102 |
+
return fallback
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
def run_episode(task_id: str, client: OpenAI | None) -> None:
|
| 106 |
+
env = SupportTicketEnv(base_url=ENV_BASE_URL, task_id=task_id)
|
| 107 |
+
rewards: list[float] = []
|
| 108 |
+
steps_taken = 0
|
| 109 |
+
final_score = 0.0
|
| 110 |
+
success = False
|
| 111 |
+
|
| 112 |
+
log_start(task=task_id, env=BENCHMARK_NAME, model=MODEL_NAME)
|
| 113 |
+
|
| 114 |
+
try:
|
| 115 |
+
result = env.reset(task_id)
|
| 116 |
+
while not result.done:
|
| 117 |
+
action = choose_action(client, result.observation)
|
| 118 |
+
result = env.step(action)
|
| 119 |
+
steps_taken += 1
|
| 120 |
+
rewards.append(result.reward)
|
| 121 |
+
action_str = json.dumps(action.model_dump(mode="json"), separators=(",", ":"))
|
| 122 |
+
log_step(
|
| 123 |
+
step=steps_taken,
|
| 124 |
+
action=action_str,
|
| 125 |
+
reward=result.reward,
|
| 126 |
+
done=result.done,
|
| 127 |
+
error=result.observation.last_action_error,
|
| 128 |
+
)
|
| 129 |
+
final_score = float(result.info.get("score", 0.0))
|
| 130 |
+
success = final_score >= SUCCESS_THRESHOLD
|
| 131 |
+
finally:
|
| 132 |
+
env.close()
|
| 133 |
+
log_end(success=success, steps=steps_taken, score=final_score, rewards=rewards)
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
if __name__ == "__main__":
|
| 137 |
+
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY) if API_KEY else None
|
| 138 |
+
for task_id in list_task_ids():
|
| 139 |
+
run_episode(task_id, client)
|
openenv.yaml
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
name: customer_support_ticket_handler
|
| 2 |
+
version: 0.1.0
|
| 3 |
+
description: Deterministic B2B SaaS support benchmark with typed tool actions and rubric-based rewards.
|
| 4 |
+
entrypoint: server.app:app
|
| 5 |
+
runtime: fastapi
|
| 6 |
+
port: 8000
|
| 7 |
+
tags:
|
| 8 |
+
- openenv
|
| 9 |
+
- customer-support
|
| 10 |
+
- reinforcement-learning
|
| 11 |
+
- benchmark
|
pyproject.toml
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[build-system]
|
| 2 |
+
requires = ["setuptools>=68", "wheel"]
|
| 3 |
+
build-backend = "setuptools.build_meta"
|
| 4 |
+
|
| 5 |
+
[project]
|
| 6 |
+
name = "customer-support-openenv"
|
| 7 |
+
version = "0.1.0"
|
| 8 |
+
description = "Deterministic OpenEnv-style customer support ticket benchmark for B2B SaaS workflows."
|
| 9 |
+
readme = "README.md"
|
| 10 |
+
requires-python = ">=3.11"
|
| 11 |
+
dependencies = [
|
| 12 |
+
"fastapi>=0.115",
|
| 13 |
+
"openenv-core>=0.2.0",
|
| 14 |
+
"openai>=1.30",
|
| 15 |
+
"pydantic>=2.7",
|
| 16 |
+
"uvicorn>=0.30",
|
| 17 |
+
]
|
| 18 |
+
|
| 19 |
+
[project.scripts]
|
| 20 |
+
server = "server.app:main"
|
| 21 |
+
|
| 22 |
+
[project.optional-dependencies]
|
| 23 |
+
dev = ["pytest>=8.0"]
|
| 24 |
+
|
| 25 |
+
[tool.setuptools.packages.find]
|
| 26 |
+
where = ["."]
|
| 27 |
+
include = ["support_ticket_env", "support_ticket_env.*", "server", "server.*"]
|
| 28 |
+
|
| 29 |
+
[tool.pytest.ini_options]
|
| 30 |
+
addopts = "-p no:cacheprovider"
|
| 31 |
+
pythonpath = ["."]
|
| 32 |
+
testpaths = ["tests"]
|
server/Dockerfile
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
FROM python:3.12-slim
|
| 2 |
+
|
| 3 |
+
ENV PYTHONDONTWRITEBYTECODE=1 \
|
| 4 |
+
PYTHONUNBUFFERED=1 \
|
| 5 |
+
PORT=8000
|
| 6 |
+
|
| 7 |
+
WORKDIR /app
|
| 8 |
+
|
| 9 |
+
COPY pyproject.toml README.md openenv.yaml ./
|
| 10 |
+
COPY support_ticket_env ./support_ticket_env
|
| 11 |
+
COPY server ./server
|
| 12 |
+
COPY inference.py ./inference.py
|
| 13 |
+
|
| 14 |
+
RUN pip install --no-cache-dir --upgrade pip && \
|
| 15 |
+
pip install --no-cache-dir .
|
| 16 |
+
|
| 17 |
+
EXPOSE 8000
|
| 18 |
+
|
| 19 |
+
CMD ["python", "-m", "server.app"]
|
server/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
|
server/app.py
ADDED
|
@@ -0,0 +1,168 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from threading import Lock
|
| 4 |
+
from uuid import uuid4
|
| 5 |
+
from typing import Any
|
| 6 |
+
|
| 7 |
+
from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
|
| 8 |
+
from fastapi.responses import HTMLResponse
|
| 9 |
+
from pydantic import BaseModel, ConfigDict
|
| 10 |
+
|
| 11 |
+
from support_ticket_env import SupportTicketEnvironment, list_task_ids
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
class ResetRequest(BaseModel):
|
| 15 |
+
model_config = ConfigDict(extra="forbid")
|
| 16 |
+
|
| 17 |
+
task_id: str | None = None
|
| 18 |
+
session_id: str | None = None
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
class StepRequest(BaseModel):
|
| 22 |
+
model_config = ConfigDict(extra="forbid")
|
| 23 |
+
|
| 24 |
+
session_id: str
|
| 25 |
+
action: dict[str, Any]
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
class SessionManager:
|
| 29 |
+
def __init__(self) -> None:
|
| 30 |
+
self._sessions: dict[str, SupportTicketEnvironment] = {}
|
| 31 |
+
self._lock = Lock()
|
| 32 |
+
|
| 33 |
+
def create_or_reuse(self, session_id: str | None = None, task_id: str | None = None) -> tuple[str, SupportTicketEnvironment]:
|
| 34 |
+
with self._lock:
|
| 35 |
+
if session_id and session_id in self._sessions:
|
| 36 |
+
return session_id, self._sessions[session_id]
|
| 37 |
+
new_session_id = session_id or str(uuid4())
|
| 38 |
+
env = SupportTicketEnvironment(task_id=task_id)
|
| 39 |
+
self._sessions[new_session_id] = env
|
| 40 |
+
return new_session_id, env
|
| 41 |
+
|
| 42 |
+
def get(self, session_id: str) -> SupportTicketEnvironment:
|
| 43 |
+
with self._lock:
|
| 44 |
+
if session_id not in self._sessions:
|
| 45 |
+
raise KeyError(session_id)
|
| 46 |
+
return self._sessions[session_id]
|
| 47 |
+
|
| 48 |
+
def delete(self, session_id: str) -> None:
|
| 49 |
+
with self._lock:
|
| 50 |
+
self._sessions.pop(session_id, None)
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
manager = SessionManager()
|
| 54 |
+
app = FastAPI(
|
| 55 |
+
title="AcmeCloud Customer Support Ticket Handler",
|
| 56 |
+
version="0.1.0",
|
| 57 |
+
description="Deterministic OpenEnv-style customer support benchmark for B2B SaaS ticket handling.",
|
| 58 |
+
)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def _step_payload(result, session_id: str) -> dict[str, Any]:
|
| 62 |
+
payload = result.model_dump(mode="json")
|
| 63 |
+
payload.setdefault("info", {})["session_id"] = session_id
|
| 64 |
+
return payload
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
@app.get("/health")
|
| 68 |
+
def health() -> dict[str, Any]:
|
| 69 |
+
return {"status": "healthy", "tasks": list_task_ids()}
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
@app.post("/reset")
|
| 73 |
+
def reset(request: ResetRequest) -> dict[str, Any]:
|
| 74 |
+
session_id, env = manager.create_or_reuse(request.session_id, request.task_id)
|
| 75 |
+
result = env.reset(request.task_id)
|
| 76 |
+
return _step_payload(result, session_id)
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
@app.post("/step")
|
| 80 |
+
def step(request: StepRequest) -> dict[str, Any]:
|
| 81 |
+
try:
|
| 82 |
+
env = manager.get(request.session_id)
|
| 83 |
+
except KeyError as exc:
|
| 84 |
+
raise HTTPException(status_code=404, detail=f"Unknown session_id: {request.session_id}") from exc
|
| 85 |
+
result = env.step(request.action)
|
| 86 |
+
return _step_payload(result, request.session_id)
|
| 87 |
+
|
| 88 |
+
|
| 89 |
+
@app.get("/state")
|
| 90 |
+
def state(session_id: str) -> dict[str, Any]:
|
| 91 |
+
try:
|
| 92 |
+
env = manager.get(session_id)
|
| 93 |
+
except KeyError as exc:
|
| 94 |
+
raise HTTPException(status_code=404, detail=f"Unknown session_id: {session_id}") from exc
|
| 95 |
+
return {"session_id": session_id, **env.state()}
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
@app.delete("/session/{session_id}")
|
| 99 |
+
def close_session(session_id: str) -> dict[str, str]:
|
| 100 |
+
manager.delete(session_id)
|
| 101 |
+
return {"status": "deleted", "session_id": session_id}
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
@app.get("/web")
|
| 105 |
+
def web_ui() -> HTMLResponse:
|
| 106 |
+
task_items = "".join(f"<li><code>{task_id}</code></li>" for task_id in list_task_ids())
|
| 107 |
+
html = f"""
|
| 108 |
+
<html>
|
| 109 |
+
<head>
|
| 110 |
+
<title>AcmeCloud Customer Support Ticket Handler</title>
|
| 111 |
+
<style>
|
| 112 |
+
body {{ font-family: Segoe UI, sans-serif; margin: 2rem auto; max-width: 900px; line-height: 1.5; }}
|
| 113 |
+
code {{ background: #f4f4f4; padding: 0.15rem 0.35rem; border-radius: 0.25rem; }}
|
| 114 |
+
pre {{ background: #111827; color: #f9fafb; padding: 1rem; border-radius: 0.5rem; overflow-x: auto; }}
|
| 115 |
+
</style>
|
| 116 |
+
</head>
|
| 117 |
+
<body>
|
| 118 |
+
<h1>AcmeCloud Customer Support Ticket Handler</h1>
|
| 119 |
+
<p>One episode equals one support ticket. Available fixed tasks:</p>
|
| 120 |
+
<ul>{task_items}</ul>
|
| 121 |
+
<p>Example local reset:</p>
|
| 122 |
+
<pre>curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{{"task_id":"password_reset_guidance"}}'</pre>
|
| 123 |
+
</body>
|
| 124 |
+
</html>
|
| 125 |
+
"""
|
| 126 |
+
return HTMLResponse(html)
|
| 127 |
+
|
| 128 |
+
|
| 129 |
+
@app.websocket("/ws")
|
| 130 |
+
async def websocket_endpoint(websocket: WebSocket) -> None:
|
| 131 |
+
await websocket.accept()
|
| 132 |
+
session_id = str(uuid4())
|
| 133 |
+
env = SupportTicketEnvironment()
|
| 134 |
+
try:
|
| 135 |
+
while True:
|
| 136 |
+
payload = await websocket.receive_json()
|
| 137 |
+
message_type = payload.get("type")
|
| 138 |
+
if message_type == "reset":
|
| 139 |
+
result = env.reset(payload.get("task_id"))
|
| 140 |
+
await websocket.send_json(_step_payload(result, session_id))
|
| 141 |
+
elif message_type == "step":
|
| 142 |
+
result = env.step(payload.get("action", {}))
|
| 143 |
+
await websocket.send_json(_step_payload(result, session_id))
|
| 144 |
+
elif message_type == "state":
|
| 145 |
+
await websocket.send_json({"session_id": session_id, **env.state()})
|
| 146 |
+
elif message_type == "close":
|
| 147 |
+
await websocket.send_json({"status": "closed", "session_id": session_id})
|
| 148 |
+
break
|
| 149 |
+
else:
|
| 150 |
+
await websocket.send_json(
|
| 151 |
+
{
|
| 152 |
+
"error": "unsupported_message_type",
|
| 153 |
+
"message": "Use reset, step, state, or close.",
|
| 154 |
+
"session_id": session_id,
|
| 155 |
+
}
|
| 156 |
+
)
|
| 157 |
+
except WebSocketDisconnect:
|
| 158 |
+
return
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
def main() -> None:
|
| 162 |
+
import uvicorn
|
| 163 |
+
|
| 164 |
+
uvicorn.run("server.app:app", host="0.0.0.0", port=8000, reload=False)
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
if __name__ == "__main__":
|
| 168 |
+
main()
|
server/requirements.txt
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
fastapi>=0.115
|
| 2 |
+
uvicorn>=0.30
|
| 3 |
+
pydantic>=2.7
|
| 4 |
+
openai>=1.30
|
support_ticket_env/__init__.py
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from .client import SupportTicketEnv
|
| 2 |
+
from .env import SupportTicketEnvironment
|
| 3 |
+
from .fixtures import BENCHMARK_NAME, DEFAULT_SUCCESS_THRESHOLD, KB_ARTICLES, TASK_FIXTURES, list_task_ids
|
| 4 |
+
from .models import (
|
| 5 |
+
ACTION_TYPE_NAMES,
|
| 6 |
+
EscalateTicketAction,
|
| 7 |
+
IssueRefundAction,
|
| 8 |
+
LookupAccountAction,
|
| 9 |
+
ResolveTicketAction,
|
| 10 |
+
SearchKBAction,
|
| 11 |
+
SendReplyAction,
|
| 12 |
+
SupportTicketAction,
|
| 13 |
+
SupportTicketObservation,
|
| 14 |
+
SupportTicketStepResult,
|
| 15 |
+
TaskScorecard,
|
| 16 |
+
parse_action,
|
| 17 |
+
)
|
| 18 |
+
from .policies import fallback_action, scripted_policy
|
| 19 |
+
|
| 20 |
+
__all__ = [
|
| 21 |
+
"ACTION_TYPE_NAMES",
|
| 22 |
+
"BENCHMARK_NAME",
|
| 23 |
+
"DEFAULT_SUCCESS_THRESHOLD",
|
| 24 |
+
"EscalateTicketAction",
|
| 25 |
+
"IssueRefundAction",
|
| 26 |
+
"KB_ARTICLES",
|
| 27 |
+
"LookupAccountAction",
|
| 28 |
+
"ResolveTicketAction",
|
| 29 |
+
"SearchKBAction",
|
| 30 |
+
"SendReplyAction",
|
| 31 |
+
"SupportTicketAction",
|
| 32 |
+
"SupportTicketEnv",
|
| 33 |
+
"SupportTicketEnvironment",
|
| 34 |
+
"SupportTicketObservation",
|
| 35 |
+
"SupportTicketStepResult",
|
| 36 |
+
"TASK_FIXTURES",
|
| 37 |
+
"TaskScorecard",
|
| 38 |
+
"fallback_action",
|
| 39 |
+
"list_task_ids",
|
| 40 |
+
"parse_action",
|
| 41 |
+
"scripted_policy",
|
| 42 |
+
]
|
support_ticket_env/client.py
ADDED
|
@@ -0,0 +1,99 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import json
|
| 4 |
+
from typing import Any
|
| 5 |
+
from urllib import parse, request
|
| 6 |
+
|
| 7 |
+
from .env import SupportTicketEnvironment
|
| 8 |
+
from .fixtures import list_task_ids
|
| 9 |
+
from .models import SupportTicketAction, SupportTicketStepResult
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
class SupportTicketEnv:
|
| 13 |
+
def __init__(self, base_url: str | None = None, task_id: str | None = None) -> None:
|
| 14 |
+
self.base_url = base_url.rstrip("/") if base_url else None
|
| 15 |
+
self.task_id = task_id
|
| 16 |
+
self.session_id: str | None = None
|
| 17 |
+
self._local_env = SupportTicketEnvironment(task_id=task_id) if not self.base_url else None
|
| 18 |
+
|
| 19 |
+
@classmethod
|
| 20 |
+
def from_docker_image(
|
| 21 |
+
cls,
|
| 22 |
+
image_name: str,
|
| 23 |
+
base_url: str = "http://localhost:8000",
|
| 24 |
+
task_id: str | None = None,
|
| 25 |
+
) -> "SupportTicketEnv":
|
| 26 |
+
del image_name
|
| 27 |
+
return cls(base_url=base_url, task_id=task_id)
|
| 28 |
+
|
| 29 |
+
@classmethod
|
| 30 |
+
def from_env(
|
| 31 |
+
cls,
|
| 32 |
+
repo_id: str,
|
| 33 |
+
base_url: str,
|
| 34 |
+
task_id: str | None = None,
|
| 35 |
+
) -> "SupportTicketEnv":
|
| 36 |
+
del repo_id
|
| 37 |
+
return cls(base_url=base_url, task_id=task_id)
|
| 38 |
+
|
| 39 |
+
def reset(self, task_id: str | None = None) -> SupportTicketStepResult:
|
| 40 |
+
effective_task_id = task_id or self.task_id
|
| 41 |
+
if self._local_env is not None:
|
| 42 |
+
return self._local_env.reset(effective_task_id)
|
| 43 |
+
|
| 44 |
+
payload = {}
|
| 45 |
+
if effective_task_id:
|
| 46 |
+
payload["task_id"] = effective_task_id
|
| 47 |
+
if self.session_id:
|
| 48 |
+
payload["session_id"] = self.session_id
|
| 49 |
+
result = self._post_json("/reset", payload)
|
| 50 |
+
self.session_id = result.info.get("session_id")
|
| 51 |
+
return result
|
| 52 |
+
|
| 53 |
+
def step(self, action: SupportTicketAction | dict[str, Any]) -> SupportTicketStepResult:
|
| 54 |
+
if self._local_env is not None:
|
| 55 |
+
return self._local_env.step(action)
|
| 56 |
+
|
| 57 |
+
payload = {
|
| 58 |
+
"session_id": self.session_id,
|
| 59 |
+
"action": action.model_dump(mode="json") if hasattr(action, "model_dump") else action,
|
| 60 |
+
}
|
| 61 |
+
result = self._post_json("/step", payload)
|
| 62 |
+
self.session_id = result.info.get("session_id", self.session_id)
|
| 63 |
+
return result
|
| 64 |
+
|
| 65 |
+
def state(self) -> dict[str, Any]:
|
| 66 |
+
if self._local_env is not None:
|
| 67 |
+
return self._local_env.state()
|
| 68 |
+
|
| 69 |
+
if not self.session_id:
|
| 70 |
+
raise RuntimeError("reset() must be called before state() when using HTTP mode.")
|
| 71 |
+
query = parse.urlencode({"session_id": self.session_id})
|
| 72 |
+
with request.urlopen(f"{self.base_url}/state?{query}") as response:
|
| 73 |
+
return json.loads(response.read().decode("utf-8"))
|
| 74 |
+
|
| 75 |
+
def close(self) -> None:
|
| 76 |
+
self.session_id = None
|
| 77 |
+
|
| 78 |
+
def __enter__(self) -> "SupportTicketEnv":
|
| 79 |
+
return self
|
| 80 |
+
|
| 81 |
+
def __exit__(self, exc_type, exc, tb) -> None:
|
| 82 |
+
self.close()
|
| 83 |
+
return None
|
| 84 |
+
|
| 85 |
+
@staticmethod
|
| 86 |
+
def list_tasks() -> list[str]:
|
| 87 |
+
return list_task_ids()
|
| 88 |
+
|
| 89 |
+
def _post_json(self, path: str, payload: dict[str, Any]) -> SupportTicketStepResult:
|
| 90 |
+
body = json.dumps(payload).encode("utf-8")
|
| 91 |
+
req = request.Request(
|
| 92 |
+
f"{self.base_url}{path}",
|
| 93 |
+
data=body,
|
| 94 |
+
headers={"Content-Type": "application/json"},
|
| 95 |
+
method="POST",
|
| 96 |
+
)
|
| 97 |
+
with request.urlopen(req) as response:
|
| 98 |
+
data = json.loads(response.read().decode("utf-8"))
|
| 99 |
+
return SupportTicketStepResult.model_validate(data)
|
support_ticket_env/env.py
ADDED
|
@@ -0,0 +1,423 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from dataclasses import dataclass, field
|
| 4 |
+
from typing import Any
|
| 5 |
+
|
| 6 |
+
from .fixtures import (
|
| 7 |
+
BENCHMARK_NAME,
|
| 8 |
+
DEFAULT_SUCCESS_THRESHOLD,
|
| 9 |
+
KB_ARTICLES,
|
| 10 |
+
KnowledgeBaseArticle,
|
| 11 |
+
TaskFixture,
|
| 12 |
+
get_task_fixture,
|
| 13 |
+
list_task_ids,
|
| 14 |
+
)
|
| 15 |
+
from .models import (
|
| 16 |
+
ACTION_TYPE_NAMES,
|
| 17 |
+
AccountLookupResult,
|
| 18 |
+
ConversationTurn,
|
| 19 |
+
KBSearchResult,
|
| 20 |
+
ErrorToolResult,
|
| 21 |
+
EscalateTicketAction,
|
| 22 |
+
EscalationResult,
|
| 23 |
+
IssueRefundAction,
|
| 24 |
+
LookupAccountAction,
|
| 25 |
+
RefundResult,
|
| 26 |
+
ReplyResult,
|
| 27 |
+
ResolveResult,
|
| 28 |
+
SearchKBAction,
|
| 29 |
+
SupportTicketAction,
|
| 30 |
+
SupportTicketObservation,
|
| 31 |
+
SupportTicketStepResult,
|
| 32 |
+
ToolResult,
|
| 33 |
+
parse_action,
|
| 34 |
+
)
|
| 35 |
+
from .scoring import build_scorecard, normalize_text
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
@dataclass
|
| 39 |
+
class SessionState:
|
| 40 |
+
fixture: TaskFixture
|
| 41 |
+
ticket_status: str = "open"
|
| 42 |
+
steps_taken: int = 0
|
| 43 |
+
conversation_history: list[ConversationTurn] = field(default_factory=list)
|
| 44 |
+
action_history: list[dict[str, Any]] = field(default_factory=list)
|
| 45 |
+
reply_history: list[dict[str, Any]] = field(default_factory=list)
|
| 46 |
+
known_facts: dict[str, Any] = field(default_factory=dict)
|
| 47 |
+
kb_articles_seen: set[str] = field(default_factory=set)
|
| 48 |
+
search_signatures: set[str] = field(default_factory=set)
|
| 49 |
+
lookup_performed: bool = False
|
| 50 |
+
lookup_customer_id: str | None = None
|
| 51 |
+
refund_record: dict[str, Any] | None = None
|
| 52 |
+
refund_attempted: bool = False
|
| 53 |
+
resolution_code: str | None = None
|
| 54 |
+
escalation: dict[str, Any] | None = None
|
| 55 |
+
done: bool = False
|
| 56 |
+
terminal_reason: str | None = None
|
| 57 |
+
previous_score: float = 0.0
|
| 58 |
+
last_tool_result: ToolResult | None = None
|
| 59 |
+
last_action_error: str | None = None
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
class SupportTicketEnvironment:
|
| 63 |
+
benchmark_name = BENCHMARK_NAME
|
| 64 |
+
max_steps = 8
|
| 65 |
+
step_cost = 0.01
|
| 66 |
+
invalid_action_penalty = 0.10
|
| 67 |
+
repeated_action_penalty = 0.02
|
| 68 |
+
success_threshold = DEFAULT_SUCCESS_THRESHOLD
|
| 69 |
+
|
| 70 |
+
def __init__(self, task_id: str | None = None) -> None:
|
| 71 |
+
self._default_task_id = task_id or list_task_ids()[0]
|
| 72 |
+
self._session: SessionState | None = None
|
| 73 |
+
|
| 74 |
+
def reset(self, task_id: str | None = None) -> SupportTicketStepResult:
|
| 75 |
+
fixture = get_task_fixture(task_id or self._default_task_id)
|
| 76 |
+
self._session = SessionState(
|
| 77 |
+
fixture=fixture,
|
| 78 |
+
conversation_history=[
|
| 79 |
+
ConversationTurn(
|
| 80 |
+
role="customer",
|
| 81 |
+
message=fixture.ticket.message,
|
| 82 |
+
step_index=0,
|
| 83 |
+
)
|
| 84 |
+
],
|
| 85 |
+
)
|
| 86 |
+
return self._build_result(reward=0.0)
|
| 87 |
+
|
| 88 |
+
def step(self, action: SupportTicketAction | dict[str, Any]) -> SupportTicketStepResult:
|
| 89 |
+
session = self._require_session()
|
| 90 |
+
if session.done:
|
| 91 |
+
session.last_action_error = "episode_already_done"
|
| 92 |
+
session.last_tool_result = ErrorToolResult(
|
| 93 |
+
tool_name="error",
|
| 94 |
+
success=False,
|
| 95 |
+
error_code="episode_already_done",
|
| 96 |
+
message="This ticket is already terminal. Reset the environment before stepping again.",
|
| 97 |
+
)
|
| 98 |
+
return self._build_result(reward=-self.invalid_action_penalty)
|
| 99 |
+
|
| 100 |
+
invalid_penalty = 0.0
|
| 101 |
+
redundancy_penalty = 0.0
|
| 102 |
+
session.last_action_error = None
|
| 103 |
+
|
| 104 |
+
try:
|
| 105 |
+
parsed_action = parse_action(action)
|
| 106 |
+
except Exception as exc:
|
| 107 |
+
session.steps_taken += 1
|
| 108 |
+
session.last_action_error = f"invalid_action: {exc}"
|
| 109 |
+
session.last_tool_result = ErrorToolResult(
|
| 110 |
+
tool_name="error",
|
| 111 |
+
success=False,
|
| 112 |
+
error_code="invalid_action",
|
| 113 |
+
message=str(exc),
|
| 114 |
+
)
|
| 115 |
+
invalid_penalty = self.invalid_action_penalty
|
| 116 |
+
self._record_action({"action_type": "invalid"}, False)
|
| 117 |
+
if session.steps_taken >= self.max_steps:
|
| 118 |
+
session.done = True
|
| 119 |
+
session.terminal_reason = "max_steps_exceeded"
|
| 120 |
+
return self._finalize_step(invalid_penalty=invalid_penalty, redundancy_penalty=0.0)
|
| 121 |
+
|
| 122 |
+
session.steps_taken += 1
|
| 123 |
+
session.last_tool_result, invalid_penalty, redundancy_penalty = self._apply_action(parsed_action)
|
| 124 |
+
action_succeeded = bool(getattr(session.last_tool_result, "success", False))
|
| 125 |
+
self._record_action(parsed_action.model_dump(mode="json"), action_succeeded)
|
| 126 |
+
|
| 127 |
+
if not session.done and session.steps_taken >= self.max_steps:
|
| 128 |
+
session.done = True
|
| 129 |
+
session.terminal_reason = "max_steps_exceeded"
|
| 130 |
+
|
| 131 |
+
return self._finalize_step(
|
| 132 |
+
invalid_penalty=invalid_penalty,
|
| 133 |
+
redundancy_penalty=redundancy_penalty,
|
| 134 |
+
)
|
| 135 |
+
|
| 136 |
+
def state(self) -> dict[str, Any]:
|
| 137 |
+
session = self._require_session()
|
| 138 |
+
scorecard = build_scorecard(session.fixture, session)
|
| 139 |
+
return {
|
| 140 |
+
"benchmark_name": self.benchmark_name,
|
| 141 |
+
"task_id": session.fixture.task_id,
|
| 142 |
+
"ticket_status": session.ticket_status,
|
| 143 |
+
"steps_taken": session.steps_taken,
|
| 144 |
+
"steps_remaining": max(self.max_steps - session.steps_taken, 0),
|
| 145 |
+
"conversation_history": [turn.model_dump(mode="json") for turn in session.conversation_history],
|
| 146 |
+
"audit_log": list(session.action_history),
|
| 147 |
+
"known_facts": dict(session.known_facts),
|
| 148 |
+
"current_rubric_score": scorecard.score,
|
| 149 |
+
"score_breakdown": scorecard.model_dump(mode="json"),
|
| 150 |
+
"terminal_reason": session.terminal_reason,
|
| 151 |
+
"done": session.done,
|
| 152 |
+
}
|
| 153 |
+
|
| 154 |
+
def _apply_action(self, action: SupportTicketAction) -> tuple[ToolResult, float, float]:
|
| 155 |
+
session = self._require_session()
|
| 156 |
+
invalid_penalty = 0.0
|
| 157 |
+
redundancy_penalty = 0.0
|
| 158 |
+
|
| 159 |
+
if isinstance(action, SearchKBAction):
|
| 160 |
+
query_signature = normalize_text(action.query)
|
| 161 |
+
if query_signature in session.search_signatures:
|
| 162 |
+
redundancy_penalty = self.repeated_action_penalty
|
| 163 |
+
session.search_signatures.add(query_signature)
|
| 164 |
+
articles = self._search_knowledge_base(action.query)
|
| 165 |
+
article_ids = [article.article_id for article in articles]
|
| 166 |
+
session.kb_articles_seen.update(article_ids)
|
| 167 |
+
session.known_facts["kb_articles_seen"] = sorted(session.kb_articles_seen)
|
| 168 |
+
session.known_facts["kb_titles_seen"] = [KB_ARTICLES[article_id].title for article_id in sorted(session.kb_articles_seen)]
|
| 169 |
+
result = KBSearchResult(
|
| 170 |
+
tool_name="search_kb",
|
| 171 |
+
success=bool(articles),
|
| 172 |
+
query=action.query,
|
| 173 |
+
article_ids=article_ids,
|
| 174 |
+
snippets=[article.snippet for article in articles],
|
| 175 |
+
message="Knowledge base search completed." if articles else "No KB articles matched the query.",
|
| 176 |
+
)
|
| 177 |
+
return result, invalid_penalty, redundancy_penalty
|
| 178 |
+
|
| 179 |
+
if isinstance(action, LookupAccountAction):
|
| 180 |
+
if action.customer_id != session.fixture.account.customer_id:
|
| 181 |
+
session.last_action_error = "unknown_customer_id"
|
| 182 |
+
result = ErrorToolResult(
|
| 183 |
+
tool_name="error",
|
| 184 |
+
success=False,
|
| 185 |
+
error_code="unknown_customer_id",
|
| 186 |
+
message=f"No account found for customer_id={action.customer_id}.",
|
| 187 |
+
)
|
| 188 |
+
return result, self.invalid_action_penalty, redundancy_penalty
|
| 189 |
+
|
| 190 |
+
if session.lookup_performed and session.lookup_customer_id == action.customer_id:
|
| 191 |
+
redundancy_penalty = self.repeated_action_penalty
|
| 192 |
+
|
| 193 |
+
account = session.fixture.account
|
| 194 |
+
session.lookup_performed = True
|
| 195 |
+
session.lookup_customer_id = action.customer_id
|
| 196 |
+
account_summary = {
|
| 197 |
+
"customer_id": account.customer_id,
|
| 198 |
+
"organization_name": account.organization_name,
|
| 199 |
+
"plan": account.plan,
|
| 200 |
+
"tenure_years": account.tenure_years,
|
| 201 |
+
"arr_usd": account.arr_usd,
|
| 202 |
+
"duplicate_charge_amount_cents": account.duplicate_charge_amount_cents,
|
| 203 |
+
"duplicate_charge_count": account.duplicate_charge_count,
|
| 204 |
+
"duplicate_charge_refund_eligible": account.duplicate_charge_refund_eligible,
|
| 205 |
+
"legal_threat": account.legal_threat,
|
| 206 |
+
"incident_severity": account.incident_severity,
|
| 207 |
+
}
|
| 208 |
+
session.known_facts["account"] = account_summary
|
| 209 |
+
result = AccountLookupResult(
|
| 210 |
+
tool_name="lookup_account",
|
| 211 |
+
success=True,
|
| 212 |
+
customer_id=action.customer_id,
|
| 213 |
+
account_summary=account_summary,
|
| 214 |
+
message="Account lookup completed.",
|
| 215 |
+
)
|
| 216 |
+
return result, invalid_penalty, redundancy_penalty
|
| 217 |
+
|
| 218 |
+
if action.action_type == "send_reply":
|
| 219 |
+
reply = action.message.strip()
|
| 220 |
+
session.reply_history.append({"message": reply, "step_index": session.steps_taken})
|
| 221 |
+
session.conversation_history.append(
|
| 222 |
+
ConversationTurn(role="agent", message=reply, step_index=session.steps_taken)
|
| 223 |
+
)
|
| 224 |
+
result = ReplyResult(
|
| 225 |
+
tool_name="send_reply",
|
| 226 |
+
success=True,
|
| 227 |
+
message_preview=reply[:120],
|
| 228 |
+
message="Reply sent to the customer.",
|
| 229 |
+
)
|
| 230 |
+
return result, invalid_penalty, redundancy_penalty
|
| 231 |
+
|
| 232 |
+
if isinstance(action, IssueRefundAction):
|
| 233 |
+
session.refund_attempted = True
|
| 234 |
+
account = session.fixture.account
|
| 235 |
+
if not session.lookup_performed:
|
| 236 |
+
session.last_action_error = "lookup_required_before_refund"
|
| 237 |
+
result = ErrorToolResult(
|
| 238 |
+
tool_name="error",
|
| 239 |
+
success=False,
|
| 240 |
+
error_code="lookup_required_before_refund",
|
| 241 |
+
message="lookup_account must succeed before issue_refund can be used.",
|
| 242 |
+
)
|
| 243 |
+
return result, self.invalid_action_penalty, redundancy_penalty
|
| 244 |
+
|
| 245 |
+
if not account.duplicate_charge_refund_eligible or not account.duplicate_charge_amount_cents:
|
| 246 |
+
session.last_action_error = "refund_not_applicable"
|
| 247 |
+
result = RefundResult(
|
| 248 |
+
tool_name="issue_refund",
|
| 249 |
+
success=False,
|
| 250 |
+
refunded=False,
|
| 251 |
+
amount_cents=action.amount_cents,
|
| 252 |
+
reason_code=action.reason_code,
|
| 253 |
+
message="No duplicate charge is eligible for refund on this account.",
|
| 254 |
+
)
|
| 255 |
+
return result, self.invalid_action_penalty, redundancy_penalty
|
| 256 |
+
|
| 257 |
+
if action.amount_cents != account.duplicate_charge_amount_cents or action.reason_code != "duplicate_charge":
|
| 258 |
+
session.last_action_error = "incorrect_refund_payload"
|
| 259 |
+
result = RefundResult(
|
| 260 |
+
tool_name="issue_refund",
|
| 261 |
+
success=False,
|
| 262 |
+
refunded=False,
|
| 263 |
+
amount_cents=action.amount_cents,
|
| 264 |
+
reason_code=action.reason_code,
|
| 265 |
+
message="Refund payload does not match the verified duplicate charge.",
|
| 266 |
+
)
|
| 267 |
+
return result, self.invalid_action_penalty, redundancy_penalty
|
| 268 |
+
|
| 269 |
+
session.refund_record = {
|
| 270 |
+
"amount_cents": action.amount_cents,
|
| 271 |
+
"reason_code": action.reason_code,
|
| 272 |
+
"step_index": session.steps_taken,
|
| 273 |
+
}
|
| 274 |
+
result = RefundResult(
|
| 275 |
+
tool_name="issue_refund",
|
| 276 |
+
success=True,
|
| 277 |
+
refunded=True,
|
| 278 |
+
amount_cents=action.amount_cents,
|
| 279 |
+
reason_code=action.reason_code,
|
| 280 |
+
message="Refund recorded successfully.",
|
| 281 |
+
)
|
| 282 |
+
return result, invalid_penalty, redundancy_penalty
|
| 283 |
+
|
| 284 |
+
if action.action_type == "resolve_ticket":
|
| 285 |
+
session.resolution_code = action.resolution_code
|
| 286 |
+
session.ticket_status = "resolved"
|
| 287 |
+
session.done = True
|
| 288 |
+
session.terminal_reason = "resolved"
|
| 289 |
+
result = ResolveResult(
|
| 290 |
+
tool_name="resolve_ticket",
|
| 291 |
+
success=True,
|
| 292 |
+
resolution_code=action.resolution_code,
|
| 293 |
+
ticket_status="resolved",
|
| 294 |
+
message="Ticket marked as resolved.",
|
| 295 |
+
)
|
| 296 |
+
return result, invalid_penalty, redundancy_penalty
|
| 297 |
+
|
| 298 |
+
if isinstance(action, EscalateTicketAction):
|
| 299 |
+
session.escalation = {
|
| 300 |
+
"queue": action.queue,
|
| 301 |
+
"priority": action.priority,
|
| 302 |
+
"summary": action.summary,
|
| 303 |
+
"step_index": session.steps_taken,
|
| 304 |
+
}
|
| 305 |
+
session.ticket_status = "escalated"
|
| 306 |
+
session.done = True
|
| 307 |
+
session.terminal_reason = "escalated"
|
| 308 |
+
result = EscalationResult(
|
| 309 |
+
tool_name="escalate_ticket",
|
| 310 |
+
success=True,
|
| 311 |
+
queue=action.queue,
|
| 312 |
+
priority=action.priority,
|
| 313 |
+
summary=action.summary,
|
| 314 |
+
ticket_status="escalated",
|
| 315 |
+
message="Ticket escalated.",
|
| 316 |
+
)
|
| 317 |
+
return result, invalid_penalty, redundancy_penalty
|
| 318 |
+
|
| 319 |
+
session.last_action_error = "unsupported_action"
|
| 320 |
+
return (
|
| 321 |
+
ErrorToolResult(
|
| 322 |
+
tool_name="error",
|
| 323 |
+
success=False,
|
| 324 |
+
error_code="unsupported_action",
|
| 325 |
+
message=f"Unsupported action type: {type(action).__name__}",
|
| 326 |
+
),
|
| 327 |
+
self.invalid_action_penalty,
|
| 328 |
+
redundancy_penalty,
|
| 329 |
+
)
|
| 330 |
+
|
| 331 |
+
def _search_knowledge_base(self, query: str) -> list[KnowledgeBaseArticle]:
|
| 332 |
+
query_terms = set(normalize_text(query).split())
|
| 333 |
+
ranked: list[tuple[int, str, KnowledgeBaseArticle]] = []
|
| 334 |
+
for article in KB_ARTICLES.values():
|
| 335 |
+
searchable = normalize_text(" ".join((article.title, article.content, " ".join(article.tags))))
|
| 336 |
+
article_terms = set(searchable.split())
|
| 337 |
+
score = len(query_terms & article_terms)
|
| 338 |
+
if score > 0:
|
| 339 |
+
ranked.append((score, article.article_id, article))
|
| 340 |
+
ranked.sort(key=lambda item: (-item[0], item[1]))
|
| 341 |
+
return [article for _, _, article in ranked[:3]]
|
| 342 |
+
|
| 343 |
+
def _record_action(self, action_payload: dict[str, Any], action_succeeded: bool) -> None:
|
| 344 |
+
session = self._require_session()
|
| 345 |
+
session.action_history.append(
|
| 346 |
+
{
|
| 347 |
+
"step_index": session.steps_taken,
|
| 348 |
+
"action": action_payload,
|
| 349 |
+
"success": action_succeeded,
|
| 350 |
+
"ticket_status": session.ticket_status,
|
| 351 |
+
}
|
| 352 |
+
)
|
| 353 |
+
|
| 354 |
+
def _finalize_step(self, invalid_penalty: float, redundancy_penalty: float) -> SupportTicketStepResult:
|
| 355 |
+
session = self._require_session()
|
| 356 |
+
scorecard = build_scorecard(session.fixture, session)
|
| 357 |
+
reward = round(
|
| 358 |
+
(scorecard.score - session.previous_score) - self.step_cost - invalid_penalty - redundancy_penalty,
|
| 359 |
+
6,
|
| 360 |
+
)
|
| 361 |
+
session.previous_score = scorecard.score
|
| 362 |
+
return SupportTicketStepResult(
|
| 363 |
+
observation=self._build_observation(),
|
| 364 |
+
reward=reward,
|
| 365 |
+
done=session.done,
|
| 366 |
+
info={
|
| 367 |
+
"task_id": session.fixture.task_id,
|
| 368 |
+
"benchmark_name": self.benchmark_name,
|
| 369 |
+
"score": scorecard.score,
|
| 370 |
+
"score_breakdown": scorecard.model_dump(mode="json"),
|
| 371 |
+
"success": scorecard.score >= self.success_threshold,
|
| 372 |
+
"success_threshold": self.success_threshold,
|
| 373 |
+
"terminal_reason": session.terminal_reason,
|
| 374 |
+
"invalid_penalty": invalid_penalty,
|
| 375 |
+
"redundancy_penalty": redundancy_penalty,
|
| 376 |
+
},
|
| 377 |
+
)
|
| 378 |
+
|
| 379 |
+
def _build_observation(self) -> SupportTicketObservation:
|
| 380 |
+
session = self._require_session()
|
| 381 |
+
ticket = session.fixture.ticket
|
| 382 |
+
return SupportTicketObservation(
|
| 383 |
+
task_id=session.fixture.task_id,
|
| 384 |
+
ticket_id=ticket.ticket_id,
|
| 385 |
+
ticket_status=session.ticket_status,
|
| 386 |
+
customer_id=ticket.customer_id,
|
| 387 |
+
organization_name=ticket.organization_name,
|
| 388 |
+
subject=ticket.subject,
|
| 389 |
+
customer_message=ticket.message,
|
| 390 |
+
conversation_history=list(session.conversation_history),
|
| 391 |
+
last_tool_result=session.last_tool_result,
|
| 392 |
+
steps_taken=session.steps_taken,
|
| 393 |
+
steps_remaining=max(self.max_steps - session.steps_taken, 0),
|
| 394 |
+
available_action_types=list(ACTION_TYPE_NAMES),
|
| 395 |
+
last_action_error=session.last_action_error,
|
| 396 |
+
known_facts=dict(session.known_facts),
|
| 397 |
+
)
|
| 398 |
+
|
| 399 |
+
def _build_result(self, reward: float) -> SupportTicketStepResult:
|
| 400 |
+
session = self._require_session()
|
| 401 |
+
scorecard = build_scorecard(session.fixture, session)
|
| 402 |
+
session.previous_score = scorecard.score
|
| 403 |
+
return SupportTicketStepResult(
|
| 404 |
+
observation=self._build_observation(),
|
| 405 |
+
reward=reward,
|
| 406 |
+
done=session.done,
|
| 407 |
+
info={
|
| 408 |
+
"task_id": session.fixture.task_id,
|
| 409 |
+
"benchmark_name": self.benchmark_name,
|
| 410 |
+
"score": scorecard.score,
|
| 411 |
+
"score_breakdown": scorecard.model_dump(mode="json"),
|
| 412 |
+
"success": scorecard.score >= self.success_threshold,
|
| 413 |
+
"success_threshold": self.success_threshold,
|
| 414 |
+
"terminal_reason": session.terminal_reason,
|
| 415 |
+
"invalid_penalty": 0.0,
|
| 416 |
+
"redundancy_penalty": 0.0,
|
| 417 |
+
},
|
| 418 |
+
)
|
| 419 |
+
|
| 420 |
+
def _require_session(self) -> SessionState:
|
| 421 |
+
if self._session is None:
|
| 422 |
+
raise RuntimeError("Environment has not been reset yet.")
|
| 423 |
+
return self._session
|
support_ticket_env/fixtures.py
ADDED
|
@@ -0,0 +1,270 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from collections import OrderedDict
|
| 4 |
+
from dataclasses import dataclass, field
|
| 5 |
+
from typing import Literal
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
BENCHMARK_NAME = "customer_support_ticket_handler"
|
| 9 |
+
DEFAULT_SUCCESS_THRESHOLD = 0.75
|
| 10 |
+
RESET_URL = "https://app.acmecloud.com/reset"
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
@dataclass(frozen=True)
|
| 14 |
+
class KnowledgeBaseArticle:
|
| 15 |
+
article_id: str
|
| 16 |
+
title: str
|
| 17 |
+
tags: tuple[str, ...]
|
| 18 |
+
snippet: str
|
| 19 |
+
content: str
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
@dataclass(frozen=True)
|
| 23 |
+
class TicketFixture:
|
| 24 |
+
ticket_id: str
|
| 25 |
+
customer_id: str
|
| 26 |
+
organization_name: str
|
| 27 |
+
subject: str
|
| 28 |
+
message: str
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
@dataclass(frozen=True)
|
| 32 |
+
class AccountFixture:
|
| 33 |
+
customer_id: str
|
| 34 |
+
organization_name: str
|
| 35 |
+
plan: str
|
| 36 |
+
tenure_years: float | None = None
|
| 37 |
+
arr_usd: int | None = None
|
| 38 |
+
duplicate_charge_amount_cents: int | None = None
|
| 39 |
+
duplicate_charge_count: int = 0
|
| 40 |
+
duplicate_charge_refund_eligible: bool = False
|
| 41 |
+
legal_threat: bool = False
|
| 42 |
+
incident_severity: str | None = None
|
| 43 |
+
mandatory_escalation_queue: str | None = None
|
| 44 |
+
mandatory_escalation_priority: str | None = None
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
@dataclass(frozen=True)
|
| 48 |
+
class TaskFixture:
|
| 49 |
+
task_id: str
|
| 50 |
+
title: str
|
| 51 |
+
difficulty: Literal["easy", "medium", "hard"]
|
| 52 |
+
ticket: TicketFixture
|
| 53 |
+
account: AccountFixture
|
| 54 |
+
relevant_kb_article_id: str | None = None
|
| 55 |
+
expected_terminal_mode: Literal["resolve", "escalate"] = "resolve"
|
| 56 |
+
expected_resolution_code: str | None = None
|
| 57 |
+
expected_refund_amount_cents: int | None = None
|
| 58 |
+
refund_reason_code: str | None = None
|
| 59 |
+
expected_escalation_queue: str | None = None
|
| 60 |
+
expected_escalation_priority: str | None = None
|
| 61 |
+
reply_keyword_groups: dict[str, tuple[str, ...]] = field(default_factory=dict)
|
| 62 |
+
forbidden_reply_phrases: tuple[str, ...] = ()
|
| 63 |
+
rubric_weights: dict[str, float] = field(default_factory=dict)
|
| 64 |
+
efficiency_bonus_max_steps: int | None = None
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
KB_ARTICLES = OrderedDict(
|
| 68 |
+
(
|
| 69 |
+
article.article_id,
|
| 70 |
+
article,
|
| 71 |
+
)
|
| 72 |
+
for article in (
|
| 73 |
+
KnowledgeBaseArticle(
|
| 74 |
+
article_id="KB-PW-RESET",
|
| 75 |
+
title="Password reset email troubleshooting",
|
| 76 |
+
tags=("password", "reset", "email", "spam", "login"),
|
| 77 |
+
snippet="Ask the customer to use the AcmeCloud reset page, check spam or junk, and wait 5 minutes before retrying.",
|
| 78 |
+
content=(
|
| 79 |
+
f"If a password reset email does not arrive, direct the user to {RESET_URL}. "
|
| 80 |
+
"Ask them to check their spam or junk folder and wait 5 minutes before requesting another email."
|
| 81 |
+
),
|
| 82 |
+
),
|
| 83 |
+
KnowledgeBaseArticle(
|
| 84 |
+
article_id="KB-BILL-DUPLICATE",
|
| 85 |
+
title="Duplicate subscription charge refund policy",
|
| 86 |
+
tags=("billing", "refund", "duplicate", "charge", "subscription"),
|
| 87 |
+
snippet="After verifying a duplicate charge, support can refund the extra charge. Refunds settle in 3-5 business days.",
|
| 88 |
+
content=(
|
| 89 |
+
"If account history confirms an accidental duplicate subscription charge, refund the duplicate amount in full. "
|
| 90 |
+
"Communicate that the refund will appear in 3-5 business days."
|
| 91 |
+
),
|
| 92 |
+
),
|
| 93 |
+
KnowledgeBaseArticle(
|
| 94 |
+
article_id="KB-INCIDENT-LEGAL",
|
| 95 |
+
title="Critical data incident legal escalation",
|
| 96 |
+
tags=("incident", "legal", "data", "escalation", "enterprise"),
|
| 97 |
+
snippet="Legal threats and alleged customer data loss must be escalated immediately to the legal_data_incident queue at P0.",
|
| 98 |
+
content=(
|
| 99 |
+
"If an enterprise customer reports data loss and mentions legal action, do not promise a resolution, do not admit fault, "
|
| 100 |
+
"and escalate immediately to the legal_data_incident queue with priority P0."
|
| 101 |
+
),
|
| 102 |
+
),
|
| 103 |
+
KnowledgeBaseArticle(
|
| 104 |
+
article_id="KB-SSO-SETUP",
|
| 105 |
+
title="Single sign-on setup guide",
|
| 106 |
+
tags=("sso", "setup", "identity", "onboarding"),
|
| 107 |
+
snippet="Configure SAML or OIDC before enforcing SSO in production.",
|
| 108 |
+
content="SSO setup steps for administrators integrating AcmeCloud with their identity provider.",
|
| 109 |
+
),
|
| 110 |
+
KnowledgeBaseArticle(
|
| 111 |
+
article_id="KB-INVOICE-DOWNLOAD",
|
| 112 |
+
title="Invoice download instructions",
|
| 113 |
+
tags=("invoice", "billing", "download", "finance"),
|
| 114 |
+
snippet="Billing administrators can download invoices from the Finance tab in workspace settings.",
|
| 115 |
+
content="Steps for locating billing history and downloading invoices from the AcmeCloud admin console.",
|
| 116 |
+
),
|
| 117 |
+
KnowledgeBaseArticle(
|
| 118 |
+
article_id="KB-MFA-RESET",
|
| 119 |
+
title="Multi-factor authentication reset",
|
| 120 |
+
tags=("mfa", "reset", "login", "security"),
|
| 121 |
+
snippet="MFA resets require identity verification or admin override.",
|
| 122 |
+
content="How to reset multi-factor authentication for locked-out users.",
|
| 123 |
+
),
|
| 124 |
+
)
|
| 125 |
+
)
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
TASK_FIXTURES = OrderedDict(
|
| 129 |
+
(
|
| 130 |
+
task.task_id,
|
| 131 |
+
task,
|
| 132 |
+
)
|
| 133 |
+
for task in (
|
| 134 |
+
TaskFixture(
|
| 135 |
+
task_id="password_reset_guidance",
|
| 136 |
+
title="Password reset guidance",
|
| 137 |
+
difficulty="easy",
|
| 138 |
+
ticket=TicketFixture(
|
| 139 |
+
ticket_id="ticket_pw_001",
|
| 140 |
+
customer_id="cust_pw_001",
|
| 141 |
+
organization_name="Northstar Analytics",
|
| 142 |
+
subject="Reset email never arrived",
|
| 143 |
+
message="Hi, I forgot my password and the reset email isn't arriving. Please help.",
|
| 144 |
+
),
|
| 145 |
+
account=AccountFixture(
|
| 146 |
+
customer_id="cust_pw_001",
|
| 147 |
+
organization_name="Northstar Analytics",
|
| 148 |
+
plan="Pro",
|
| 149 |
+
),
|
| 150 |
+
relevant_kb_article_id="KB-PW-RESET",
|
| 151 |
+
expected_terminal_mode="resolve",
|
| 152 |
+
expected_resolution_code="password_reset_guidance",
|
| 153 |
+
reply_keyword_groups={
|
| 154 |
+
"reset_url": (RESET_URL,),
|
| 155 |
+
"spam_folder": ("spam", "junk"),
|
| 156 |
+
"wait_guidance": ("5 minutes", "five minutes"),
|
| 157 |
+
},
|
| 158 |
+
rubric_weights={
|
| 159 |
+
"searched_kb": 0.20,
|
| 160 |
+
"reply_has_reset_url": 0.30,
|
| 161 |
+
"reply_mentions_spam_folder": 0.20,
|
| 162 |
+
"resolved_correctly": 0.20,
|
| 163 |
+
"efficient_completion": 0.10,
|
| 164 |
+
},
|
| 165 |
+
efficiency_bonus_max_steps=4,
|
| 166 |
+
),
|
| 167 |
+
TaskFixture(
|
| 168 |
+
task_id="duplicate_charge_refund",
|
| 169 |
+
title="Duplicate subscription charge refund",
|
| 170 |
+
difficulty="medium",
|
| 171 |
+
ticket=TicketFixture(
|
| 172 |
+
ticket_id="ticket_bill_002",
|
| 173 |
+
customer_id="cust_bill_002",
|
| 174 |
+
organization_name="BlueOrbit Labs",
|
| 175 |
+
subject="Charged twice this month",
|
| 176 |
+
message=(
|
| 177 |
+
"I was charged twice for my subscription this month. I want both charges refunded immediately. "
|
| 178 |
+
"I've been a customer for 3 years."
|
| 179 |
+
),
|
| 180 |
+
),
|
| 181 |
+
account=AccountFixture(
|
| 182 |
+
customer_id="cust_bill_002",
|
| 183 |
+
organization_name="BlueOrbit Labs",
|
| 184 |
+
plan="Business",
|
| 185 |
+
tenure_years=3.2,
|
| 186 |
+
duplicate_charge_amount_cents=4900,
|
| 187 |
+
duplicate_charge_count=2,
|
| 188 |
+
duplicate_charge_refund_eligible=True,
|
| 189 |
+
),
|
| 190 |
+
relevant_kb_article_id="KB-BILL-DUPLICATE",
|
| 191 |
+
expected_terminal_mode="resolve",
|
| 192 |
+
expected_resolution_code="billing_refund_processed",
|
| 193 |
+
expected_refund_amount_cents=4900,
|
| 194 |
+
refund_reason_code="duplicate_charge",
|
| 195 |
+
reply_keyword_groups={
|
| 196 |
+
"timeline": ("3-5 business days", "3 to 5 business days"),
|
| 197 |
+
"duplicate_ack": ("duplicate charge", "double charge", "charged twice"),
|
| 198 |
+
"regret": ("sorry", "apologize", "regret"),
|
| 199 |
+
"refund_confirmed": ("refund", "refunded", "processed"),
|
| 200 |
+
},
|
| 201 |
+
rubric_weights={
|
| 202 |
+
"lookup_account": 0.15,
|
| 203 |
+
"searched_kb": 0.15,
|
| 204 |
+
"correct_refund": 0.25,
|
| 205 |
+
"reply_mentions_timeline": 0.20,
|
| 206 |
+
"reply_acknowledges_and_apologizes": 0.15,
|
| 207 |
+
"resolved_without_escalation": 0.10,
|
| 208 |
+
},
|
| 209 |
+
),
|
| 210 |
+
TaskFixture(
|
| 211 |
+
task_id="enterprise_data_loss_escalation",
|
| 212 |
+
title="Enterprise data loss legal escalation",
|
| 213 |
+
difficulty="hard",
|
| 214 |
+
ticket=TicketFixture(
|
| 215 |
+
ticket_id="ticket_ent_003",
|
| 216 |
+
customer_id="cust_ent_003",
|
| 217 |
+
organization_name="Granite Peak Holdings",
|
| 218 |
+
subject="Critical enterprise incident",
|
| 219 |
+
message=(
|
| 220 |
+
"Your platform deleted 2 years of our customer data during last night's maintenance. "
|
| 221 |
+
"We are a Fortune 500 client. Our legal team will be in contact unless this is resolved in 2 hours. "
|
| 222 |
+
"I need to speak to your CTO immediately."
|
| 223 |
+
),
|
| 224 |
+
),
|
| 225 |
+
account=AccountFixture(
|
| 226 |
+
customer_id="cust_ent_003",
|
| 227 |
+
organization_name="Granite Peak Holdings",
|
| 228 |
+
plan="Enterprise",
|
| 229 |
+
arr_usd=500000,
|
| 230 |
+
legal_threat=True,
|
| 231 |
+
incident_severity="data_loss",
|
| 232 |
+
mandatory_escalation_queue="legal_data_incident",
|
| 233 |
+
mandatory_escalation_priority="P0",
|
| 234 |
+
),
|
| 235 |
+
relevant_kb_article_id="KB-INCIDENT-LEGAL",
|
| 236 |
+
expected_terminal_mode="escalate",
|
| 237 |
+
expected_escalation_queue="legal_data_incident",
|
| 238 |
+
expected_escalation_priority="P0",
|
| 239 |
+
reply_keyword_groups={
|
| 240 |
+
"urgency": ("urgent", "immediately", "right away", "priority"),
|
| 241 |
+
"escalation": ("escalating", "escalated", "investigated", "investigation"),
|
| 242 |
+
},
|
| 243 |
+
forbidden_reply_phrases=(
|
| 244 |
+
"we deleted your data",
|
| 245 |
+
"this is our fault",
|
| 246 |
+
"we are liable",
|
| 247 |
+
"we caused this",
|
| 248 |
+
"we guarantee recovery",
|
| 249 |
+
),
|
| 250 |
+
rubric_weights={
|
| 251 |
+
"lookup_account": 0.10,
|
| 252 |
+
"no_refund_or_policy_action": 0.20,
|
| 253 |
+
"reply_sent_before_escalation": 0.15,
|
| 254 |
+
"careful_reply": 0.20,
|
| 255 |
+
"correct_escalation": 0.25,
|
| 256 |
+
"not_resolved": 0.10,
|
| 257 |
+
},
|
| 258 |
+
),
|
| 259 |
+
)
|
| 260 |
+
)
|
| 261 |
+
|
| 262 |
+
|
| 263 |
+
def list_task_ids() -> list[str]:
|
| 264 |
+
return list(TASK_FIXTURES.keys())
|
| 265 |
+
|
| 266 |
+
|
| 267 |
+
def get_task_fixture(task_id: str) -> TaskFixture:
|
| 268 |
+
if task_id not in TASK_FIXTURES:
|
| 269 |
+
raise KeyError(f"Unknown task_id: {task_id}")
|
| 270 |
+
return TASK_FIXTURES[task_id]
|
support_ticket_env/models.py
ADDED
|
@@ -0,0 +1,214 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from typing import Annotated, Any, Literal, TypeAlias
|
| 4 |
+
|
| 5 |
+
from pydantic import BaseModel, ConfigDict, Field, TypeAdapter
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
class SearchKBAction(BaseModel):
|
| 9 |
+
model_config = ConfigDict(extra="forbid")
|
| 10 |
+
|
| 11 |
+
action_type: Literal["search_kb"]
|
| 12 |
+
query: str = Field(min_length=1)
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
class LookupAccountAction(BaseModel):
|
| 16 |
+
model_config = ConfigDict(extra="forbid")
|
| 17 |
+
|
| 18 |
+
action_type: Literal["lookup_account"]
|
| 19 |
+
customer_id: str = Field(min_length=1)
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
class SendReplyAction(BaseModel):
|
| 23 |
+
model_config = ConfigDict(extra="forbid")
|
| 24 |
+
|
| 25 |
+
action_type: Literal["send_reply"]
|
| 26 |
+
message: str = Field(min_length=1)
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
class IssueRefundAction(BaseModel):
|
| 30 |
+
model_config = ConfigDict(extra="forbid")
|
| 31 |
+
|
| 32 |
+
action_type: Literal["issue_refund"]
|
| 33 |
+
amount_cents: int = Field(gt=0)
|
| 34 |
+
reason_code: Literal["duplicate_charge"]
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
class ResolveTicketAction(BaseModel):
|
| 38 |
+
model_config = ConfigDict(extra="forbid")
|
| 39 |
+
|
| 40 |
+
action_type: Literal["resolve_ticket"]
|
| 41 |
+
resolution_code: Literal["password_reset_guidance", "billing_refund_processed"]
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
class EscalateTicketAction(BaseModel):
|
| 45 |
+
model_config = ConfigDict(extra="forbid")
|
| 46 |
+
|
| 47 |
+
action_type: Literal["escalate_ticket"]
|
| 48 |
+
queue: Literal["support_lead", "legal_data_incident"]
|
| 49 |
+
priority: Literal["P2", "P0"]
|
| 50 |
+
summary: str = Field(min_length=1)
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
SupportTicketAction: TypeAlias = Annotated[
|
| 54 |
+
SearchKBAction
|
| 55 |
+
| LookupAccountAction
|
| 56 |
+
| SendReplyAction
|
| 57 |
+
| IssueRefundAction
|
| 58 |
+
| ResolveTicketAction
|
| 59 |
+
| EscalateTicketAction,
|
| 60 |
+
Field(discriminator="action_type"),
|
| 61 |
+
]
|
| 62 |
+
|
| 63 |
+
ACTION_ADAPTER = TypeAdapter(SupportTicketAction)
|
| 64 |
+
|
| 65 |
+
ACTION_TYPE_NAMES = [
|
| 66 |
+
"search_kb",
|
| 67 |
+
"lookup_account",
|
| 68 |
+
"send_reply",
|
| 69 |
+
"issue_refund",
|
| 70 |
+
"resolve_ticket",
|
| 71 |
+
"escalate_ticket",
|
| 72 |
+
]
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def parse_action(value: SupportTicketAction | dict[str, Any]) -> SupportTicketAction:
|
| 76 |
+
return ACTION_ADAPTER.validate_python(value)
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
class ConversationTurn(BaseModel):
|
| 80 |
+
model_config = ConfigDict(extra="forbid")
|
| 81 |
+
|
| 82 |
+
role: Literal["customer", "agent"]
|
| 83 |
+
message: str
|
| 84 |
+
step_index: int = Field(ge=0)
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
class KBSearchResult(BaseModel):
|
| 88 |
+
model_config = ConfigDict(extra="forbid")
|
| 89 |
+
|
| 90 |
+
tool_name: Literal["search_kb"]
|
| 91 |
+
success: bool
|
| 92 |
+
query: str
|
| 93 |
+
article_ids: list[str] = Field(default_factory=list)
|
| 94 |
+
snippets: list[str] = Field(default_factory=list)
|
| 95 |
+
message: str | None = None
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
class AccountLookupResult(BaseModel):
|
| 99 |
+
model_config = ConfigDict(extra="forbid")
|
| 100 |
+
|
| 101 |
+
tool_name: Literal["lookup_account"]
|
| 102 |
+
success: bool
|
| 103 |
+
customer_id: str
|
| 104 |
+
account_summary: dict[str, Any] = Field(default_factory=dict)
|
| 105 |
+
message: str | None = None
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
class ReplyResult(BaseModel):
|
| 109 |
+
model_config = ConfigDict(extra="forbid")
|
| 110 |
+
|
| 111 |
+
tool_name: Literal["send_reply"]
|
| 112 |
+
success: bool
|
| 113 |
+
message_preview: str
|
| 114 |
+
message: str | None = None
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
class RefundResult(BaseModel):
|
| 118 |
+
model_config = ConfigDict(extra="forbid")
|
| 119 |
+
|
| 120 |
+
tool_name: Literal["issue_refund"]
|
| 121 |
+
success: bool
|
| 122 |
+
refunded: bool
|
| 123 |
+
amount_cents: int
|
| 124 |
+
reason_code: str
|
| 125 |
+
message: str | None = None
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
class ResolveResult(BaseModel):
|
| 129 |
+
model_config = ConfigDict(extra="forbid")
|
| 130 |
+
|
| 131 |
+
tool_name: Literal["resolve_ticket"]
|
| 132 |
+
success: bool
|
| 133 |
+
resolution_code: str
|
| 134 |
+
ticket_status: Literal["resolved"]
|
| 135 |
+
message: str | None = None
|
| 136 |
+
|
| 137 |
+
|
| 138 |
+
class EscalationResult(BaseModel):
|
| 139 |
+
model_config = ConfigDict(extra="forbid")
|
| 140 |
+
|
| 141 |
+
tool_name: Literal["escalate_ticket"]
|
| 142 |
+
success: bool
|
| 143 |
+
queue: str
|
| 144 |
+
priority: str
|
| 145 |
+
summary: str
|
| 146 |
+
ticket_status: Literal["escalated"]
|
| 147 |
+
message: str | None = None
|
| 148 |
+
|
| 149 |
+
|
| 150 |
+
class ErrorToolResult(BaseModel):
|
| 151 |
+
model_config = ConfigDict(extra="forbid")
|
| 152 |
+
|
| 153 |
+
tool_name: Literal["error"]
|
| 154 |
+
success: Literal[False]
|
| 155 |
+
error_code: str
|
| 156 |
+
message: str
|
| 157 |
+
|
| 158 |
+
|
| 159 |
+
ToolResult: TypeAlias = Annotated[
|
| 160 |
+
KBSearchResult
|
| 161 |
+
| AccountLookupResult
|
| 162 |
+
| ReplyResult
|
| 163 |
+
| RefundResult
|
| 164 |
+
| ResolveResult
|
| 165 |
+
| EscalationResult
|
| 166 |
+
| ErrorToolResult,
|
| 167 |
+
Field(discriminator="tool_name"),
|
| 168 |
+
]
|
| 169 |
+
|
| 170 |
+
|
| 171 |
+
class ScoreCriterion(BaseModel):
|
| 172 |
+
model_config = ConfigDict(extra="forbid")
|
| 173 |
+
|
| 174 |
+
criterion_id: str
|
| 175 |
+
label: str
|
| 176 |
+
weight: float = Field(ge=0.0, le=1.0)
|
| 177 |
+
earned: bool
|
| 178 |
+
contribution: float = Field(ge=0.0, le=1.0)
|
| 179 |
+
|
| 180 |
+
|
| 181 |
+
class TaskScorecard(BaseModel):
|
| 182 |
+
model_config = ConfigDict(extra="forbid")
|
| 183 |
+
|
| 184 |
+
task_id: str
|
| 185 |
+
score: float = Field(ge=0.0, le=1.0)
|
| 186 |
+
criteria: list[ScoreCriterion]
|
| 187 |
+
|
| 188 |
+
|
| 189 |
+
class SupportTicketObservation(BaseModel):
|
| 190 |
+
model_config = ConfigDict(extra="forbid")
|
| 191 |
+
|
| 192 |
+
task_id: str
|
| 193 |
+
ticket_id: str
|
| 194 |
+
ticket_status: Literal["open", "resolved", "escalated"]
|
| 195 |
+
customer_id: str
|
| 196 |
+
organization_name: str
|
| 197 |
+
subject: str
|
| 198 |
+
customer_message: str
|
| 199 |
+
conversation_history: list[ConversationTurn]
|
| 200 |
+
last_tool_result: ToolResult | None = None
|
| 201 |
+
steps_taken: int = Field(ge=0)
|
| 202 |
+
steps_remaining: int = Field(ge=0)
|
| 203 |
+
available_action_types: list[str]
|
| 204 |
+
last_action_error: str | None = None
|
| 205 |
+
known_facts: dict[str, Any] = Field(default_factory=dict)
|
| 206 |
+
|
| 207 |
+
|
| 208 |
+
class SupportTicketStepResult(BaseModel):
|
| 209 |
+
model_config = ConfigDict(extra="forbid")
|
| 210 |
+
|
| 211 |
+
observation: SupportTicketObservation
|
| 212 |
+
reward: float
|
| 213 |
+
done: bool
|
| 214 |
+
info: dict[str, Any] = Field(default_factory=dict)
|
support_ticket_env/policies.py
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
from .fixtures import RESET_URL
|
| 4 |
+
from .models import (
|
| 5 |
+
EscalateTicketAction,
|
| 6 |
+
IssueRefundAction,
|
| 7 |
+
LookupAccountAction,
|
| 8 |
+
ResolveTicketAction,
|
| 9 |
+
SearchKBAction,
|
| 10 |
+
SupportTicketAction,
|
| 11 |
+
SupportTicketObservation,
|
| 12 |
+
SendReplyAction,
|
| 13 |
+
)
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def scripted_policy(task_id: str, step_index: int, customer_id: str) -> SupportTicketAction:
|
| 17 |
+
if task_id == "password_reset_guidance":
|
| 18 |
+
if step_index == 1:
|
| 19 |
+
return SearchKBAction(action_type="search_kb", query="password reset email not arriving")
|
| 20 |
+
if step_index == 2:
|
| 21 |
+
return SendReplyAction(
|
| 22 |
+
action_type="send_reply",
|
| 23 |
+
message=(
|
| 24 |
+
f"Please use {RESET_URL}. Check your spam or junk folder, then wait 5 minutes before trying again."
|
| 25 |
+
),
|
| 26 |
+
)
|
| 27 |
+
return ResolveTicketAction(action_type="resolve_ticket", resolution_code="password_reset_guidance")
|
| 28 |
+
|
| 29 |
+
if task_id == "duplicate_charge_refund":
|
| 30 |
+
if step_index == 1:
|
| 31 |
+
return LookupAccountAction(action_type="lookup_account", customer_id=customer_id)
|
| 32 |
+
if step_index == 2:
|
| 33 |
+
return SearchKBAction(action_type="search_kb", query="duplicate charge refund policy")
|
| 34 |
+
if step_index == 3:
|
| 35 |
+
return IssueRefundAction(action_type="issue_refund", amount_cents=4900, reason_code="duplicate_charge")
|
| 36 |
+
if step_index == 4:
|
| 37 |
+
return SendReplyAction(
|
| 38 |
+
action_type="send_reply",
|
| 39 |
+
message=(
|
| 40 |
+
"I'm sorry about the duplicate charge. I've processed the refund for the extra subscription charge, "
|
| 41 |
+
"and it should appear in 3-5 business days."
|
| 42 |
+
),
|
| 43 |
+
)
|
| 44 |
+
return ResolveTicketAction(action_type="resolve_ticket", resolution_code="billing_refund_processed")
|
| 45 |
+
|
| 46 |
+
if task_id == "enterprise_data_loss_escalation":
|
| 47 |
+
if step_index == 1:
|
| 48 |
+
return LookupAccountAction(action_type="lookup_account", customer_id=customer_id)
|
| 49 |
+
if step_index == 2:
|
| 50 |
+
return SendReplyAction(
|
| 51 |
+
action_type="send_reply",
|
| 52 |
+
message=(
|
| 53 |
+
"I understand this is urgent. I am escalating this to our legal and incident response team right now, "
|
| 54 |
+
"and the case is being actively investigated."
|
| 55 |
+
),
|
| 56 |
+
)
|
| 57 |
+
return EscalateTicketAction(
|
| 58 |
+
action_type="escalate_ticket",
|
| 59 |
+
queue="legal_data_incident",
|
| 60 |
+
priority="P0",
|
| 61 |
+
summary="Enterprise customer reporting possible data loss and a legal threat after maintenance.",
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
raise ValueError(f"Unsupported task_id: {task_id}")
|
| 65 |
+
|
| 66 |
+
|
| 67 |
+
def fallback_action(observation: SupportTicketObservation) -> SupportTicketAction:
|
| 68 |
+
next_step = observation.steps_taken + 1
|
| 69 |
+
return scripted_policy(observation.task_id, next_step, observation.customer_id)
|
support_ticket_env/scoring.py
ADDED
|
@@ -0,0 +1,218 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from __future__ import annotations
|
| 2 |
+
|
| 3 |
+
import re
|
| 4 |
+
from typing import Any
|
| 5 |
+
|
| 6 |
+
from .fixtures import TaskFixture
|
| 7 |
+
from .models import ScoreCriterion, TaskScorecard
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
_SPACE_RE = re.compile(r"\s+")
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
def normalize_text(text: str) -> str:
|
| 14 |
+
lowered = text.lower().replace("-", " ")
|
| 15 |
+
normalized = _SPACE_RE.sub(" ", lowered)
|
| 16 |
+
return normalized.strip()
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def contains_any(text: str, phrases: tuple[str, ...] | list[str]) -> bool:
|
| 20 |
+
normalized = normalize_text(text)
|
| 21 |
+
return any(normalize_text(phrase) in normalized for phrase in phrases)
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
def contains_all_groups(text: str, groups: list[tuple[str, ...]]) -> bool:
|
| 25 |
+
return all(contains_any(text, group) for group in groups)
|
| 26 |
+
|
| 27 |
+
|
| 28 |
+
def _criterion(criterion_id: str, label: str, weight: float, earned: bool) -> ScoreCriterion:
|
| 29 |
+
contribution = round(weight if earned else 0.0, 6)
|
| 30 |
+
return ScoreCriterion(
|
| 31 |
+
criterion_id=criterion_id,
|
| 32 |
+
label=label,
|
| 33 |
+
weight=weight,
|
| 34 |
+
earned=earned,
|
| 35 |
+
contribution=contribution,
|
| 36 |
+
)
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
def _reply_messages(state: Any) -> list[str]:
|
| 40 |
+
return [entry["message"] for entry in state.reply_history]
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
def _has_reply_matching(state: Any, matcher) -> bool:
|
| 44 |
+
return any(matcher(message) for message in _reply_messages(state))
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def build_scorecard(fixture: TaskFixture, state: Any) -> TaskScorecard:
|
| 48 |
+
if fixture.task_id == "password_reset_guidance":
|
| 49 |
+
criteria = _score_password_reset(fixture, state)
|
| 50 |
+
elif fixture.task_id == "duplicate_charge_refund":
|
| 51 |
+
criteria = _score_duplicate_charge(fixture, state)
|
| 52 |
+
elif fixture.task_id == "enterprise_data_loss_escalation":
|
| 53 |
+
criteria = _score_enterprise_escalation(fixture, state)
|
| 54 |
+
else:
|
| 55 |
+
raise ValueError(f"Unsupported task_id: {fixture.task_id}")
|
| 56 |
+
|
| 57 |
+
total_score = round(sum(item.contribution for item in criteria), 6)
|
| 58 |
+
return TaskScorecard(task_id=fixture.task_id, score=min(max(total_score, 0.0), 1.0), criteria=criteria)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def _score_password_reset(fixture: TaskFixture, state: Any) -> list[ScoreCriterion]:
|
| 62 |
+
weights = fixture.rubric_weights
|
| 63 |
+
replies = _reply_messages(state)
|
| 64 |
+
return [
|
| 65 |
+
_criterion(
|
| 66 |
+
"searched_kb",
|
| 67 |
+
"Relevant KB article retrieved",
|
| 68 |
+
weights["searched_kb"],
|
| 69 |
+
fixture.relevant_kb_article_id in state.kb_articles_seen,
|
| 70 |
+
),
|
| 71 |
+
_criterion(
|
| 72 |
+
"reply_has_reset_url",
|
| 73 |
+
"Reply includes the password reset URL",
|
| 74 |
+
weights["reply_has_reset_url"],
|
| 75 |
+
any(fixture.reply_keyword_groups["reset_url"][0] in reply for reply in replies),
|
| 76 |
+
),
|
| 77 |
+
_criterion(
|
| 78 |
+
"reply_mentions_spam_folder",
|
| 79 |
+
"Reply mentions checking spam or junk",
|
| 80 |
+
weights["reply_mentions_spam_folder"],
|
| 81 |
+
_has_reply_matching(state, lambda text: contains_any(text, fixture.reply_keyword_groups["spam_folder"])),
|
| 82 |
+
),
|
| 83 |
+
_criterion(
|
| 84 |
+
"resolved_correctly",
|
| 85 |
+
"Ticket resolved with the correct resolution code",
|
| 86 |
+
weights["resolved_correctly"],
|
| 87 |
+
state.ticket_status == "resolved" and state.resolution_code == fixture.expected_resolution_code,
|
| 88 |
+
),
|
| 89 |
+
_criterion(
|
| 90 |
+
"efficient_completion",
|
| 91 |
+
"Episode completed efficiently",
|
| 92 |
+
weights["efficient_completion"],
|
| 93 |
+
state.ticket_status == "resolved"
|
| 94 |
+
and state.resolution_code == fixture.expected_resolution_code
|
| 95 |
+
and state.steps_taken <= (fixture.efficiency_bonus_max_steps or 0),
|
| 96 |
+
),
|
| 97 |
+
]
|
| 98 |
+
|
| 99 |
+
|
| 100 |
+
def _score_duplicate_charge(fixture: TaskFixture, state: Any) -> list[ScoreCriterion]:
|
| 101 |
+
weights = fixture.rubric_weights
|
| 102 |
+
|
| 103 |
+
def acknowledges_and_apologizes(reply: str) -> bool:
|
| 104 |
+
return contains_all_groups(
|
| 105 |
+
reply,
|
| 106 |
+
[
|
| 107 |
+
fixture.reply_keyword_groups["duplicate_ack"],
|
| 108 |
+
fixture.reply_keyword_groups["regret"],
|
| 109 |
+
fixture.reply_keyword_groups["refund_confirmed"],
|
| 110 |
+
],
|
| 111 |
+
)
|
| 112 |
+
|
| 113 |
+
return [
|
| 114 |
+
_criterion(
|
| 115 |
+
"lookup_account",
|
| 116 |
+
"Account lookup completed",
|
| 117 |
+
weights["lookup_account"],
|
| 118 |
+
state.lookup_performed,
|
| 119 |
+
),
|
| 120 |
+
_criterion(
|
| 121 |
+
"searched_kb",
|
| 122 |
+
"Duplicate charge policy article retrieved",
|
| 123 |
+
weights["searched_kb"],
|
| 124 |
+
fixture.relevant_kb_article_id in state.kb_articles_seen,
|
| 125 |
+
),
|
| 126 |
+
_criterion(
|
| 127 |
+
"correct_refund",
|
| 128 |
+
"Correct full duplicate-charge refund issued",
|
| 129 |
+
weights["correct_refund"],
|
| 130 |
+
state.refund_record is not None
|
| 131 |
+
and state.refund_record["amount_cents"] == fixture.expected_refund_amount_cents
|
| 132 |
+
and state.refund_record["reason_code"] == fixture.refund_reason_code,
|
| 133 |
+
),
|
| 134 |
+
_criterion(
|
| 135 |
+
"reply_mentions_timeline",
|
| 136 |
+
"Reply mentions the refund timeline",
|
| 137 |
+
weights["reply_mentions_timeline"],
|
| 138 |
+
_has_reply_matching(state, lambda text: contains_any(text, fixture.reply_keyword_groups["timeline"])),
|
| 139 |
+
),
|
| 140 |
+
_criterion(
|
| 141 |
+
"reply_acknowledges_and_apologizes",
|
| 142 |
+
"Reply acknowledges the duplicate charge, apologizes, and confirms the refund",
|
| 143 |
+
weights["reply_acknowledges_and_apologizes"],
|
| 144 |
+
_has_reply_matching(state, acknowledges_and_apologizes),
|
| 145 |
+
),
|
| 146 |
+
_criterion(
|
| 147 |
+
"resolved_without_escalation",
|
| 148 |
+
"Ticket resolved instead of escalated",
|
| 149 |
+
weights["resolved_without_escalation"],
|
| 150 |
+
state.ticket_status == "resolved" and state.resolution_code == fixture.expected_resolution_code,
|
| 151 |
+
),
|
| 152 |
+
]
|
| 153 |
+
|
| 154 |
+
|
| 155 |
+
def _score_enterprise_escalation(fixture: TaskFixture, state: Any) -> list[ScoreCriterion]:
|
| 156 |
+
weights = fixture.rubric_weights
|
| 157 |
+
escalation_step = state.escalation["step_index"] if state.escalation else None
|
| 158 |
+
|
| 159 |
+
def careful_reply(reply: str) -> bool:
|
| 160 |
+
return (
|
| 161 |
+
contains_all_groups(
|
| 162 |
+
reply,
|
| 163 |
+
[
|
| 164 |
+
fixture.reply_keyword_groups["urgency"],
|
| 165 |
+
fixture.reply_keyword_groups["escalation"],
|
| 166 |
+
],
|
| 167 |
+
)
|
| 168 |
+
and not contains_any(reply, fixture.forbidden_reply_phrases)
|
| 169 |
+
)
|
| 170 |
+
|
| 171 |
+
reply_before_escalation = any(
|
| 172 |
+
escalation_step is None or reply["step_index"] < escalation_step for reply in state.reply_history
|
| 173 |
+
)
|
| 174 |
+
|
| 175 |
+
return [
|
| 176 |
+
_criterion(
|
| 177 |
+
"lookup_account",
|
| 178 |
+
"Account lookup completed",
|
| 179 |
+
weights["lookup_account"],
|
| 180 |
+
state.lookup_performed,
|
| 181 |
+
),
|
| 182 |
+
_criterion(
|
| 183 |
+
"no_refund_or_policy_action",
|
| 184 |
+
"No refund or resolution policy action was applied",
|
| 185 |
+
weights["no_refund_or_policy_action"],
|
| 186 |
+
state.done and not state.refund_attempted and state.resolution_code is None,
|
| 187 |
+
),
|
| 188 |
+
_criterion(
|
| 189 |
+
"reply_sent_before_escalation",
|
| 190 |
+
"A reply was sent before escalation",
|
| 191 |
+
weights["reply_sent_before_escalation"],
|
| 192 |
+
reply_before_escalation and bool(state.reply_history),
|
| 193 |
+
),
|
| 194 |
+
_criterion(
|
| 195 |
+
"careful_reply",
|
| 196 |
+
"Reply acknowledges urgency, mentions escalation, and avoids liability",
|
| 197 |
+
weights["careful_reply"],
|
| 198 |
+
any(
|
| 199 |
+
(escalation_step is None or reply["step_index"] < escalation_step)
|
| 200 |
+
and careful_reply(reply["message"])
|
| 201 |
+
for reply in state.reply_history
|
| 202 |
+
),
|
| 203 |
+
),
|
| 204 |
+
_criterion(
|
| 205 |
+
"correct_escalation",
|
| 206 |
+
"Escalation uses the correct queue and priority",
|
| 207 |
+
weights["correct_escalation"],
|
| 208 |
+
state.escalation is not None
|
| 209 |
+
and state.escalation["queue"] == fixture.expected_escalation_queue
|
| 210 |
+
and state.escalation["priority"] == fixture.expected_escalation_priority,
|
| 211 |
+
),
|
| 212 |
+
_criterion(
|
| 213 |
+
"not_resolved",
|
| 214 |
+
"Ticket was not resolved",
|
| 215 |
+
weights["not_resolved"],
|
| 216 |
+
state.done and state.resolution_code is None,
|
| 217 |
+
),
|
| 218 |
+
]
|
tests/test_env.py
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import pytest
|
| 2 |
+
|
| 3 |
+
from support_ticket_env import SupportTicketEnvironment
|
| 4 |
+
from support_ticket_env.models import LookupAccountAction, SearchKBAction, SendReplyAction
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
def test_search_returns_expected_article_and_progress_reward() -> None:
|
| 8 |
+
env = SupportTicketEnvironment()
|
| 9 |
+
env.reset("password_reset_guidance")
|
| 10 |
+
result = env.step(SearchKBAction(action_type="search_kb", query="password reset email not arriving"))
|
| 11 |
+
assert result.observation.last_tool_result.tool_name == "search_kb"
|
| 12 |
+
assert result.observation.last_tool_result.article_ids[0] == "KB-PW-RESET"
|
| 13 |
+
assert result.reward == pytest.approx(0.19)
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def test_lookup_populates_account_facts() -> None:
|
| 17 |
+
env = SupportTicketEnvironment()
|
| 18 |
+
env.reset("duplicate_charge_refund")
|
| 19 |
+
result = env.step(LookupAccountAction(action_type="lookup_account", customer_id="cust_bill_002"))
|
| 20 |
+
account = result.observation.known_facts["account"]
|
| 21 |
+
assert account["plan"] == "Business"
|
| 22 |
+
assert account["duplicate_charge_amount_cents"] == 4900
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
def test_redundant_search_is_penalized() -> None:
|
| 26 |
+
env = SupportTicketEnvironment()
|
| 27 |
+
env.reset("password_reset_guidance")
|
| 28 |
+
env.step(SearchKBAction(action_type="search_kb", query="password reset email not arriving"))
|
| 29 |
+
result = env.step(SearchKBAction(action_type="search_kb", query="password reset email not arriving"))
|
| 30 |
+
assert result.reward == pytest.approx(-0.03)
|
| 31 |
+
assert result.info["redundancy_penalty"] == pytest.approx(0.02)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def test_refund_before_lookup_is_invalid() -> None:
|
| 35 |
+
env = SupportTicketEnvironment()
|
| 36 |
+
env.reset("duplicate_charge_refund")
|
| 37 |
+
result = env.step({"action_type": "issue_refund", "amount_cents": 4900, "reason_code": "duplicate_charge"})
|
| 38 |
+
assert result.observation.last_action_error == "lookup_required_before_refund"
|
| 39 |
+
assert result.reward == pytest.approx(-0.11)
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def test_reset_clears_previous_state() -> None:
|
| 43 |
+
env = SupportTicketEnvironment()
|
| 44 |
+
env.reset("password_reset_guidance")
|
| 45 |
+
env.step(SearchKBAction(action_type="search_kb", query="password reset email not arriving"))
|
| 46 |
+
result = env.reset("password_reset_guidance")
|
| 47 |
+
assert result.observation.steps_taken == 0
|
| 48 |
+
assert result.observation.known_facts == {}
|
| 49 |
+
assert len(result.observation.conversation_history) == 1
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def test_max_steps_timeout_is_deterministic() -> None:
|
| 53 |
+
env = SupportTicketEnvironment()
|
| 54 |
+
result = env.reset("password_reset_guidance")
|
| 55 |
+
for _ in range(8):
|
| 56 |
+
result = env.step(SendReplyAction(action_type="send_reply", message="Still investigating."))
|
| 57 |
+
assert result.done is True
|
| 58 |
+
assert result.info["terminal_reason"] == "max_steps_exceeded"
|
tests/test_models.py
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import pytest
|
| 2 |
+
|
| 3 |
+
from support_ticket_env import parse_action
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
def test_parse_action_accepts_valid_discriminated_union() -> None:
|
| 7 |
+
action = parse_action({"action_type": "search_kb", "query": "password reset"})
|
| 8 |
+
assert action.action_type == "search_kb"
|
| 9 |
+
assert action.query == "password reset"
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
def test_parse_action_rejects_invalid_refund_reason_code() -> None:
|
| 13 |
+
with pytest.raises(Exception):
|
| 14 |
+
parse_action(
|
| 15 |
+
{
|
| 16 |
+
"action_type": "issue_refund",
|
| 17 |
+
"amount_cents": 4900,
|
| 18 |
+
"reason_code": "manual_override",
|
| 19 |
+
}
|
| 20 |
+
)
|
tests/test_scenarios.py
ADDED
|
@@ -0,0 +1,74 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from support_ticket_env import SupportTicketEnvironment, list_task_ids, scripted_policy
|
| 2 |
+
from support_ticket_env.models import EscalateTicketAction, IssueRefundAction, LookupAccountAction, ResolveTicketAction, SendReplyAction
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
def run_actions(task_id: str, actions: list[object]):
|
| 6 |
+
env = SupportTicketEnvironment()
|
| 7 |
+
result = env.reset(task_id)
|
| 8 |
+
for action in actions:
|
| 9 |
+
result = env.step(action)
|
| 10 |
+
if result.done:
|
| 11 |
+
break
|
| 12 |
+
return result
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def test_gold_policies_score_perfectly() -> None:
|
| 16 |
+
env = SupportTicketEnvironment()
|
| 17 |
+
for task_id in list_task_ids():
|
| 18 |
+
result = env.reset(task_id)
|
| 19 |
+
while not result.done:
|
| 20 |
+
action = scripted_policy(task_id, result.observation.steps_taken + 1, result.observation.customer_id)
|
| 21 |
+
result = env.step(action)
|
| 22 |
+
assert result.info["score"] == 1.0
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
def test_premature_resolution_scores_poorly() -> None:
|
| 26 |
+
result = run_actions(
|
| 27 |
+
"duplicate_charge_refund",
|
| 28 |
+
[ResolveTicketAction(action_type="resolve_ticket", resolution_code="billing_refund_processed")],
|
| 29 |
+
)
|
| 30 |
+
assert result.done is True
|
| 31 |
+
assert result.info["score"] < 0.5
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def test_task3_refund_attempt_hurts_final_score() -> None:
|
| 35 |
+
result = run_actions(
|
| 36 |
+
"enterprise_data_loss_escalation",
|
| 37 |
+
[
|
| 38 |
+
LookupAccountAction(action_type="lookup_account", customer_id="cust_ent_003"),
|
| 39 |
+
IssueRefundAction(action_type="issue_refund", amount_cents=4900, reason_code="duplicate_charge"),
|
| 40 |
+
SendReplyAction(
|
| 41 |
+
action_type="send_reply",
|
| 42 |
+
message="This is urgent. I am escalating this to our legal team right now and the case is being actively investigated.",
|
| 43 |
+
),
|
| 44 |
+
EscalateTicketAction(
|
| 45 |
+
action_type="escalate_ticket",
|
| 46 |
+
queue="legal_data_incident",
|
| 47 |
+
priority="P0",
|
| 48 |
+
summary="Enterprise customer reports possible data loss and legal threat.",
|
| 49 |
+
),
|
| 50 |
+
],
|
| 51 |
+
)
|
| 52 |
+
assert result.info["score"] == 0.8
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def test_task3_liability_admission_fails_careful_reply() -> None:
|
| 56 |
+
result = run_actions(
|
| 57 |
+
"enterprise_data_loss_escalation",
|
| 58 |
+
[
|
| 59 |
+
LookupAccountAction(action_type="lookup_account", customer_id="cust_ent_003"),
|
| 60 |
+
SendReplyAction(
|
| 61 |
+
action_type="send_reply",
|
| 62 |
+
message="This is urgent and we are escalating it, but this is our fault and we caused this.",
|
| 63 |
+
),
|
| 64 |
+
EscalateTicketAction(
|
| 65 |
+
action_type="escalate_ticket",
|
| 66 |
+
queue="legal_data_incident",
|
| 67 |
+
priority="P0",
|
| 68 |
+
summary="Enterprise customer reports possible data loss and legal threat.",
|
| 69 |
+
),
|
| 70 |
+
],
|
| 71 |
+
)
|
| 72 |
+
criteria = {item["criterion_id"]: item for item in result.info["score_breakdown"]["criteria"]}
|
| 73 |
+
assert criteria["careful_reply"]["earned"] is False
|
| 74 |
+
assert result.info["score"] < 1.0
|
uv.lock
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|