Spaces:

Dar3devil
/

customer-support-openenv

Running

@@ -1,11 +1,176 @@
----
-title: Customer Support Openenv
-emoji: 👁
-colorFrom: blue
-colorTo: blue
-sdk: docker
-pinned: false
-short_description: 'Meta Env Hackathon Submission Space '
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# AcmeCloud Customer Support Ticket Handler
+A deterministic OpenEnv-style environment for training and evaluating agents on realistic B2B SaaS support workflows.
+## What It Simulates
+Each episode is one inbound customer-support ticket at a fictional company, `AcmeCloud`.
+The agent acts like a support representative and must choose the right sequence of typed tool actions to handle the ticket correctly.
+The benchmark ships with three fixed tasks:
+1. `password_reset_guidance`
+2. `duplicate_charge_refund`
+3. `enterprise_data_loss_escalation`
+## Why This Is Useful
+This environment models a real operational task rather than a toy game:
+- reading support tickets
+- searching internal knowledge base articles
+- looking up customer account details
+- deciding whether to resolve, refund, or escalate
+- sending customer-facing replies under policy constraints
+The environment is fully deterministic and graded without any LLM judge, which makes it suitable for reproducible RL rollouts and benchmark evaluation.
+## Action Space
+The agent can take exactly six typed actions:
+- `search_kb(query: str)`
+- `lookup_account(customer_id: str)`
+- `send_reply(message: str)`
+- `issue_refund(amount_cents: int, reason_code: "duplicate_charge")`
+- `resolve_ticket(resolution_code: "password_reset_guidance" | "billing_refund_processed")`
+- `escalate_ticket(queue: "support_lead" | "legal_data_incident", priority: "P2" | "P0", summary: str)`
+## Observation Space
+Each observation includes:
+- task and ticket identifiers
+- current ticket status
+- customer metadata
+- customer message and full conversation history
+- the last tool result
+- steps taken / remaining
+- available action types
+- last action error
+- accumulated known facts learned from prior tool calls
+## Reward Design
+The environment uses rubric-based reward shaping.
+- Each task has a deterministic scorecard in `[0.0, 1.0]`
+- Step reward is `score_delta - 0.01 - invalid_penalty - redundancy_penalty`
+- Repeated search/lookup actions incur `-0.02`
+- Invalid actions incur `-0.10`
+- `resolve_ticket` and `escalate_ticket` terminate the episode
+- `issue_refund` changes state but does not terminate the episode
+Global success threshold: `0.75`
+## Task Details
+### 1. Password Reset Guidance
+Customer issue: reset email did not arrive.
+Expected flow:
+- search password reset KB article
+- send reply with reset URL and spam/junk guidance
+- resolve with `password_reset_guidance`
+### 2. Duplicate Charge Refund
+Customer issue: billed twice for the current subscription period.
+Expected flow:
+- lookup the account
+- search the refund policy
+- issue the verified duplicate-charge refund
+- reply with apology and timeline
+- resolve with `billing_refund_processed`
+### 3. Enterprise Data Loss Escalation
+Customer issue: enterprise data-loss complaint with legal threat.
+Expected flow:
+- lookup the account
+- send a careful acknowledgment reply
+- escalate to `legal_data_incident` with `P0`
+- do not refund
+- do not resolve
+## Project Layout
+- `support_ticket_env/`: models, fixtures, scoring, environment core, policy helpers, local/HTTP client
+- `server/`: FastAPI app and Dockerfile
+- `tests/`: unit and scenario tests
+- `inference.py`: baseline runner using the OpenAI client interface
+- `openenv.yaml`: environment metadata
+## Local Setup
+```bash
+python -m pip install -e .[dev]
+pytest
+uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
+```
+Open the docs at [http://localhost:8000/docs](http://localhost:8000/docs) or the simple UI at [http://localhost:8000/web](http://localhost:8000/web).
+## Docker
+```bash
+docker build -t customer-support-openenv -f server/Dockerfile .
+docker run -p 8000:8000 customer-support-openenv
+```
+## Baseline Inference
+The baseline script uses the OpenAI client interface and supports any OpenAI-compatible endpoint.
+Environment variables:
+- `HF_TOKEN` or `OPENAI_API_KEY`
+- `API_BASE_URL`
+- `MODEL_NAME`
+- optional `ENV_BASE_URL` if you want the script to hit a running server instead of the in-process environment
+Run:
+```bash
+python inference.py
+```
+The script emits strict stdout lines in the required format:
+- `[START]`
+- `[STEP]`
+- `[END]`
+If the model call fails or credentials are missing, the script falls back to a deterministic scripted policy so the benchmark still runs reproducibly.
+## Example Gold Scores
+Using the included scripted policy:
+- `password_reset_guidance`: `1.0`
+- `duplicate_charge_refund`: `1.0`
+- `enterprise_data_loss_escalation`: `1.0`
+## Deployment Notes
+- The app exposes `/health`, `/reset`, `/step`, `/state`, `/docs`, `/web`, and `/ws`
+- Sessions are managed in-memory
+- No external services are required to run the environment server itself
+- The benchmark is designed to fit comfortably in the hackathon resource limits
+## Validation
+If `openenv` is installed locally, run:
+```bash
+openenv validate
+```
+This repository does not depend on an LLM judge for grading.
+All graders are deterministic and implemented directly in the environment scorer.

customer_support_openenv.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,190 @@

+Metadata-Version: 2.4
+Name: customer-support-openenv
+Version: 0.1.0
+Summary: Deterministic OpenEnv-style customer support ticket benchmark for B2B SaaS workflows.
+Requires-Python: >=3.11
+Description-Content-Type: text/markdown
+Requires-Dist: fastapi>=0.115
+Requires-Dist: openenv-core>=0.2.0
+Requires-Dist: openai>=1.30
+Requires-Dist: pydantic>=2.7
+Requires-Dist: uvicorn>=0.30
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0; extra == "dev"
+# AcmeCloud Customer Support Ticket Handler
+A deterministic OpenEnv-style environment for training and evaluating agents on realistic B2B SaaS support workflows.
+## What It Simulates
+Each episode is one inbound customer-support ticket at a fictional company, `AcmeCloud`.
+The agent acts like a support representative and must choose the right sequence of typed tool actions to handle the ticket correctly.
+The benchmark ships with three fixed tasks:
+1. `password_reset_guidance`
+2. `duplicate_charge_refund`
+3. `enterprise_data_loss_escalation`
+## Why This Is Useful
+This environment models a real operational task rather than a toy game:
+- reading support tickets
+- searching internal knowledge base articles
+- looking up customer account details
+- deciding whether to resolve, refund, or escalate
+- sending customer-facing replies under policy constraints
+The environment is fully deterministic and graded without any LLM judge, which makes it suitable for reproducible RL rollouts and benchmark evaluation.
+## Action Space
+The agent can take exactly six typed actions:
+- `search_kb(query: str)`
+- `lookup_account(customer_id: str)`
+- `send_reply(message: str)`
+- `issue_refund(amount_cents: int, reason_code: "duplicate_charge")`
+- `resolve_ticket(resolution_code: "password_reset_guidance" | "billing_refund_processed")`
+- `escalate_ticket(queue: "support_lead" | "legal_data_incident", priority: "P2" | "P0", summary: str)`
+## Observation Space
+Each observation includes:
+- task and ticket identifiers
+- current ticket status
+- customer metadata
+- customer message and full conversation history
+- the last tool result
+- steps taken / remaining
+- available action types
+- last action error
+- accumulated known facts learned from prior tool calls
+## Reward Design
+The environment uses rubric-based reward shaping.
+- Each task has a deterministic scorecard in `[0.0, 1.0]`
+- Step reward is `score_delta - 0.01 - invalid_penalty - redundancy_penalty`
+- Repeated search/lookup actions incur `-0.02`
+- Invalid actions incur `-0.10`
+- `resolve_ticket` and `escalate_ticket` terminate the episode
+- `issue_refund` changes state but does not terminate the episode
+Global success threshold: `0.75`
+## Task Details
+### 1. Password Reset Guidance
+Customer issue: reset email did not arrive.
+Expected flow:
+- search password reset KB article
+- send reply with reset URL and spam/junk guidance
+- resolve with `password_reset_guidance`
+### 2. Duplicate Charge Refund
+Customer issue: billed twice for the current subscription period.
+Expected flow:
+- lookup the account
+- search the refund policy
+- issue the verified duplicate-charge refund
+- reply with apology and timeline
+- resolve with `billing_refund_processed`
+### 3. Enterprise Data Loss Escalation
+Customer issue: enterprise data-loss complaint with legal threat.
+Expected flow:
+- lookup the account
+- send a careful acknowledgment reply
+- escalate to `legal_data_incident` with `P0`
+- do not refund
+- do not resolve
+## Project Layout
+- `support_ticket_env/`: models, fixtures, scoring, environment core, policy helpers, local/HTTP client
+- `server/`: FastAPI app and Dockerfile
+- `tests/`: unit and scenario tests
+- `inference.py`: baseline runner using the OpenAI client interface
+- `openenv.yaml`: environment metadata
+## Local Setup
+```bash
+python -m pip install -e .[dev]
+pytest
+uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
+```
+Open the docs at [http://localhost:8000/docs](http://localhost:8000/docs) or the simple UI at [http://localhost:8000/web](http://localhost:8000/web).
+## Docker
+```bash
+docker build -t customer-support-openenv -f server/Dockerfile .
+docker run -p 8000:8000 customer-support-openenv
+```
+## Baseline Inference
+The baseline script uses the OpenAI client interface and supports any OpenAI-compatible endpoint.
+Environment variables:
+- `HF_TOKEN` or `OPENAI_API_KEY`
+- `API_BASE_URL`
+- `MODEL_NAME`
+- optional `ENV_BASE_URL` if you want the script to hit a running server instead of the in-process environment
+Run:
+```bash
+python inference.py
+```
+The script emits strict stdout lines in the required format:
+- `[START]`
+- `[STEP]`
+- `[END]`
+If the model call fails or credentials are missing, the script falls back to a deterministic scripted policy so the benchmark still runs reproducibly.
+## Example Gold Scores
+Using the included scripted policy:
+- `password_reset_guidance`: `1.0`
+- `duplicate_charge_refund`: `1.0`
+- `enterprise_data_loss_escalation`: `1.0`
+## Deployment Notes
+- The app exposes `/health`, `/reset`, `/step`, `/state`, `/docs`, `/web`, and `/ws`
+- Sessions are managed in-memory
+- No external services are required to run the environment server itself
+- The benchmark is designed to fit comfortably in the hackathon resource limits
+## Validation
+If `openenv` is installed locally, run:
+```bash
+openenv validate
+```
+This repository does not depend on an LLM judge for grading.
+All graders are deterministic and implemented directly in the environment scorer.

customer_support_openenv.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,20 @@

+README.md
+pyproject.toml
+customer_support_openenv.egg-info/PKG-INFO
+customer_support_openenv.egg-info/SOURCES.txt
+customer_support_openenv.egg-info/dependency_links.txt
+customer_support_openenv.egg-info/entry_points.txt
+customer_support_openenv.egg-info/requires.txt
+customer_support_openenv.egg-info/top_level.txt
+server/__init__.py
+server/app.py
+support_ticket_env/__init__.py
+support_ticket_env/client.py
+support_ticket_env/env.py
+support_ticket_env/fixtures.py
+support_ticket_env/models.py
+support_ticket_env/policies.py
+support_ticket_env/scoring.py
+tests/test_env.py
+tests/test_models.py
+tests/test_scenarios.py

customer_support_openenv.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

customer_support_openenv.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = server.app:main

customer_support_openenv.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+fastapi>=0.115
+openenv-core>=0.2.0
+openai>=1.30
+pydantic>=2.7
+uvicorn>=0.30
+[dev]
+pytest>=8.0

customer_support_openenv.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ server
2	+ support_ticket_env

inference.py ADDED Viewed

	@@ -0,0 +1,139 @@

+from __future__ import annotations
+import json
+import os
+import sys
+from typing import Any
+from openai import OpenAI
+from support_ticket_env import BENCHMARK_NAME, DEFAULT_SUCCESS_THRESHOLD, SupportTicketEnv, fallback_action, list_task_ids, parse_action
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
+MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
+ENV_BASE_URL = os.getenv("ENV_BASE_URL")
+TEMPERATURE = 0.0
+MAX_TOKENS = 220
+SUCCESS_THRESHOLD = DEFAULT_SUCCESS_THRESHOLD
+SYSTEM_PROMPT = """You are operating a deterministic customer-support environment.
+Choose exactly one tool action at each step and respond with exactly one JSON object.
+Valid actions:
+- {\"action_type\": \"search_kb\", \"query\": \"...\"}
+- {\"action_type\": \"lookup_account\", \"customer_id\": \"...\"}
+- {\"action_type\": \"send_reply\", \"message\": \"...\"}
+- {\"action_type\": \"issue_refund\", \"amount_cents\": 4900, \"reason_code\": \"duplicate_charge\"}
+- {\"action_type\": \"resolve_ticket\", \"resolution_code\": \"password_reset_guidance\"}
+- {\"action_type\": \"resolve_ticket\", \"resolution_code\": \"billing_refund_processed\"}
+- {\"action_type\": \"escalate_ticket\", \"queue\": \"support_lead\", \"priority\": \"P2\", \"summary\": \"...\"}
+- {\"action_type\": \"escalate_ticket\", \"queue\": \"legal_data_incident\", \"priority\": \"P0\", \"summary\": \"...\"}
+Do not include markdown, code fences, or explanations."""
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: str | None) -> None:
+    error_value = "null" if not error else error.replace("\n", " ")
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={str(done).lower()} error={error_value}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: list[float]) -> None:
+    rewards_str = ",".join(f"{reward:.2f}" for reward in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+def _strip_code_fences(text: str) -> str:
+    cleaned = text.strip()
+    if cleaned.startswith("```"):
+        lines = cleaned.splitlines()
+        if lines and lines[0].startswith("```"):
+            lines = lines[1:]
+        if lines and lines[-1].startswith("```"):
+            lines = lines[:-1]
+        cleaned = "\n".join(lines).strip()
+    return cleaned
+def _extract_json_object(text: str) -> dict[str, Any]:
+    cleaned = _strip_code_fences(text)
+    start = cleaned.find("{")
+    end = cleaned.rfind("}")
+    if start == -1 or end == -1 or end <= start:
+        raise ValueError("No JSON object found in model response")
+    return json.loads(cleaned[start : end + 1])
+def build_user_prompt(observation: dict[str, Any]) -> str:
+    return (
+        "Choose the next best action for this support ticket. "
+        "Keep it valid and deterministic. Observation JSON:\n"
+        f"{json.dumps(observation, indent=2)}"
+    )
+def choose_action(client: OpenAI | None, observation) -> Any:
+    fallback = fallback_action(observation)
+    if client is None:
+        return fallback
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": build_user_prompt(observation.model_dump(mode="json"))},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+        )
+        content = (completion.choices[0].message.content or "").strip()
+        return parse_action(_extract_json_object(content))
+    except Exception as exc:  # pragma: no cover - depends on external endpoint
+        print(f"[DEBUG] Falling back to scripted policy: {exc}", file=sys.stderr, flush=True)
+        return fallback
+def run_episode(task_id: str, client: OpenAI | None) -> None:
+    env = SupportTicketEnv(base_url=ENV_BASE_URL, task_id=task_id)
+    rewards: list[float] = []
+    steps_taken = 0
+    final_score = 0.0
+    success = False
+    log_start(task=task_id, env=BENCHMARK_NAME, model=MODEL_NAME)
+    try:
+        result = env.reset(task_id)
+        while not result.done:
+            action = choose_action(client, result.observation)
+            result = env.step(action)
+            steps_taken += 1
+            rewards.append(result.reward)
+            action_str = json.dumps(action.model_dump(mode="json"), separators=(",", ":"))
+            log_step(
+                step=steps_taken,
+                action=action_str,
+                reward=result.reward,
+                done=result.done,
+                error=result.observation.last_action_error,
+            )
+        final_score = float(result.info.get("score", 0.0))
+        success = final_score >= SUCCESS_THRESHOLD
+    finally:
+        env.close()
+        log_end(success=success, steps=steps_taken, score=final_score, rewards=rewards)
+if __name__ == "__main__":
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY) if API_KEY else None
+    for task_id in list_task_ids():
+        run_episode(task_id, client)

openenv.yaml ADDED Viewed

	@@ -0,0 +1,11 @@

+name: customer_support_ticket_handler
+version: 0.1.0
+description: Deterministic B2B SaaS support benchmark with typed tool actions and rubric-based rewards.
+entrypoint: server.app:app
+runtime: fastapi
+port: 8000
+tags:
+  - openenv
+  - customer-support
+  - reinforcement-learning
+  - benchmark

pyproject.toml ADDED Viewed

	@@ -0,0 +1,32 @@

+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "customer-support-openenv"
+version = "0.1.0"
+description = "Deterministic OpenEnv-style customer support ticket benchmark for B2B SaaS workflows."
+readme = "README.md"
+requires-python = ">=3.11"
+dependencies = [
+  "fastapi>=0.115",
+  "openenv-core>=0.2.0",
+  "openai>=1.30",
+  "pydantic>=2.7",
+  "uvicorn>=0.30",
+]
+[project.scripts]
+server = "server.app:main"
+[project.optional-dependencies]
+dev = ["pytest>=8.0"]
+[tool.setuptools.packages.find]
+where = ["."]
+include = ["support_ticket_env", "support_ticket_env.*", "server", "server.*"]
+[tool.pytest.ini_options]
+addopts = "-p no:cacheprovider"
+pythonpath = ["."]
+testpaths = ["tests"]

server/Dockerfile ADDED Viewed

	@@ -0,0 +1,19 @@

+FROM python:3.12-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PORT=8000
+WORKDIR /app
+COPY pyproject.toml README.md openenv.yaml ./
+COPY support_ticket_env ./support_ticket_env
+COPY server ./server
+COPY inference.py ./inference.py
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install --no-cache-dir .
+EXPOSE 8000
+CMD ["python", "-m", "server.app"]

server/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+

server/app.py ADDED Viewed

	@@ -0,0 +1,168 @@

+from __future__ import annotations
+from threading import Lock
+from uuid import uuid4
+from typing import Any
+from fastapi import FastAPI, HTTPException, WebSocket, WebSocketDisconnect
+from fastapi.responses import HTMLResponse
+from pydantic import BaseModel, ConfigDict
+from support_ticket_env import SupportTicketEnvironment, list_task_ids
+class ResetRequest(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    task_id: str | None = None
+    session_id: str | None = None
+class StepRequest(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    session_id: str
+    action: dict[str, Any]
+class SessionManager:
+    def __init__(self) -> None:
+        self._sessions: dict[str, SupportTicketEnvironment] = {}
+        self._lock = Lock()
+    def create_or_reuse(self, session_id: str | None = None, task_id: str | None = None) -> tuple[str, SupportTicketEnvironment]:
+        with self._lock:
+            if session_id and session_id in self._sessions:
+                return session_id, self._sessions[session_id]
+            new_session_id = session_id or str(uuid4())
+            env = SupportTicketEnvironment(task_id=task_id)
+            self._sessions[new_session_id] = env
+            return new_session_id, env
+    def get(self, session_id: str) -> SupportTicketEnvironment:
+        with self._lock:
+            if session_id not in self._sessions:
+                raise KeyError(session_id)
+            return self._sessions[session_id]
+    def delete(self, session_id: str) -> None:
+        with self._lock:
+            self._sessions.pop(session_id, None)
+manager = SessionManager()
+app = FastAPI(
+    title="AcmeCloud Customer Support Ticket Handler",
+    version="0.1.0",
+    description="Deterministic OpenEnv-style customer support benchmark for B2B SaaS ticket handling.",
+)
+def _step_payload(result, session_id: str) -> dict[str, Any]:
+    payload = result.model_dump(mode="json")
+    payload.setdefault("info", {})["session_id"] = session_id
+    return payload
+@app.get("/health")
+def health() -> dict[str, Any]:
+    return {"status": "healthy", "tasks": list_task_ids()}
+@app.post("/reset")
+def reset(request: ResetRequest) -> dict[str, Any]:
+    session_id, env = manager.create_or_reuse(request.session_id, request.task_id)
+    result = env.reset(request.task_id)
+    return _step_payload(result, session_id)
+@app.post("/step")
+def step(request: StepRequest) -> dict[str, Any]:
+    try:
+        env = manager.get(request.session_id)
+    except KeyError as exc:
+        raise HTTPException(status_code=404, detail=f"Unknown session_id: {request.session_id}") from exc
+    result = env.step(request.action)
+    return _step_payload(result, request.session_id)
+@app.get("/state")
+def state(session_id: str) -> dict[str, Any]:
+    try:
+        env = manager.get(session_id)
+    except KeyError as exc:
+        raise HTTPException(status_code=404, detail=f"Unknown session_id: {session_id}") from exc
+    return {"session_id": session_id, **env.state()}
+@app.delete("/session/{session_id}")
+def close_session(session_id: str) -> dict[str, str]:
+    manager.delete(session_id)
+    return {"status": "deleted", "session_id": session_id}
+@app.get("/web")
+def web_ui() -> HTMLResponse:
+    task_items = "".join(f"<li><code>{task_id}</code></li>" for task_id in list_task_ids())
+    html = f"""
+    <html>
+      <head>
+        <title>AcmeCloud Customer Support Ticket Handler</title>
+        <style>
+          body {{ font-family: Segoe UI, sans-serif; margin: 2rem auto; max-width: 900px; line-height: 1.5; }}
+          code {{ background: #f4f4f4; padding: 0.15rem 0.35rem; border-radius: 0.25rem; }}
+          pre {{ background: #111827; color: #f9fafb; padding: 1rem; border-radius: 0.5rem; overflow-x: auto; }}
+        </style>
+      </head>
+      <body>
+        <h1>AcmeCloud Customer Support Ticket Handler</h1>
+        <p>One episode equals one support ticket. Available fixed tasks:</p>
+        <ul>{task_items}</ul>
+        <p>Example local reset:</p>
+        <pre>curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{{"task_id":"password_reset_guidance"}}'</pre>
+      </body>
+    </html>
+    """
+    return HTMLResponse(html)
+@app.websocket("/ws")
+async def websocket_endpoint(websocket: WebSocket) -> None:
+    await websocket.accept()
+    session_id = str(uuid4())
+    env = SupportTicketEnvironment()
+    try:
+        while True:
+            payload = await websocket.receive_json()
+            message_type = payload.get("type")
+            if message_type == "reset":
+                result = env.reset(payload.get("task_id"))
+                await websocket.send_json(_step_payload(result, session_id))
+            elif message_type == "step":
+                result = env.step(payload.get("action", {}))
+                await websocket.send_json(_step_payload(result, session_id))
+            elif message_type == "state":
+                await websocket.send_json({"session_id": session_id, **env.state()})
+            elif message_type == "close":
+                await websocket.send_json({"status": "closed", "session_id": session_id})
+                break
+            else:
+                await websocket.send_json(
+                    {
+                        "error": "unsupported_message_type",
+                        "message": "Use reset, step, state, or close.",
+                        "session_id": session_id,
+                    }
+                )
+    except WebSocketDisconnect:
+        return
+def main() -> None:
+    import uvicorn
+    uvicorn.run("server.app:app", host="0.0.0.0", port=8000, reload=False)
+if __name__ == "__main__":
+    main()

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+fastapi>=0.115
+uvicorn>=0.30
+pydantic>=2.7
+openai>=1.30

support_ticket_env/__init__.py ADDED Viewed

	@@ -0,0 +1,42 @@

+from .client import SupportTicketEnv
+from .env import SupportTicketEnvironment
+from .fixtures import BENCHMARK_NAME, DEFAULT_SUCCESS_THRESHOLD, KB_ARTICLES, TASK_FIXTURES, list_task_ids
+from .models import (
+    ACTION_TYPE_NAMES,
+    EscalateTicketAction,
+    IssueRefundAction,
+    LookupAccountAction,
+    ResolveTicketAction,
+    SearchKBAction,
+    SendReplyAction,
+    SupportTicketAction,
+    SupportTicketObservation,
+    SupportTicketStepResult,
+    TaskScorecard,
+    parse_action,
+)
+from .policies import fallback_action, scripted_policy
+__all__ = [
+    "ACTION_TYPE_NAMES",
+    "BENCHMARK_NAME",
+    "DEFAULT_SUCCESS_THRESHOLD",
+    "EscalateTicketAction",
+    "IssueRefundAction",
+    "KB_ARTICLES",
+    "LookupAccountAction",
+    "ResolveTicketAction",
+    "SearchKBAction",
+    "SendReplyAction",
+    "SupportTicketAction",
+    "SupportTicketEnv",
+    "SupportTicketEnvironment",
+    "SupportTicketObservation",
+    "SupportTicketStepResult",
+    "TASK_FIXTURES",
+    "TaskScorecard",
+    "fallback_action",
+    "list_task_ids",
+    "parse_action",
+    "scripted_policy",
+]

support_ticket_env/client.py ADDED Viewed

	@@ -0,0 +1,99 @@

+from __future__ import annotations
+import json
+from typing import Any
+from urllib import parse, request
+from .env import SupportTicketEnvironment
+from .fixtures import list_task_ids
+from .models import SupportTicketAction, SupportTicketStepResult
+class SupportTicketEnv:
+    def __init__(self, base_url: str | None = None, task_id: str | None = None) -> None:
+        self.base_url = base_url.rstrip("/") if base_url else None
+        self.task_id = task_id
+        self.session_id: str | None = None
+        self._local_env = SupportTicketEnvironment(task_id=task_id) if not self.base_url else None
+    @classmethod
+    def from_docker_image(
+        cls,
+        image_name: str,
+        base_url: str = "http://localhost:8000",
+        task_id: str | None = None,
+    ) -> "SupportTicketEnv":
+        del image_name
+        return cls(base_url=base_url, task_id=task_id)
+    @classmethod
+    def from_env(
+        cls,
+        repo_id: str,
+        base_url: str,
+        task_id: str | None = None,
+    ) -> "SupportTicketEnv":
+        del repo_id
+        return cls(base_url=base_url, task_id=task_id)
+    def reset(self, task_id: str | None = None) -> SupportTicketStepResult:
+        effective_task_id = task_id or self.task_id
+        if self._local_env is not None:
+            return self._local_env.reset(effective_task_id)
+        payload = {}
+        if effective_task_id:
+            payload["task_id"] = effective_task_id
+        if self.session_id:
+            payload["session_id"] = self.session_id
+        result = self._post_json("/reset", payload)
+        self.session_id = result.info.get("session_id")
+        return result
+    def step(self, action: SupportTicketAction | dict[str, Any]) -> SupportTicketStepResult:
+        if self._local_env is not None:
+            return self._local_env.step(action)
+        payload = {
+            "session_id": self.session_id,
+            "action": action.model_dump(mode="json") if hasattr(action, "model_dump") else action,
+        }
+        result = self._post_json("/step", payload)
+        self.session_id = result.info.get("session_id", self.session_id)
+        return result
+    def state(self) -> dict[str, Any]:
+        if self._local_env is not None:
+            return self._local_env.state()
+        if not self.session_id:
+            raise RuntimeError("reset() must be called before state() when using HTTP mode.")
+        query = parse.urlencode({"session_id": self.session_id})
+        with request.urlopen(f"{self.base_url}/state?{query}") as response:
+            return json.loads(response.read().decode("utf-8"))
+    def close(self) -> None:
+        self.session_id = None
+    def __enter__(self) -> "SupportTicketEnv":
+        return self
+    def __exit__(self, exc_type, exc, tb) -> None:
+        self.close()
+        return None
+    @staticmethod
+    def list_tasks() -> list[str]:
+        return list_task_ids()
+    def _post_json(self, path: str, payload: dict[str, Any]) -> SupportTicketStepResult:
+        body = json.dumps(payload).encode("utf-8")
+        req = request.Request(
+            f"{self.base_url}{path}",
+            data=body,
+            headers={"Content-Type": "application/json"},
+            method="POST",
+        )
+        with request.urlopen(req) as response:
+            data = json.loads(response.read().decode("utf-8"))
+        return SupportTicketStepResult.model_validate(data)

support_ticket_env/env.py ADDED Viewed

	@@ -0,0 +1,423 @@

+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import Any
+from .fixtures import (
+    BENCHMARK_NAME,
+    DEFAULT_SUCCESS_THRESHOLD,
+    KB_ARTICLES,
+    KnowledgeBaseArticle,
+    TaskFixture,
+    get_task_fixture,
+    list_task_ids,
+)
+from .models import (
+    ACTION_TYPE_NAMES,
+    AccountLookupResult,
+    ConversationTurn,
+    KBSearchResult,
+    ErrorToolResult,
+    EscalateTicketAction,
+    EscalationResult,
+    IssueRefundAction,
+    LookupAccountAction,
+    RefundResult,
+    ReplyResult,
+    ResolveResult,
+    SearchKBAction,
+    SupportTicketAction,
+    SupportTicketObservation,
+    SupportTicketStepResult,
+    ToolResult,
+    parse_action,
+)
+from .scoring import build_scorecard, normalize_text
+@dataclass
+class SessionState:
+    fixture: TaskFixture
+    ticket_status: str = "open"
+    steps_taken: int = 0
+    conversation_history: list[ConversationTurn] = field(default_factory=list)
+    action_history: list[dict[str, Any]] = field(default_factory=list)
+    reply_history: list[dict[str, Any]] = field(default_factory=list)
+    known_facts: dict[str, Any] = field(default_factory=dict)
+    kb_articles_seen: set[str] = field(default_factory=set)
+    search_signatures: set[str] = field(default_factory=set)
+    lookup_performed: bool = False
+    lookup_customer_id: str | None = None
+    refund_record: dict[str, Any] | None = None
+    refund_attempted: bool = False
+    resolution_code: str | None = None
+    escalation: dict[str, Any] | None = None
+    done: bool = False
+    terminal_reason: str | None = None
+    previous_score: float = 0.0
+    last_tool_result: ToolResult | None = None
+    last_action_error: str | None = None
+class SupportTicketEnvironment:
+    benchmark_name = BENCHMARK_NAME
+    max_steps = 8
+    step_cost = 0.01
+    invalid_action_penalty = 0.10
+    repeated_action_penalty = 0.02
+    success_threshold = DEFAULT_SUCCESS_THRESHOLD
+    def __init__(self, task_id: str | None = None) -> None:
+        self._default_task_id = task_id or list_task_ids()[0]
+        self._session: SessionState | None = None
+    def reset(self, task_id: str | None = None) -> SupportTicketStepResult:
+        fixture = get_task_fixture(task_id or self._default_task_id)
+        self._session = SessionState(
+            fixture=fixture,
+            conversation_history=[
+                ConversationTurn(
+                    role="customer",
+                    message=fixture.ticket.message,
+                    step_index=0,
+                )
+            ],
+        )
+        return self._build_result(reward=0.0)
+    def step(self, action: SupportTicketAction | dict[str, Any]) -> SupportTicketStepResult:
+        session = self._require_session()
+        if session.done:
+            session.last_action_error = "episode_already_done"
+            session.last_tool_result = ErrorToolResult(
+                tool_name="error",
+                success=False,
+                error_code="episode_already_done",
+                message="This ticket is already terminal. Reset the environment before stepping again.",
+            )
+            return self._build_result(reward=-self.invalid_action_penalty)
+        invalid_penalty = 0.0
+        redundancy_penalty = 0.0
+        session.last_action_error = None
+        try:
+            parsed_action = parse_action(action)
+        except Exception as exc:
+            session.steps_taken += 1
+            session.last_action_error = f"invalid_action: {exc}"
+            session.last_tool_result = ErrorToolResult(
+                tool_name="error",
+                success=False,
+                error_code="invalid_action",
+                message=str(exc),
+            )
+            invalid_penalty = self.invalid_action_penalty
+            self._record_action({"action_type": "invalid"}, False)
+            if session.steps_taken >= self.max_steps:
+                session.done = True
+                session.terminal_reason = "max_steps_exceeded"
+            return self._finalize_step(invalid_penalty=invalid_penalty, redundancy_penalty=0.0)
+        session.steps_taken += 1
+        session.last_tool_result, invalid_penalty, redundancy_penalty = self._apply_action(parsed_action)
+        action_succeeded = bool(getattr(session.last_tool_result, "success", False))
+        self._record_action(parsed_action.model_dump(mode="json"), action_succeeded)
+        if not session.done and session.steps_taken >= self.max_steps:
+            session.done = True
+            session.terminal_reason = "max_steps_exceeded"
+        return self._finalize_step(
+            invalid_penalty=invalid_penalty,
+            redundancy_penalty=redundancy_penalty,
+        )
+    def state(self) -> dict[str, Any]:
+        session = self._require_session()
+        scorecard = build_scorecard(session.fixture, session)
+        return {
+            "benchmark_name": self.benchmark_name,
+            "task_id": session.fixture.task_id,
+            "ticket_status": session.ticket_status,
+            "steps_taken": session.steps_taken,
+            "steps_remaining": max(self.max_steps - session.steps_taken, 0),
+            "conversation_history": [turn.model_dump(mode="json") for turn in session.conversation_history],
+            "audit_log": list(session.action_history),
+            "known_facts": dict(session.known_facts),
+            "current_rubric_score": scorecard.score,
+            "score_breakdown": scorecard.model_dump(mode="json"),
+            "terminal_reason": session.terminal_reason,
+            "done": session.done,
+        }
+    def _apply_action(self, action: SupportTicketAction) -> tuple[ToolResult, float, float]:
+        session = self._require_session()
+        invalid_penalty = 0.0
+        redundancy_penalty = 0.0
+        if isinstance(action, SearchKBAction):
+            query_signature = normalize_text(action.query)
+            if query_signature in session.search_signatures:
+                redundancy_penalty = self.repeated_action_penalty
+            session.search_signatures.add(query_signature)
+            articles = self._search_knowledge_base(action.query)
+            article_ids = [article.article_id for article in articles]
+            session.kb_articles_seen.update(article_ids)
+            session.known_facts["kb_articles_seen"] = sorted(session.kb_articles_seen)
+            session.known_facts["kb_titles_seen"] = [KB_ARTICLES[article_id].title for article_id in sorted(session.kb_articles_seen)]
+            result = KBSearchResult(
+                tool_name="search_kb",
+                success=bool(articles),
+                query=action.query,
+                article_ids=article_ids,
+                snippets=[article.snippet for article in articles],
+                message="Knowledge base search completed." if articles else "No KB articles matched the query.",
+            )
+            return result, invalid_penalty, redundancy_penalty
+        if isinstance(action, LookupAccountAction):
+            if action.customer_id != session.fixture.account.customer_id:
+                session.last_action_error = "unknown_customer_id"
+                result = ErrorToolResult(
+                    tool_name="error",
+                    success=False,
+                    error_code="unknown_customer_id",
+                    message=f"No account found for customer_id={action.customer_id}.",
+                )
+                return result, self.invalid_action_penalty, redundancy_penalty
+            if session.lookup_performed and session.lookup_customer_id == action.customer_id:
+                redundancy_penalty = self.repeated_action_penalty
+            account = session.fixture.account
+            session.lookup_performed = True
+            session.lookup_customer_id = action.customer_id
+            account_summary = {
+                "customer_id": account.customer_id,
+                "organization_name": account.organization_name,
+                "plan": account.plan,
+                "tenure_years": account.tenure_years,
+                "arr_usd": account.arr_usd,
+                "duplicate_charge_amount_cents": account.duplicate_charge_amount_cents,
+                "duplicate_charge_count": account.duplicate_charge_count,
+                "duplicate_charge_refund_eligible": account.duplicate_charge_refund_eligible,
+                "legal_threat": account.legal_threat,
+                "incident_severity": account.incident_severity,
+            }
+            session.known_facts["account"] = account_summary
+            result = AccountLookupResult(
+                tool_name="lookup_account",
+                success=True,
+                customer_id=action.customer_id,
+                account_summary=account_summary,
+                message="Account lookup completed.",
+            )
+            return result, invalid_penalty, redundancy_penalty
+        if action.action_type == "send_reply":
+            reply = action.message.strip()
+            session.reply_history.append({"message": reply, "step_index": session.steps_taken})
+            session.conversation_history.append(
+                ConversationTurn(role="agent", message=reply, step_index=session.steps_taken)
+            )
+            result = ReplyResult(
+                tool_name="send_reply",
+                success=True,
+                message_preview=reply[:120],
+                message="Reply sent to the customer.",
+            )
+            return result, invalid_penalty, redundancy_penalty
+        if isinstance(action, IssueRefundAction):
+            session.refund_attempted = True
+            account = session.fixture.account
+            if not session.lookup_performed:
+                session.last_action_error = "lookup_required_before_refund"
+                result = ErrorToolResult(
+                    tool_name="error",
+                    success=False,
+                    error_code="lookup_required_before_refund",
+                    message="lookup_account must succeed before issue_refund can be used.",
+                )
+                return result, self.invalid_action_penalty, redundancy_penalty
+            if not account.duplicate_charge_refund_eligible or not account.duplicate_charge_amount_cents:
+                session.last_action_error = "refund_not_applicable"
+                result = RefundResult(
+                    tool_name="issue_refund",
+                    success=False,
+                    refunded=False,
+                    amount_cents=action.amount_cents,
+                    reason_code=action.reason_code,
+                    message="No duplicate charge is eligible for refund on this account.",
+                )
+                return result, self.invalid_action_penalty, redundancy_penalty
+            if action.amount_cents != account.duplicate_charge_amount_cents or action.reason_code != "duplicate_charge":
+                session.last_action_error = "incorrect_refund_payload"
+                result = RefundResult(
+                    tool_name="issue_refund",
+                    success=False,
+                    refunded=False,
+                    amount_cents=action.amount_cents,
+                    reason_code=action.reason_code,
+                    message="Refund payload does not match the verified duplicate charge.",
+                )
+                return result, self.invalid_action_penalty, redundancy_penalty
+            session.refund_record = {
+                "amount_cents": action.amount_cents,
+                "reason_code": action.reason_code,
+                "step_index": session.steps_taken,
+            }
+            result = RefundResult(
+                tool_name="issue_refund",
+                success=True,
+                refunded=True,
+                amount_cents=action.amount_cents,
+                reason_code=action.reason_code,
+                message="Refund recorded successfully.",
+            )
+            return result, invalid_penalty, redundancy_penalty
+        if action.action_type == "resolve_ticket":
+            session.resolution_code = action.resolution_code
+            session.ticket_status = "resolved"
+            session.done = True
+            session.terminal_reason = "resolved"
+            result = ResolveResult(
+                tool_name="resolve_ticket",
+                success=True,
+                resolution_code=action.resolution_code,
+                ticket_status="resolved",
+                message="Ticket marked as resolved.",
+            )
+            return result, invalid_penalty, redundancy_penalty
+        if isinstance(action, EscalateTicketAction):
+            session.escalation = {
+                "queue": action.queue,
+                "priority": action.priority,
+                "summary": action.summary,
+                "step_index": session.steps_taken,
+            }
+            session.ticket_status = "escalated"
+            session.done = True
+            session.terminal_reason = "escalated"
+            result = EscalationResult(
+                tool_name="escalate_ticket",
+                success=True,
+                queue=action.queue,
+                priority=action.priority,
+                summary=action.summary,
+                ticket_status="escalated",
+                message="Ticket escalated.",
+            )
+            return result, invalid_penalty, redundancy_penalty
+        session.last_action_error = "unsupported_action"
+        return (
+            ErrorToolResult(
+                tool_name="error",
+                success=False,
+                error_code="unsupported_action",
+                message=f"Unsupported action type: {type(action).__name__}",
+            ),
+            self.invalid_action_penalty,
+            redundancy_penalty,
+        )
+    def _search_knowledge_base(self, query: str) -> list[KnowledgeBaseArticle]:
+        query_terms = set(normalize_text(query).split())
+        ranked: list[tuple[int, str, KnowledgeBaseArticle]] = []
+        for article in KB_ARTICLES.values():
+            searchable = normalize_text(" ".join((article.title, article.content, " ".join(article.tags))))
+            article_terms = set(searchable.split())
+            score = len(query_terms & article_terms)
+            if score > 0:
+                ranked.append((score, article.article_id, article))
+        ranked.sort(key=lambda item: (-item[0], item[1]))
+        return [article for _, _, article in ranked[:3]]
+    def _record_action(self, action_payload: dict[str, Any], action_succeeded: bool) -> None:
+        session = self._require_session()
+        session.action_history.append(
+            {
+                "step_index": session.steps_taken,
+                "action": action_payload,
+                "success": action_succeeded,
+                "ticket_status": session.ticket_status,
+            }
+        )
+    def _finalize_step(self, invalid_penalty: float, redundancy_penalty: float) -> SupportTicketStepResult:
+        session = self._require_session()
+        scorecard = build_scorecard(session.fixture, session)
+        reward = round(
+            (scorecard.score - session.previous_score) - self.step_cost - invalid_penalty - redundancy_penalty,
+            6,
+        )
+        session.previous_score = scorecard.score
+        return SupportTicketStepResult(
+            observation=self._build_observation(),
+            reward=reward,
+            done=session.done,
+            info={
+                "task_id": session.fixture.task_id,
+                "benchmark_name": self.benchmark_name,
+                "score": scorecard.score,
+                "score_breakdown": scorecard.model_dump(mode="json"),
+                "success": scorecard.score >= self.success_threshold,
+                "success_threshold": self.success_threshold,
+                "terminal_reason": session.terminal_reason,
+                "invalid_penalty": invalid_penalty,
+                "redundancy_penalty": redundancy_penalty,
+            },
+        )
+    def _build_observation(self) -> SupportTicketObservation:
+        session = self._require_session()
+        ticket = session.fixture.ticket
+        return SupportTicketObservation(
+            task_id=session.fixture.task_id,
+            ticket_id=ticket.ticket_id,
+            ticket_status=session.ticket_status,
+            customer_id=ticket.customer_id,
+            organization_name=ticket.organization_name,
+            subject=ticket.subject,
+            customer_message=ticket.message,
+            conversation_history=list(session.conversation_history),
+            last_tool_result=session.last_tool_result,
+            steps_taken=session.steps_taken,
+            steps_remaining=max(self.max_steps - session.steps_taken, 0),
+            available_action_types=list(ACTION_TYPE_NAMES),
+            last_action_error=session.last_action_error,
+            known_facts=dict(session.known_facts),
+        )
+    def _build_result(self, reward: float) -> SupportTicketStepResult:
+        session = self._require_session()
+        scorecard = build_scorecard(session.fixture, session)
+        session.previous_score = scorecard.score
+        return SupportTicketStepResult(
+            observation=self._build_observation(),
+            reward=reward,
+            done=session.done,
+            info={
+                "task_id": session.fixture.task_id,
+                "benchmark_name": self.benchmark_name,
+                "score": scorecard.score,
+                "score_breakdown": scorecard.model_dump(mode="json"),
+                "success": scorecard.score >= self.success_threshold,
+                "success_threshold": self.success_threshold,
+                "terminal_reason": session.terminal_reason,
+                "invalid_penalty": 0.0,
+                "redundancy_penalty": 0.0,
+            },
+        )
+    def _require_session(self) -> SessionState:
+        if self._session is None:
+            raise RuntimeError("Environment has not been reset yet.")
+        return self._session

support_ticket_env/fixtures.py ADDED Viewed

	@@ -0,0 +1,270 @@

+from __future__ import annotations
+from collections import OrderedDict
+from dataclasses import dataclass, field
+from typing import Literal
+BENCHMARK_NAME = "customer_support_ticket_handler"
+DEFAULT_SUCCESS_THRESHOLD = 0.75
+RESET_URL = "https://app.acmecloud.com/reset"
+@dataclass(frozen=True)
+class KnowledgeBaseArticle:
+    article_id: str
+    title: str
+    tags: tuple[str, ...]
+    snippet: str
+    content: str
+@dataclass(frozen=True)
+class TicketFixture:
+    ticket_id: str
+    customer_id: str
+    organization_name: str
+    subject: str
+    message: str
+@dataclass(frozen=True)
+class AccountFixture:
+    customer_id: str
+    organization_name: str
+    plan: str
+    tenure_years: float | None = None
+    arr_usd: int | None = None
+    duplicate_charge_amount_cents: int | None = None
+    duplicate_charge_count: int = 0
+    duplicate_charge_refund_eligible: bool = False
+    legal_threat: bool = False
+    incident_severity: str | None = None
+    mandatory_escalation_queue: str | None = None
+    mandatory_escalation_priority: str | None = None
+@dataclass(frozen=True)
+class TaskFixture:
+    task_id: str
+    title: str
+    difficulty: Literal["easy", "medium", "hard"]
+    ticket: TicketFixture
+    account: AccountFixture
+    relevant_kb_article_id: str | None = None
+    expected_terminal_mode: Literal["resolve", "escalate"] = "resolve"
+    expected_resolution_code: str | None = None
+    expected_refund_amount_cents: int | None = None
+    refund_reason_code: str | None = None
+    expected_escalation_queue: str | None = None
+    expected_escalation_priority: str | None = None
+    reply_keyword_groups: dict[str, tuple[str, ...]] = field(default_factory=dict)
+    forbidden_reply_phrases: tuple[str, ...] = ()
+    rubric_weights: dict[str, float] = field(default_factory=dict)
+    efficiency_bonus_max_steps: int | None = None
+KB_ARTICLES = OrderedDict(
+    (
+        article.article_id,
+        article,
+    )
+    for article in (
+        KnowledgeBaseArticle(
+            article_id="KB-PW-RESET",
+            title="Password reset email troubleshooting",
+            tags=("password", "reset", "email", "spam", "login"),
+            snippet="Ask the customer to use the AcmeCloud reset page, check spam or junk, and wait 5 minutes before retrying.",
+            content=(
+                f"If a password reset email does not arrive, direct the user to {RESET_URL}. "
+                "Ask them to check their spam or junk folder and wait 5 minutes before requesting another email."
+            ),
+        ),
+        KnowledgeBaseArticle(
+            article_id="KB-BILL-DUPLICATE",
+            title="Duplicate subscription charge refund policy",
+            tags=("billing", "refund", "duplicate", "charge", "subscription"),
+            snippet="After verifying a duplicate charge, support can refund the extra charge. Refunds settle in 3-5 business days.",
+            content=(
+                "If account history confirms an accidental duplicate subscription charge, refund the duplicate amount in full. "
+                "Communicate that the refund will appear in 3-5 business days."
+            ),
+        ),
+        KnowledgeBaseArticle(
+            article_id="KB-INCIDENT-LEGAL",
+            title="Critical data incident legal escalation",
+            tags=("incident", "legal", "data", "escalation", "enterprise"),
+            snippet="Legal threats and alleged customer data loss must be escalated immediately to the legal_data_incident queue at P0.",
+            content=(
+                "If an enterprise customer reports data loss and mentions legal action, do not promise a resolution, do not admit fault, "
+                "and escalate immediately to the legal_data_incident queue with priority P0."
+            ),
+        ),
+        KnowledgeBaseArticle(
+            article_id="KB-SSO-SETUP",
+            title="Single sign-on setup guide",
+            tags=("sso", "setup", "identity", "onboarding"),
+            snippet="Configure SAML or OIDC before enforcing SSO in production.",
+            content="SSO setup steps for administrators integrating AcmeCloud with their identity provider.",
+        ),
+        KnowledgeBaseArticle(
+            article_id="KB-INVOICE-DOWNLOAD",
+            title="Invoice download instructions",
+            tags=("invoice", "billing", "download", "finance"),
+            snippet="Billing administrators can download invoices from the Finance tab in workspace settings.",
+            content="Steps for locating billing history and downloading invoices from the AcmeCloud admin console.",
+        ),
+        KnowledgeBaseArticle(
+            article_id="KB-MFA-RESET",
+            title="Multi-factor authentication reset",
+            tags=("mfa", "reset", "login", "security"),
+            snippet="MFA resets require identity verification or admin override.",
+            content="How to reset multi-factor authentication for locked-out users.",
+        ),
+    )
+)
+TASK_FIXTURES = OrderedDict(
+    (
+        task.task_id,
+        task,
+    )
+    for task in (
+        TaskFixture(
+            task_id="password_reset_guidance",
+            title="Password reset guidance",
+            difficulty="easy",
+            ticket=TicketFixture(
+                ticket_id="ticket_pw_001",
+                customer_id="cust_pw_001",
+                organization_name="Northstar Analytics",
+                subject="Reset email never arrived",
+                message="Hi, I forgot my password and the reset email isn't arriving. Please help.",
+            ),
+            account=AccountFixture(
+                customer_id="cust_pw_001",
+                organization_name="Northstar Analytics",
+                plan="Pro",
+            ),
+            relevant_kb_article_id="KB-PW-RESET",
+            expected_terminal_mode="resolve",
+            expected_resolution_code="password_reset_guidance",
+            reply_keyword_groups={
+                "reset_url": (RESET_URL,),
+                "spam_folder": ("spam", "junk"),
+                "wait_guidance": ("5 minutes", "five minutes"),
+            },
+            rubric_weights={
+                "searched_kb": 0.20,
+                "reply_has_reset_url": 0.30,
+                "reply_mentions_spam_folder": 0.20,
+                "resolved_correctly": 0.20,
+                "efficient_completion": 0.10,
+            },
+            efficiency_bonus_max_steps=4,
+        ),
+        TaskFixture(
+            task_id="duplicate_charge_refund",
+            title="Duplicate subscription charge refund",
+            difficulty="medium",
+            ticket=TicketFixture(
+                ticket_id="ticket_bill_002",
+                customer_id="cust_bill_002",
+                organization_name="BlueOrbit Labs",
+                subject="Charged twice this month",
+                message=(
+                    "I was charged twice for my subscription this month. I want both charges refunded immediately. "
+                    "I've been a customer for 3 years."
+                ),
+            ),
+            account=AccountFixture(
+                customer_id="cust_bill_002",
+                organization_name="BlueOrbit Labs",
+                plan="Business",
+                tenure_years=3.2,
+                duplicate_charge_amount_cents=4900,
+                duplicate_charge_count=2,
+                duplicate_charge_refund_eligible=True,
+            ),
+            relevant_kb_article_id="KB-BILL-DUPLICATE",
+            expected_terminal_mode="resolve",
+            expected_resolution_code="billing_refund_processed",
+            expected_refund_amount_cents=4900,
+            refund_reason_code="duplicate_charge",
+            reply_keyword_groups={
+                "timeline": ("3-5 business days", "3 to 5 business days"),
+                "duplicate_ack": ("duplicate charge", "double charge", "charged twice"),
+                "regret": ("sorry", "apologize", "regret"),
+                "refund_confirmed": ("refund", "refunded", "processed"),
+            },
+            rubric_weights={
+                "lookup_account": 0.15,
+                "searched_kb": 0.15,
+                "correct_refund": 0.25,
+                "reply_mentions_timeline": 0.20,
+                "reply_acknowledges_and_apologizes": 0.15,
+                "resolved_without_escalation": 0.10,
+            },
+        ),
+        TaskFixture(
+            task_id="enterprise_data_loss_escalation",
+            title="Enterprise data loss legal escalation",
+            difficulty="hard",
+            ticket=TicketFixture(
+                ticket_id="ticket_ent_003",
+                customer_id="cust_ent_003",
+                organization_name="Granite Peak Holdings",
+                subject="Critical enterprise incident",
+                message=(
+                    "Your platform deleted 2 years of our customer data during last night's maintenance. "
+                    "We are a Fortune 500 client. Our legal team will be in contact unless this is resolved in 2 hours. "
+                    "I need to speak to your CTO immediately."
+                ),
+            ),
+            account=AccountFixture(
+                customer_id="cust_ent_003",
+                organization_name="Granite Peak Holdings",
+                plan="Enterprise",
+                arr_usd=500000,
+                legal_threat=True,
+                incident_severity="data_loss",
+                mandatory_escalation_queue="legal_data_incident",
+                mandatory_escalation_priority="P0",
+            ),
+            relevant_kb_article_id="KB-INCIDENT-LEGAL",
+            expected_terminal_mode="escalate",
+            expected_escalation_queue="legal_data_incident",
+            expected_escalation_priority="P0",
+            reply_keyword_groups={
+                "urgency": ("urgent", "immediately", "right away", "priority"),
+                "escalation": ("escalating", "escalated", "investigated", "investigation"),
+            },
+            forbidden_reply_phrases=(
+                "we deleted your data",
+                "this is our fault",
+                "we are liable",
+                "we caused this",
+                "we guarantee recovery",
+            ),
+            rubric_weights={
+                "lookup_account": 0.10,
+                "no_refund_or_policy_action": 0.20,
+                "reply_sent_before_escalation": 0.15,
+                "careful_reply": 0.20,
+                "correct_escalation": 0.25,
+                "not_resolved": 0.10,
+            },
+        ),
+    )
+)
+def list_task_ids() -> list[str]:
+    return list(TASK_FIXTURES.keys())
+def get_task_fixture(task_id: str) -> TaskFixture:
+    if task_id not in TASK_FIXTURES:
+        raise KeyError(f"Unknown task_id: {task_id}")
+    return TASK_FIXTURES[task_id]

support_ticket_env/models.py ADDED Viewed

	@@ -0,0 +1,214 @@

+from __future__ import annotations
+from typing import Annotated, Any, Literal, TypeAlias
+from pydantic import BaseModel, ConfigDict, Field, TypeAdapter
+class SearchKBAction(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    action_type: Literal["search_kb"]
+    query: str = Field(min_length=1)
+class LookupAccountAction(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    action_type: Literal["lookup_account"]
+    customer_id: str = Field(min_length=1)
+class SendReplyAction(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    action_type: Literal["send_reply"]
+    message: str = Field(min_length=1)
+class IssueRefundAction(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    action_type: Literal["issue_refund"]
+    amount_cents: int = Field(gt=0)
+    reason_code: Literal["duplicate_charge"]
+class ResolveTicketAction(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    action_type: Literal["resolve_ticket"]
+    resolution_code: Literal["password_reset_guidance", "billing_refund_processed"]
+class EscalateTicketAction(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    action_type: Literal["escalate_ticket"]
+    queue: Literal["support_lead", "legal_data_incident"]
+    priority: Literal["P2", "P0"]
+    summary: str = Field(min_length=1)
+SupportTicketAction: TypeAlias = Annotated[
+    SearchKBAction
+    | LookupAccountAction
+    | SendReplyAction
+    | IssueRefundAction
+    | ResolveTicketAction
+    | EscalateTicketAction,
+    Field(discriminator="action_type"),
+]
+ACTION_ADAPTER = TypeAdapter(SupportTicketAction)
+ACTION_TYPE_NAMES = [
+    "search_kb",
+    "lookup_account",
+    "send_reply",
+    "issue_refund",
+    "resolve_ticket",
+    "escalate_ticket",
+]
+def parse_action(value: SupportTicketAction | dict[str, Any]) -> SupportTicketAction:
+    return ACTION_ADAPTER.validate_python(value)
+class ConversationTurn(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    role: Literal["customer", "agent"]
+    message: str
+    step_index: int = Field(ge=0)
+class KBSearchResult(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    tool_name: Literal["search_kb"]
+    success: bool
+    query: str
+    article_ids: list[str] = Field(default_factory=list)
+    snippets: list[str] = Field(default_factory=list)
+    message: str | None = None
+class AccountLookupResult(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    tool_name: Literal["lookup_account"]
+    success: bool
+    customer_id: str
+    account_summary: dict[str, Any] = Field(default_factory=dict)
+    message: str | None = None
+class ReplyResult(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    tool_name: Literal["send_reply"]
+    success: bool
+    message_preview: str
+    message: str | None = None
+class RefundResult(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    tool_name: Literal["issue_refund"]
+    success: bool
+    refunded: bool
+    amount_cents: int
+    reason_code: str
+    message: str | None = None
+class ResolveResult(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    tool_name: Literal["resolve_ticket"]
+    success: bool
+    resolution_code: str
+    ticket_status: Literal["resolved"]
+    message: str | None = None
+class EscalationResult(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    tool_name: Literal["escalate_ticket"]
+    success: bool
+    queue: str
+    priority: str
+    summary: str
+    ticket_status: Literal["escalated"]
+    message: str | None = None
+class ErrorToolResult(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    tool_name: Literal["error"]
+    success: Literal[False]
+    error_code: str
+    message: str
+ToolResult: TypeAlias = Annotated[
+    KBSearchResult
+    | AccountLookupResult
+    | ReplyResult
+    | RefundResult
+    | ResolveResult
+    | EscalationResult
+    | ErrorToolResult,
+    Field(discriminator="tool_name"),
+]
+class ScoreCriterion(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    criterion_id: str
+    label: str
+    weight: float = Field(ge=0.0, le=1.0)
+    earned: bool
+    contribution: float = Field(ge=0.0, le=1.0)
+class TaskScorecard(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    task_id: str
+    score: float = Field(ge=0.0, le=1.0)
+    criteria: list[ScoreCriterion]
+class SupportTicketObservation(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    task_id: str
+    ticket_id: str
+    ticket_status: Literal["open", "resolved", "escalated"]
+    customer_id: str
+    organization_name: str
+    subject: str
+    customer_message: str
+    conversation_history: list[ConversationTurn]
+    last_tool_result: ToolResult | None = None
+    steps_taken: int = Field(ge=0)
+    steps_remaining: int = Field(ge=0)
+    available_action_types: list[str]
+    last_action_error: str | None = None
+    known_facts: dict[str, Any] = Field(default_factory=dict)
+class SupportTicketStepResult(BaseModel):
+    model_config = ConfigDict(extra="forbid")
+    observation: SupportTicketObservation
+    reward: float
+    done: bool
+    info: dict[str, Any] = Field(default_factory=dict)

support_ticket_env/policies.py ADDED Viewed

	@@ -0,0 +1,69 @@

+from __future__ import annotations
+from .fixtures import RESET_URL
+from .models import (
+    EscalateTicketAction,
+    IssueRefundAction,
+    LookupAccountAction,
+    ResolveTicketAction,
+    SearchKBAction,
+    SupportTicketAction,
+    SupportTicketObservation,
+    SendReplyAction,
+)
+def scripted_policy(task_id: str, step_index: int, customer_id: str) -> SupportTicketAction:
+    if task_id == "password_reset_guidance":
+        if step_index == 1:
+            return SearchKBAction(action_type="search_kb", query="password reset email not arriving")
+        if step_index == 2:
+            return SendReplyAction(
+                action_type="send_reply",
+                message=(
+                    f"Please use {RESET_URL}. Check your spam or junk folder, then wait 5 minutes before trying again."
+                ),
+            )
+        return ResolveTicketAction(action_type="resolve_ticket", resolution_code="password_reset_guidance")
+    if task_id == "duplicate_charge_refund":
+        if step_index == 1:
+            return LookupAccountAction(action_type="lookup_account", customer_id=customer_id)
+        if step_index == 2:
+            return SearchKBAction(action_type="search_kb", query="duplicate charge refund policy")
+        if step_index == 3:
+            return IssueRefundAction(action_type="issue_refund", amount_cents=4900, reason_code="duplicate_charge")
+        if step_index == 4:
+            return SendReplyAction(
+                action_type="send_reply",
+                message=(
+                    "I'm sorry about the duplicate charge. I've processed the refund for the extra subscription charge, "
+                    "and it should appear in 3-5 business days."
+                ),
+            )
+        return ResolveTicketAction(action_type="resolve_ticket", resolution_code="billing_refund_processed")
+    if task_id == "enterprise_data_loss_escalation":
+        if step_index == 1:
+            return LookupAccountAction(action_type="lookup_account", customer_id=customer_id)
+        if step_index == 2:
+            return SendReplyAction(
+                action_type="send_reply",
+                message=(
+                    "I understand this is urgent. I am escalating this to our legal and incident response team right now, "
+                    "and the case is being actively investigated."
+                ),
+            )
+        return EscalateTicketAction(
+            action_type="escalate_ticket",
+            queue="legal_data_incident",
+            priority="P0",
+            summary="Enterprise customer reporting possible data loss and a legal threat after maintenance.",
+        )
+    raise ValueError(f"Unsupported task_id: {task_id}")
+def fallback_action(observation: SupportTicketObservation) -> SupportTicketAction:
+    next_step = observation.steps_taken + 1
+    return scripted_policy(observation.task_id, next_step, observation.customer_id)

support_ticket_env/scoring.py ADDED Viewed

	@@ -0,0 +1,218 @@

+from __future__ import annotations
+import re
+from typing import Any
+from .fixtures import TaskFixture
+from .models import ScoreCriterion, TaskScorecard
+_SPACE_RE = re.compile(r"\s+")
+def normalize_text(text: str) -> str:
+    lowered = text.lower().replace("-", " ")
+    normalized = _SPACE_RE.sub(" ", lowered)
+    return normalized.strip()
+def contains_any(text: str, phrases: tuple[str, ...] | list[str]) -> bool:
+    normalized = normalize_text(text)
+    return any(normalize_text(phrase) in normalized for phrase in phrases)
+def contains_all_groups(text: str, groups: list[tuple[str, ...]]) -> bool:
+    return all(contains_any(text, group) for group in groups)
+def _criterion(criterion_id: str, label: str, weight: float, earned: bool) -> ScoreCriterion:
+    contribution = round(weight if earned else 0.0, 6)
+    return ScoreCriterion(
+        criterion_id=criterion_id,
+        label=label,
+        weight=weight,
+        earned=earned,
+        contribution=contribution,
+    )
+def _reply_messages(state: Any) -> list[str]:
+    return [entry["message"] for entry in state.reply_history]
+def _has_reply_matching(state: Any, matcher) -> bool:
+    return any(matcher(message) for message in _reply_messages(state))
+def build_scorecard(fixture: TaskFixture, state: Any) -> TaskScorecard:
+    if fixture.task_id == "password_reset_guidance":
+        criteria = _score_password_reset(fixture, state)
+    elif fixture.task_id == "duplicate_charge_refund":
+        criteria = _score_duplicate_charge(fixture, state)
+    elif fixture.task_id == "enterprise_data_loss_escalation":
+        criteria = _score_enterprise_escalation(fixture, state)
+    else:
+        raise ValueError(f"Unsupported task_id: {fixture.task_id}")
+    total_score = round(sum(item.contribution for item in criteria), 6)
+    return TaskScorecard(task_id=fixture.task_id, score=min(max(total_score, 0.0), 1.0), criteria=criteria)
+def _score_password_reset(fixture: TaskFixture, state: Any) -> list[ScoreCriterion]:
+    weights = fixture.rubric_weights
+    replies = _reply_messages(state)
+    return [
+        _criterion(
+            "searched_kb",
+            "Relevant KB article retrieved",
+            weights["searched_kb"],
+            fixture.relevant_kb_article_id in state.kb_articles_seen,
+        ),
+        _criterion(
+            "reply_has_reset_url",
+            "Reply includes the password reset URL",
+            weights["reply_has_reset_url"],
+            any(fixture.reply_keyword_groups["reset_url"][0] in reply for reply in replies),
+        ),
+        _criterion(
+            "reply_mentions_spam_folder",
+            "Reply mentions checking spam or junk",
+            weights["reply_mentions_spam_folder"],
+            _has_reply_matching(state, lambda text: contains_any(text, fixture.reply_keyword_groups["spam_folder"])),
+        ),
+        _criterion(
+            "resolved_correctly",
+            "Ticket resolved with the correct resolution code",
+            weights["resolved_correctly"],
+            state.ticket_status == "resolved" and state.resolution_code == fixture.expected_resolution_code,
+        ),
+        _criterion(
+            "efficient_completion",
+            "Episode completed efficiently",
+            weights["efficient_completion"],
+            state.ticket_status == "resolved"
+            and state.resolution_code == fixture.expected_resolution_code
+            and state.steps_taken <= (fixture.efficiency_bonus_max_steps or 0),
+        ),
+    ]
+def _score_duplicate_charge(fixture: TaskFixture, state: Any) -> list[ScoreCriterion]:
+    weights = fixture.rubric_weights
+    def acknowledges_and_apologizes(reply: str) -> bool:
+        return contains_all_groups(
+            reply,
+            [
+                fixture.reply_keyword_groups["duplicate_ack"],
+                fixture.reply_keyword_groups["regret"],
+                fixture.reply_keyword_groups["refund_confirmed"],
+            ],
+        )
+    return [
+        _criterion(
+            "lookup_account",
+            "Account lookup completed",
+            weights["lookup_account"],
+            state.lookup_performed,
+        ),
+        _criterion(
+            "searched_kb",
+            "Duplicate charge policy article retrieved",
+            weights["searched_kb"],
+            fixture.relevant_kb_article_id in state.kb_articles_seen,
+        ),
+        _criterion(
+            "correct_refund",
+            "Correct full duplicate-charge refund issued",
+            weights["correct_refund"],
+            state.refund_record is not None
+            and state.refund_record["amount_cents"] == fixture.expected_refund_amount_cents
+            and state.refund_record["reason_code"] == fixture.refund_reason_code,
+        ),
+        _criterion(
+            "reply_mentions_timeline",
+            "Reply mentions the refund timeline",
+            weights["reply_mentions_timeline"],
+            _has_reply_matching(state, lambda text: contains_any(text, fixture.reply_keyword_groups["timeline"])),
+        ),
+        _criterion(
+            "reply_acknowledges_and_apologizes",
+            "Reply acknowledges the duplicate charge, apologizes, and confirms the refund",
+            weights["reply_acknowledges_and_apologizes"],
+            _has_reply_matching(state, acknowledges_and_apologizes),
+        ),
+        _criterion(
+            "resolved_without_escalation",
+            "Ticket resolved instead of escalated",
+            weights["resolved_without_escalation"],
+            state.ticket_status == "resolved" and state.resolution_code == fixture.expected_resolution_code,
+        ),
+    ]
+def _score_enterprise_escalation(fixture: TaskFixture, state: Any) -> list[ScoreCriterion]:
+    weights = fixture.rubric_weights
+    escalation_step = state.escalation["step_index"] if state.escalation else None
+    def careful_reply(reply: str) -> bool:
+        return (
+            contains_all_groups(
+                reply,
+                [
+                    fixture.reply_keyword_groups["urgency"],
+                    fixture.reply_keyword_groups["escalation"],
+                ],
+            )
+            and not contains_any(reply, fixture.forbidden_reply_phrases)
+        )
+    reply_before_escalation = any(
+        escalation_step is None or reply["step_index"] < escalation_step for reply in state.reply_history
+    )
+    return [
+        _criterion(
+            "lookup_account",
+            "Account lookup completed",
+            weights["lookup_account"],
+            state.lookup_performed,
+        ),
+        _criterion(
+            "no_refund_or_policy_action",
+            "No refund or resolution policy action was applied",
+            weights["no_refund_or_policy_action"],
+            state.done and not state.refund_attempted and state.resolution_code is None,
+        ),
+        _criterion(
+            "reply_sent_before_escalation",
+            "A reply was sent before escalation",
+            weights["reply_sent_before_escalation"],
+            reply_before_escalation and bool(state.reply_history),
+        ),
+        _criterion(
+            "careful_reply",
+            "Reply acknowledges urgency, mentions escalation, and avoids liability",
+            weights["careful_reply"],
+            any(
+                (escalation_step is None or reply["step_index"] < escalation_step)
+                and careful_reply(reply["message"])
+                for reply in state.reply_history
+            ),
+        ),
+        _criterion(
+            "correct_escalation",
+            "Escalation uses the correct queue and priority",
+            weights["correct_escalation"],
+            state.escalation is not None
+            and state.escalation["queue"] == fixture.expected_escalation_queue
+            and state.escalation["priority"] == fixture.expected_escalation_priority,
+        ),
+        _criterion(
+            "not_resolved",
+            "Ticket was not resolved",
+            weights["not_resolved"],
+            state.done and state.resolution_code is None,
+        ),
+    ]

tests/test_env.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import pytest
+from support_ticket_env import SupportTicketEnvironment
+from support_ticket_env.models import LookupAccountAction, SearchKBAction, SendReplyAction
+def test_search_returns_expected_article_and_progress_reward() -> None:
+    env = SupportTicketEnvironment()
+    env.reset("password_reset_guidance")
+    result = env.step(SearchKBAction(action_type="search_kb", query="password reset email not arriving"))
+    assert result.observation.last_tool_result.tool_name == "search_kb"
+    assert result.observation.last_tool_result.article_ids[0] == "KB-PW-RESET"
+    assert result.reward == pytest.approx(0.19)
+def test_lookup_populates_account_facts() -> None:
+    env = SupportTicketEnvironment()
+    env.reset("duplicate_charge_refund")
+    result = env.step(LookupAccountAction(action_type="lookup_account", customer_id="cust_bill_002"))
+    account = result.observation.known_facts["account"]
+    assert account["plan"] == "Business"
+    assert account["duplicate_charge_amount_cents"] == 4900
+def test_redundant_search_is_penalized() -> None:
+    env = SupportTicketEnvironment()
+    env.reset("password_reset_guidance")
+    env.step(SearchKBAction(action_type="search_kb", query="password reset email not arriving"))
+    result = env.step(SearchKBAction(action_type="search_kb", query="password reset email not arriving"))
+    assert result.reward == pytest.approx(-0.03)
+    assert result.info["redundancy_penalty"] == pytest.approx(0.02)
+def test_refund_before_lookup_is_invalid() -> None:
+    env = SupportTicketEnvironment()
+    env.reset("duplicate_charge_refund")
+    result = env.step({"action_type": "issue_refund", "amount_cents": 4900, "reason_code": "duplicate_charge"})
+    assert result.observation.last_action_error == "lookup_required_before_refund"
+    assert result.reward == pytest.approx(-0.11)
+def test_reset_clears_previous_state() -> None:
+    env = SupportTicketEnvironment()
+    env.reset("password_reset_guidance")
+    env.step(SearchKBAction(action_type="search_kb", query="password reset email not arriving"))
+    result = env.reset("password_reset_guidance")
+    assert result.observation.steps_taken == 0
+    assert result.observation.known_facts == {}
+    assert len(result.observation.conversation_history) == 1
+def test_max_steps_timeout_is_deterministic() -> None:
+    env = SupportTicketEnvironment()
+    result = env.reset("password_reset_guidance")
+    for _ in range(8):
+        result = env.step(SendReplyAction(action_type="send_reply", message="Still investigating."))
+    assert result.done is True
+    assert result.info["terminal_reason"] == "max_steps_exceeded"

tests/test_models.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import pytest
+from support_ticket_env import parse_action
+def test_parse_action_accepts_valid_discriminated_union() -> None:
+    action = parse_action({"action_type": "search_kb", "query": "password reset"})
+    assert action.action_type == "search_kb"
+    assert action.query == "password reset"
+def test_parse_action_rejects_invalid_refund_reason_code() -> None:
+    with pytest.raises(Exception):
+        parse_action(
+            {
+                "action_type": "issue_refund",
+                "amount_cents": 4900,
+                "reason_code": "manual_override",
+            }
+        )

tests/test_scenarios.py ADDED Viewed

	@@ -0,0 +1,74 @@

+from support_ticket_env import SupportTicketEnvironment, list_task_ids, scripted_policy
+from support_ticket_env.models import EscalateTicketAction, IssueRefundAction, LookupAccountAction, ResolveTicketAction, SendReplyAction
+def run_actions(task_id: str, actions: list[object]):
+    env = SupportTicketEnvironment()
+    result = env.reset(task_id)
+    for action in actions:
+        result = env.step(action)
+        if result.done:
+            break
+    return result
+def test_gold_policies_score_perfectly() -> None:
+    env = SupportTicketEnvironment()
+    for task_id in list_task_ids():
+        result = env.reset(task_id)
+        while not result.done:
+            action = scripted_policy(task_id, result.observation.steps_taken + 1, result.observation.customer_id)
+            result = env.step(action)
+        assert result.info["score"] == 1.0
+def test_premature_resolution_scores_poorly() -> None:
+    result = run_actions(
+        "duplicate_charge_refund",
+        [ResolveTicketAction(action_type="resolve_ticket", resolution_code="billing_refund_processed")],
+    )
+    assert result.done is True
+    assert result.info["score"] < 0.5
+def test_task3_refund_attempt_hurts_final_score() -> None:
+    result = run_actions(
+        "enterprise_data_loss_escalation",
+        [
+            LookupAccountAction(action_type="lookup_account", customer_id="cust_ent_003"),
+            IssueRefundAction(action_type="issue_refund", amount_cents=4900, reason_code="duplicate_charge"),
+            SendReplyAction(
+                action_type="send_reply",
+                message="This is urgent. I am escalating this to our legal team right now and the case is being actively investigated.",
+            ),
+            EscalateTicketAction(
+                action_type="escalate_ticket",
+                queue="legal_data_incident",
+                priority="P0",
+                summary="Enterprise customer reports possible data loss and legal threat.",
+            ),
+        ],
+    )
+    assert result.info["score"] == 0.8
+def test_task3_liability_admission_fails_careful_reply() -> None:
+    result = run_actions(
+        "enterprise_data_loss_escalation",
+        [
+            LookupAccountAction(action_type="lookup_account", customer_id="cust_ent_003"),
+            SendReplyAction(
+                action_type="send_reply",
+                message="This is urgent and we are escalating it, but this is our fault and we caused this.",
+            ),
+            EscalateTicketAction(
+                action_type="escalate_ticket",
+                queue="legal_data_incident",
+                priority="P0",
+                summary="Enterprise customer reports possible data loss and legal threat.",
+            ),
+        ],
+    )
+    criteria = {item["criterion_id"]: item for item in result.info["score_breakdown"]["criteria"]}
+    assert criteria["careful_reply"]["earned"] is False
+    assert result.info["score"] < 1.0

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff