Dar3devil's picture
Sync README
92cb19d verified
|
Raw
History Blame Contribute Delete
7.39 kB
---
title: Customer Support OpenEnv
emoji: "🎫"
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
short_description: Deterministic B2B SaaS support benchmark.
pinned: false
---
# AcmeCloud Customer Support Ticket Handler
A deterministic OpenEnv-style environment for training and evaluating agents on realistic B2B SaaS support workflows.
## What It Simulates
Each episode is one inbound customer-support ticket at a fictional company, `AcmeCloud`.
The agent acts like a support representative and must choose the right sequence of typed tool actions to handle the ticket correctly.
The benchmark ships with three fixed tasks:
1. `password_reset_guidance`
2. `duplicate_charge_refund`
3. `enterprise_data_loss_escalation`
## Why This Is Useful
This environment models a real operational task rather than a toy game:
- reading support tickets
- searching internal knowledge base articles
- looking up customer account details
- deciding whether to resolve, refund, or escalate
- sending customer-facing replies under policy constraints
The environment is fully deterministic and graded without any LLM judge, which makes it suitable for reproducible RL rollouts and benchmark evaluation.
## Action Space
The agent can take exactly six typed actions:
- `search_kb(query: str)`
- `lookup_account(customer_id: str)`
- `send_reply(message: str)`
- `issue_refund(amount_cents: int, reason_code: "duplicate_charge")`
- `resolve_ticket(resolution_code: "password_reset_guidance" | "billing_refund_processed")`
- `escalate_ticket(queue: "support_lead" | "legal_data_incident", priority: "P2" | "P0", summary: str)`
## Observation Space
Each observation includes:
- task and ticket identifiers
- current ticket status
- customer metadata
- customer message and full conversation history
- the last tool result
- steps taken / remaining
- available action types
- last action error
- accumulated known facts learned from prior tool calls
## Reward Design
The environment uses rubric-based reward shaping.
- Each task has a deterministic scorecard in `[0.0, 1.0]`
- Step reward is `score_delta - 0.01 - invalid_penalty - redundancy_penalty`
- Repeated search/lookup actions incur `-0.02`
- Invalid actions incur `-0.10`
- `resolve_ticket` and `escalate_ticket` terminate the episode
- `issue_refund` changes state but does not terminate the episode
Global success threshold: `0.75`
## Task Details
### 1. Password Reset Guidance
Customer issue: reset email did not arrive.
Expected flow:
- search password reset KB article
- send reply with reset URL and spam/junk guidance
- resolve with `password_reset_guidance`
### 2. Duplicate Charge Refund
Customer issue: billed twice for the current subscription period.
Expected flow:
- lookup the account
- search the refund policy
- issue the verified duplicate-charge refund
- reply with apology and timeline
- resolve with `billing_refund_processed`
### 3. Enterprise Data Loss Escalation
Customer issue: enterprise data-loss complaint with legal threat.
Expected flow:
- lookup the account
- send a careful acknowledgment reply
- escalate to `legal_data_incident` with `P0`
- do not refund
- do not resolve
## Project Layout
- `support_ticket_env/`: models, fixtures, scoring, environment core, policy helpers, local/HTTP client
- `server/`: FastAPI app and Dockerfile
- `tests/`: unit and scenario tests
- `inference.py`: baseline runner using the OpenAI client interface
- `openenv.yaml`: environment metadata
## Local Setup
```bash
python -m pip install -e .[dev]
pytest
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
```
Open the docs at [http://localhost:8000/docs](http://localhost:8000/docs) or the simple UI at [http://localhost:8000/web](http://localhost:8000/web).
## Docker
```bash
docker build -t customer-support-openenv .
docker run -p 8000:8000 customer-support-openenv
```
## Baseline Inference
The baseline script uses the OpenAI client interface and supports any OpenAI-compatible endpoint.
Mandatory environment variables for hosted model inference:
- `HF_TOKEN`
- `API_BASE_URL`
- `MODEL_NAME`
Optional environment variables:
- `ENV_BASE_URL` to target a running local server or deployed HF Space
- `LOCAL_IMAGE_NAME` if you want the script to instantiate the environment via `from_docker_image(...)`
Inference environment selection:
1. `LOCAL_IMAGE_NAME` set: use `from_docker_image(...)`
2. otherwise `ENV_BASE_URL` set: use the running HTTP environment
3. otherwise: use the in-process local environment for offline reproducibility
Run:
```bash
python inference.py
```
The script emits strict stdout lines in the required format:
- `[START]`
- `[STEP]`
- `[END]`
Output contract:
- one `[START]` line per task
- one `[STEP]` line immediately after each `env.step()`
- one `[END]` line per task, even on exception
- `reward`, `rewards`, and `score` formatted to 2 decimal places
- `done` and `success` emitted as lowercase booleans
- `error` emitted as the raw `last_action_error` string or `null`
If the model call fails or credentials are missing, the script falls back to a deterministic scripted policy so the benchmark still runs reproducibly.
## Example Gold Scores
Using the included scripted policy:
- `password_reset_guidance`: `1.0`
- `duplicate_charge_refund`: `1.0`
- `enterprise_data_loss_escalation`: `1.0`
## Deployment Notes
- HF Space page: `https://huggingface.co/spaces/Dar3devil/customer-support-openenv`
- HF app URL: `https://dar3devil-customer-support-openenv.hf.space`
- The app exposes `/health`, `/reset`, `/step`, `/state`, `/docs`, `/web`, and `/ws`
- Sessions are managed in-memory
- No external services are required to run the environment server itself
- The benchmark is designed to fit comfortably in the hackathon resource limits
## Validation
If `openenv` is installed locally, run:
```bash
openenv validate
```
## Pre-Submission Commands
Local checks:
```powershell
cd "C:\Users\aarya\.codex\worktrees\e74f\Task Scheduler\customer_support_openenv"
openenv validate
pytest -q
```
Baseline run:
```powershell
$env:HF_TOKEN="<your-hf-token>"
$env:API_BASE_URL="https://router.huggingface.co/v1"
$env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
python inference.py
```
Local Docker smoke test:
```powershell
docker build -t customer-support-openenv .
docker run --rm -p 8000:8000 customer-support-openenv
curl.exe -sS http://localhost:8000/health
curl.exe -sS -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d "{}"
```
Live Space smoke test:
```powershell
curl.exe -sS https://dar3devil-customer-support-openenv.hf.space/health
curl.exe -sS -X POST https://dar3devil-customer-support-openenv.hf.space/reset -H "Content-Type: application/json" -d "{}"
```
Submission validator:
```powershell
wsl bash -lc "cd '/mnt/c/Users/aarya/.codex/worktrees/e74f/Task Scheduler/customer_support_openenv' && chmod +x scripts/validate-submission.sh && scripts/validate-submission.sh https://dar3devil-customer-support-openenv.hf.space ."
```
Windows users should run the validator script through WSL or Git Bash.
This repository does not depend on an LLM judge for grading.
All graders are deterministic and implemented directly in the environment scorer.