Spaces:
Sleeping
Sleeping
| title: Customer Support OpenEnv | |
| emoji: "🎫" | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 8000 | |
| short_description: Deterministic B2B SaaS support benchmark. | |
| pinned: false | |
| # AcmeCloud Customer Support Ticket Handler | |
| A deterministic OpenEnv-style environment for training and evaluating agents on realistic B2B SaaS support workflows. | |
| ## What It Simulates | |
| Each episode is one inbound customer-support ticket at a fictional company, `AcmeCloud`. | |
| The agent acts like a support representative and must choose the right sequence of typed tool actions to handle the ticket correctly. | |
| The benchmark ships with three fixed tasks: | |
| 1. `password_reset_guidance` | |
| 2. `duplicate_charge_refund` | |
| 3. `enterprise_data_loss_escalation` | |
| ## Why This Is Useful | |
| This environment models a real operational task rather than a toy game: | |
| - reading support tickets | |
| - searching internal knowledge base articles | |
| - looking up customer account details | |
| - deciding whether to resolve, refund, or escalate | |
| - sending customer-facing replies under policy constraints | |
| The environment is fully deterministic and graded without any LLM judge, which makes it suitable for reproducible RL rollouts and benchmark evaluation. | |
| ## Action Space | |
| The agent can take exactly six typed actions: | |
| - `search_kb(query: str)` | |
| - `lookup_account(customer_id: str)` | |
| - `send_reply(message: str)` | |
| - `issue_refund(amount_cents: int, reason_code: "duplicate_charge")` | |
| - `resolve_ticket(resolution_code: "password_reset_guidance" | "billing_refund_processed")` | |
| - `escalate_ticket(queue: "support_lead" | "legal_data_incident", priority: "P2" | "P0", summary: str)` | |
| ## Observation Space | |
| Each observation includes: | |
| - task and ticket identifiers | |
| - current ticket status | |
| - customer metadata | |
| - customer message and full conversation history | |
| - the last tool result | |
| - steps taken / remaining | |
| - available action types | |
| - last action error | |
| - accumulated known facts learned from prior tool calls | |
| ## Reward Design | |
| The environment uses rubric-based reward shaping. | |
| - Each task has a deterministic scorecard in `[0.0, 1.0]` | |
| - Step reward is `score_delta - 0.01 - invalid_penalty - redundancy_penalty` | |
| - Repeated search/lookup actions incur `-0.02` | |
| - Invalid actions incur `-0.10` | |
| - `resolve_ticket` and `escalate_ticket` terminate the episode | |
| - `issue_refund` changes state but does not terminate the episode | |
| Global success threshold: `0.75` | |
| ## Task Details | |
| ### 1. Password Reset Guidance | |
| Customer issue: reset email did not arrive. | |
| Expected flow: | |
| - search password reset KB article | |
| - send reply with reset URL and spam/junk guidance | |
| - resolve with `password_reset_guidance` | |
| ### 2. Duplicate Charge Refund | |
| Customer issue: billed twice for the current subscription period. | |
| Expected flow: | |
| - lookup the account | |
| - search the refund policy | |
| - issue the verified duplicate-charge refund | |
| - reply with apology and timeline | |
| - resolve with `billing_refund_processed` | |
| ### 3. Enterprise Data Loss Escalation | |
| Customer issue: enterprise data-loss complaint with legal threat. | |
| Expected flow: | |
| - lookup the account | |
| - send a careful acknowledgment reply | |
| - escalate to `legal_data_incident` with `P0` | |
| - do not refund | |
| - do not resolve | |
| ## Project Layout | |
| - `support_ticket_env/`: models, fixtures, scoring, environment core, policy helpers, local/HTTP client | |
| - `server/`: FastAPI app and Dockerfile | |
| - `tests/`: unit and scenario tests | |
| - `inference.py`: baseline runner using the OpenAI client interface | |
| - `openenv.yaml`: environment metadata | |
| ## Local Setup | |
| ```bash | |
| python -m pip install -e .[dev] | |
| pytest | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload | |
| ``` | |
| Open the docs at [http://localhost:8000/docs](http://localhost:8000/docs) or the simple UI at [http://localhost:8000/web](http://localhost:8000/web). | |
| ## Docker | |
| ```bash | |
| docker build -t customer-support-openenv . | |
| docker run -p 8000:8000 customer-support-openenv | |
| ``` | |
| ## Baseline Inference | |
| The baseline script uses the OpenAI client interface and supports any OpenAI-compatible endpoint. | |
| Mandatory environment variables for hosted model inference: | |
| - `HF_TOKEN` | |
| - `API_BASE_URL` | |
| - `MODEL_NAME` | |
| Optional environment variables: | |
| - `ENV_BASE_URL` to target a running local server or deployed HF Space | |
| - `LOCAL_IMAGE_NAME` if you want the script to instantiate the environment via `from_docker_image(...)` | |
| Inference environment selection: | |
| 1. `LOCAL_IMAGE_NAME` set: use `from_docker_image(...)` | |
| 2. otherwise `ENV_BASE_URL` set: use the running HTTP environment | |
| 3. otherwise: use the in-process local environment for offline reproducibility | |
| Run: | |
| ```bash | |
| python inference.py | |
| ``` | |
| The script emits strict stdout lines in the required format: | |
| - `[START]` | |
| - `[STEP]` | |
| - `[END]` | |
| Output contract: | |
| - one `[START]` line per task | |
| - one `[STEP]` line immediately after each `env.step()` | |
| - one `[END]` line per task, even on exception | |
| - `reward`, `rewards`, and `score` formatted to 2 decimal places | |
| - `done` and `success` emitted as lowercase booleans | |
| - `error` emitted as the raw `last_action_error` string or `null` | |
| If the model call fails or credentials are missing, the script falls back to a deterministic scripted policy so the benchmark still runs reproducibly. | |
| ## Example Gold Scores | |
| Using the included scripted policy: | |
| - `password_reset_guidance`: `1.0` | |
| - `duplicate_charge_refund`: `1.0` | |
| - `enterprise_data_loss_escalation`: `1.0` | |
| ## Deployment Notes | |
| - HF Space page: `https://huggingface.co/spaces/Dar3devil/customer-support-openenv` | |
| - HF app URL: `https://dar3devil-customer-support-openenv.hf.space` | |
| - The app exposes `/health`, `/reset`, `/step`, `/state`, `/docs`, `/web`, and `/ws` | |
| - Sessions are managed in-memory | |
| - No external services are required to run the environment server itself | |
| - The benchmark is designed to fit comfortably in the hackathon resource limits | |
| ## Validation | |
| If `openenv` is installed locally, run: | |
| ```bash | |
| openenv validate | |
| ``` | |
| ## Pre-Submission Commands | |
| Local checks: | |
| ```powershell | |
| cd "C:\Users\aarya\.codex\worktrees\e74f\Task Scheduler\customer_support_openenv" | |
| openenv validate | |
| pytest -q | |
| ``` | |
| Baseline run: | |
| ```powershell | |
| $env:HF_TOKEN="<your-hf-token>" | |
| $env:API_BASE_URL="https://router.huggingface.co/v1" | |
| $env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct" | |
| python inference.py | |
| ``` | |
| Local Docker smoke test: | |
| ```powershell | |
| docker build -t customer-support-openenv . | |
| docker run --rm -p 8000:8000 customer-support-openenv | |
| curl.exe -sS http://localhost:8000/health | |
| curl.exe -sS -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d "{}" | |
| ``` | |
| Live Space smoke test: | |
| ```powershell | |
| curl.exe -sS https://dar3devil-customer-support-openenv.hf.space/health | |
| curl.exe -sS -X POST https://dar3devil-customer-support-openenv.hf.space/reset -H "Content-Type: application/json" -d "{}" | |
| ``` | |
| Submission validator: | |
| ```powershell | |
| wsl bash -lc "cd '/mnt/c/Users/aarya/.codex/worktrees/e74f/Task Scheduler/customer_support_openenv' && chmod +x scripts/validate-submission.sh && scripts/validate-submission.sh https://dar3devil-customer-support-openenv.hf.space ." | |
| ``` | |
| Windows users should run the validator script through WSL or Git Bash. | |
| This repository does not depend on an LLM judge for grading. | |
| All graders are deterministic and implemented directly in the environment scorer. | |