Spaces:
Sleeping
title: Customer Support OpenEnv
emoji: 🎫
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
short_description: Deterministic B2B SaaS support benchmark.
pinned: false
AcmeCloud Customer Support Ticket Handler
A deterministic OpenEnv-style environment for training and evaluating agents on realistic B2B SaaS support workflows.
What It Simulates
Each episode is one inbound customer-support ticket at a fictional company, AcmeCloud.
The agent acts like a support representative and must choose the right sequence of typed tool actions to handle the ticket correctly.
The benchmark ships with three fixed tasks:
password_reset_guidanceduplicate_charge_refundenterprise_data_loss_escalation
Why This Is Useful
This environment models a real operational task rather than a toy game:
- reading support tickets
- searching internal knowledge base articles
- looking up customer account details
- deciding whether to resolve, refund, or escalate
- sending customer-facing replies under policy constraints
The environment is fully deterministic and graded without any LLM judge, which makes it suitable for reproducible RL rollouts and benchmark evaluation.
Action Space
The agent can take exactly six typed actions:
search_kb(query: str)lookup_account(customer_id: str)send_reply(message: str)issue_refund(amount_cents: int, reason_code: "duplicate_charge")resolve_ticket(resolution_code: "password_reset_guidance" | "billing_refund_processed")escalate_ticket(queue: "support_lead" | "legal_data_incident", priority: "P2" | "P0", summary: str)
Observation Space
Each observation includes:
- task and ticket identifiers
- current ticket status
- customer metadata
- customer message and full conversation history
- the last tool result
- steps taken / remaining
- available action types
- last action error
- accumulated known facts learned from prior tool calls
Reward Design
The environment uses rubric-based reward shaping.
- Each task has a deterministic scorecard in
[0.0, 1.0] - Step reward is
score_delta - 0.01 - invalid_penalty - redundancy_penalty - Repeated search/lookup actions incur
-0.02 - Invalid actions incur
-0.10 resolve_ticketandescalate_ticketterminate the episodeissue_refundchanges state but does not terminate the episode
Global success threshold: 0.75
Task Details
1. Password Reset Guidance
Customer issue: reset email did not arrive.
Expected flow:
- search password reset KB article
- send reply with reset URL and spam/junk guidance
- resolve with
password_reset_guidance
2. Duplicate Charge Refund
Customer issue: billed twice for the current subscription period.
Expected flow:
- lookup the account
- search the refund policy
- issue the verified duplicate-charge refund
- reply with apology and timeline
- resolve with
billing_refund_processed
3. Enterprise Data Loss Escalation
Customer issue: enterprise data-loss complaint with legal threat.
Expected flow:
- lookup the account
- send a careful acknowledgment reply
- escalate to
legal_data_incidentwithP0 - do not refund
- do not resolve
Project Layout
support_ticket_env/: models, fixtures, scoring, environment core, policy helpers, local/HTTP clientserver/: FastAPI app and Dockerfiletests/: unit and scenario testsinference.py: baseline runner using the OpenAI client interfaceopenenv.yaml: environment metadata
Local Setup
python -m pip install -e .[dev]
pytest
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
Open the docs at http://localhost:8000/docs or the simple UI at http://localhost:8000/web.
Docker
docker build -t customer-support-openenv .
docker run -p 8000:8000 customer-support-openenv
Baseline Inference
The baseline script uses the OpenAI client interface and supports any OpenAI-compatible endpoint.
Mandatory environment variables for hosted model inference:
HF_TOKENAPI_BASE_URLMODEL_NAME
Optional environment variables:
ENV_BASE_URLto target a running local server or deployed HF SpaceLOCAL_IMAGE_NAMEif you want the script to instantiate the environment viafrom_docker_image(...)
Inference environment selection:
LOCAL_IMAGE_NAMEset: usefrom_docker_image(...)- otherwise
ENV_BASE_URLset: use the running HTTP environment - otherwise: use the in-process local environment for offline reproducibility
Run:
python inference.py
The script emits strict stdout lines in the required format:
[START][STEP][END]
Output contract:
- one
[START]line per task - one
[STEP]line immediately after eachenv.step() - one
[END]line per task, even on exception reward,rewards, andscoreformatted to 2 decimal placesdoneandsuccessemitted as lowercase booleanserroremitted as the rawlast_action_errorstring ornull
If the model call fails or credentials are missing, the script falls back to a deterministic scripted policy so the benchmark still runs reproducibly.
Example Gold Scores
Using the included scripted policy:
password_reset_guidance:1.0duplicate_charge_refund:1.0enterprise_data_loss_escalation:1.0
Deployment Notes
- HF Space page:
https://huggingface.co/spaces/Dar3devil/customer-support-openenv - HF app URL:
https://dar3devil-customer-support-openenv.hf.space - The app exposes
/health,/reset,/step,/state,/docs,/web, and/ws - Sessions are managed in-memory
- No external services are required to run the environment server itself
- The benchmark is designed to fit comfortably in the hackathon resource limits
Validation
If openenv is installed locally, run:
openenv validate
Pre-Submission Commands
Local checks:
cd "C:\Users\aarya\.codex\worktrees\e74f\Task Scheduler\customer_support_openenv"
openenv validate
pytest -q
Baseline run:
$env:HF_TOKEN="<your-hf-token>"
$env:API_BASE_URL="https://router.huggingface.co/v1"
$env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
python inference.py
Local Docker smoke test:
docker build -t customer-support-openenv .
docker run --rm -p 8000:8000 customer-support-openenv
curl.exe -sS http://localhost:8000/health
curl.exe -sS -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d "{}"
Live Space smoke test:
curl.exe -sS https://dar3devil-customer-support-openenv.hf.space/health
curl.exe -sS -X POST https://dar3devil-customer-support-openenv.hf.space/reset -H "Content-Type: application/json" -d "{}"
Submission validator:
wsl bash -lc "cd '/mnt/c/Users/aarya/.codex/worktrees/e74f/Task Scheduler/customer_support_openenv' && chmod +x scripts/validate-submission.sh && scripts/validate-submission.sh https://dar3devil-customer-support-openenv.hf.space ."
Windows users should run the validator script through WSL or Git Bash.
This repository does not depend on an LLM judge for grading. All graders are deterministic and implemented directly in the environment scorer.