Dar3devil's picture
Sync README
92cb19d verified
|
Raw
History Blame Contribute Delete
7.39 kB
metadata
title: Customer Support OpenEnv
emoji: 🎫
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
short_description: Deterministic B2B SaaS support benchmark.
pinned: false

AcmeCloud Customer Support Ticket Handler

A deterministic OpenEnv-style environment for training and evaluating agents on realistic B2B SaaS support workflows.

What It Simulates

Each episode is one inbound customer-support ticket at a fictional company, AcmeCloud. The agent acts like a support representative and must choose the right sequence of typed tool actions to handle the ticket correctly.

The benchmark ships with three fixed tasks:

  1. password_reset_guidance
  2. duplicate_charge_refund
  3. enterprise_data_loss_escalation

Why This Is Useful

This environment models a real operational task rather than a toy game:

  • reading support tickets
  • searching internal knowledge base articles
  • looking up customer account details
  • deciding whether to resolve, refund, or escalate
  • sending customer-facing replies under policy constraints

The environment is fully deterministic and graded without any LLM judge, which makes it suitable for reproducible RL rollouts and benchmark evaluation.

Action Space

The agent can take exactly six typed actions:

  • search_kb(query: str)
  • lookup_account(customer_id: str)
  • send_reply(message: str)
  • issue_refund(amount_cents: int, reason_code: "duplicate_charge")
  • resolve_ticket(resolution_code: "password_reset_guidance" | "billing_refund_processed")
  • escalate_ticket(queue: "support_lead" | "legal_data_incident", priority: "P2" | "P0", summary: str)

Observation Space

Each observation includes:

  • task and ticket identifiers
  • current ticket status
  • customer metadata
  • customer message and full conversation history
  • the last tool result
  • steps taken / remaining
  • available action types
  • last action error
  • accumulated known facts learned from prior tool calls

Reward Design

The environment uses rubric-based reward shaping.

  • Each task has a deterministic scorecard in [0.0, 1.0]
  • Step reward is score_delta - 0.01 - invalid_penalty - redundancy_penalty
  • Repeated search/lookup actions incur -0.02
  • Invalid actions incur -0.10
  • resolve_ticket and escalate_ticket terminate the episode
  • issue_refund changes state but does not terminate the episode

Global success threshold: 0.75

Task Details

1. Password Reset Guidance

Customer issue: reset email did not arrive.

Expected flow:

  • search password reset KB article
  • send reply with reset URL and spam/junk guidance
  • resolve with password_reset_guidance

2. Duplicate Charge Refund

Customer issue: billed twice for the current subscription period.

Expected flow:

  • lookup the account
  • search the refund policy
  • issue the verified duplicate-charge refund
  • reply with apology and timeline
  • resolve with billing_refund_processed

3. Enterprise Data Loss Escalation

Customer issue: enterprise data-loss complaint with legal threat.

Expected flow:

  • lookup the account
  • send a careful acknowledgment reply
  • escalate to legal_data_incident with P0
  • do not refund
  • do not resolve

Project Layout

  • support_ticket_env/: models, fixtures, scoring, environment core, policy helpers, local/HTTP client
  • server/: FastAPI app and Dockerfile
  • tests/: unit and scenario tests
  • inference.py: baseline runner using the OpenAI client interface
  • openenv.yaml: environment metadata

Local Setup

python -m pip install -e .[dev]
pytest
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload

Open the docs at http://localhost:8000/docs or the simple UI at http://localhost:8000/web.

Docker

docker build -t customer-support-openenv .
docker run -p 8000:8000 customer-support-openenv

Baseline Inference

The baseline script uses the OpenAI client interface and supports any OpenAI-compatible endpoint.

Mandatory environment variables for hosted model inference:

  • HF_TOKEN
  • API_BASE_URL
  • MODEL_NAME

Optional environment variables:

  • ENV_BASE_URL to target a running local server or deployed HF Space
  • LOCAL_IMAGE_NAME if you want the script to instantiate the environment via from_docker_image(...)

Inference environment selection:

  1. LOCAL_IMAGE_NAME set: use from_docker_image(...)
  2. otherwise ENV_BASE_URL set: use the running HTTP environment
  3. otherwise: use the in-process local environment for offline reproducibility

Run:

python inference.py

The script emits strict stdout lines in the required format:

  • [START]
  • [STEP]
  • [END]

Output contract:

  • one [START] line per task
  • one [STEP] line immediately after each env.step()
  • one [END] line per task, even on exception
  • reward, rewards, and score formatted to 2 decimal places
  • done and success emitted as lowercase booleans
  • error emitted as the raw last_action_error string or null

If the model call fails or credentials are missing, the script falls back to a deterministic scripted policy so the benchmark still runs reproducibly.

Example Gold Scores

Using the included scripted policy:

  • password_reset_guidance: 1.0
  • duplicate_charge_refund: 1.0
  • enterprise_data_loss_escalation: 1.0

Deployment Notes

  • HF Space page: https://huggingface.co/spaces/Dar3devil/customer-support-openenv
  • HF app URL: https://dar3devil-customer-support-openenv.hf.space
  • The app exposes /health, /reset, /step, /state, /docs, /web, and /ws
  • Sessions are managed in-memory
  • No external services are required to run the environment server itself
  • The benchmark is designed to fit comfortably in the hackathon resource limits

Validation

If openenv is installed locally, run:

openenv validate

Pre-Submission Commands

Local checks:

cd "C:\Users\aarya\.codex\worktrees\e74f\Task Scheduler\customer_support_openenv"
openenv validate
pytest -q

Baseline run:

$env:HF_TOKEN="<your-hf-token>"
$env:API_BASE_URL="https://router.huggingface.co/v1"
$env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
python inference.py

Local Docker smoke test:

docker build -t customer-support-openenv .
docker run --rm -p 8000:8000 customer-support-openenv
curl.exe -sS http://localhost:8000/health
curl.exe -sS -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d "{}"

Live Space smoke test:

curl.exe -sS https://dar3devil-customer-support-openenv.hf.space/health
curl.exe -sS -X POST https://dar3devil-customer-support-openenv.hf.space/reset -H "Content-Type: application/json" -d "{}"

Submission validator:

wsl bash -lc "cd '/mnt/c/Users/aarya/.codex/worktrees/e74f/Task Scheduler/customer_support_openenv' && chmod +x scripts/validate-submission.sh && scripts/validate-submission.sh https://dar3devil-customer-support-openenv.hf.space ."

Windows users should run the validator script through WSL or Git Bash.

This repository does not depend on an LLM judge for grading. All graders are deterministic and implemented directly in the environment scorer.