Spaces:

Dar3devil
/

customer-support-openenv

Sleeping

App Files Files Community

customer-support-openenv / README.md

Dar3devil

Sync README

92cb19d verified 3 months ago

preview code

Raw

History Blame Contribute Delete

7.39 kB

metadata

title: Customer Support OpenEnv
emoji: 🎫
colorFrom: blue
colorTo: green
sdk: docker
app_port: 8000
short_description: Deterministic B2B SaaS support benchmark.
pinned: false

AcmeCloud Customer Support Ticket Handler

A deterministic OpenEnv-style environment for training and evaluating agents on realistic B2B SaaS support workflows.

What It Simulates

Each episode is one inbound customer-support ticket at a fictional company, AcmeCloud. The agent acts like a support representative and must choose the right sequence of typed tool actions to handle the ticket correctly.

The benchmark ships with three fixed tasks:

password_reset_guidance
duplicate_charge_refund
enterprise_data_loss_escalation

Why This Is Useful

This environment models a real operational task rather than a toy game:

reading support tickets
searching internal knowledge base articles
looking up customer account details
deciding whether to resolve, refund, or escalate
sending customer-facing replies under policy constraints

The environment is fully deterministic and graded without any LLM judge, which makes it suitable for reproducible RL rollouts and benchmark evaluation.

Action Space

The agent can take exactly six typed actions:

search_kb(query: str)
lookup_account(customer_id: str)
send_reply(message: str)
issue_refund(amount_cents: int, reason_code: "duplicate_charge")
resolve_ticket(resolution_code: "password_reset_guidance" | "billing_refund_processed")
escalate_ticket(queue: "support_lead" | "legal_data_incident", priority: "P2" | "P0", summary: str)

Observation Space

Each observation includes:

task and ticket identifiers
current ticket status
customer metadata
customer message and full conversation history
the last tool result
steps taken / remaining
available action types
last action error
accumulated known facts learned from prior tool calls

Reward Design

The environment uses rubric-based reward shaping.

Each task has a deterministic scorecard in [0.0, 1.0]
Step reward is score_delta - 0.01 - invalid_penalty - redundancy_penalty
Repeated search/lookup actions incur -0.02
Invalid actions incur -0.10
resolve_ticket and escalate_ticket terminate the episode
issue_refund changes state but does not terminate the episode

Global success threshold: 0.75

Task Details

1. Password Reset Guidance

Customer issue: reset email did not arrive.

Expected flow:

search password reset KB article
send reply with reset URL and spam/junk guidance
resolve with password_reset_guidance

2. Duplicate Charge Refund

Customer issue: billed twice for the current subscription period.

Expected flow:

lookup the account
search the refund policy
issue the verified duplicate-charge refund
reply with apology and timeline
resolve with billing_refund_processed

3. Enterprise Data Loss Escalation

Customer issue: enterprise data-loss complaint with legal threat.

Expected flow:

lookup the account
send a careful acknowledgment reply
escalate to legal_data_incident with P0
do not refund
do not resolve

Project Layout

support_ticket_env/: models, fixtures, scoring, environment core, policy helpers, local/HTTP client
server/: FastAPI app and Dockerfile
tests/: unit and scenario tests
inference.py: baseline runner using the OpenAI client interface
openenv.yaml: environment metadata

Local Setup

python -m pip install -e .[dev]
pytest
uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload

Open the docs at http://localhost:8000/docs or the simple UI at http://localhost:8000/web.

Docker

docker build -t customer-support-openenv .
docker run -p 8000:8000 customer-support-openenv

Baseline Inference

The baseline script uses the OpenAI client interface and supports any OpenAI-compatible endpoint.

Mandatory environment variables for hosted model inference:

HF_TOKEN
API_BASE_URL
MODEL_NAME

Optional environment variables:

ENV_BASE_URL to target a running local server or deployed HF Space
LOCAL_IMAGE_NAME if you want the script to instantiate the environment via from_docker_image(...)

Inference environment selection:

LOCAL_IMAGE_NAME set: use from_docker_image(...)
otherwise ENV_BASE_URL set: use the running HTTP environment
otherwise: use the in-process local environment for offline reproducibility

Run:

python inference.py

The script emits strict stdout lines in the required format:

[START]
[STEP]
[END]

Output contract:

one [START] line per task
one [STEP] line immediately after each env.step()
one [END] line per task, even on exception
reward, rewards, and score formatted to 2 decimal places
done and success emitted as lowercase booleans
error emitted as the raw last_action_error string or null

If the model call fails or credentials are missing, the script falls back to a deterministic scripted policy so the benchmark still runs reproducibly.

Example Gold Scores

Using the included scripted policy:

password_reset_guidance: 1.0
duplicate_charge_refund: 1.0
enterprise_data_loss_escalation: 1.0

Deployment Notes

HF Space page: https://huggingface.co/spaces/Dar3devil/customer-support-openenv
HF app URL: https://dar3devil-customer-support-openenv.hf.space
The app exposes /health, /reset, /step, /state, /docs, /web, and /ws
Sessions are managed in-memory
No external services are required to run the environment server itself
The benchmark is designed to fit comfortably in the hackathon resource limits

Validation

If openenv is installed locally, run:

openenv validate

Pre-Submission Commands

Local checks:

cd "C:\Users\aarya\.codex\worktrees\e74f\Task Scheduler\customer_support_openenv"
openenv validate
pytest -q

Baseline run:

$env:HF_TOKEN="<your-hf-token>"
$env:API_BASE_URL="https://router.huggingface.co/v1"
$env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
python inference.py

Local Docker smoke test:

docker build -t customer-support-openenv .
docker run --rm -p 8000:8000 customer-support-openenv
curl.exe -sS http://localhost:8000/health
curl.exe -sS -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d "{}"

Live Space smoke test:

curl.exe -sS https://dar3devil-customer-support-openenv.hf.space/health
curl.exe -sS -X POST https://dar3devil-customer-support-openenv.hf.space/reset -H "Content-Type: application/json" -d "{}"

Submission validator:

wsl bash -lc "cd '/mnt/c/Users/aarya/.codex/worktrees/e74f/Task Scheduler/customer_support_openenv' && chmod +x scripts/validate-submission.sh && scripts/validate-submission.sh https://dar3devil-customer-support-openenv.hf.space ."

Windows users should run the validator script through WSL or Git Bash.

This repository does not depend on an LLM judge for grading. All graders are deterministic and implemented directly in the environment scorer.