Spaces:

Dar3devil
/

customer-support-openenv

Sleeping

App Files Files Community

customer-support-openenv / README.md

Dar3devil

Sync README

92cb19d verified 3 months ago

preview code

Raw

History Blame Contribute Delete

7.39 kB

	---
	title: Customer Support OpenEnv
	emoji: "🎫"
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 8000
	short_description: Deterministic B2B SaaS support benchmark.
	pinned: false
	---

	# AcmeCloud Customer Support Ticket Handler

	A deterministic OpenEnv-style environment for training and evaluating agents on realistic B2B SaaS support workflows.

	## What It Simulates

	Each episode is one inbound customer-support ticket at a fictional company, `AcmeCloud`.
	The agent acts like a support representative and must choose the right sequence of typed tool actions to handle the ticket correctly.

	The benchmark ships with three fixed tasks:

	1. `password_reset_guidance`
	2. `duplicate_charge_refund`
	3. `enterprise_data_loss_escalation`

	## Why This Is Useful

	This environment models a real operational task rather than a toy game:

	- reading support tickets
	- searching internal knowledge base articles
	- looking up customer account details
	- deciding whether to resolve, refund, or escalate
	- sending customer-facing replies under policy constraints

	The environment is fully deterministic and graded without any LLM judge, which makes it suitable for reproducible RL rollouts and benchmark evaluation.

	## Action Space

	The agent can take exactly six typed actions:

	- `search_kb(query: str)`
	- `lookup_account(customer_id: str)`
	- `send_reply(message: str)`
	- `issue_refund(amount_cents: int, reason_code: "duplicate_charge")`
	- `resolve_ticket(resolution_code: "password_reset_guidance" \| "billing_refund_processed")`
	- `escalate_ticket(queue: "support_lead" \| "legal_data_incident", priority: "P2" \| "P0", summary: str)`

	## Observation Space

	Each observation includes:

	- task and ticket identifiers
	- current ticket status
	- customer metadata
	- customer message and full conversation history
	- the last tool result
	- steps taken / remaining
	- available action types
	- last action error
	- accumulated known facts learned from prior tool calls

	## Reward Design

	The environment uses rubric-based reward shaping.

	- Each task has a deterministic scorecard in `[0.0, 1.0]`
	- Step reward is `score_delta - 0.01 - invalid_penalty - redundancy_penalty`
	- Repeated search/lookup actions incur `-0.02`
	- Invalid actions incur `-0.10`
	- `resolve_ticket` and `escalate_ticket` terminate the episode
	- `issue_refund` changes state but does not terminate the episode

	Global success threshold: `0.75`

	## Task Details

	### 1. Password Reset Guidance

	Customer issue: reset email did not arrive.

	Expected flow:

	- search password reset KB article
	- send reply with reset URL and spam/junk guidance
	- resolve with `password_reset_guidance`

	### 2. Duplicate Charge Refund

	Customer issue: billed twice for the current subscription period.

	Expected flow:

	- lookup the account
	- search the refund policy
	- issue the verified duplicate-charge refund
	- reply with apology and timeline
	- resolve with `billing_refund_processed`

	### 3. Enterprise Data Loss Escalation

	Customer issue: enterprise data-loss complaint with legal threat.

	Expected flow:

	- lookup the account
	- send a careful acknowledgment reply
	- escalate to `legal_data_incident` with `P0`
	- do not refund
	- do not resolve

	## Project Layout

	- `support_ticket_env/`: models, fixtures, scoring, environment core, policy helpers, local/HTTP client
	- `server/`: FastAPI app and Dockerfile
	- `tests/`: unit and scenario tests
	- `inference.py`: baseline runner using the OpenAI client interface
	- `openenv.yaml`: environment metadata

	## Local Setup

	```bash
	python -m pip install -e .[dev]
	pytest
	uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
	```

	Open the docs at [http://localhost:8000/docs](http://localhost:8000/docs) or the simple UI at [http://localhost:8000/web](http://localhost:8000/web).

	## Docker

	```bash
	docker build -t customer-support-openenv .
	docker run -p 8000:8000 customer-support-openenv
	```

	## Baseline Inference

	The baseline script uses the OpenAI client interface and supports any OpenAI-compatible endpoint.

	Mandatory environment variables for hosted model inference:

	- `HF_TOKEN`
	- `API_BASE_URL`
	- `MODEL_NAME`

	Optional environment variables:

	- `ENV_BASE_URL` to target a running local server or deployed HF Space
	- `LOCAL_IMAGE_NAME` if you want the script to instantiate the environment via `from_docker_image(...)`

	Inference environment selection:

	1. `LOCAL_IMAGE_NAME` set: use `from_docker_image(...)`
	2. otherwise `ENV_BASE_URL` set: use the running HTTP environment
	3. otherwise: use the in-process local environment for offline reproducibility

	Run:

	```bash
	python inference.py
	```

	The script emits strict stdout lines in the required format:

	- `[START]`
	- `[STEP]`
	- `[END]`

	Output contract:

	- one `[START]` line per task
	- one `[STEP]` line immediately after each `env.step()`
	- one `[END]` line per task, even on exception
	- `reward`, `rewards`, and `score` formatted to 2 decimal places
	- `done` and `success` emitted as lowercase booleans
	- `error` emitted as the raw `last_action_error` string or `null`

	If the model call fails or credentials are missing, the script falls back to a deterministic scripted policy so the benchmark still runs reproducibly.

	## Example Gold Scores

	Using the included scripted policy:

	- `password_reset_guidance`: `1.0`
	- `duplicate_charge_refund`: `1.0`
	- `enterprise_data_loss_escalation`: `1.0`

	## Deployment Notes

	- HF Space page: `https://huggingface.co/spaces/Dar3devil/customer-support-openenv`
	- HF app URL: `https://dar3devil-customer-support-openenv.hf.space`
	- The app exposes `/health`, `/reset`, `/step`, `/state`, `/docs`, `/web`, and `/ws`
	- Sessions are managed in-memory
	- No external services are required to run the environment server itself
	- The benchmark is designed to fit comfortably in the hackathon resource limits

	## Validation

	If `openenv` is installed locally, run:

	```bash
	openenv validate
	```

	## Pre-Submission Commands

	Local checks:

	```powershell
	cd "C:\Users\aarya\.codex\worktrees\e74f\Task Scheduler\customer_support_openenv"
	openenv validate
	pytest -q
	```

	Baseline run:

	```powershell
	$env:HF_TOKEN="<your-hf-token>"
	$env:API_BASE_URL="https://router.huggingface.co/v1"
	$env:MODEL_NAME="Qwen/Qwen2.5-72B-Instruct"
	python inference.py
	```

	Local Docker smoke test:

	```powershell
	docker build -t customer-support-openenv .
	docker run --rm -p 8000:8000 customer-support-openenv
	curl.exe -sS http://localhost:8000/health
	curl.exe -sS -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d "{}"
	```

	Live Space smoke test:

	```powershell
	curl.exe -sS https://dar3devil-customer-support-openenv.hf.space/health
	curl.exe -sS -X POST https://dar3devil-customer-support-openenv.hf.space/reset -H "Content-Type: application/json" -d "{}"
	```

	Submission validator:

	```powershell
	wsl bash -lc "cd '/mnt/c/Users/aarya/.codex/worktrees/e74f/Task Scheduler/customer_support_openenv' && chmod +x scripts/validate-submission.sh && scripts/validate-submission.sh https://dar3devil-customer-support-openenv.hf.space ."
	```

	Windows users should run the validator script through WSL or Git Bash.

	This repository does not depend on an LLM judge for grading.
	All graders are deterministic and implemented directly in the environment scorer.