Spaces:
Sleeping
Sleeping
| title: SupportOpsEnv | |
| sdk: docker | |
| app_port: 7860 | |
| tags: | |
| - openenv | |
| - customer-support | |
| - evaluation | |
| # SupportOpsEnv | |
| SupportOpsEnv is a multi-step environment for evaluating agents on realistic customer support operations. The agent behaves like a support analyst: it reviews ticket summaries, requests missing context, assigns priority, chooses the correct internal route, selects a resolution, escalates when needed, and finalizes the case. This models a genuine workflow used by support operations, trust and safety, monetization, and account-recovery teams. | |
| The environment is designed to score well against OpenEnv-style hackathon criteria: | |
| - Real-world task simulation instead of a toy game | |
| - Three deterministic tasks with easy, medium, and hard difficulty | |
| - Dense reward shaping across the trajectory | |
| - Typed observation, action, and reward models | |
| - Reproducible OpenAI baseline runner | |
| - Reproducible rule-based baseline runner that works with no API key | |
| - Dockerized deployment on Hugging Face Spaces | |
| ## Environment Motivation | |
| Support queue triage is one of the clearest real-world benchmarks for agent quality: | |
| - Humans perform it every day | |
| - It requires multi-step reasoning, not one-shot classification | |
| - Progress can be measured deterministically | |
| - It exposes practical agent failure modes such as premature resolution, wrong escalation, and poor prioritization | |
| ## Observation Space | |
| `Observation` is a Pydantic model with: | |
| - `task_id`: active task identifier | |
| - `difficulty`: `easy`, `medium`, or `hard` | |
| - `title`: task title | |
| - `instruction`: natural-language objective | |
| - `queue_mode`: whether the task contains multiple tickets | |
| - `tickets`: list of ticket observations | |
| - `remaining_steps`: steps left in the episode | |
| - `available_actions`: valid action names | |
| - `current_queue_order`: current queue ranking, if any | |
| - `score_hint`: latest intermediate grader snapshot | |
| Each ticket observation contains: | |
| - `ticket_id` | |
| - `summary` | |
| - `visible_context` | |
| - `discovered_context` | |
| - `selected_priority` | |
| - `selected_route` | |
| - `selected_resolution` | |
| - `escalation_team` | |
| ## Action Space | |
| `Action` is a Pydantic model with: | |
| - `action_type` | |
| - `target` | |
| - `value` | |
| Supported `action_type` values: | |
| | `action_type` | `target` | `value` example | | |
| |------------------|------------|----------------------------------------| | |
| | `inspect_ticket` | ticket ID | `""` | | |
| | `request_context`| ticket ID | `"tax_status"` | | |
| | `set_priority` | ticket ID | `"urgent"` / `"high"` / `"normal"` / `"low"` | | |
| | `set_route` | ticket ID | `"account_security"` / `"billing_refunds"` / `"monetization_compliance"` / `"policy_appeals"` | | |
| | `set_resolution` | ticket ID | `"temporary_lock_and_manual_recovery"` / `"request_tax_renewal"` / `"approve_refund"` / `"expedited_human_review"` | | |
| | `escalate` | ticket ID | `"security_specialist"` | | |
| | `rank_queue` | `"queue"` | `"T2,T1,T3"` | | |
| | `finalize` | ticket ID | `""` | | |
| ## Reward Design | |
| `RewardModel` is a Pydantic model with: | |
| - `value`: scalar reward for this step | |
| - `components`: dict of named sub-rewards | |
| - `rationale`: human-readable explanation | |
| Reward shaping is dense, not sparse: | |
| - positive reward for discovering required context keys | |
| - positive reward for correct intermediate decisions (priority, route, resolution) | |
| - positive reward for correct queue ranking progress | |
| - terminal reward from the deterministic grader score | |
| - penalties for invalid actions, redundant actions, and wasted steps | |
| This creates a learning or evaluation signal over the full trajectory, not just at episode end. | |
| ## Tasks | |
| ### Easy: Account Takeover Triage | |
| Objective: correctly handle an urgent suspected account takeover with unauthorized ad spend. | |
| Success criteria: | |
| - request the right security and billing context | |
| - assign `urgent` | |
| - route to `account_security` | |
| - choose `temporary_lock_and_manual_recovery` | |
| - escalate to `security_specialist` | |
| ### Medium: Monetization Payout Hold | |
| Objective: investigate a missing creator payout and avoid unsafe release of funds. | |
| Success criteria: | |
| - discover tax-expiry and compliance-hold context | |
| - assign `high` | |
| - route to `monetization_compliance` | |
| - choose `request_tax_renewal` | |
| - avoid unnecessary escalation | |
| ### Hard: Mixed Support Queue Triage | |
| Objective: prioritize and resolve a heterogeneous queue of three tickets under SLA pressure. | |
| Success criteria: | |
| - correctly rank the queue by urgency | |
| - assign route and priority for each ticket independently | |
| - choose correct resolutions per ticket | |
| - escalate only the security-critical case | |
| ## Graders | |
| Each task has a deterministic grader that returns a score in `[0.0, 1.0]`. | |
| - Easy grader weights context discovery, priority, route, resolution, and escalation | |
| - Medium grader weights context and policy-safe resolution more heavily | |
| - Hard grader scores per-ticket handling and queue ranking independently | |
| Programmatic graders live in [`support_ops_env/graders/`](./support_ops_env/graders/). | |
| ## Baseline Scores | |
| ### Rule-based baseline (no API key required) | |
| The deterministic rule-based baseline always takes the optimal action sequence and is used as a sanity check that the graders are correct and reachable: | |
| | Task | Score | | |
| |-------------------------|-------| | |
| | `easy_account_takeover` | 1.000 | | |
| | `medium_payout_hold` | 1.000 | | |
| | `hard_queue_triage` | 1.000 | | |
| | **average** | **1.000** | | |
| ### LLM baseline (GPT-4.1-mini) | |
| These are the reproducible scores from the OpenAI baseline runner. They demonstrate that the environment provides a genuine challenge to frontier models, particularly on the hard task: | |
| | Task | Score | Notes | | |
| |-------------------------|-------|-------| | |
| | `easy_account_takeover` | ~0.20 | Model skips mandatory set_priority / set_route / set_resolution before finalize | | |
| | `medium_payout_hold` | ~0.35 | Correct context discovery but premature finalize | | |
| | `hard_queue_triage` | ~0.13 | Multi-ticket ranking and per-ticket mandatory actions not completed | | |
| | **average** | **~0.23** | | | |
| The gap between the rule baseline and the LLM baseline confirms the reward function produces genuine signal and the hard task challenges frontier models. | |
| ## Setup | |
| ```bash | |
| cd support_ops_env | |
| python -m venv .venv | |
| source .venv/bin/activate | |
| pip install -r requirements.txt | |
| ``` | |
| ## Usage | |
| Run the local tests: | |
| ```bash | |
| python -m unittest discover -s tests -p 'test_*.py' | |
| ``` | |
| Run the app locally: | |
| ```bash | |
| python app.py | |
| ``` | |
| Run the default no-API baseline: | |
| ```bash | |
| python scripts/run_rule_baseline.py | |
| ``` | |
| Run the OpenAI baseline: | |
| ```bash | |
| export OPENAI_API_KEY=your_key_here | |
| python scripts/run_baseline.py --model gpt-4.1-mini | |
| ``` | |
| Validate OpenEnv metadata: | |
| ```bash | |
| bash scripts/validate_env.sh | |
| # If the openenv CLI is installed, this also runs: openenv validate openenv.yaml | |
| ``` | |
| ## API Quick Start | |
| The live environment is available at `https://suppops-supportopsenv.hf.space`. | |
| Reset to a task: | |
| ```bash | |
| curl -X POST https://suppops-supportopsenv.hf.space/reset \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"task_id": "easy_account_takeover"}' | |
| ``` | |
| Take a step: | |
| ```bash | |
| curl -X POST https://suppops-supportopsenv.hf.space/step \ | |
| -H "Content-Type: application/json" \ | |
| -d '{"action": {"action_type": "inspect_ticket", "target": "T1", "value": ""}}' | |
| ``` | |
| Inspect the full environment state: | |
| ```bash | |
| curl https://suppops-supportopsenv.hf.space/state | |
| ``` | |
| Get JSON schemas for all models: | |
| ```bash | |
| curl https://suppops-supportopsenv.hf.space/schema | |
| ``` | |
| ## Hugging Face Space Deployment | |
| This repository includes a `Dockerfile`, `app.py`, and `openenv.yaml` and deploys as a Docker Space. | |
| 1. Create a new Hugging Face Space with SDK set to Docker. | |
| 2. Push this repository to the Space. | |
| 3. Add the `openenv` tag in the Space metadata (already present in this README's frontmatter). | |
| 4. Optionally set `OPENAI_API_KEY` as a Space secret for baseline experiments. | |
| ## Project Structure | |
| ```text | |
| support_ops_env/ | |
| βββ support_ops_env/ | |
| β βββ env.py | |
| β βββ models.py | |
| β βββ reward.py | |
| β βββ state.py | |
| β βββ data/ | |
| β βββ graders/ | |
| β βββ tasks/ | |
| βββ scripts/ | |
| β βββ run_baseline.py | |
| β βββ run_rule_baseline.py | |
| β βββ validate_env.sh | |
| βββ tests/ | |
| βββ app.py | |
| βββ openenv.yaml | |
| βββ Dockerfile | |
| βββ requirements.txt | |
| βββ README.md | |
| ``` | |