Spaces:
Sleeping
Sleeping
| title: SupportOpsEnv | |
| sdk: docker | |
| app_port: 7860 | |
| tags: | |
| - openenv | |
| - customer-support | |
| - evaluation | |
| # SupportOpsEnv | |
| SupportOpsEnv is a multi-step environment for evaluating agents on realistic customer support operations. The agent behaves like a support analyst: it reviews ticket summaries, requests missing context, assigns priority, chooses the correct internal route, selects a resolution, escalates when needed, and finalizes the case. This models a genuine workflow used by support operations, trust and safety, monetization, and account-recovery teams. | |
| The environment is designed to score well against OpenEnv-style hackathon criteria: | |
| - Real-world task simulation instead of a toy game | |
| - Three deterministic tasks with easy, medium, and hard difficulty | |
| - Dense reward shaping across the trajectory | |
| - Typed observation, action, and reward models | |
| - Reproducible OpenAI baseline runner | |
| - Reproducible rule-based baseline runner that works with no API key | |
| - Dockerized deployment path for Hugging Face Spaces | |
| ## Environment Motivation | |
| Support queue triage is one of the clearest real-world benchmarks for agent quality: | |
| - Humans perform it every day | |
| - It requires multi-step reasoning, not one-shot classification | |
| - Progress can be measured deterministically | |
| - It exposes practical agent failure modes such as premature resolution, wrong escalation, and poor prioritization | |
| ## Observation Space | |
| `Observation` is a Pydantic model with: | |
| - `task_id`: active task identifier | |
| - `difficulty`: `easy`, `medium`, or `hard` | |
| - `title`: task title | |
| - `instruction`: natural-language objective | |
| - `queue_mode`: whether the task contains multiple tickets | |
| - `tickets`: list of ticket observations | |
| - `remaining_steps`: steps left in the episode | |
| - `available_actions`: valid action names | |
| - `current_queue_order`: current queue ranking, if any | |
| - `score_hint`: latest intermediate grader snapshot | |
| Each ticket observation contains: | |
| - `ticket_id` | |
| - `summary` | |
| - `visible_context` | |
| - `discovered_context` | |
| - `selected_priority` | |
| - `selected_route` | |
| - `selected_resolution` | |
| - `escalation_team` | |
| ## Action Space | |
| `Action` is a Pydantic model with: | |
| - `action_type` | |
| - `target` | |
| - `value` | |
| Supported `action_type` values: | |
| - `inspect_ticket` | |
| - `request_context` | |
| - `set_priority` | |
| - `set_route` | |
| - `set_resolution` | |
| - `escalate` | |
| - `rank_queue` | |
| - `finalize` | |
| ## Reward Design | |
| `RewardModel` is a Pydantic model with: | |
| - `value` | |
| - `components` | |
| - `rationale` | |
| Reward shaping is dense, not sparse: | |
| - positive reward for discovering required context | |
| - positive reward for correct intermediate decisions | |
| - positive reward for correct queue ranking progress | |
| - terminal reward from the deterministic grader score | |
| - penalties for invalid actions, redundant actions, and wasted steps | |
| This creates learning or evaluation signal over the full trajectory. | |
| ## Tasks | |
| ### Easy: Account Takeover Triage | |
| Objective: correctly handle an urgent suspected account takeover with unauthorized ad spend. | |
| Expected difficulty: easy. | |
| Success criteria: | |
| - request the right security and billing context | |
| - assign `urgent` | |
| - route to `account_security` | |
| - choose `temporary_lock_and_manual_recovery` | |
| - escalate to `security_specialist` | |
| ### Medium: Monetization Payout Hold | |
| Objective: investigate a missing creator payout and avoid unsafe release of funds. | |
| Expected difficulty: medium. | |
| Success criteria: | |
| - discover tax-expiry and compliance-hold context | |
| - assign `high` | |
| - route to `monetization_compliance` | |
| - choose `request_tax_renewal` | |
| - avoid unnecessary escalation | |
| ### Hard: Mixed Support Queue Triage | |
| Objective: prioritize and resolve a heterogeneous queue under SLA pressure. | |
| Expected difficulty: hard. | |
| Success criteria: | |
| - correctly rank the queue | |
| - assign route and priority for each ticket | |
| - choose correct resolutions | |
| - escalate only the security-critical case | |
| ## Graders | |
| Each task has a deterministic grader that returns a score in `0.0` to `1.0`. | |
| - Easy grader weights context, priority, route, resolution, and escalation | |
| - Medium grader weights context and policy-safe resolution more heavily | |
| - Hard grader scores per-ticket handling and queue ranking | |
| Programmatic graders live in [support_ops_env/graders](/home/batman/Downloads/presentation_template/support_ops_env/support_ops_env/graders). | |
| ## Setup | |
| ```bash | |
| cd support_ops_env | |
| python -m venv .venv | |
| source .venv/bin/activate | |
| pip install -r requirements.txt | |
| ``` | |
| ## Usage | |
| Run the local tests: | |
| ```bash | |
| python -m unittest discover -s tests -p 'test_*.py' | |
| ``` | |
| Run the app locally: | |
| ```bash | |
| python app.py | |
| ``` | |
| Run the default no-API baseline: | |
| ```bash | |
| python scripts/run_rule_baseline.py | |
| ``` | |
| Run the OpenAI baseline if you have an API key: | |
| ```bash | |
| export OPENAI_API_KEY=your_key_here | |
| python scripts/run_baseline.py --model gpt-4.1-mini | |
| ``` | |
| Validate metadata: | |
| ```bash | |
| bash scripts/validate_env.sh | |
| ``` | |
| If the `openenv` CLI is installed, the script will also run `openenv validate openenv.yaml`. | |
| ## Baseline Scores | |
| The repository now includes a deterministic baseline in [run_rule_baseline.py](/home/batman/Downloads/presentation_template/support_ops_env/scripts/run_rule_baseline.py), so you can produce reproducible scores without any external API. | |
| In this workspace, use: | |
| ```bash | |
| python scripts/run_rule_baseline.py | |
| ``` | |
| This writes `rule_baseline_results.json` with per-task transcripts and the average score. | |
| The current deterministic baseline score from this workspace is: | |
| - `easy_account_takeover`: `1.0` | |
| - `medium_payout_hold`: `1.0` | |
| - `hard_queue_triage`: `1.0` | |
| - average: `1.0` | |
| The OpenAI baseline in [run_baseline.py](/home/batman/Downloads/presentation_template/support_ops_env/scripts/run_baseline.py) is still available as an optional comparison path after installing dependencies and setting `OPENAI_API_KEY`. | |
| ## Hugging Face Space Deployment | |
| This repository includes: | |
| - `Dockerfile` | |
| - `app.py` | |
| - `openenv.yaml` | |
| To deploy as a Docker Space: | |
| 1. Create a new Hugging Face Space with SDK set to Docker. | |
| 2. Upload this repository. | |
| 3. Add the `openenv` tag in the Space metadata. | |
| 4. Optionally set `OPENAI_API_KEY` as a Space secret for baseline experiments. | |
| ## Project Structure | |
| ```text | |
| support_ops_env/ | |
| βββ support_ops_env/ | |
| β βββ env.py | |
| β βββ models.py | |
| β βββ reward.py | |
| β βββ state.py | |
| β βββ data/ | |
| β βββ graders/ | |
| β βββ tasks/ | |
| βββ scripts/ | |
| βββ tests/ | |
| βββ app.py | |
| βββ openenv.yaml | |
| βββ Dockerfile | |
| βββ requirements.txt | |
| βββ README.md | |
| ``` | |