# Support Queue OpenEnv A real-world OpenEnv benchmark for **SaaS support triage**. Agents must read incoming support tickets, assign the right priority, route the case to the correct internal queue, choose the next action, and draft a safe first reply. The benchmark is designed to feel like an actual support operations workflow rather than a toy task. ## Why This Environment Real support teams repeatedly solve the same high-value triage problems: - decide how urgent a ticket is - route it to the right team - avoid unsafe or misleading replies - handle ambiguous requests without over-escalating This makes support triage a strong RL and agent-evaluation environment because success is measurable, partial credit is meaningful, and mistakes are easy to interpret. ## What The Agent Does For each ticket, the agent must produce a `SupportQueueAction` with: - `priority`: `P1 | P2 | P3 | P4` - `queue`: `billing | security | technical | success | trust_safety` - `disposition`: `respond | request_info | escalate | close` - `summary`: short internal triage note - `response`: first customer-facing reply - `confidence`: float in `[0.0, 1.0]` ## Observation Space Each `reset()` and `step()` returns a typed `SupportQueueObservation` containing: | Field | Meaning | | --- | --- | | `task_id`, `task_title`, `difficulty` | Active benchmark task metadata | | `instructions` | Task-specific operating guidance | | `current_index`, `total_tickets` | Episode progress | | `ticket` | Current customer ticket payload | | `allowed_priorities`, `allowed_queues`, `allowed_dispositions` | Valid discrete actions | | `scoring_weights` | Reward decomposition | | `last_feedback` | Previous grader output | | `reward`, `cumulative_reward`, `done` | Episode feedback | | `info` | Extra metadata such as `episode_id` | The ticket payload includes: - `ticket_id` - `subject` - `body` - `customer_tier` - `product_area` - `sla_hours` - `recent_events` ## State Space `state()` returns a typed `SupportQueueState` with: - active task card - current cursor - cumulative and average reward - processed ticket ids - full action history - full per-ticket grading history ## Tasks The benchmark includes three deterministic tasks with increasing difficulty. | Task ID | Difficulty | Tickets | Description | | --- | --- | ---: | --- | | `easy_inbox_cleanup` | Easy | 2 | Straightforward access and billing tickets | | `medium_sla_defense` | Medium | 3 | Mix of phishing escalation, webhook failure, and billing ambiguity | | `hard_exec_escalations` | Hard | 4 | Executive-pressure tickets spanning production, security, commercial, and retention workflows | ## Reward Design Each processed ticket gets a reward in `[0.0, 1.0]`. Reward components: | Component | Weight | | --- | ---: | | Priority accuracy | `0.30` | | Queue accuracy | `0.25` | | Disposition accuracy | `0.20` | | Summary keyword coverage | `0.15` | | Response keyword coverage | `0.10` | | Unsafe reply penalty | `-0.10` | This gives useful partial progress signals. An agent can still earn reward for a good route or good reply even if one part of the triage decision is wrong. ## API Surface The environment server exposes: - `POST /reset` - `POST /step` - `GET /state` - `GET /tasks` - `GET /health` - `GET /` Example reset payload: ```json { "task_id": "easy_inbox_cleanup" } ``` ## Project Structure ```text support_queue_env/ client.py grading.py models.py tasks.py server/ app.py openenv_compat.py support_queue_environment.py Dockerfile openenv.yaml inference.py ``` ## Running Locally ### Python ```bash pip install -r requirements.txt uvicorn support_queue_env.server.app:app --host 0.0.0.0 --port 8000 ``` ### Docker ```bash docker build -t support-queue-openenv . docker run --rm -p 8000:8000 support-queue-openenv ``` ## Baseline Inference The required inference script is [inference.py](./inference.py). It: - uses the OpenAI Python client - reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`, and optional `LOCAL_IMAGE_NAME` - emits structured `[START]`, `[STEP]`, and `[END]` logs - writes `inference_results.json` Set environment variables: ```bash API_BASE_URL=https://api.openai.com/v1 MODEL_NAME=gpt-4o-mini HF_TOKEN=your_token LOCAL_IMAGE_NAME= ``` Then run: ```bash python inference.py ``` ## Baseline Scores Expected deterministic baseline scores from the bundled heuristic policy: | Task | Score | | --- | ---: | | `easy_inbox_cleanup` | `1.00` | | `medium_sla_defense` | `0.98` | | `hard_exec_escalations` | `0.97` | | Average | `0.98` | ## Hugging Face Space This repository is configured for a **Docker Space**. - front matter in `README.md` sets `sdk: docker` - app serves on port `8000` - `GET /health` and `POST /reset` support deployment checks ## OpenEnv Files Core submission files: - [openenv.yaml](./openenv.yaml) - [inference.py](./inference.py) - [Dockerfile](./Dockerfile) - [support_queue_env/models.py](./support_queue_env/models.py) - [support_queue_env/server/support_queue_environment.py](./support_queue_env/server/support_queue_environment.py) ## Submission Checklist - typed action, observation, and state models included - `reset()`, `step()`, and `state()` implemented - three graded tasks included - reward bounded to `[0.0, 1.0]` - Dockerfile included - Hugging Face Docker Space compatible - root `inference.py` included