Spaces:
Sleeping
Sleeping
| # Support Queue OpenEnv | |
| A real-world OpenEnv benchmark for **SaaS support triage**. | |
| Agents must read incoming support tickets, assign the right priority, route the case to the correct internal queue, choose the next action, and draft a safe first reply. The benchmark is designed to feel like an actual support operations workflow rather than a toy task. | |
| ## Why This Environment | |
| Real support teams repeatedly solve the same high-value triage problems: | |
| - decide how urgent a ticket is | |
| - route it to the right team | |
| - avoid unsafe or misleading replies | |
| - handle ambiguous requests without over-escalating | |
| This makes support triage a strong RL and agent-evaluation environment because success is measurable, partial credit is meaningful, and mistakes are easy to interpret. | |
| ## What The Agent Does | |
| For each ticket, the agent must produce a `SupportQueueAction` with: | |
| - `priority`: `P1 | P2 | P3 | P4` | |
| - `queue`: `billing | security | technical | success | trust_safety` | |
| - `disposition`: `respond | request_info | escalate | close` | |
| - `summary`: short internal triage note | |
| - `response`: first customer-facing reply | |
| - `confidence`: float in `[0.0, 1.0]` | |
| ## Observation Space | |
| Each `reset()` and `step()` returns a typed `SupportQueueObservation` containing: | |
| | Field | Meaning | | |
| | --- | --- | | |
| | `task_id`, `task_title`, `difficulty` | Active benchmark task metadata | | |
| | `instructions` | Task-specific operating guidance | | |
| | `current_index`, `total_tickets` | Episode progress | | |
| | `ticket` | Current customer ticket payload | | |
| | `allowed_priorities`, `allowed_queues`, `allowed_dispositions` | Valid discrete actions | | |
| | `scoring_weights` | Reward decomposition | | |
| | `last_feedback` | Previous grader output | | |
| | `reward`, `cumulative_reward`, `done` | Episode feedback | | |
| | `info` | Extra metadata such as `episode_id` | | |
| The ticket payload includes: | |
| - `ticket_id` | |
| - `subject` | |
| - `body` | |
| - `customer_tier` | |
| - `product_area` | |
| - `sla_hours` | |
| - `recent_events` | |
| ## State Space | |
| `state()` returns a typed `SupportQueueState` with: | |
| - active task card | |
| - current cursor | |
| - cumulative and average reward | |
| - processed ticket ids | |
| - full action history | |
| - full per-ticket grading history | |
| ## Tasks | |
| The benchmark includes three deterministic tasks with increasing difficulty. | |
| | Task ID | Difficulty | Tickets | Description | | |
| | --- | --- | ---: | --- | | |
| | `easy_inbox_cleanup` | Easy | 2 | Straightforward access and billing tickets | | |
| | `medium_sla_defense` | Medium | 3 | Mix of phishing escalation, webhook failure, and billing ambiguity | | |
| | `hard_exec_escalations` | Hard | 4 | Executive-pressure tickets spanning production, security, commercial, and retention workflows | | |
| ## Reward Design | |
| Each processed ticket gets a reward in `[0.0, 1.0]`. | |
| Reward components: | |
| | Component | Weight | | |
| | --- | ---: | | |
| | Priority accuracy | `0.30` | | |
| | Queue accuracy | `0.25` | | |
| | Disposition accuracy | `0.20` | | |
| | Summary keyword coverage | `0.15` | | |
| | Response keyword coverage | `0.10` | | |
| | Unsafe reply penalty | `-0.10` | | |
| This gives useful partial progress signals. An agent can still earn reward for a good route or good reply even if one part of the triage decision is wrong. | |
| ## API Surface | |
| The environment server exposes: | |
| - `POST /reset` | |
| - `POST /step` | |
| - `GET /state` | |
| - `GET /tasks` | |
| - `GET /health` | |
| - `GET /` | |
| Example reset payload: | |
| ```json | |
| { | |
| "task_id": "easy_inbox_cleanup" | |
| } | |
| ``` | |
| ## Project Structure | |
| ```text | |
| support_queue_env/ | |
| client.py | |
| grading.py | |
| models.py | |
| tasks.py | |
| server/ | |
| app.py | |
| openenv_compat.py | |
| support_queue_environment.py | |
| Dockerfile | |
| openenv.yaml | |
| inference.py | |
| ``` | |
| ## Running Locally | |
| ### Python | |
| ```bash | |
| pip install -r requirements.txt | |
| uvicorn support_queue_env.server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ### Docker | |
| ```bash | |
| docker build -t support-queue-openenv . | |
| docker run --rm -p 8000:8000 support-queue-openenv | |
| ``` | |
| ## Baseline Inference | |
| The required inference script is [inference.py](./inference.py). | |
| It: | |
| - uses the OpenAI Python client | |
| - reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`, and optional `LOCAL_IMAGE_NAME` | |
| - emits structured `[START]`, `[STEP]`, and `[END]` logs | |
| - writes `inference_results.json` | |
| Set environment variables: | |
| ```bash | |
| API_BASE_URL=https://api.openai.com/v1 | |
| MODEL_NAME=gpt-4o-mini | |
| HF_TOKEN=your_token | |
| LOCAL_IMAGE_NAME= | |
| ``` | |
| Then run: | |
| ```bash | |
| python inference.py | |
| ``` | |
| ## Baseline Scores | |
| Expected deterministic baseline scores from the bundled heuristic policy: | |
| | Task | Score | | |
| | --- | ---: | | |
| | `easy_inbox_cleanup` | `1.00` | | |
| | `medium_sla_defense` | `0.98` | | |
| | `hard_exec_escalations` | `0.97` | | |
| | Average | `0.98` | | |
| ## Hugging Face Space | |
| This repository is configured for a **Docker Space**. | |
| - front matter in `README.md` sets `sdk: docker` | |
| - app serves on port `8000` | |
| - `GET /health` and `POST /reset` support deployment checks | |
| ## OpenEnv Files | |
| Core submission files: | |
| - [openenv.yaml](./openenv.yaml) | |
| - [inference.py](./inference.py) | |
| - [Dockerfile](./Dockerfile) | |
| - [support_queue_env/models.py](./support_queue_env/models.py) | |
| - [support_queue_env/server/support_queue_environment.py](./support_queue_env/server/support_queue_environment.py) | |
| ## Submission Checklist | |
| - typed action, observation, and state models included | |
| - `reset()`, `step()`, and `state()` implemented | |
| - three graded tasks included | |
| - reward bounded to `[0.0, 1.0]` | |
| - Dockerfile included | |
| - Hugging Face Docker Space compatible | |
| - root `inference.py` included | |