support-queue-openenv / PROJECT.md
eeshwar143
Clean submission history
e4accbb
# Support Queue OpenEnv
A real-world OpenEnv benchmark for **SaaS support triage**.
Agents must read incoming support tickets, assign the right priority, route the case to the correct internal queue, choose the next action, and draft a safe first reply. The benchmark is designed to feel like an actual support operations workflow rather than a toy task.
## Why This Environment
Real support teams repeatedly solve the same high-value triage problems:
- decide how urgent a ticket is
- route it to the right team
- avoid unsafe or misleading replies
- handle ambiguous requests without over-escalating
This makes support triage a strong RL and agent-evaluation environment because success is measurable, partial credit is meaningful, and mistakes are easy to interpret.
## What The Agent Does
For each ticket, the agent must produce a `SupportQueueAction` with:
- `priority`: `P1 | P2 | P3 | P4`
- `queue`: `billing | security | technical | success | trust_safety`
- `disposition`: `respond | request_info | escalate | close`
- `summary`: short internal triage note
- `response`: first customer-facing reply
- `confidence`: float in `[0.0, 1.0]`
## Observation Space
Each `reset()` and `step()` returns a typed `SupportQueueObservation` containing:
| Field | Meaning |
| --- | --- |
| `task_id`, `task_title`, `difficulty` | Active benchmark task metadata |
| `instructions` | Task-specific operating guidance |
| `current_index`, `total_tickets` | Episode progress |
| `ticket` | Current customer ticket payload |
| `allowed_priorities`, `allowed_queues`, `allowed_dispositions` | Valid discrete actions |
| `scoring_weights` | Reward decomposition |
| `last_feedback` | Previous grader output |
| `reward`, `cumulative_reward`, `done` | Episode feedback |
| `info` | Extra metadata such as `episode_id` |
The ticket payload includes:
- `ticket_id`
- `subject`
- `body`
- `customer_tier`
- `product_area`
- `sla_hours`
- `recent_events`
## State Space
`state()` returns a typed `SupportQueueState` with:
- active task card
- current cursor
- cumulative and average reward
- processed ticket ids
- full action history
- full per-ticket grading history
## Tasks
The benchmark includes three deterministic tasks with increasing difficulty.
| Task ID | Difficulty | Tickets | Description |
| --- | --- | ---: | --- |
| `easy_inbox_cleanup` | Easy | 2 | Straightforward access and billing tickets |
| `medium_sla_defense` | Medium | 3 | Mix of phishing escalation, webhook failure, and billing ambiguity |
| `hard_exec_escalations` | Hard | 4 | Executive-pressure tickets spanning production, security, commercial, and retention workflows |
## Reward Design
Each processed ticket gets a reward in `[0.0, 1.0]`.
Reward components:
| Component | Weight |
| --- | ---: |
| Priority accuracy | `0.30` |
| Queue accuracy | `0.25` |
| Disposition accuracy | `0.20` |
| Summary keyword coverage | `0.15` |
| Response keyword coverage | `0.10` |
| Unsafe reply penalty | `-0.10` |
This gives useful partial progress signals. An agent can still earn reward for a good route or good reply even if one part of the triage decision is wrong.
## API Surface
The environment server exposes:
- `POST /reset`
- `POST /step`
- `GET /state`
- `GET /tasks`
- `GET /health`
- `GET /`
Example reset payload:
```json
{
"task_id": "easy_inbox_cleanup"
}
```
## Project Structure
```text
support_queue_env/
client.py
grading.py
models.py
tasks.py
server/
app.py
openenv_compat.py
support_queue_environment.py
Dockerfile
openenv.yaml
inference.py
```
## Running Locally
### Python
```bash
pip install -r requirements.txt
uvicorn support_queue_env.server.app:app --host 0.0.0.0 --port 8000
```
### Docker
```bash
docker build -t support-queue-openenv .
docker run --rm -p 8000:8000 support-queue-openenv
```
## Baseline Inference
The required inference script is [inference.py](./inference.py).
It:
- uses the OpenAI Python client
- reads `API_BASE_URL`, `MODEL_NAME`, `HF_TOKEN`, and optional `LOCAL_IMAGE_NAME`
- emits structured `[START]`, `[STEP]`, and `[END]` logs
- writes `inference_results.json`
Set environment variables:
```bash
API_BASE_URL=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini
HF_TOKEN=your_token
LOCAL_IMAGE_NAME=
```
Then run:
```bash
python inference.py
```
## Baseline Scores
Expected deterministic baseline scores from the bundled heuristic policy:
| Task | Score |
| --- | ---: |
| `easy_inbox_cleanup` | `1.00` |
| `medium_sla_defense` | `0.98` |
| `hard_exec_escalations` | `0.97` |
| Average | `0.98` |
## Hugging Face Space
This repository is configured for a **Docker Space**.
- front matter in `README.md` sets `sdk: docker`
- app serves on port `8000`
- `GET /health` and `POST /reset` support deployment checks
## OpenEnv Files
Core submission files:
- [openenv.yaml](./openenv.yaml)
- [inference.py](./inference.py)
- [Dockerfile](./Dockerfile)
- [support_queue_env/models.py](./support_queue_env/models.py)
- [support_queue_env/server/support_queue_environment.py](./support_queue_env/server/support_queue_environment.py)
## Submission Checklist
- typed action, observation, and state models included
- `reset()`, `step()`, and `state()` implemented
- three graded tasks included
- reward bounded to `[0.0, 1.0]`
- Dockerfile included
- Hugging Face Docker Space compatible
- root `inference.py` included