Spaces:

aim143
/

support-queue-openenv

Sleeping

App Files Files Community

support-queue-openenv / PROJECT.md

eeshwar143

Clean submission history

e4accbb about 1 month ago

preview code

raw

history blame contribute delete

5.39 kB

Support Queue OpenEnv

A real-world OpenEnv benchmark for SaaS support triage.

Agents must read incoming support tickets, assign the right priority, route the case to the correct internal queue, choose the next action, and draft a safe first reply. The benchmark is designed to feel like an actual support operations workflow rather than a toy task.

Why This Environment

Real support teams repeatedly solve the same high-value triage problems:

decide how urgent a ticket is
route it to the right team
avoid unsafe or misleading replies
handle ambiguous requests without over-escalating

This makes support triage a strong RL and agent-evaluation environment because success is measurable, partial credit is meaningful, and mistakes are easy to interpret.

What The Agent Does

For each ticket, the agent must produce a SupportQueueAction with:

priority: P1 | P2 | P3 | P4
queue: billing | security | technical | success | trust_safety
disposition: respond | request_info | escalate | close
summary: short internal triage note
response: first customer-facing reply
confidence: float in [0.0, 1.0]

Observation Space

Each reset() and step() returns a typed SupportQueueObservation containing:

Field	Meaning
`task_id`, `task_title`, `difficulty`	Active benchmark task metadata
`instructions`	Task-specific operating guidance
`current_index`, `total_tickets`	Episode progress
`ticket`	Current customer ticket payload
`allowed_priorities`, `allowed_queues`, `allowed_dispositions`	Valid discrete actions
`scoring_weights`	Reward decomposition
`last_feedback`	Previous grader output
`reward`, `cumulative_reward`, `done`	Episode feedback
`info`	Extra metadata such as `episode_id`

The ticket payload includes:

ticket_id
subject
body
customer_tier
product_area
sla_hours
recent_events

State Space

state() returns a typed SupportQueueState with:

active task card
current cursor
cumulative and average reward
processed ticket ids
full action history
full per-ticket grading history

Tasks

The benchmark includes three deterministic tasks with increasing difficulty.

Task ID	Difficulty	Tickets	Description
`easy_inbox_cleanup`	Easy	2	Straightforward access and billing tickets
`medium_sla_defense`	Medium	3	Mix of phishing escalation, webhook failure, and billing ambiguity
`hard_exec_escalations`	Hard	4	Executive-pressure tickets spanning production, security, commercial, and retention workflows

Reward Design

Each processed ticket gets a reward in [0.0, 1.0].

Reward components:

Component	Weight
Priority accuracy	`0.30`
Queue accuracy	`0.25`
Disposition accuracy	`0.20`
Summary keyword coverage	`0.15`
Response keyword coverage	`0.10`
Unsafe reply penalty	`-0.10`

This gives useful partial progress signals. An agent can still earn reward for a good route or good reply even if one part of the triage decision is wrong.

API Surface

The environment server exposes:

POST /reset
POST /step
GET /state
GET /tasks
GET /health
GET /

Example reset payload:

{
  "task_id": "easy_inbox_cleanup"
}

Project Structure

support_queue_env/
  client.py
  grading.py
  models.py
  tasks.py
  server/
    app.py
    openenv_compat.py
    support_queue_environment.py
Dockerfile
openenv.yaml
inference.py

Running Locally

Python

pip install -r requirements.txt
uvicorn support_queue_env.server.app:app --host 0.0.0.0 --port 8000

Docker

docker build -t support-queue-openenv .
docker run --rm -p 8000:8000 support-queue-openenv

Baseline Inference

The required inference script is inference.py.

It:

uses the OpenAI Python client
reads API_BASE_URL, MODEL_NAME, HF_TOKEN, and optional LOCAL_IMAGE_NAME
emits structured [START], [STEP], and [END] logs
writes inference_results.json

Set environment variables:

API_BASE_URL=https://api.openai.com/v1
MODEL_NAME=gpt-4o-mini
HF_TOKEN=your_token
LOCAL_IMAGE_NAME=

Then run:

python inference.py

Baseline Scores

Expected deterministic baseline scores from the bundled heuristic policy:

Task	Score
`easy_inbox_cleanup`	`1.00`
`medium_sla_defense`	`0.98`
`hard_exec_escalations`	`0.97`
Average	`0.98`

Hugging Face Space

This repository is configured for a Docker Space.

front matter in README.md sets sdk: docker
app serves on port 8000
GET /health and POST /reset support deployment checks

OpenEnv Files

Core submission files:

Submission Checklist

typed action, observation, and state models included
reset(), step(), and state() implemented
three graded tasks included
reward bounded to [0.0, 1.0]
Dockerfile included
Hugging Face Docker Space compatible
root inference.py included