π« Ticket Triage β OpenEnv Environment
A real-world customer support ticket triage environment for training and evaluating AI agents. Agents learn to prioritize, categorize, route, respond to, and resolve support tickets β a genuine daily workflow at any software company.
Why This Environment?
Support ticket triage is a high-stakes, real-world task that:
- Requires multi-step reasoning (read β prioritize β categorize β respond β close)
- Has clear objective signals (SLA compliance, classification accuracy, response quality)
- Tests language understanding (detecting urgency, product area, and customer tier)
- Scales naturally to harder tasks (easy prioritization β full workflow management)
- Is used by real teams every day β frontier model performance here has direct commercial value
Environment Description
The environment simulates a support operations team receiving a mixed queue of incoming tickets. Each ticket includes a subject, body, customer tier (free/pro/enterprise), and an SLA deadline.
The agent takes structured actions on tickets across a full triage workflow. Shaped rewards guide the agent toward correct behavior at each sub-step, not just at episode end.
Action Space
All actions are JSON objects:
{
"action_type": "<type>",
"ticket_id": "<ticket_id>",
"value": "<value or null>"
}
action_type |
value |
Description |
|---|---|---|
prioritize |
critical | high | medium | low |
Set ticket urgency priority |
categorize |
billing | technical | account | shipping | product | general |
Assign product area |
assign |
billing_team | engineering | account_team | logistics | product_team | general_support |
Route to team |
respond |
Free-text response string (150β600 chars ideal) | Send a reply to the customer |
escalate |
null |
Escalate to senior support / eng |
close |
null |
Mark ticket as resolved |
skip |
null |
Skip ticket (penalty incurred) |
Observation Space
Each step returns an Observation with:
| Field | Type | Description |
|---|---|---|
queue |
list[dict] |
All tickets with current triage state (no ground truth) |
resolved |
list[str] |
IDs of closed tickets |
current_ticket_id |
str | null |
First open ticket (suggested focus) |
time_step |
int |
Step counter within episode |
sla_breaches |
int |
Cumulative SLA violations |
budget_remaining |
int |
Steps remaining before episode ends |
episode_done |
bool |
Whether episode has ended |
info |
dict |
Valid enum values for actions |
Each ticket in the queue exposes: id, subject, body, customer_tier, created_at, sla_deadline, status, priority, category, assigned_team, response_sent, escalated.
Reward Function
Rewards are shaped across the full trajectory β not sparse end-of-episode signals.
| Action | Correct | Wrong / Partial |
|---|---|---|
prioritize |
+0.15 | +0.06 (1-level off), β0.10 (far off) |
categorize |
+0.15 | β0.08 |
assign |
+0.15 | β0.08 |
respond |
0β+0.25 | Proportional to quality heuristic |
escalate (correct) |
+0.10 | β0.10 if not needed |
close + SLA met |
+0.13 | +0.05 if SLA breached |
skip |
β | β0.05 |
| SLA breach (passive) | β | β0.15 per ticket |
| Over budget | β | β0.20 |
Response quality is scored heuristically on: length, professional tone, urgency matching, subject acknowledgement, escalation mentions, and structural elements (salutation, sign-off).
Tasks
Task 1 β Priority Sorting (task_easy) π’
- 10 tickets covering all priority levels and domains
- Objective: Assign the correct priority to each ticket
- Grading: Weighted partial-credit priority accuracy + step efficiency bonus
- Max steps: 30
- Expected difficulty: Solvable by any capable LLM
Task 2 β Triage & Categorize (task_medium) π‘
- 15 tickets (superset of easy)
- Objective: Correctly set priority + category + assign to the right team
- Grading: 35% priority (partial) + 35% category accuracy + 30% assignment accuracy
- Max steps: 75
- Expected difficulty: Requires domain knowledge and multi-label reasoning
Task 3 β Full Support Workflow (task_hard) π΄
- 20 tickets including PII breach, SSO failure, GDPR request
- Objective: End-to-end triage β priority, category, assign, respond, escalate when needed, close
- Grading: 15% priority + 15% category + 15% assignment + 15% resolution rate + 20% response quality + 10% escalation accuracy + 10% SLA compliance
- Max steps: 160
- Expected difficulty: Challenges frontier models β requires SLA awareness, escalation judgment, and quality response drafting
Grader Details
All graders are deterministic and reproducible:
| Metric | Measurement |
|---|---|
| Priority accuracy | Partial credit: exact=1.0, 1-level off=0.5, 2-level=0.25, else 0 |
| Category accuracy | Exact match fraction across all tickets |
| Assignment accuracy | Exact match fraction across all tickets |
| Response quality | Heuristic scorer (0β1): length, tone, urgency, structure |
| Escalation accuracy | TP + TN rate: escalated iff requires_escalation=True |
| SLA compliance | 1 β (breaches / total tickets) |
| Resolution rate | Fraction of tickets closed |
Scores are always in [0.0, 1.0].
Setup & Usage
Option 1 β Python (local)
git clone https://huggingface.co/spaces/your-org/ticket-triage-env
cd ticket-triage-env
pip install -r requirements.txt
# Start API server
python app.py
# β http://localhost:7860/docs
Option 2 β Docker
docker build -t ticket-triage-env .
docker run -p 7860:7860 ticket-triage-env
# β http://localhost:7860/docs
Option 3 β Python API (no server)
from env import TicketTriageEnv, Action, ActionType
env = TicketTriageEnv()
obs = env.reset("task_easy")
action = Action(action_type=ActionType.PRIORITIZE, ticket_id="T001", value="critical")
obs, reward, done, info = env.step(action)
print(f"Reward: {reward.total} β {reward.reason}")
result = env.grade()
print(f"Final score: {result.final_score:.4f}")
Baseline Scores
Run the baseline inference script against all 3 tasks:
export OPENAI_API_KEY=sk-...
export OPENAI_MODEL=gpt-4o-mini # or gpt-4o, claude-*, etc.
python baseline.py
Baseline Results (gpt-4o-mini)
| Task | Score | Notes |
|---|---|---|
task_easy |
~0.82 | Strong priority classification |
task_medium |
~0.64 | Category + assignment errors drag score |
task_hard |
~0.41 | SLA misses and shallow responses hurt quality |
| Average | ~0.62 |
Scores are reproducible with temperature=0.2 and the provided system prompt.
Running Tests
pytest tests/ -v
All tests are deterministic and pass without any API keys.
API Reference
Once running, visit /docs for interactive Swagger documentation.
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/info |
GET | Environment metadata and valid values |
/reset |
POST | Start a new episode (returns session_id) |
/step |
POST | Take one action (returns obs/reward/done) |
/state |
POST | Full internal state (includes ground truth) |
/grade |
POST | Compute final episode score |
Quick API Example
# 1. Reset
curl -X POST http://localhost:7860/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "task_easy"}'
# β {"session_id": "abc123", "observation": {...}}
# 2. Step
curl -X POST http://localhost:7860/step \
-H "Content-Type: application/json" \
-d '{"session_id": "abc123", "action": {"action_type": "prioritize", "ticket_id": "T001", "value": "critical"}}'
# β {"reward": {"total": 0.15, ...}, "done": false, ...}
# 3. Grade
curl -X POST http://localhost:7860/grade \
-H "Content-Type: application/json" \
-d '{"session_id": "abc123"}'
# β {"result": {"final_score": 0.87, ...}}
Project Structure
ticket-triage-env/
βββ openenv.yaml # OpenEnv spec metadata
βββ app.py # FastAPI server (HF Spaces entry)
βββ baseline.py # Baseline inference script
βββ requirements.txt
βββ Dockerfile
βββ env/
β βββ __init__.py
β βββ models.py # Typed Pydantic models (Observation, Action, Reward)
β βββ environment.py # TicketTriageEnv β full OpenEnv implementation
β βββ tickets.py # Realistic ticket dataset with ground truth
β βββ graders.py # Deterministic task graders (0.0β1.0)
β βββ response_eval.py # Heuristic response quality scorer
βββ tests/
βββ test_environment.py
License
MIT License β free to use for research and commercial applications.