🎫 Ticket Triage — OpenEnv Environment

A real-world customer support ticket triage environment for training and evaluating AI agents. Agents learn to prioritize, categorize, route, respond to, and resolve support tickets — a genuine daily workflow at any software company.

Why This Environment?

Support ticket triage is a high-stakes, real-world task that:

Requires multi-step reasoning (read → prioritize → categorize → respond → close)
Has clear objective signals (SLA compliance, classification accuracy, response quality)
Tests language understanding (detecting urgency, product area, and customer tier)
Scales naturally to harder tasks (easy prioritization → full workflow management)
Is used by real teams every day — frontier model performance here has direct commercial value

Environment Description

The environment simulates a support operations team receiving a mixed queue of incoming tickets. Each ticket includes a subject, body, customer tier (free/pro/enterprise), and an SLA deadline.

The agent takes structured actions on tickets across a full triage workflow. Shaped rewards guide the agent toward correct behavior at each sub-step, not just at episode end.

Action Space

All actions are JSON objects:

{
  "action_type": "<type>",
  "ticket_id": "<ticket_id>",
  "value": "<value or null>"
}

`action_type`	`value`	Description
`prioritize`	`critical` \| `high` \| `medium` \| `low`	Set ticket urgency priority
`categorize`	`billing` \| `technical` \| `account` \| `shipping` \| `product` \| `general`	Assign product area
`assign`	`billing_team` \| `engineering` \| `account_team` \| `logistics` \| `product_team` \| `general_support`	Route to team
`respond`	Free-text response string (150–600 chars ideal)	Send a reply to the customer
`escalate`	`null`	Escalate to senior support / eng
`close`	`null`	Mark ticket as resolved
`skip`	`null`	Skip ticket (penalty incurred)

Observation Space

Each step returns an Observation with:

Field	Type	Description
`queue`	`list[dict]`	All tickets with current triage state (no ground truth)
`resolved`	`list[str]`	IDs of closed tickets
`current_ticket_id`	`str \| null`	First open ticket (suggested focus)
`time_step`	`int`	Step counter within episode
`sla_breaches`	`int`	Cumulative SLA violations
`budget_remaining`	`int`	Steps remaining before episode ends
`episode_done`	`bool`	Whether episode has ended
`info`	`dict`	Valid enum values for actions

Each ticket in the queue exposes: id, subject, body, customer_tier, created_at, sla_deadline, status, priority, category, assigned_team, response_sent, escalated.

Reward Function

Rewards are shaped across the full trajectory — not sparse end-of-episode signals.

Action	Correct	Wrong / Partial
`prioritize`	+0.15	+0.06 (1-level off), −0.10 (far off)
`categorize`	+0.15	−0.08
`assign`	+0.15	−0.08
`respond`	0–+0.25	Proportional to quality heuristic
`escalate` (correct)	+0.10	−0.10 if not needed
`close` + SLA met	+0.13	+0.05 if SLA breached
`skip`	—	−0.05
SLA breach (passive)	—	−0.15 per ticket
Over budget	—	−0.20

Response quality is scored heuristically on: length, professional tone, urgency matching, subject acknowledgement, escalation mentions, and structural elements (salutation, sign-off).

Tasks

Task 1 — Priority Sorting (`task_easy`) 🟢

10 tickets covering all priority levels and domains
Objective: Assign the correct priority to each ticket
Grading: Weighted partial-credit priority accuracy + step efficiency bonus
Max steps: 30
Expected difficulty: Solvable by any capable LLM

Task 2 — Triage & Categorize (`task_medium`) 🟡

15 tickets (superset of easy)
Objective: Correctly set priority + category + assign to the right team
Grading: 35% priority (partial) + 35% category accuracy + 30% assignment accuracy
Max steps: 75
Expected difficulty: Requires domain knowledge and multi-label reasoning

Task 3 — Full Support Workflow (`task_hard`) 🔴

20 tickets including PII breach, SSO failure, GDPR request
Objective: End-to-end triage — priority, category, assign, respond, escalate when needed, close
Grading: 15% priority + 15% category + 15% assignment + 15% resolution rate + 20% response quality + 10% escalation accuracy + 10% SLA compliance
Max steps: 160
Expected difficulty: Challenges frontier models — requires SLA awareness, escalation judgment, and quality response drafting

Grader Details

All graders are deterministic and reproducible:

Metric	Measurement
Priority accuracy	Partial credit: exact=1.0, 1-level off=0.5, 2-level=0.25, else 0
Category accuracy	Exact match fraction across all tickets
Assignment accuracy	Exact match fraction across all tickets
Response quality	Heuristic scorer (0–1): length, tone, urgency, structure
Escalation accuracy	TP + TN rate: escalated iff `requires_escalation=True`
SLA compliance	1 − (breaches / total tickets)
Resolution rate	Fraction of tickets closed

Scores are always in [0.0, 1.0].

Setup & Usage

Option 1 — Python (local)

git clone https://huggingface.co/spaces/your-org/ticket-triage-env
cd ticket-triage-env
pip install -r requirements.txt

# Start API server
python app.py
# → http://localhost:7860/docs

Option 2 — Docker

docker build -t ticket-triage-env .
docker run -p 7860:7860 ticket-triage-env
# → http://localhost:7860/docs

Option 3 — Python API (no server)

from env import TicketTriageEnv, Action, ActionType

env = TicketTriageEnv()
obs = env.reset("task_easy")

action = Action(action_type=ActionType.PRIORITIZE, ticket_id="T001", value="critical")
obs, reward, done, info = env.step(action)

print(f"Reward: {reward.total} — {reward.reason}")
result = env.grade()
print(f"Final score: {result.final_score:.4f}")

Baseline Scores

Run the baseline inference script against all 3 tasks:

export OPENAI_API_KEY=sk-...
export OPENAI_MODEL=gpt-4o-mini   # or gpt-4o, claude-*, etc.
python baseline.py

Baseline Results (gpt-4o-mini)

Task	Score	Notes
`task_easy`	~0.82	Strong priority classification
`task_medium`	~0.64	Category + assignment errors drag score
`task_hard`	~0.41	SLA misses and shallow responses hurt quality
Average	~0.62

Scores are reproducible with temperature=0.2 and the provided system prompt.

Running Tests

pytest tests/ -v

All tests are deterministic and pass without any API keys.

API Reference

Once running, visit /docs for interactive Swagger documentation.

Endpoint	Method	Description
`/health`	GET	Health check
`/info`	GET	Environment metadata and valid values
`/reset`	POST	Start a new episode (returns session_id)
`/step`	POST	Take one action (returns obs/reward/done)
`/state`	POST	Full internal state (includes ground truth)
`/grade`	POST	Compute final episode score

Quick API Example

# 1. Reset
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "task_easy"}'
# → {"session_id": "abc123", "observation": {...}}

# 2. Step
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"session_id": "abc123", "action": {"action_type": "prioritize", "ticket_id": "T001", "value": "critical"}}'
# → {"reward": {"total": 0.15, ...}, "done": false, ...}

# 3. Grade
curl -X POST http://localhost:7860/grade \
  -H "Content-Type: application/json" \
  -d '{"session_id": "abc123"}'
# → {"result": {"final_score": 0.87, ...}}

Project Structure

ticket-triage-env/
├── openenv.yaml          # OpenEnv spec metadata
├── app.py                # FastAPI server (HF Spaces entry)
├── baseline.py           # Baseline inference script
├── requirements.txt
├── Dockerfile
├── env/
│   ├── __init__.py
│   ├── models.py         # Typed Pydantic models (Observation, Action, Reward)
│   ├── environment.py    # TicketTriageEnv — full OpenEnv implementation
│   ├── tickets.py        # Realistic ticket dataset with ground truth
│   ├── graders.py        # Deterministic task graders (0.0–1.0)
│   └── response_eval.py  # Heuristic response quality scorer
└── tests/
    └── test_environment.py

License

MIT License — free to use for research and commercial applications.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning