🎫 Ticket Triage β€” OpenEnv Environment

A real-world customer support ticket triage environment for training and evaluating AI agents. Agents learn to prioritize, categorize, route, respond to, and resolve support tickets β€” a genuine daily workflow at any software company.


Why This Environment?

Support ticket triage is a high-stakes, real-world task that:

  • Requires multi-step reasoning (read β†’ prioritize β†’ categorize β†’ respond β†’ close)
  • Has clear objective signals (SLA compliance, classification accuracy, response quality)
  • Tests language understanding (detecting urgency, product area, and customer tier)
  • Scales naturally to harder tasks (easy prioritization β†’ full workflow management)
  • Is used by real teams every day β€” frontier model performance here has direct commercial value

Environment Description

The environment simulates a support operations team receiving a mixed queue of incoming tickets. Each ticket includes a subject, body, customer tier (free/pro/enterprise), and an SLA deadline.

The agent takes structured actions on tickets across a full triage workflow. Shaped rewards guide the agent toward correct behavior at each sub-step, not just at episode end.


Action Space

All actions are JSON objects:

{
  "action_type": "<type>",
  "ticket_id": "<ticket_id>",
  "value": "<value or null>"
}
action_type value Description
prioritize critical | high | medium | low Set ticket urgency priority
categorize billing | technical | account | shipping | product | general Assign product area
assign billing_team | engineering | account_team | logistics | product_team | general_support Route to team
respond Free-text response string (150–600 chars ideal) Send a reply to the customer
escalate null Escalate to senior support / eng
close null Mark ticket as resolved
skip null Skip ticket (penalty incurred)

Observation Space

Each step returns an Observation with:

Field Type Description
queue list[dict] All tickets with current triage state (no ground truth)
resolved list[str] IDs of closed tickets
current_ticket_id str | null First open ticket (suggested focus)
time_step int Step counter within episode
sla_breaches int Cumulative SLA violations
budget_remaining int Steps remaining before episode ends
episode_done bool Whether episode has ended
info dict Valid enum values for actions

Each ticket in the queue exposes: id, subject, body, customer_tier, created_at, sla_deadline, status, priority, category, assigned_team, response_sent, escalated.


Reward Function

Rewards are shaped across the full trajectory β€” not sparse end-of-episode signals.

Action Correct Wrong / Partial
prioritize +0.15 +0.06 (1-level off), βˆ’0.10 (far off)
categorize +0.15 βˆ’0.08
assign +0.15 βˆ’0.08
respond 0–+0.25 Proportional to quality heuristic
escalate (correct) +0.10 βˆ’0.10 if not needed
close + SLA met +0.13 +0.05 if SLA breached
skip β€” βˆ’0.05
SLA breach (passive) β€” βˆ’0.15 per ticket
Over budget β€” βˆ’0.20

Response quality is scored heuristically on: length, professional tone, urgency matching, subject acknowledgement, escalation mentions, and structural elements (salutation, sign-off).


Tasks

Task 1 β€” Priority Sorting (task_easy) 🟒

  • 10 tickets covering all priority levels and domains
  • Objective: Assign the correct priority to each ticket
  • Grading: Weighted partial-credit priority accuracy + step efficiency bonus
  • Max steps: 30
  • Expected difficulty: Solvable by any capable LLM

Task 2 β€” Triage & Categorize (task_medium) 🟑

  • 15 tickets (superset of easy)
  • Objective: Correctly set priority + category + assign to the right team
  • Grading: 35% priority (partial) + 35% category accuracy + 30% assignment accuracy
  • Max steps: 75
  • Expected difficulty: Requires domain knowledge and multi-label reasoning

Task 3 β€” Full Support Workflow (task_hard) πŸ”΄

  • 20 tickets including PII breach, SSO failure, GDPR request
  • Objective: End-to-end triage β€” priority, category, assign, respond, escalate when needed, close
  • Grading: 15% priority + 15% category + 15% assignment + 15% resolution rate + 20% response quality + 10% escalation accuracy + 10% SLA compliance
  • Max steps: 160
  • Expected difficulty: Challenges frontier models β€” requires SLA awareness, escalation judgment, and quality response drafting

Grader Details

All graders are deterministic and reproducible:

Metric Measurement
Priority accuracy Partial credit: exact=1.0, 1-level off=0.5, 2-level=0.25, else 0
Category accuracy Exact match fraction across all tickets
Assignment accuracy Exact match fraction across all tickets
Response quality Heuristic scorer (0–1): length, tone, urgency, structure
Escalation accuracy TP + TN rate: escalated iff requires_escalation=True
SLA compliance 1 βˆ’ (breaches / total tickets)
Resolution rate Fraction of tickets closed

Scores are always in [0.0, 1.0].


Setup & Usage

Option 1 β€” Python (local)

git clone https://huggingface.co/spaces/your-org/ticket-triage-env
cd ticket-triage-env
pip install -r requirements.txt

# Start API server
python app.py
# β†’ http://localhost:7860/docs

Option 2 β€” Docker

docker build -t ticket-triage-env .
docker run -p 7860:7860 ticket-triage-env
# β†’ http://localhost:7860/docs

Option 3 β€” Python API (no server)

from env import TicketTriageEnv, Action, ActionType

env = TicketTriageEnv()
obs = env.reset("task_easy")

action = Action(action_type=ActionType.PRIORITIZE, ticket_id="T001", value="critical")
obs, reward, done, info = env.step(action)

print(f"Reward: {reward.total} β€” {reward.reason}")
result = env.grade()
print(f"Final score: {result.final_score:.4f}")

Baseline Scores

Run the baseline inference script against all 3 tasks:

export OPENAI_API_KEY=sk-...
export OPENAI_MODEL=gpt-4o-mini   # or gpt-4o, claude-*, etc.
python baseline.py

Baseline Results (gpt-4o-mini)

Task Score Notes
task_easy ~0.82 Strong priority classification
task_medium ~0.64 Category + assignment errors drag score
task_hard ~0.41 SLA misses and shallow responses hurt quality
Average ~0.62

Scores are reproducible with temperature=0.2 and the provided system prompt.


Running Tests

pytest tests/ -v

All tests are deterministic and pass without any API keys.


API Reference

Once running, visit /docs for interactive Swagger documentation.

Endpoint Method Description
/health GET Health check
/info GET Environment metadata and valid values
/reset POST Start a new episode (returns session_id)
/step POST Take one action (returns obs/reward/done)
/state POST Full internal state (includes ground truth)
/grade POST Compute final episode score

Quick API Example

# 1. Reset
curl -X POST http://localhost:7860/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "task_easy"}'
# β†’ {"session_id": "abc123", "observation": {...}}

# 2. Step
curl -X POST http://localhost:7860/step \
  -H "Content-Type: application/json" \
  -d '{"session_id": "abc123", "action": {"action_type": "prioritize", "ticket_id": "T001", "value": "critical"}}'
# β†’ {"reward": {"total": 0.15, ...}, "done": false, ...}

# 3. Grade
curl -X POST http://localhost:7860/grade \
  -H "Content-Type: application/json" \
  -d '{"session_id": "abc123"}'
# β†’ {"result": {"final_score": 0.87, ...}}

Project Structure

ticket-triage-env/
β”œβ”€β”€ openenv.yaml          # OpenEnv spec metadata
β”œβ”€β”€ app.py                # FastAPI server (HF Spaces entry)
β”œβ”€β”€ baseline.py           # Baseline inference script
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ env/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ models.py         # Typed Pydantic models (Observation, Action, Reward)
β”‚   β”œβ”€β”€ environment.py    # TicketTriageEnv β€” full OpenEnv implementation
β”‚   β”œβ”€β”€ tickets.py        # Realistic ticket dataset with ground truth
β”‚   β”œβ”€β”€ graders.py        # Deterministic task graders (0.0–1.0)
β”‚   └── response_eval.py  # Heuristic response quality scorer
└── tests/
    └── test_environment.py

License

MIT License β€” free to use for research and commercial applications.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading