File size: 3,641 Bytes
cd688d7
 
 
 
 
 
 
 
 
1724801
 
 
 
 
 
cd688d7
 
1724801
 
 
 
 
 
 
 
 
 
 
 
 
cd688d7
 
 
 
 
 
1724801
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# Product Requirements Document (PRD): Support Ticket Environment for OpenEnv

## 1. Introduction and Objectives
The **Support Ticket Environment** aims to test Large Language Models (LLMs) and agentic frameworks in a highly realistic, consequence-driven enterprise setting. Customer support resolution requires strict adherence to internal policies, information verification, and multi-step reasoning before taking terminal actions (e.g., refunds or escalations). 

**Objective**: Provide an OpenEnv-compliant simulation where an agent assumes the role of a support professional. The environment acts as an adversarial and deterministic evaluator to cleanly quantify an agent's ability to gather state, read contextual rules, and execute appropriate API actions.

## 2. Real-World Utility
Most AI evaluations focus on static benchmarks (MMLU) or gamified environments (Minecraft). However, the most immediate commercial application of agentic AI is customer support automation. 

### The Problem
Companies lose millions to unchecked LLM agents hallucinating policies, issuing improper refunds, or frustrating high-tier enterprise clients.

### The Solution
This environment models the actual complexity of a ticketing system. It enforces that agents must securely verify `UserData`, correctly attribute `IssueType` to a `Policy`, and avoid taking destructive actions (like rejecting an enterprise client abruptly) under pressure or when faced with confusing queries.

## 3. Environment Architecture

### State Boundaries
- Each task begins with a newly opened ticket.
- The episode terminates either when the agent explicitly uses a terminal action (`close_ticket`, `escalate`) or after reaching the hard threshold of $N=10$ steps.

### Action Constraints
- Intermediate actions (`fetch_user_data`, `check_policy`) do not alter the external ticket state but provide critical context.
- Terminal actions irreversibly mutate the state and trigger evaluation.

### Grading and Reward Shaping
- Graders are strictly deterministic.
- Fractional rewards are yielded for necessary intermediate contextualization steps (promoting chain-of-thought grounding).
- Sharp penalties are applied for protocol violations (e.g., escalating a simple refund directly to billing Tier 2).

## 4. Required Agent Capabilities
To succeed on hard tasks, an agent must demonstrate:
- **State Management**: Remembering the constraints of the `policy` retrieved earlier in the episode.
- **Self-Correction**: Adapting if `fetch_user_data` returns constraints (e.g., the user is not a premium member).
- **Nuanced Execution**: Apologizing organically when generating the `reply_to_customer` response during a high-stakes failure ticket.

## 5. Evaluation Criteria

### Core Metrics
- **Task Completion Rate**: Fraction of tasks completed successfully.
- **Protocol Adherence**: Fraction of steps that align with the defined policy.
- **Efficiency**: Average number of steps taken to complete a task.

### Grader Outputs
Grader outputs are JSON objects with the following fields:
```json
{
  "task_id": "task_hard_1",
  "score": 0.8,
  "violations": ["policy_violation", "premature_closure"]
}
```

### Constraints
- Agents must not exceed the step limit.
- Agents must avoid terminal actions unless confident of the resolution.

## 6. Future Extensions
- **Multi-Agent Collaboration**: Introduce scenarios where multiple agents must collaborate to resolve a ticket.
- **Dynamic Policies**: Add tasks where policies change mid-episode, requiring agents to adapt.
- **Realistic User Simulation**: Enhance the environment with stochastic user behavior to test robustness.