File size: 4,558 Bytes
ceec48c
 
 
2930dae
ceec48c
2930dae
 
 
 
ceec48c
 
 
2930dae
 
 
 
ceec48c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6070db1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ceec48c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b64237
ceec48c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2930dae
ceec48c
 
 
 
 
2930dae
ceec48c
2930dae
ceec48c
2930dae
ceec48c
2930dae
ceec48c
2930dae
ceec48c
2930dae
ceec48c
 
 
 
2930dae
ceec48c
2930dae
ceec48c
6070db1
ceec48c
 
 
 
2930dae
6070db1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b64237
2930dae
ceec48c
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: SupportEnv
emoji: 🎫
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
tags:
  - openenv
  - customer-support
  - nlp
  - ticket-triage
  - agent-evaluation
pinned: false
---

# SupportEnv

SupportEnv is an OpenEnv-compliant environment for evaluating LLM agents on customer support ticket triage. Each episode presents a realistic support ticket and asks the agent to classify, extract, or resolve it β€” scored deterministically against ground-truth labels.

## Tasks

| Task | Difficulty | Action | Max Steps |
|------|-----------|--------|-----------|
| Task 1 β€” Ticket Classification | Easy | `classify` | 3 |
| Task 2 β€” Information Extraction | Medium | `extract` | 5 |
| Task 3 β€” Resolution Generation | Hard | `respond` | 8 |

**Task 1 β€” Ticket Classification (Easy)**  
Assign a `category` (billing / technical / account / feature_request / complaint / general) and `priority` (low / medium / high / critical) to each ticket.

**Task 2 β€” Information Extraction (Medium)**  
Extract structured entities (IDs, names, amounts, dates) and identify the list of required resolution actions.

**Task 3 β€” Resolution Generation (Hard)**  
Write a professional customer-facing response and an ordered list of internal resolution steps. Graded on keyword coverage, step completeness, tone adherence, and minimum length.

## Observation Space

Each observation includes:

- `task_id`, `task_description`, `episode_id`
- `ticket` object with `ticket_id`, `subject`, `body`, `customer_tier`, `account_age_days`, `previous_tickets`, `attachments`
- `thread_history` as ordered action summaries
- `available_actions` for the current task state
- `step_number`, `max_steps`
- `hint` (optional guidance)

## Action Space

Supported `action.action_type` values:

- `classify`: requires `category` and `priority`
- `extract`: requires `extracted_entities` and `required_actions`
- `respond`: requires `response_text` and `resolution_steps`
- `submit`: closes the episode and triggers terminal grading

## API

| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/reset` | Start a new episode |
| `POST` | `/step` | Submit an action |
| `GET` | `/state` | Get current episode state |
| `POST` | `/grader` | Grade a finished episode |
| `GET` | `/tasks` | List all tasks |
| `GET` | `/health` | Liveness check |
| `GET` | `/docs` | OpenAPI docs |

### Reset
```json
POST /reset
{"task_id": "task1", "ticket_index": 0}
```

### Step β€” Task 1 (classify)
```json
POST /step
{
  "episode_id": "<id>",
  "action": {"action_type": "classify", "category": "billing", "priority": "high"}
}
```

### Step β€” Task 2 (extract)
```json
POST /step
{
  "episode_id": "<id>",
  "action": {
    "action_type": "extract",
    "extracted_entities": {"customer_name": "Alice", "invoice_number": "INV-001"},
    "required_actions": ["issue_refund", "send_corrected_invoice"]
  }
}
```

### Step β€” Task 3 (respond)
```json
POST /step
{
  "episode_id": "<id>",
  "action": {
    "action_type": "respond",
    "response_text": "Dear customer, we sincerely apologize...",
    "resolution_steps": ["verify_account", "issue_refund", "send_confirmation"]
  }
}
```

### Submit
```json
POST /step
{"episode_id": "<id>", "action": {"action_type": "submit"}}
```

## Scoring

**Task 1:** category match (0.50) + priority match (0.40) + efficiency (0.10)

**Task 2:** entity coverage (0.60) + action coverage (0.30) + no hallucination (0.10)

**Task 3:** keyword coverage (0.30) + step coverage (0.30) + tone compliance (0.25) + length adequate (0.10) + non-empty steps (0.05)

## Running Locally

```bash
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860
```

## Running the Baseline Agent

```bash
export API_BASE_URL=https://router.huggingface.co/v1
export HF_TOKEN=your_token_here
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
python inference.py
```

Required environment variables for baseline LLM calls:

- `API_BASE_URL` (default provided in code)
- `MODEL_NAME` (default provided in code)
- `HF_TOKEN` (must be provided)

Environment endpoint variables for the baseline:

- `OPENENV_BASE_URL` (preferred, default `http://localhost:7860`)
- `API_BASE_URL_ENV` (backward-compatible alias)

The baseline emits strict structured stdout lines only:

- `[START] task=<...> env=<...> model=<...>`
- `[STEP] step=<...> action=<...> reward=<...> done=<...> error=<...>`
- `[END] success=<...> steps=<...> rewards=<...>`

## Docker

```bash
docker build -t supportenv .
docker run -p 7860:7860 supportenv
```