Spaces:
Sleeping
Sleeping
File size: 4,558 Bytes
ceec48c 2930dae ceec48c 2930dae ceec48c 2930dae ceec48c 6070db1 ceec48c 5b64237 ceec48c 2930dae ceec48c 2930dae ceec48c 2930dae ceec48c 2930dae ceec48c 2930dae ceec48c 2930dae ceec48c 2930dae ceec48c 2930dae ceec48c 2930dae ceec48c 6070db1 ceec48c 2930dae 6070db1 5b64237 2930dae ceec48c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | ---
title: SupportEnv
emoji: π«
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
tags:
- openenv
- customer-support
- nlp
- ticket-triage
- agent-evaluation
pinned: false
---
# SupportEnv
SupportEnv is an OpenEnv-compliant environment for evaluating LLM agents on customer support ticket triage. Each episode presents a realistic support ticket and asks the agent to classify, extract, or resolve it β scored deterministically against ground-truth labels.
## Tasks
| Task | Difficulty | Action | Max Steps |
|------|-----------|--------|-----------|
| Task 1 β Ticket Classification | Easy | `classify` | 3 |
| Task 2 β Information Extraction | Medium | `extract` | 5 |
| Task 3 β Resolution Generation | Hard | `respond` | 8 |
**Task 1 β Ticket Classification (Easy)**
Assign a `category` (billing / technical / account / feature_request / complaint / general) and `priority` (low / medium / high / critical) to each ticket.
**Task 2 β Information Extraction (Medium)**
Extract structured entities (IDs, names, amounts, dates) and identify the list of required resolution actions.
**Task 3 β Resolution Generation (Hard)**
Write a professional customer-facing response and an ordered list of internal resolution steps. Graded on keyword coverage, step completeness, tone adherence, and minimum length.
## Observation Space
Each observation includes:
- `task_id`, `task_description`, `episode_id`
- `ticket` object with `ticket_id`, `subject`, `body`, `customer_tier`, `account_age_days`, `previous_tickets`, `attachments`
- `thread_history` as ordered action summaries
- `available_actions` for the current task state
- `step_number`, `max_steps`
- `hint` (optional guidance)
## Action Space
Supported `action.action_type` values:
- `classify`: requires `category` and `priority`
- `extract`: requires `extracted_entities` and `required_actions`
- `respond`: requires `response_text` and `resolution_steps`
- `submit`: closes the episode and triggers terminal grading
## API
| Method | Endpoint | Description |
|--------|----------|-------------|
| `POST` | `/reset` | Start a new episode |
| `POST` | `/step` | Submit an action |
| `GET` | `/state` | Get current episode state |
| `POST` | `/grader` | Grade a finished episode |
| `GET` | `/tasks` | List all tasks |
| `GET` | `/health` | Liveness check |
| `GET` | `/docs` | OpenAPI docs |
### Reset
```json
POST /reset
{"task_id": "task1", "ticket_index": 0}
```
### Step β Task 1 (classify)
```json
POST /step
{
"episode_id": "<id>",
"action": {"action_type": "classify", "category": "billing", "priority": "high"}
}
```
### Step β Task 2 (extract)
```json
POST /step
{
"episode_id": "<id>",
"action": {
"action_type": "extract",
"extracted_entities": {"customer_name": "Alice", "invoice_number": "INV-001"},
"required_actions": ["issue_refund", "send_corrected_invoice"]
}
}
```
### Step β Task 3 (respond)
```json
POST /step
{
"episode_id": "<id>",
"action": {
"action_type": "respond",
"response_text": "Dear customer, we sincerely apologize...",
"resolution_steps": ["verify_account", "issue_refund", "send_confirmation"]
}
}
```
### Submit
```json
POST /step
{"episode_id": "<id>", "action": {"action_type": "submit"}}
```
## Scoring
**Task 1:** category match (0.50) + priority match (0.40) + efficiency (0.10)
**Task 2:** entity coverage (0.60) + action coverage (0.30) + no hallucination (0.10)
**Task 3:** keyword coverage (0.30) + step coverage (0.30) + tone compliance (0.25) + length adequate (0.10) + non-empty steps (0.05)
## Running Locally
```bash
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 7860
```
## Running the Baseline Agent
```bash
export API_BASE_URL=https://router.huggingface.co/v1
export HF_TOKEN=your_token_here
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
python inference.py
```
Required environment variables for baseline LLM calls:
- `API_BASE_URL` (default provided in code)
- `MODEL_NAME` (default provided in code)
- `HF_TOKEN` (must be provided)
Environment endpoint variables for the baseline:
- `OPENENV_BASE_URL` (preferred, default `http://localhost:7860`)
- `API_BASE_URL_ENV` (backward-compatible alias)
The baseline emits strict structured stdout lines only:
- `[START] task=<...> env=<...> model=<...>`
- `[STEP] step=<...> action=<...> reward=<...> done=<...> error=<...>`
- `[END] success=<...> steps=<...> rewards=<...>`
## Docker
```bash
docker build -t supportenv .
docker run -p 7860:7860 supportenv
```
|