File size: 6,954 Bytes
a3d65ce
 
 
 
 
 
3ebc5c6
 
a3d65ce
 
 
 
 
3d83a5d
 
 
bdd9825
 
 
3d83a5d
bdd9825
 
 
 
 
 
 
 
 
 
 
 
 
 
3ebc5c6
 
bdd9825
 
 
 
 
 
3ebc5c6
 
bdd9825
 
 
 
 
 
 
 
 
3ebc5c6
 
bdd9825
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d771897
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bdd9825
d771897
 
 
bdd9825
 
 
 
 
 
d771897
bdd9825
 
d771897
bdd9825
d771897
bdd9825
 
 
 
 
 
3ebc5c6
d771897
 
3d83a5d
e531507
3d83a5d
e531507
f45e3e0
e531507
 
 
 
 
 
 
 
3ebc5c6
bdd9825
 
 
 
 
 
 
 
 
 
3d83a5d
 
 
 
 
 
 
 
 
 
 
bdd9825
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title: Support Ticket Env
emoji: ๐ŸŽซ
colorFrom: blue
colorTo: green
sdk: docker
tags:
  - openenv
pinned: false
---

# Customer Support Ticket Resolution Environment

> ๐Ÿ† **OpenEnv x Scalar Hackathon** โ€” Theme **#3.1 Professional Tasks** | Sub-theme: **Scaler AI Labs โ€” Multi-App RL Environment for Enterprise Workflows**

A real-world [OpenEnv](https://github.com/meta-pytorch/OpenEnv) environment where an AI agent acts as a customer support executive, triaging and resolving incoming tickets โ€” simulating complex enterprise workflows, business rule nuances, and multi-step decision making under partial observability.

## Overview

Customer support triage is one of the most common real-world tasks for AI agents in enterprise settings. Every company handles thousands of tickets daily. Getting the classification wrong routes the ticket to the wrong team. Choosing the wrong action has direct business impact. This environment trains agents to handle exactly this challenge โ€” with real tool interaction, dynamic state, and a multi-step reward structure that resists shortcuts.

## Quick Start

```python
from support_ticket_env import SupportAction, SupportTicketEnv

with SupportTicketEnv(base_url="https://algocore-support-ticket-env.hf.space").sync() as env:
    # Task 1 - Classify a ticket
    result = env.reset(task_id=1, seed=42)
    print(result.observation.ticket_text)

    result = env.step(SupportAction(action_type="classify", category="billing"))
    print(result.reward)  # 1.0 if correct
```

## Tasks

| Task | Difficulty | Description | Score Range |
|------|-----------|-------------|-------------|
| Task 1 | Easy | Classify ticket into correct category | 0.0 - 1.0 |
| Task 2 | Medium | Classify then choose correct action | 0.0 - 1.0 |
| Task 3 | Hard | Resolve a full queue of 3 tickets | 0.0 - 1.0 |

## Action Space

Actions are `SupportAction` Pydantic objects:

| Field | Type | Required | Values |
|-------|------|----------|--------|
| `action_type` | str | always | `classify` / `reply` / `escalate` / `close` |
| `category` | str | for classify | `billing` / `technical` / `account` / `general` / `refund` |
| `reply_text` | str | for reply | free text |
| `reason` | str | optional | free text |

## Observation Space

| Field | Type | Description |
|-------|------|-------------|
| `ticket_id` | str | Unique ticket ID |
| `ticket_text` | str | Customer message |
| `task_id` | int | 1, 2, or 3 |
| `current_category` | str | Category assigned so far |
| `resolved` | bool | Whether ticket is resolved |
| `step_count` | int | Steps taken this episode |
| `feedback` | str | Human-readable feedback |
| `reward` | float | Reward signal |
| `done` | bool | Episode finished |

## Reward Function

Rewards provide partial progress signals throughout the trajectory:

- **Task 1:** 1.0 for correct category, 0.0 for wrong
- **Task 2:** 1.0 correct action, 0.5 defensible alternative, 0.3 classification only
- **Task 3:** 0.20 classification + 0.40 action + 0.25 reply quality + 0.15 efficiency bonus
- **Penalty:** -0.05 per step over 10 (loop deterrent)

## Project Structure

```
support_ticket_env/
โ”œโ”€โ”€ __init__.py               # Package exports
โ”œโ”€โ”€ models.py                 # SupportAction, SupportObservation, SupportState
โ”œโ”€โ”€ tickets.py                # Ticket dataset with ground-truth labels
โ”œโ”€โ”€ graders.py                # Reward/grader functions for all 3 tasks
โ”œโ”€โ”€ client.py                 # EnvClient subclass
โ”œโ”€โ”€ baseline.py               # Baseline inference script
โ”œโ”€โ”€ get_baseline.py           # Fetch & save baseline results
โ”œโ”€โ”€ gradio_ui.py              # Interactive Gradio playground UI
โ”œโ”€โ”€ make_chart.py             # Plot training reward curves
โ”œโ”€โ”€ plot_results.py           # Visualise evaluation results
โ”œโ”€โ”€ grpo_results.png          # GRPO training results chart
โ”œโ”€โ”€ reward_chart.png          # Reward curve chart
โ”œโ”€โ”€ openenv.yaml              # Environment metadata
โ”œโ”€โ”€ Dockerfile                # Container definition
โ”œโ”€โ”€ train_sft.ipynb           # Step 1: SFT pre-training notebook
โ”œโ”€โ”€ train_grpo.ipynb          # Step 2: GRPO fine-tuning notebook
โ””โ”€โ”€ server/
    โ”œโ”€โ”€ app.py                # FastAPI entry point (+ Gradio UI mounted at /playground)
    โ”œโ”€โ”€ support_environment.py # Environment logic
    โ””โ”€โ”€ requirements.txt      # Server dependencies
```

## Setup

```bash
# Install dependencies
pip install openenv-core fastapi uvicorn pydantic gradio openai pyyaml

# Run locally
uvicorn support_ticket_env.server.app:app --host 0.0.0.0 --port 7860

# Or via Docker
docker build -t support-ticket-env .
docker run -p 7860:7860 support-ticket-env

# Run tests
python run_tests.py
```

> ๐ŸŽฎ **Playground UI** available at `http://localhost:7860/playground` once the server is running.

## ๐Ÿ“ˆ Training Results (GRPO) โ€” Evidence of Improvement

Fine-tuned `Qwen2.5-0.5B-Instruct` using **2-stage training** (SFT pre-training โ†’ GRPO) via HuggingFace TRL over **700+ steps** on the live environment API:

![GRPO Training Results](https://raw.githubusercontent.com/TryingHardToBeDeveloper/support-ticket-env/main/grpo_results.png)

| Task | Before GRPO | After GRPO | Improvement |
|------|-------------|------------|-------------|
| Task 1 - Classification | 0.67 | **1.00** | +49% ๐Ÿš€ |
| Task 2 - Action Selection | 0.12 | **0.48** | +300% ๐Ÿš€ |
| Task 3 - Full Resolution | 0.08 | **0.23** | +187% ๐Ÿš€ |
| **Overall** | **0.29** | **0.57** | **+96% ๐Ÿš€** |

## Baseline Scores

Measured with `gpt-4o-mini`, seeds `[42, 7, 123]`:

| Task | Avg Score |
|------|-----------|
| Task 1 - Classification | 0.87 |
| Task 2 - Action Selection | 0.71 |
| Task 3 - Full Resolution | 0.58 |
| **Overall** | **0.72** |

## ๐ŸŽฏ Why This Fits Theme 3.1 โ€” Professional Tasks

> *"Real interaction with tools, APIs, or dynamic systems where the model does real hard work instead of exploiting shortcuts"*

- โœ… **Live FastAPI environment** โ€” agent interacts with a real stateful API, not a simulation
- โœ… **No shortcut exploitation** โ€” reward function penalises loops (-0.05/step over 10), forces genuine reasoning
- โœ… **Persistent world state** โ€” ticket queue, classification state, and resolution state tracked across steps
- โœ… **Multi-step causal reasoning** โ€” classify โ†’ choose action โ†’ craft reply โ†’ resolve, all causally linked
- โœ… **Enterprise workflow complexity** โ€” billing, technical, account, general, refund categories with real business rules
- โœ… **Scaler AI Labs sub-theme** โ€” demonstrates complex enterprise workflows and business rule nuances in an RL environment

## Links

- **HuggingFace Space:** https://huggingface.co/spaces/AlgoCore/support-ticket-env
- **GitHub:** https://github.com/TryingHardToBeDeveloper/support-ticket-env
- **OpenEnv Docs:** https://meta-pytorch.org/OpenEnv/

## License

MIT