File size: 7,144 Bytes
0a0ff2a
2bd71de
 
 
 
0a0ff2a
 
2bd71de
 
 
 
 
 
0a0ff2a
 
2bd71de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
title: UPI Banking Support Environment
emoji: 🏦
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv
  - banking
  - upi
  - customer-support
---

# UPI Banking Support Environment

OpenEnv-style environment for evaluating agents on UPI customer support workflows. The benchmark focuses on realistic banking support decisions rather than generic FAQ matching.

## Motivation

This environment is designed to test whether an agent can behave like a safe and useful support assistant for a UPI payments product such as Paytm, PhonePe, or Google Pay style support flows.

The goal is not only to answer customers correctly, but also to:
- identify the right issue type
- retrieve the right knowledge entry
- escalate fraud or overdue review cases when needed
- avoid unsafe behavior such as asking for PINs or OTPs
- handle multi-turn conversations before closing a case

## Environment Description

The environment uses three tasks with increasing difficulty:
- `easy`: classify a customer issue into the correct support track
- `medium`: choose the right FAQ or escalate when human/manual review is required
- `hard`: run a short multi-turn support conversation with clarification, guidance, and closure

The current support tracks are:
- `payment_failure`
- `refund_delay`
- `fraud_complaint`
- `kyc_account_restriction`
- `upi_pin_or_bank_linking`

The dataset includes:
- 10 banking FAQ entries in [knowledge_base.json](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/data/knowledge_base.json)
- 10 `easy` tickets in [easy.json](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/data/tickets/easy.json)
- 10 `medium` tickets in [medium.json](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/data/tickets/medium.json)
- 10 `hard` tickets in [hard.json](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/data/tickets/hard.json)

## Action Space

The public baseline and server currently accept the legacy action names below, which are internally mapped to the compact action model in [models.py](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/models.py).

| Action | Parameters | Purpose |
|---|---|---|
| `classify` | `category` | Predict the correct support track for an `easy` ticket |
| `lookup_faq` | `faq_id` | Choose the best FAQ entry for `medium` or `hard` |
| `ask_clarification` | `message` | Ask a question to gather missing details in `hard` |
| `reply` | `message` | Provide safe support guidance to the user |
| `escalate` | `message` | Escalate a case that should not be fully handled automatically |
| `resolve_ticket` | none | Close the case when it appears correctly resolved |

Internally, these are normalized to:
- `ask_for_details`
- `take_action`
- `respond_to_user`
- `escalate_case`
- `close_case`

## Observation Space

The model receives an `Observation` object from [models.py](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/models.py).

| Field | Type | Description |
|---|---|---|
| `case_id` | `str` | Unique identifier for the active ticket |
| `track` | `str` | Task split only: `easy`, `medium`, or `hard` |
| `customer_message` | `str` | Current customer issue text shown to the agent |
| `conversation_history` | `list[dict]` | Prior user/agent turns |
| `known_facts` | `dict` | Agent-visible state such as FAQ set, available categories, and progress flags |
| `required_slots` | `list[str]` | High-level missing information requirements for the episode |
| `available_actions` | `list[str]` | Actions allowed by the environment |
| `turn_number` | `int` | Current turn count |

Important evaluation detail:
- hidden gold labels such as the correct FAQ id and escalation label are not exposed to the model in the observation

## Reward

Rewards are normalized to the range `0.0` to `1.0` in [environment.py](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/environment.py).

The final reward is shaped rather than purely binary. It combines:
- `correctness`
- `safety`
- `resolution`
- `efficiency`
- `penalties`

Weighted reward:

```text
0.35 * correctness
+ 0.30 * safety
+ 0.20 * resolution
+ 0.15 * efficiency
+ penalties
```

Examples:
- correct classification gives a strong `easy` reward
- correct FAQ retrieval gives partial progress on `medium`
- correct escalation gives reward on `medium`
- clarification plus guidance plus successful closure raises `hard` reward
- unsafe prompts such as asking for PIN or OTP reduce reward sharply

## Task Difficulty

| Task | Difficulty | Description | Expected Agent Behavior |
|---|---|---|---|
| `easy` | Low | Single-turn issue classification | Identify the correct banking support track |
| `medium` | Medium | FAQ retrieval or escalation decision | Select the right FAQ or escalate fraud / overdue review cases |
| `hard` | High | Multi-turn support conversation | Ask clarification, guide safely, and close only when appropriate |

## Setup

From the package root:

```bash
cd /path/to/helpdesk_env
python3 -m venv .venv
.venv/bin/pip install -r requirements.txt
```

## Usage

### Run Tests

```bash
cd /path/to/helpdesk_env
.venv/bin/python -m py_compile environment.py inference.py models.py
```

### Run the Server

```bash
cd /path/to
PYTHONPATH=. /path/to/helpdesk_env/.venv/bin/uvicorn helpdesk_env.server.app:app --host 127.0.0.1 --port 8000
```

### Build the Docker Image

```bash
cd /path/to/helpdesk_env
docker build -t helpdesk-openenv .
docker run --rm -p 8000:8000 helpdesk-openenv
```

### Use the Python Client

```python
from helpdesk_env.client import HelpdeskEnvClient

client = HelpdeskEnvClient("http://127.0.0.1:8000")
result = client.reset("easy")
print(result.observation.customer_message)
```

### Run Inference

```bash
cd /path/to/helpdesk_env
export GROQ_API_KEY=your_key
.venv/bin/python inference.py
```

Optional model override:

```bash
export LLM_MODEL=llama-3.1-8b-instant
export TASK_NAME=medium
```

## Baseline Scores

Latest observed Groq baseline run after removing answer leakage from the observation:

| Model | Easy | Medium | Hard | Average |
|---|---:|---:|---:|---:|
| `llama-3.3-70b-versatile` | 1.00 | 0.60 | 0.59 | 0.73 |

Interpretation:
- `easy` is still quite direct and can be near-perfect for strong LLMs
- `medium` and `hard` are more informative because they require retrieval, escalation judgment, and multi-turn behavior

## Project Structure

```text
helpdesk_env/
β”œβ”€β”€ README.md
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ .gitignore
β”œβ”€β”€ .dockerignore
β”œβ”€β”€ __init__.py
β”œβ”€β”€ client.py
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ knowledge_base.json
β”‚   └── tickets/
β”‚       β”œβ”€β”€ easy.json
β”‚       β”œβ”€β”€ medium.json
β”‚       └── hard.json
β”œβ”€β”€ environment.py
β”œβ”€β”€ inference.py
β”œβ”€β”€ models.py
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ graders/
β”‚   β”œβ”€β”€ category_grader.py
β”‚   β”œβ”€β”€ faq_grader.py
β”‚   └── resolution_grader.py
└── server/
    β”œβ”€β”€ app.py
    └── helpdesk_environment.py
```