| --- |
| title: UPI Banking Support Environment |
| emoji: π¦ |
| colorFrom: blue |
| colorTo: indigo |
| sdk: docker |
| pinned: false |
| app_port: 8000 |
| tags: |
| - openenv |
| - banking |
| - upi |
| - customer-support |
| --- |
| |
| # UPI Banking Support Environment |
|
|
| OpenEnv-style environment for evaluating agents on UPI customer support workflows. The benchmark focuses on realistic banking support decisions rather than generic FAQ matching. |
|
|
| ## Motivation |
|
|
| This environment is designed to test whether an agent can behave like a safe and useful support assistant for a UPI payments product such as Paytm, PhonePe, or Google Pay style support flows. |
|
|
| The goal is not only to answer customers correctly, but also to: |
| - identify the right issue type |
| - retrieve the right knowledge entry |
| - escalate fraud or overdue review cases when needed |
| - avoid unsafe behavior such as asking for PINs or OTPs |
| - handle multi-turn conversations before closing a case |
|
|
| ## Environment Description |
|
|
| The environment uses three tasks with increasing difficulty: |
| - `easy`: classify a customer issue into the correct support track |
| - `medium`: choose the right FAQ or escalate when human/manual review is required |
| - `hard`: run a short multi-turn support conversation with clarification, guidance, and closure |
|
|
| The current support tracks are: |
| - `payment_failure` |
| - `refund_delay` |
| - `fraud_complaint` |
| - `kyc_account_restriction` |
| - `upi_pin_or_bank_linking` |
|
|
| The dataset includes: |
| - 10 banking FAQ entries in [knowledge_base.json](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/data/knowledge_base.json) |
| - 10 `easy` tickets in [easy.json](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/data/tickets/easy.json) |
| - 10 `medium` tickets in [medium.json](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/data/tickets/medium.json) |
| - 10 `hard` tickets in [hard.json](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/data/tickets/hard.json) |
|
|
| ## Action Space |
|
|
| The public baseline and server currently accept the legacy action names below, which are internally mapped to the compact action model in [models.py](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/models.py). |
|
|
| | Action | Parameters | Purpose | |
| |---|---|---| |
| | `classify` | `category` | Predict the correct support track for an `easy` ticket | |
| | `lookup_faq` | `faq_id` | Choose the best FAQ entry for `medium` or `hard` | |
| | `ask_clarification` | `message` | Ask a question to gather missing details in `hard` | |
| | `reply` | `message` | Provide safe support guidance to the user | |
| | `escalate` | `message` | Escalate a case that should not be fully handled automatically | |
| | `resolve_ticket` | none | Close the case when it appears correctly resolved | |
|
|
| Internally, these are normalized to: |
| - `ask_for_details` |
| - `take_action` |
| - `respond_to_user` |
| - `escalate_case` |
| - `close_case` |
|
|
| ## Observation Space |
|
|
| The model receives an `Observation` object from [models.py](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/models.py). |
|
|
| | Field | Type | Description | |
| |---|---|---| |
| | `case_id` | `str` | Unique identifier for the active ticket | |
| | `track` | `str` | Task split only: `easy`, `medium`, or `hard` | |
| | `customer_message` | `str` | Current customer issue text shown to the agent | |
| | `conversation_history` | `list[dict]` | Prior user/agent turns | |
| | `known_facts` | `dict` | Agent-visible state such as FAQ set, available categories, and progress flags | |
| | `required_slots` | `list[str]` | High-level missing information requirements for the episode | |
| | `available_actions` | `list[str]` | Actions allowed by the environment | |
| | `turn_number` | `int` | Current turn count | |
|
|
| Important evaluation detail: |
| - hidden gold labels such as the correct FAQ id and escalation label are not exposed to the model in the observation |
|
|
| ## Reward |
|
|
| Rewards are normalized to the range `0.0` to `1.0` in [environment.py](/Users/shivanshmundra/Downloads/MetaHack/helpdesk-env/envs/helpdesk_env/environment.py). |
|
|
| The final reward is shaped rather than purely binary. It combines: |
| - `correctness` |
| - `safety` |
| - `resolution` |
| - `efficiency` |
| - `penalties` |
|
|
| Weighted reward: |
|
|
| ```text |
| 0.35 * correctness |
| + 0.30 * safety |
| + 0.20 * resolution |
| + 0.15 * efficiency |
| + penalties |
| ``` |
|
|
| Examples: |
| - correct classification gives a strong `easy` reward |
| - correct FAQ retrieval gives partial progress on `medium` |
| - correct escalation gives reward on `medium` |
| - clarification plus guidance plus successful closure raises `hard` reward |
| - unsafe prompts such as asking for PIN or OTP reduce reward sharply |
|
|
| ## Task Difficulty |
|
|
| | Task | Difficulty | Description | Expected Agent Behavior | |
| |---|---|---|---| |
| | `easy` | Low | Single-turn issue classification | Identify the correct banking support track | |
| | `medium` | Medium | FAQ retrieval or escalation decision | Select the right FAQ or escalate fraud / overdue review cases | |
| | `hard` | High | Multi-turn support conversation | Ask clarification, guide safely, and close only when appropriate | |
|
|
| ## Setup |
|
|
| From the package root: |
|
|
| ```bash |
| cd /path/to/helpdesk_env |
| python3 -m venv .venv |
| .venv/bin/pip install -r requirements.txt |
| ``` |
|
|
| ## Usage |
|
|
| ### Run Tests |
|
|
| ```bash |
| cd /path/to/helpdesk_env |
| .venv/bin/python -m py_compile environment.py inference.py models.py |
| ``` |
|
|
| ### Run the Server |
|
|
| ```bash |
| cd /path/to |
| PYTHONPATH=. /path/to/helpdesk_env/.venv/bin/uvicorn helpdesk_env.server.app:app --host 127.0.0.1 --port 8000 |
| ``` |
|
|
| ### Build the Docker Image |
|
|
| ```bash |
| cd /path/to/helpdesk_env |
| docker build -t helpdesk-openenv . |
| docker run --rm -p 8000:8000 helpdesk-openenv |
| ``` |
|
|
| ### Use the Python Client |
|
|
| ```python |
| from helpdesk_env.client import HelpdeskEnvClient |
| |
| client = HelpdeskEnvClient("http://127.0.0.1:8000") |
| result = client.reset("easy") |
| print(result.observation.customer_message) |
| ``` |
|
|
| ### Run Inference |
|
|
| ```bash |
| cd /path/to/helpdesk_env |
| export GROQ_API_KEY=your_key |
| .venv/bin/python inference.py |
| ``` |
|
|
| Optional model override: |
|
|
| ```bash |
| export LLM_MODEL=llama-3.1-8b-instant |
| export TASK_NAME=medium |
| ``` |
|
|
| ## Baseline Scores |
|
|
| Latest observed Groq baseline run after removing answer leakage from the observation: |
|
|
| | Model | Easy | Medium | Hard | Average | |
| |---|---:|---:|---:|---:| |
| | `llama-3.3-70b-versatile` | 1.00 | 0.60 | 0.59 | 0.73 | |
|
|
| Interpretation: |
| - `easy` is still quite direct and can be near-perfect for strong LLMs |
| - `medium` and `hard` are more informative because they require retrieval, escalation judgment, and multi-turn behavior |
|
|
| ## Project Structure |
|
|
| ```text |
| helpdesk_env/ |
| βββ README.md |
| βββ Dockerfile |
| βββ .gitignore |
| βββ .dockerignore |
| βββ __init__.py |
| βββ client.py |
| βββ data/ |
| β βββ knowledge_base.json |
| β βββ tickets/ |
| β βββ easy.json |
| β βββ medium.json |
| β βββ hard.json |
| βββ environment.py |
| βββ inference.py |
| βββ models.py |
| βββ openenv.yaml |
| βββ requirements.txt |
| βββ graders/ |
| β βββ category_grader.py |
| β βββ faq_grader.py |
| β βββ resolution_grader.py |
| βββ server/ |
| βββ app.py |
| βββ helpdesk_environment.py |
| ``` |
|
|