Sid8421's picture
Improve README, tests, and validation script for RL environment
aae9736
---
title: OpenEnv Support Ticket RL Environment
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
app_file: inference.py
license: mit
library_name: openenv
language: en
tags:
- reinforcement-learning
- openenv
- hackathon
- customer-support
---
# OpenEnv: Support Ticket Resolution System
An OpenEnv standards-compliant reinforcement learning environment for customer support operations. The agent acts as a support specialist and resolves incoming tickets by choosing structured actions (fetch data, check policy, refund, reply, escalate, close).
## Motivation & Real-world Relevance
Most RL evaluations are game-like or synthetic. This environment evaluates policy adherence and operational safety in a realistic business workflow:
- The agent must gather context before taking irreversible actions.
- It is rewarded for compliance and penalized for destructive shortcuts.
- It is scored on both correctness and process quality.
*Please see our detailed [Product Requirements Document (PRD.md)](./PRD.md) for full breakdown.*
## Core RL Task (Domain Clarification)
Each episode is a support ticket lifecycle.
- State: ticket metadata, optional fetched user profile, action history, and termination flag.
- Observation: current ticket, available actions, system message, history, optional tool output, and step count.
- Action: choose one of six typed operations with parameters.
- Reward: dense scorer in [0.01, 0.99] based on whether the action trajectory matches policy-safe resolution behavior.
This is not a navigation/game environment; it is a process-control environment where incorrect sequencing (for example, refunding before policy verification) reduces score.
## Enhanced Domain Explanation
This environment simulates a customer support ticket resolution system. The agent must navigate through a structured workflow to resolve tickets efficiently and safely. The core challenge lies in adhering to policy constraints while optimizing for resolution speed and accuracy.
### Example Episode Walkthrough
Here is a detailed walkthrough of an example episode for `task_easy_1`:
1. **Reset**:
- Observation: A refund ticket from `USR-A1` with open status and `step_count=0`.
2. **Action 1**: `check_policy({})`
- Tool output: Refund policy for accidental purchases.
- Reward: Increases for verifying the policy.
3. **Action 2**: `issue_refund({"amount": "full"})`
- Tool output: Refund confirmed.
- Reward: Increases for correct remediation.
4. **Action 3**: `close_ticket({"resolution": "refunded"})`
- Episode ends.
- Final score: Near-optimal.
### Visual Representation
A flowchart or diagram can be added here to visually represent the episode flow.
## Episode Walkthrough (Concrete Example)
Example: `task_easy_1` accidental purchase refund.
1. Reset
- Observation includes refund ticket from `USR-A1`, open status, step_count=0.
2. Action 1: `check_policy({})`
- Tool output returns refund policy for accidental purchase.
- Reward increases for policy verification.
3. Action 2: `issue_refund({"amount": "full"})`
- Tool output confirms refund.
- Reward increases for correct remediation.
4. Action 3: `close_ticket({"resolution": "refunded"})`
- Episode ends.
- Final score reaches near-optimal band.
Flow (high-level):
```
reset -> check_policy -> issue_refund -> close_ticket -> done
```
## Task Set and Difficulty Progression
The environment contains 4 tasks, including 3 required benchmark tasks with increasing difficulty.
| Task | Difficulty | What changes vs previous | Typical Horizon | Stochasticity | Expected Optimal Score |
|---|---|---|---:|---|---:|
| `task_easy_1` | easy | Baseline accidental purchase refund flow | 3 | Low | 0.99 |
| `task_medium_1` | medium | Adds policy-conflict trap: must reject invalid refund | 3 | Low | 0.99 |
| `task_hard_1` | hard | Requires data fetch + correct escalation reason + customer communication | 3 | Medium | 0.99 |
| `task_fraud_detection` | hard | Adds chargeback-based fraud risk and denial behavior | 4 | Medium | 0.99 |
Difficulty metadata is encoded in [env/tasks.py](env/tasks.py).
## Action Space
- `fetch_user_data(user_id)`
- `check_policy(issue_type)`
- `issue_refund(amount)`
- `reply_to_customer(message)`
- `escalate(reason)`
- `close_ticket(resolution)`
## Observation Space
Observation object fields:
- `ticket`
- `available_actions`
- `system_message`
- `history`
- `tool_output`
- `step_count`
Schema is documented in [openenv.yaml](openenv.yaml).
## Inference Interface Contract
The submission entrypoint is [inference.py](inference.py) in repository root.
Required environment variables:
- `API_BASE_URL`: OpenAI-compatible API endpoint
- `MODEL_NAME`: model identifier
- `HF_TOKEN`: API key/token
The inference loop uses OpenAI client calls and emits strict structured logs:
- `[START] task=... env=... model=...`
- `[STEP] step=... action=... reward=... done=... error=...`
- `[END] success=... steps=... score=... rewards=...`
Action serialization format expected from the model:
```json
{"action_type": "check_policy", "parameters": {"issue_type": "refund_request"}}
```
## API Endpoints (Runtime Environment)
Implemented in [server/app.py](server/app.py):
- `GET /` health check
- `POST /reset` starts a new session and returns initial observation
- `POST /step` applies an action for a session
- `GET /state?session_id=...` returns typed environment state
## Reproducibility
- Environment dynamics are deterministic for a fixed action trajectory.
- Graders are deterministic and bounded; tests in [tests/test_graders.py](tests/test_graders.py) verify this.
- Fixed benchmark trajectories are provided in [evaluate.py](evaluate.py).
## Reproducibility Enhancements
- **Seed Management**: The environment supports deterministic runs by setting a random seed. Use the `--seed` flag in scripts to ensure reproducibility.
- **Baseline Scores**:
- Random Policy: 0.33
- Greedy Policy: 0.75
These scores are verified in the validation script and can be reproduced using the provided `evaluate.py` script.
## Baseline Reproduction
Run the environment and evaluate the agent:
```bash
# Install dependencies
pip install -r requirements.txt
pip install -e .
# Run baseline evaluator
python evaluate.py
```
Example output:
```json
{
"results": {
"task_easy_1": {"score": 0.99},
"task_medium_1": {"score": 0.99},
"task_hard_1": {"score": 0.99}
}
}
```
## Setup and Run
Using Docker:
```bash
docker build -t openenv_support .
# Run API Server (HF Spaces mode):
docker run -p 7860:7860 openenv_support
```
Run baseline inference test script locally:
Ensure you install `pydantic` and `openai` first.
```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your-key"
python inference.py
```
## Pre-submission Validation (Non-Docker)
Use the evaluator script introduced for reviewers:
```bash
chmod +x scripts/validate_submission.sh
./scripts/validate_submission.sh
```
The script checks:
- pytest suite
- grader determinism and score bounds
- openenv.yaml parse + required fields
- task difficulty coverage
- baseline evaluation output
- inference smoke run and `[START]/[STEP]/[END]` log structure
## Reviewer Quickstart
For contributors and evaluators:
```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .
python -m pytest -q
```