Spaces:

Sid8421
/

openenv-rl-environment

Sleeping

App Files Files Community

openenv-rl-environment / README.md

Sid8421

Improve README, tests, and validation script for RL environment

aae9736 6 days ago

preview code

raw

history blame contribute delete

7.44 kB

metadata

title: OpenEnv Support Ticket RL Environment
emoji: 🤖
colorFrom: blue
colorTo: green
sdk: docker
app_file: inference.py
license: mit
library_name: openenv
language: en
tags:
  - reinforcement-learning
  - openenv
  - hackathon
  - customer-support

OpenEnv: Support Ticket Resolution System

An OpenEnv standards-compliant reinforcement learning environment for customer support operations. The agent acts as a support specialist and resolves incoming tickets by choosing structured actions (fetch data, check policy, refund, reply, escalate, close).

Motivation & Real-world Relevance

Most RL evaluations are game-like or synthetic. This environment evaluates policy adherence and operational safety in a realistic business workflow:

The agent must gather context before taking irreversible actions.
It is rewarded for compliance and penalized for destructive shortcuts.
It is scored on both correctness and process quality.

Please see our detailed Product Requirements Document (PRD.md) for full breakdown.

Core RL Task (Domain Clarification)

Each episode is a support ticket lifecycle.

State: ticket metadata, optional fetched user profile, action history, and termination flag.
Observation: current ticket, available actions, system message, history, optional tool output, and step count.
Action: choose one of six typed operations with parameters.
Reward: dense scorer in [0.01, 0.99] based on whether the action trajectory matches policy-safe resolution behavior.

This is not a navigation/game environment; it is a process-control environment where incorrect sequencing (for example, refunding before policy verification) reduces score.

Enhanced Domain Explanation

This environment simulates a customer support ticket resolution system. The agent must navigate through a structured workflow to resolve tickets efficiently and safely. The core challenge lies in adhering to policy constraints while optimizing for resolution speed and accuracy.

Example Episode Walkthrough

Here is a detailed walkthrough of an example episode for task_easy_1:

Reset:
- Observation: A refund ticket from USR-A1 with open status and step_count=0.
Action 1: check_policy({})
- Tool output: Refund policy for accidental purchases.
- Reward: Increases for verifying the policy.
Action 2: issue_refund({"amount": "full"})
- Tool output: Refund confirmed.
- Reward: Increases for correct remediation.
Action 3: close_ticket({"resolution": "refunded"})
- Episode ends.
- Final score: Near-optimal.

Visual Representation

A flowchart or diagram can be added here to visually represent the episode flow.

Episode Walkthrough (Concrete Example)

Example: task_easy_1 accidental purchase refund.

Reset

Observation includes refund ticket from USR-A1, open status, step_count=0.

Action 1: check_policy({})

Tool output returns refund policy for accidental purchase.
Reward increases for policy verification.

Action 2: issue_refund({"amount": "full"})

Tool output confirms refund.
Reward increases for correct remediation.

Action 3: close_ticket({"resolution": "refunded"})

Episode ends.
Final score reaches near-optimal band.

Flow (high-level):

reset -> check_policy -> issue_refund -> close_ticket -> done

Task Set and Difficulty Progression

The environment contains 4 tasks, including 3 required benchmark tasks with increasing difficulty.

Task	Difficulty	What changes vs previous	Typical Horizon	Stochasticity	Expected Optimal Score
`task_easy_1`	easy	Baseline accidental purchase refund flow	3	Low	0.99
`task_medium_1`	medium	Adds policy-conflict trap: must reject invalid refund	3	Low	0.99
`task_hard_1`	hard	Requires data fetch + correct escalation reason + customer communication	3	Medium	0.99
`task_fraud_detection`	hard	Adds chargeback-based fraud risk and denial behavior	4	Medium	0.99

Difficulty metadata is encoded in env/tasks.py.

Action Space

fetch_user_data(user_id)
check_policy(issue_type)
issue_refund(amount)
reply_to_customer(message)
escalate(reason)
close_ticket(resolution)

Observation Space

Observation object fields:

ticket
available_actions
system_message
history
tool_output
step_count

Schema is documented in openenv.yaml.

Inference Interface Contract

The submission entrypoint is inference.py in repository root.

Required environment variables:

API_BASE_URL: OpenAI-compatible API endpoint
MODEL_NAME: model identifier
HF_TOKEN: API key/token

The inference loop uses OpenAI client calls and emits strict structured logs:

[START] task=... env=... model=...
[STEP] step=... action=... reward=... done=... error=...
[END] success=... steps=... score=... rewards=...

Action serialization format expected from the model:

{"action_type": "check_policy", "parameters": {"issue_type": "refund_request"}}

API Endpoints (Runtime Environment)

Implemented in server/app.py:

GET / health check
POST /reset starts a new session and returns initial observation
POST /step applies an action for a session
GET /state?session_id=... returns typed environment state

Reproducibility

Environment dynamics are deterministic for a fixed action trajectory.
Graders are deterministic and bounded; tests in tests/test_graders.py verify this.
Fixed benchmark trajectories are provided in evaluate.py.

Reproducibility Enhancements

Seed Management: The environment supports deterministic runs by setting a random seed. Use the --seed flag in scripts to ensure reproducibility.
Baseline Scores:
- Random Policy: 0.33
- Greedy Policy: 0.75

These scores are verified in the validation script and can be reproduced using the provided evaluate.py script.

Baseline Reproduction

Run the environment and evaluate the agent:

# Install dependencies
pip install -r requirements.txt
pip install -e .

# Run baseline evaluator
python evaluate.py

Example output:

{
  "results": {
    "task_easy_1": {"score": 0.99},
    "task_medium_1": {"score": 0.99},
    "task_hard_1": {"score": 0.99}
  }
}

Setup and Run

Using Docker:

docker build -t openenv_support .
# Run API Server (HF Spaces mode):
docker run -p 7860:7860 openenv_support

Run baseline inference test script locally: Ensure you install pydantic and openai first.

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"
export HF_TOKEN="your-key"
python inference.py

Pre-submission Validation (Non-Docker)

Use the evaluator script introduced for reviewers:

chmod +x scripts/validate_submission.sh
./scripts/validate_submission.sh

The script checks:

pytest suite
grader determinism and score bounds
openenv.yaml parse + required fields
task difficulty coverage
baseline evaluation output
inference smoke run and [START]/[STEP]/[END] log structure

Reviewer Quickstart

For contributors and evaluators:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .
python -m pytest -q