initial

bd67155 7 days ago

6.52 kB

title: SupportOpsEnv
sdk: docker
app_port: 7860
tags:
  - openenv
  - customer-support
  - evaluation

SupportOpsEnv

SupportOpsEnv is a multi-step environment for evaluating agents on realistic customer support operations. The agent behaves like a support analyst: it reviews ticket summaries, requests missing context, assigns priority, chooses the correct internal route, selects a resolution, escalates when needed, and finalizes the case. This models a genuine workflow used by support operations, trust and safety, monetization, and account-recovery teams.

The environment is designed to score well against OpenEnv-style hackathon criteria:

Real-world task simulation instead of a toy game
Three deterministic tasks with easy, medium, and hard difficulty
Dense reward shaping across the trajectory
Typed observation, action, and reward models
Reproducible OpenAI baseline runner
Reproducible rule-based baseline runner that works with no API key
Dockerized deployment path for Hugging Face Spaces

Environment Motivation

Support queue triage is one of the clearest real-world benchmarks for agent quality:

Humans perform it every day
It requires multi-step reasoning, not one-shot classification
Progress can be measured deterministically
It exposes practical agent failure modes such as premature resolution, wrong escalation, and poor prioritization

Observation Space

Observation is a Pydantic model with:

task_id: active task identifier
difficulty: easy, medium, or hard
title: task title
instruction: natural-language objective
queue_mode: whether the task contains multiple tickets
tickets: list of ticket observations
remaining_steps: steps left in the episode
available_actions: valid action names
current_queue_order: current queue ranking, if any
score_hint: latest intermediate grader snapshot

Each ticket observation contains:

ticket_id
summary
visible_context
discovered_context
selected_priority
selected_route
selected_resolution
escalation_team

Action Space

Action is a Pydantic model with:

action_type
target
value

Supported action_type values:

inspect_ticket
request_context
set_priority
set_route
set_resolution
escalate
rank_queue
finalize

Reward Design

RewardModel is a Pydantic model with:

value
components
rationale

Reward shaping is dense, not sparse:

positive reward for discovering required context
positive reward for correct intermediate decisions
positive reward for correct queue ranking progress
terminal reward from the deterministic grader score
penalties for invalid actions, redundant actions, and wasted steps

This creates learning or evaluation signal over the full trajectory.

Tasks

Easy: Account Takeover Triage

Objective: correctly handle an urgent suspected account takeover with unauthorized ad spend.

Expected difficulty: easy.

Success criteria:

request the right security and billing context
assign urgent
route to account_security
choose temporary_lock_and_manual_recovery
escalate to security_specialist

Medium: Monetization Payout Hold

Objective: investigate a missing creator payout and avoid unsafe release of funds.

Expected difficulty: medium.

Success criteria:

discover tax-expiry and compliance-hold context
assign high
route to monetization_compliance
choose request_tax_renewal
avoid unnecessary escalation

Hard: Mixed Support Queue Triage

Objective: prioritize and resolve a heterogeneous queue under SLA pressure.

Expected difficulty: hard.

Success criteria:

correctly rank the queue
assign route and priority for each ticket
choose correct resolutions
escalate only the security-critical case

Graders

Each task has a deterministic grader that returns a score in 0.0 to 1.0.

Easy grader weights context, priority, route, resolution, and escalation
Medium grader weights context and policy-safe resolution more heavily
Hard grader scores per-ticket handling and queue ranking

Programmatic graders live in support_ops_env/graders.

Setup

cd support_ops_env
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Usage

Run the local tests:

python -m unittest discover -s tests -p 'test_*.py'

Run the app locally:

python app.py

Run the default no-API baseline:

python scripts/run_rule_baseline.py

Run the OpenAI baseline if you have an API key:

export OPENAI_API_KEY=your_key_here
python scripts/run_baseline.py --model gpt-4.1-mini

Validate metadata:

bash scripts/validate_env.sh

If the openenv CLI is installed, the script will also run openenv validate openenv.yaml.

Baseline Scores

The repository now includes a deterministic baseline in run_rule_baseline.py, so you can produce reproducible scores without any external API.

In this workspace, use:

python scripts/run_rule_baseline.py

This writes rule_baseline_results.json with per-task transcripts and the average score.

The current deterministic baseline score from this workspace is:

easy_account_takeover: 1.0
medium_payout_hold: 1.0
hard_queue_triage: 1.0
average: 1.0

The OpenAI baseline in run_baseline.py is still available as an optional comparison path after installing dependencies and setting OPENAI_API_KEY.

Hugging Face Space Deployment

This repository includes:

Dockerfile
app.py
openenv.yaml

To deploy as a Docker Space:

Create a new Hugging Face Space with SDK set to Docker.
Upload this repository.
Add the openenv tag in the Space metadata.
Optionally set OPENAI_API_KEY as a Space secret for baseline experiments.

Project Structure

support_ops_env/
├── support_ops_env/
│   ├── env.py
│   ├── models.py
│   ├── reward.py
│   ├── state.py
│   ├── data/
│   ├── graders/
│   └── tasks/
├── scripts/
├── tests/
├── app.py
├── openenv.yaml
├── Dockerfile
├── requirements.txt
└── README.md