Spaces:

suppops
/

supportOpsEnv

Sleeping

App Files Files Community

supportOpsEnv / README.md

Addy897

Final

735d73f 4 days ago

preview code

raw

history blame contribute delete

8.59 kB

metadata

title: SupportOpsEnv
sdk: docker
app_port: 7860
tags:
  - openenv
  - customer-support
  - evaluation

SupportOpsEnv

SupportOpsEnv is a multi-step environment for evaluating agents on realistic customer support operations. The agent behaves like a support analyst: it reviews ticket summaries, requests missing context, assigns priority, chooses the correct internal route, selects a resolution, escalates when needed, and finalizes the case. This models a genuine workflow used by support operations, trust and safety, monetization, and account-recovery teams.

The environment is designed to score well against OpenEnv-style hackathon criteria:

Real-world task simulation instead of a toy game
Three deterministic tasks with easy, medium, and hard difficulty
Dense reward shaping across the trajectory
Typed observation, action, and reward models
Reproducible OpenAI baseline runner
Reproducible rule-based baseline runner that works with no API key
Dockerized deployment on Hugging Face Spaces

Environment Motivation

Support queue triage is one of the clearest real-world benchmarks for agent quality:

Humans perform it every day
It requires multi-step reasoning, not one-shot classification
Progress can be measured deterministically
It exposes practical agent failure modes such as premature resolution, wrong escalation, and poor prioritization

Observation Space

Observation is a Pydantic model with:

task_id: active task identifier
difficulty: easy, medium, or hard
title: task title
instruction: natural-language objective
queue_mode: whether the task contains multiple tickets
tickets: list of ticket observations
remaining_steps: steps left in the episode
available_actions: valid action names
current_queue_order: current queue ranking, if any
score_hint: latest intermediate grader snapshot

Each ticket observation contains:

ticket_id
summary
visible_context
discovered_context
selected_priority
selected_route
selected_resolution
escalation_team

Action Space

Action is a Pydantic model with:

action_type
target
value

Supported action_type values:

`action_type`	`target`	`value` example
`inspect_ticket`	ticket ID	`""`
`request_context`	ticket ID	`"tax_status"`
`set_priority`	ticket ID	`"urgent"` / `"high"` / `"normal"` / `"low"`
`set_route`	ticket ID	`"account_security"` / `"billing_refunds"` / `"monetization_compliance"` / `"policy_appeals"`
`set_resolution`	ticket ID	`"temporary_lock_and_manual_recovery"` / `"request_tax_renewal"` / `"approve_refund"` / `"expedited_human_review"`
`escalate`	ticket ID	`"security_specialist"`
`rank_queue`	`"queue"`	`"T2,T1,T3"`
`finalize`	ticket ID	`""`

Reward Design

RewardModel is a Pydantic model with:

value: scalar reward for this step
components: dict of named sub-rewards
rationale: human-readable explanation

Reward shaping is dense, not sparse:

positive reward for discovering required context keys
positive reward for correct intermediate decisions (priority, route, resolution)
positive reward for correct queue ranking progress
terminal reward from the deterministic grader score
penalties for invalid actions, redundant actions, and wasted steps

This creates a learning or evaluation signal over the full trajectory, not just at episode end.

Tasks

Easy: Account Takeover Triage

Objective: correctly handle an urgent suspected account takeover with unauthorized ad spend.

Success criteria:

request the right security and billing context
assign urgent
route to account_security
choose temporary_lock_and_manual_recovery
escalate to security_specialist

Medium: Monetization Payout Hold

Objective: investigate a missing creator payout and avoid unsafe release of funds.

Success criteria:

discover tax-expiry and compliance-hold context
assign high
route to monetization_compliance
choose request_tax_renewal
avoid unnecessary escalation

Hard: Mixed Support Queue Triage

Objective: prioritize and resolve a heterogeneous queue of three tickets under SLA pressure.

Success criteria:

correctly rank the queue by urgency
assign route and priority for each ticket independently
choose correct resolutions per ticket
escalate only the security-critical case

Graders

Each task has a deterministic grader that returns a score in [0.0, 1.0].

Easy grader weights context discovery, priority, route, resolution, and escalation
Medium grader weights context and policy-safe resolution more heavily
Hard grader scores per-ticket handling and queue ranking independently

Programmatic graders live in support_ops_env/graders/.

Baseline Scores

Rule-based baseline (no API key required)

The deterministic rule-based baseline always takes the optimal action sequence and is used as a sanity check that the graders are correct and reachable:

Task	Score
`easy_account_takeover`	1.000
`medium_payout_hold`	1.000
`hard_queue_triage`	1.000
average	1.000

LLM baseline (GPT-4.1-mini)

These are the reproducible scores from the OpenAI baseline runner. They demonstrate that the environment provides a genuine challenge to frontier models, particularly on the hard task:

Task	Score	Notes
`easy_account_takeover`	~0.20	Model skips mandatory set_priority / set_route / set_resolution before finalize
`medium_payout_hold`	~0.35	Correct context discovery but premature finalize
`hard_queue_triage`	~0.13	Multi-ticket ranking and per-ticket mandatory actions not completed
average	~0.23

The gap between the rule baseline and the LLM baseline confirms the reward function produces genuine signal and the hard task challenges frontier models.

Setup

cd support_ops_env
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Usage

Run the local tests:

python -m unittest discover -s tests -p 'test_*.py'

Run the app locally:

python app.py

Run the default no-API baseline:

python scripts/run_rule_baseline.py

Run the OpenAI baseline:

export OPENAI_API_KEY=your_key_here
python scripts/run_baseline.py --model gpt-4.1-mini

Validate OpenEnv metadata:

bash scripts/validate_env.sh
# If the openenv CLI is installed, this also runs: openenv validate openenv.yaml

API Quick Start

The live environment is available at https://suppops-supportopsenv.hf.space.

Reset to a task:

curl -X POST https://suppops-supportopsenv.hf.space/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "easy_account_takeover"}'

Take a step:

curl -X POST https://suppops-supportopsenv.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action": {"action_type": "inspect_ticket", "target": "T1", "value": ""}}'

Inspect the full environment state:

curl https://suppops-supportopsenv.hf.space/state

Get JSON schemas for all models:

curl https://suppops-supportopsenv.hf.space/schema

Hugging Face Space Deployment

This repository includes a Dockerfile, app.py, and openenv.yaml and deploys as a Docker Space.

Create a new Hugging Face Space with SDK set to Docker.
Push this repository to the Space.
Add the openenv tag in the Space metadata (already present in this README's frontmatter).
Optionally set OPENAI_API_KEY as a Space secret for baseline experiments.

Project Structure

support_ops_env/
├── support_ops_env/
│   ├── env.py
│   ├── models.py
│   ├── reward.py
│   ├── state.py
│   ├── data/
│   ├── graders/
│   └── tasks/
├── scripts/
│   ├── run_baseline.py
│   ├── run_rule_baseline.py
│   └── validate_env.sh
├── tests/
├── app.py
├── openenv.yaml
├── Dockerfile
├── requirements.txt
└── README.md