Spaces:
Sleeping
title: SupportOpsEnv
sdk: docker
app_port: 7860
tags:
- openenv
- customer-support
- evaluation
SupportOpsEnv
SupportOpsEnv is a multi-step environment for evaluating agents on realistic customer support operations. The agent behaves like a support analyst: it reviews ticket summaries, requests missing context, assigns priority, chooses the correct internal route, selects a resolution, escalates when needed, and finalizes the case. This models a genuine workflow used by support operations, trust and safety, monetization, and account-recovery teams.
The environment is designed to score well against OpenEnv-style hackathon criteria:
- Real-world task simulation instead of a toy game
- Three deterministic tasks with easy, medium, and hard difficulty
- Dense reward shaping across the trajectory
- Typed observation, action, and reward models
- Reproducible OpenAI baseline runner
- Reproducible rule-based baseline runner that works with no API key
- Dockerized deployment on Hugging Face Spaces
Environment Motivation
Support queue triage is one of the clearest real-world benchmarks for agent quality:
- Humans perform it every day
- It requires multi-step reasoning, not one-shot classification
- Progress can be measured deterministically
- It exposes practical agent failure modes such as premature resolution, wrong escalation, and poor prioritization
Observation Space
Observation is a Pydantic model with:
task_id: active task identifierdifficulty:easy,medium, orhardtitle: task titleinstruction: natural-language objectivequeue_mode: whether the task contains multiple ticketstickets: list of ticket observationsremaining_steps: steps left in the episodeavailable_actions: valid action namescurrent_queue_order: current queue ranking, if anyscore_hint: latest intermediate grader snapshot
Each ticket observation contains:
ticket_idsummaryvisible_contextdiscovered_contextselected_priorityselected_routeselected_resolutionescalation_team
Action Space
Action is a Pydantic model with:
action_typetargetvalue
Supported action_type values:
action_type |
target |
value example |
|---|---|---|
inspect_ticket |
ticket ID | "" |
request_context |
ticket ID | "tax_status" |
set_priority |
ticket ID | "urgent" / "high" / "normal" / "low" |
set_route |
ticket ID | "account_security" / "billing_refunds" / "monetization_compliance" / "policy_appeals" |
set_resolution |
ticket ID | "temporary_lock_and_manual_recovery" / "request_tax_renewal" / "approve_refund" / "expedited_human_review" |
escalate |
ticket ID | "security_specialist" |
rank_queue |
"queue" |
"T2,T1,T3" |
finalize |
ticket ID | "" |
Reward Design
RewardModel is a Pydantic model with:
value: scalar reward for this stepcomponents: dict of named sub-rewardsrationale: human-readable explanation
Reward shaping is dense, not sparse:
- positive reward for discovering required context keys
- positive reward for correct intermediate decisions (priority, route, resolution)
- positive reward for correct queue ranking progress
- terminal reward from the deterministic grader score
- penalties for invalid actions, redundant actions, and wasted steps
This creates a learning or evaluation signal over the full trajectory, not just at episode end.
Tasks
Easy: Account Takeover Triage
Objective: correctly handle an urgent suspected account takeover with unauthorized ad spend.
Success criteria:
- request the right security and billing context
- assign
urgent - route to
account_security - choose
temporary_lock_and_manual_recovery - escalate to
security_specialist
Medium: Monetization Payout Hold
Objective: investigate a missing creator payout and avoid unsafe release of funds.
Success criteria:
- discover tax-expiry and compliance-hold context
- assign
high - route to
monetization_compliance - choose
request_tax_renewal - avoid unnecessary escalation
Hard: Mixed Support Queue Triage
Objective: prioritize and resolve a heterogeneous queue of three tickets under SLA pressure.
Success criteria:
- correctly rank the queue by urgency
- assign route and priority for each ticket independently
- choose correct resolutions per ticket
- escalate only the security-critical case
Graders
Each task has a deterministic grader that returns a score in [0.0, 1.0].
- Easy grader weights context discovery, priority, route, resolution, and escalation
- Medium grader weights context and policy-safe resolution more heavily
- Hard grader scores per-ticket handling and queue ranking independently
Programmatic graders live in support_ops_env/graders/.
Baseline Scores
Rule-based baseline (no API key required)
The deterministic rule-based baseline always takes the optimal action sequence and is used as a sanity check that the graders are correct and reachable:
| Task | Score |
|---|---|
easy_account_takeover |
1.000 |
medium_payout_hold |
1.000 |
hard_queue_triage |
1.000 |
| average | 1.000 |
LLM baseline (GPT-4.1-mini)
These are the reproducible scores from the OpenAI baseline runner. They demonstrate that the environment provides a genuine challenge to frontier models, particularly on the hard task:
| Task | Score | Notes |
|---|---|---|
easy_account_takeover |
~0.20 | Model skips mandatory set_priority / set_route / set_resolution before finalize |
medium_payout_hold |
~0.35 | Correct context discovery but premature finalize |
hard_queue_triage |
~0.13 | Multi-ticket ranking and per-ticket mandatory actions not completed |
| average | ~0.23 |
The gap between the rule baseline and the LLM baseline confirms the reward function produces genuine signal and the hard task challenges frontier models.
Setup
cd support_ops_env
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Usage
Run the local tests:
python -m unittest discover -s tests -p 'test_*.py'
Run the app locally:
python app.py
Run the default no-API baseline:
python scripts/run_rule_baseline.py
Run the OpenAI baseline:
export OPENAI_API_KEY=your_key_here
python scripts/run_baseline.py --model gpt-4.1-mini
Validate OpenEnv metadata:
bash scripts/validate_env.sh
# If the openenv CLI is installed, this also runs: openenv validate openenv.yaml
API Quick Start
The live environment is available at https://suppops-supportopsenv.hf.space.
Reset to a task:
curl -X POST https://suppops-supportopsenv.hf.space/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "easy_account_takeover"}'
Take a step:
curl -X POST https://suppops-supportopsenv.hf.space/step \
-H "Content-Type: application/json" \
-d '{"action": {"action_type": "inspect_ticket", "target": "T1", "value": ""}}'
Inspect the full environment state:
curl https://suppops-supportopsenv.hf.space/state
Get JSON schemas for all models:
curl https://suppops-supportopsenv.hf.space/schema
Hugging Face Space Deployment
This repository includes a Dockerfile, app.py, and openenv.yaml and deploys as a Docker Space.
- Create a new Hugging Face Space with SDK set to Docker.
- Push this repository to the Space.
- Add the
openenvtag in the Space metadata (already present in this README's frontmatter). - Optionally set
OPENAI_API_KEYas a Space secret for baseline experiments.
Project Structure
support_ops_env/
βββ support_ops_env/
β βββ env.py
β βββ models.py
β βββ reward.py
β βββ state.py
β βββ data/
β βββ graders/
β βββ tasks/
βββ scripts/
β βββ run_baseline.py
β βββ run_rule_baseline.py
β βββ validate_env.sh
βββ tests/
βββ app.py
βββ openenv.yaml
βββ Dockerfile
βββ requirements.txt
βββ README.md