supportOpsEnv / README.md
Addy897's picture
Final
735d73f
metadata
title: SupportOpsEnv
sdk: docker
app_port: 7860
tags:
  - openenv
  - customer-support
  - evaluation

SupportOpsEnv

SupportOpsEnv is a multi-step environment for evaluating agents on realistic customer support operations. The agent behaves like a support analyst: it reviews ticket summaries, requests missing context, assigns priority, chooses the correct internal route, selects a resolution, escalates when needed, and finalizes the case. This models a genuine workflow used by support operations, trust and safety, monetization, and account-recovery teams.

The environment is designed to score well against OpenEnv-style hackathon criteria:

  • Real-world task simulation instead of a toy game
  • Three deterministic tasks with easy, medium, and hard difficulty
  • Dense reward shaping across the trajectory
  • Typed observation, action, and reward models
  • Reproducible OpenAI baseline runner
  • Reproducible rule-based baseline runner that works with no API key
  • Dockerized deployment on Hugging Face Spaces

Environment Motivation

Support queue triage is one of the clearest real-world benchmarks for agent quality:

  • Humans perform it every day
  • It requires multi-step reasoning, not one-shot classification
  • Progress can be measured deterministically
  • It exposes practical agent failure modes such as premature resolution, wrong escalation, and poor prioritization

Observation Space

Observation is a Pydantic model with:

  • task_id: active task identifier
  • difficulty: easy, medium, or hard
  • title: task title
  • instruction: natural-language objective
  • queue_mode: whether the task contains multiple tickets
  • tickets: list of ticket observations
  • remaining_steps: steps left in the episode
  • available_actions: valid action names
  • current_queue_order: current queue ranking, if any
  • score_hint: latest intermediate grader snapshot

Each ticket observation contains:

  • ticket_id
  • summary
  • visible_context
  • discovered_context
  • selected_priority
  • selected_route
  • selected_resolution
  • escalation_team

Action Space

Action is a Pydantic model with:

  • action_type
  • target
  • value

Supported action_type values:

action_type target value example
inspect_ticket ticket ID ""
request_context ticket ID "tax_status"
set_priority ticket ID "urgent" / "high" / "normal" / "low"
set_route ticket ID "account_security" / "billing_refunds" / "monetization_compliance" / "policy_appeals"
set_resolution ticket ID "temporary_lock_and_manual_recovery" / "request_tax_renewal" / "approve_refund" / "expedited_human_review"
escalate ticket ID "security_specialist"
rank_queue "queue" "T2,T1,T3"
finalize ticket ID ""

Reward Design

RewardModel is a Pydantic model with:

  • value: scalar reward for this step
  • components: dict of named sub-rewards
  • rationale: human-readable explanation

Reward shaping is dense, not sparse:

  • positive reward for discovering required context keys
  • positive reward for correct intermediate decisions (priority, route, resolution)
  • positive reward for correct queue ranking progress
  • terminal reward from the deterministic grader score
  • penalties for invalid actions, redundant actions, and wasted steps

This creates a learning or evaluation signal over the full trajectory, not just at episode end.

Tasks

Easy: Account Takeover Triage

Objective: correctly handle an urgent suspected account takeover with unauthorized ad spend.

Success criteria:

  • request the right security and billing context
  • assign urgent
  • route to account_security
  • choose temporary_lock_and_manual_recovery
  • escalate to security_specialist

Medium: Monetization Payout Hold

Objective: investigate a missing creator payout and avoid unsafe release of funds.

Success criteria:

  • discover tax-expiry and compliance-hold context
  • assign high
  • route to monetization_compliance
  • choose request_tax_renewal
  • avoid unnecessary escalation

Hard: Mixed Support Queue Triage

Objective: prioritize and resolve a heterogeneous queue of three tickets under SLA pressure.

Success criteria:

  • correctly rank the queue by urgency
  • assign route and priority for each ticket independently
  • choose correct resolutions per ticket
  • escalate only the security-critical case

Graders

Each task has a deterministic grader that returns a score in [0.0, 1.0].

  • Easy grader weights context discovery, priority, route, resolution, and escalation
  • Medium grader weights context and policy-safe resolution more heavily
  • Hard grader scores per-ticket handling and queue ranking independently

Programmatic graders live in support_ops_env/graders/.

Baseline Scores

Rule-based baseline (no API key required)

The deterministic rule-based baseline always takes the optimal action sequence and is used as a sanity check that the graders are correct and reachable:

Task Score
easy_account_takeover 1.000
medium_payout_hold 1.000
hard_queue_triage 1.000
average 1.000

LLM baseline (GPT-4.1-mini)

These are the reproducible scores from the OpenAI baseline runner. They demonstrate that the environment provides a genuine challenge to frontier models, particularly on the hard task:

Task Score Notes
easy_account_takeover ~0.20 Model skips mandatory set_priority / set_route / set_resolution before finalize
medium_payout_hold ~0.35 Correct context discovery but premature finalize
hard_queue_triage ~0.13 Multi-ticket ranking and per-ticket mandatory actions not completed
average ~0.23

The gap between the rule baseline and the LLM baseline confirms the reward function produces genuine signal and the hard task challenges frontier models.

Setup

cd support_ops_env
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Usage

Run the local tests:

python -m unittest discover -s tests -p 'test_*.py'

Run the app locally:

python app.py

Run the default no-API baseline:

python scripts/run_rule_baseline.py

Run the OpenAI baseline:

export OPENAI_API_KEY=your_key_here
python scripts/run_baseline.py --model gpt-4.1-mini

Validate OpenEnv metadata:

bash scripts/validate_env.sh
# If the openenv CLI is installed, this also runs: openenv validate openenv.yaml

API Quick Start

The live environment is available at https://suppops-supportopsenv.hf.space.

Reset to a task:

curl -X POST https://suppops-supportopsenv.hf.space/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "easy_account_takeover"}'

Take a step:

curl -X POST https://suppops-supportopsenv.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"action": {"action_type": "inspect_ticket", "target": "T1", "value": ""}}'

Inspect the full environment state:

curl https://suppops-supportopsenv.hf.space/state

Get JSON schemas for all models:

curl https://suppops-supportopsenv.hf.space/schema

Hugging Face Space Deployment

This repository includes a Dockerfile, app.py, and openenv.yaml and deploys as a Docker Space.

  1. Create a new Hugging Face Space with SDK set to Docker.
  2. Push this repository to the Space.
  3. Add the openenv tag in the Space metadata (already present in this README's frontmatter).
  4. Optionally set OPENAI_API_KEY as a Space secret for baseline experiments.

Project Structure

support_ops_env/
β”œβ”€β”€ support_ops_env/
β”‚   β”œβ”€β”€ env.py
β”‚   β”œβ”€β”€ models.py
β”‚   β”œβ”€β”€ reward.py
β”‚   β”œβ”€β”€ state.py
β”‚   β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ graders/
β”‚   └── tasks/
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ run_baseline.py
β”‚   β”œβ”€β”€ run_rule_baseline.py
β”‚   └── validate_env.sh
β”œβ”€β”€ tests/
β”œβ”€β”€ app.py
β”œβ”€β”€ openenv.yaml
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ requirements.txt
└── README.md