{escape(APP_ENV

Task {t['id']} {escape(difficulty_labels.get(t['difficulty'], t['difficulty']).upper())}

{escape(t['name'])}

{escape(t['instructions'])}

{''.join(f'{escape(field)}' for field in t['allowed_fields'])}

{escape(APP_ENV_NAME)}

Queue decisions that actually carry forward.

A sleek benchmark surface for sequential helpdesk routing: hidden context, cluster-aware follow-ons, incident handling, deferrals, and a terminal rubric that rewards queue strategy instead of isolated classification alone.

Task family: easy to hard Closed-form grader Queue-level terminal objective

Explore the API Run Hard Baseline Inspect Task Definitions

{dataset_size} Tickets in the grounded dataset Curated records plus queue mutation mechanics create repeatable but non-trivial episodes.

{alternate_route_count} Capacity-aware alternate routes The grader can reward declared fallback routes instead of collapsing to all-or-nothing exact match.

{clustered_case_count} Cluster-linked or coordinated cases Handling one ticket can stabilize or destabilize the downstream tickets in the same workstream.

{hidden_context_case_count} Hidden-context routing cases Investigation tools matter because key evidence does not appear in the initial observation by default.

Task Ladder

One benchmark family, not three disconnected demos

The difficulty ladder keeps the same full-routing output while progressively changing observability, queue dependencies, and operational pressure.

{task_cards}

Environment Signals

What the agent is balancing

The benchmark is designed so strong policy choices change later tickets, incident coverage, and terminal queue quality instead of just nudging shaped reward.

Hidden context retrieval

Related-ticket previews, requester history, internal routing notes, queue cluster summaries, and capacity forecasts are revealed through explicit tool use.

investigate request_info cluster summary

Operational actions with consequences

Deferrals can raise later urgency, incident handling can reduce downstream debt, and weak handling can spawn or worsen follow-up work.

defer open_incident follow-up spawning

Queue-level terminal rubric

Final scoring blends routing trajectory quality with queue management quality so agents are rewarded for coherent episode strategy, not just isolated ticket matches.

terminal rubric queue quality planning-aware

Quick Routes

Fast ways to demo the environment

Useful entry points for judges, reviewers, or anyone trying to get signal from the project quickly.

Interactive API docs

Browse the full OpenEnv-compatible surface, request models, and built-in helper endpoints.

GET /docs Open Docs

Task manifest

Inspect the easy, medium, and hard task definitions exactly as exposed by the server.

GET /tasks View Tasks

Hard-task baseline rollout

See a deterministic baseline episode over the hardest queue with the current environment logic.

GET /baseline?task_id=3&seed=42 Run Baseline

Health and deployment status

Quick check that the service is alive and ready for OpenEnv-style evaluation requests.

GET /health Check Health