AntiAtropos / openenv.yaml
Divyansh Agrawal
updated grader
fedca64
spec_version: 1
name: AntiAtropos
type: space
runtime: fastapi
app: server.app:app
port: 7860
tasks:
- id: task-1
difficulty: easy
max_steps: 100
grader:
type: llm
prompt_template: |
You are grading a Predictive Scaling task (Task-1) in a microservice SRE simulation.
Score the agent's performance strictly between 0 and 1 (exclusive; do not return exactly 0 or 1) based on the episode trace.
Scoring criteria:
- Traffic increases linearly; the agent should scale up BEFORE queues overflow (boot delay = 5 ticks).
- Penalize heavily for high average_latency_ms or total_queue_backlog, especially on node-0 (VIP, weight=4).
- Penalize over-provisioning (scaling up unnecessarily when load is low).
- Reward proactive SCALE_UP actions that precede traffic spikes by 5+ steps.
- Ignore sensor noise readings (0 or -1 values) when evaluating decisions.
Return a single float strictly between 0 and 1 (exclusive; do not return exactly 0 or 1).
- id: task-2
difficulty: medium
max_steps: 100
grader:
type: llm
prompt_template: |
You are grading a Fault Tolerance task (Task-2) in a microservice SRE simulation.
Score the agent's performance strictly between 0 and 1 (exclusive; do not return exactly 0 or 1) based on the episode trace.
Scoring criteria:
- A node fails randomly; the agent must detect and respond quickly (within a few steps).
- Reward REROUTE_TRAFFIC or SCALE_UP on surviving nodes before backlog spikes.
- Penalize sustained high total_queue_backlog or latency after the failure event.
- REROUTE_TRAFFIC decays 50% per tick; reward agents that re-issue it persistently.
- Penalize heavily for node-0 (VIP, weight=4) experiencing high latency post-failure.
- Ignore sensor noise readings (0 or -1 values) when evaluating decisions.
Return a single float strictly between 0 and 1 (exclusive; do not return exactly 0 or 1).
- id: task-3
difficulty: hard
max_steps: 100
grader:
type: llm
prompt_template: |
You are grading a Stability Under Surge task (Task-3) in a microservice SRE simulation.
Score the agent's performance strictly between 0 and 1 (exclusive; do not return exactly 0 or 1) based on the episode trace.
Scoring criteria:
- A major traffic surge hits non-critical nodes; node-0 (Payment Gateway VIP) must be protected.
- SHED_LOAD on node-0, node-1, or node-2 is forbidden; penalize heavily if used on them.
- Reward pre-emptive SCALE_UP (boot delay = 5 ticks) that absorbs the surge before it arrives.
- Penalize any spike in node-0's queue_backlog or latency during or after the surge.
- REROUTE_TRAFFIC decays 50% per tick; reward persistent re-application to redirect load.
- Ignore sensor noise readings (0 or -1 values) when evaluating decisions.
Return a single float strictly between 0 and 1 (exclusive; do not return exactly 0 or 1).