devops-incident-response / openenv.yaml
Arijit-07's picture
Finalizing ARIA for production deployment: 8B model migration, documentation polish, cleanup
bdd0439
name: devops-incident-response
version: "2.0.0"
description: >
ARIA (Adaptive Reward & Incident Architecture) — an OpenEnv-compliant RL
environment where AI agents learn to diagnose and remediate production
software incidents under partial observability. Agents read logs, metrics,
and alerts across a 12-service microservices architecture, then choose
from 14 action types (restart, rollback, block_ip_range, create_index,
failover, alert_oncall, and more). Seven curated tasks of escalating
difficulty plus procedural seed-based generation provide a meaningful
progression for benchmarking agent reasoning quality. Dense reward shaping
with anti-gaming mechanisms (collateral damage penalty, blind remediation
penalty, semantic diagnosis matching) ensures the reward signal is
informative and resistant to exploitation. Curriculum engine tracks agent
mastery per task and recommends adaptive training sequences. Multi-agent
mode splits observability between an Observer (logs/alerts) and a
Responder (metrics/dependencies), enabling communication and coordination
research.
author: "Arijit-07"
tags:
- openenv
- devops
- incident-response
- real-world
- multi-step
- microservices
- reward-shaping
tasks:
- id: easy
name: Single Service Anomaly
description: >
A payment service is crash-looping due to a JVM heap memory leak.
Logs clearly show OutOfMemoryError and OOMKilled pod restarts.
The agent must read logs/metrics, diagnose the memory leak, and
restart the affected service without touching healthy services.
difficulty: easy
max_steps: 15
reward_range: [0.0, 1.0]
expected_score_random_agent: 0.05
expected_score_strong_llm: 0.90
- id: medium
name: Cascading Multi-Service Failure
description: >
A bad deployment of inventory-service introduced connection pool
exhaustion, cascading to order-service timeouts and api-gateway
errors. A red-herring alert fires on notification-service (high CPU
from a scheduled batch job). The agent must trace the cascade to the
root service and rollback — not restart downstream victims.
difficulty: medium
max_steps: 20
reward_range: [0.0, 1.0]
expected_score_random_agent: 0.03
expected_score_strong_llm: 0.55
- id: hard
name: Silent Data Corruption
description: >
A data pipeline deployment silently writes incorrect price values to
the product catalog. No standard error-rate or latency alerts fire —
all services show green health. The signal is buried in
price-validation WARN logs (15% mismatch rate) and an analytics
anomaly (avg order value 9x baseline). Full credit requires both
rollback of the pipeline AND alerting on-call for a data audit.
difficulty: hard
max_steps: 25
reward_range: [0.0, 1.0]
expected_score_random_agent: 0.01
expected_score_strong_llm: 0.35
- id: bonus
name: Simultaneous Dual Failure
description: >
Two independent failures strike at once: log-aggregator disk is 100% full
(causing log loss across all services) and ml-inference-service is stuck
in a model reload CPU loop. Neither failure is related to the other.
Full credit requires fixing both root causes independently.
difficulty: hard
max_steps: 25
reward_range: [0.0, 1.0]
expected_score_random_agent: 0.01
expected_score_strong_llm: 0.40
- id: security
name: Security Incident (DDoS)
description: >
A botnet is performing a DDoS and credential stuffing attack against the login endpoint.
The API gateway and Auth service are overwhelmed. The agent must read access logs,
diagnose the attack IP range, block the CIDR, and alert the security team.
difficulty: hard
max_steps: 20
reward_range: [0.0, 1.0]
expected_score_random_agent: 0.01
expected_score_strong_llm: 0.35
- id: database
name: Database Performance Degradation (Missing Index)
description: >
A database migration ran 15 minutes ago that added a new column but forgot to add an index.
Now queries are doing full table scans sequentially, leading to major DB degradation.
The agent must read the Postgres slow query logs, evaluate sequential scan rates via metrics, and correctly assign a missing index or rollback the migration.
difficulty: hard
max_steps: 20
reward_range: [0.0, 1.0]
expected_score_random_agent: 0.01
expected_score_strong_llm: 0.35
- id: failover
name: Multi-Region Failover
description: >
A primary datacenter region (us-east-1) is degraded due to a network partition.
The agent must correctly identify which services support automatic multi-region failover
(api-gateway, cdn-service, order-service, redis-cache) and which do not (payment-service, postgres-primary).
Failing over the wrong services causes severe data inconsistency penalties.
difficulty: hard
max_steps: 25
reward_range: [0.0, 1.0]
expected_score_random_agent: 0.01
expected_score_strong_llm: 0.25
- id: generated
name: Procedural Incident
description: >
A seed-based procedural incident generated by ARIA's IncidentFactory.
Deterministic and reproducible — any integer seed 0-99999 produces a unique,
consistent incident scenario. Failure modes include OOM, cascade, corruption,
security breaches, database degradation, and network partition.
difficulty: variable
max_steps: 20
reward_range: [0.0, 1.0]
expected_score_random_agent: 0.02
expected_score_strong_llm: 0.60
action_space:
type: structured
description: >
Discrete action types with optional service/parameter arguments.
Actions are expressed as Pydantic Action objects with fields:
action_type, service, root_cause, runbook, version, reason.
actions:
- name: diagnose
description: Record the agent's root cause hypothesis
- name: read_logs
description: Read recent log lines for a named service
- name: search_logs
description: Search log lines for a service matching a query string
- name: read_metrics
description: Read CPU, memory, error rate, latency for a named service
- name: read_runbook
description: Read an operational runbook by filename
- name: restart_service
description: Restart a named service (clears memory, resets connections)
- name: rollback
description: Roll back a service to a previous version
- name: scale_up
description: Increase replica count for a named service
- name: alert_oncall
description: Page the on-call engineering team
- name: acknowledge
description: Acknowledge an active alert by ID
- name: noop
description: Take no action this step
- name: block_ip_range
description: Block traffic from an IP range (CIDR format)
- name: create_index
description: Create a database index on a specific table and column
- name: failover
description: Failover a service to a different target region
observation_space:
type: structured
description: >
Pydantic Observation object containing: current step, task description,
list of ServiceStatus objects (name, status, cpu, memory, error_rate,
latency_p99, replicas, version, last_deployed), list of Alert objects
(severity, service, message, acknowledged), recent log lines per
service (dict of service_name -> last 10 lines), available runbook
names, last action result/error, and incident timing info.
reward:
type: dense
range: [0.001, 0.999]
description: >
Partial credit for information gathering, correct diagnosis, and
precise remediation. Penalties for collateral damage (restarting
healthy services), excessive noops, and treating symptoms instead
of root causes. Efficiency bonus for fast resolution. Rewards
clamped to [0.001, 0.999] to avoid dead gradients in RL training.
Anti-gaming mechanisms: collateral_damage_penalty, blind_remediation_penalty,
semantic diagnosis matching (fuzzy match against ground truth root cause).
training:
algorithm: GRPO
model: Llama-3.1-8B-Instruct
adapter: https://huggingface.co/Arijit-07/aria-devops-llama8b
episodes: 160
framework: HuggingFace TRL + Unsloth
results:
easy_pre: 0.42
easy_post: 0.87
medium_pre: 0.18
medium_post: 0.51
hard_pre: 0.05
hard_post: 0.22
average_improvement: 0.31
aria_features:
curriculum_engine: true
incident_generator: true
dual_agent_mode: true
websocket:
endpoint: /ws
protocol: json
commands: [reset, step, state]
docker:
base_image: python:3.11-slim
port: 7860
health_endpoint: /health
reset_endpoint: /reset
step_endpoint: /step
state_endpoint: /state
metrics_endpoint: /metrics
leaderboard_endpoint: /leaderboard