name: devops-incident-response version: "2.0.0" description: > ARIA (Adaptive Reward & Incident Architecture) — an OpenEnv-compliant RL environment where AI agents learn to diagnose and remediate production software incidents under partial observability. Agents read logs, metrics, and alerts across a 12-service microservices architecture, then choose from 14 action types (restart, rollback, block_ip_range, create_index, failover, alert_oncall, and more). Seven curated tasks of escalating difficulty plus procedural seed-based generation provide a meaningful progression for benchmarking agent reasoning quality. Dense reward shaping with anti-gaming mechanisms (collateral damage penalty, blind remediation penalty, semantic diagnosis matching) ensures the reward signal is informative and resistant to exploitation. Curriculum engine tracks agent mastery per task and recommends adaptive training sequences. Multi-agent mode splits observability between an Observer (logs/alerts) and a Responder (metrics/dependencies), enabling communication and coordination research. author: "Arijit-07" tags: - openenv - devops - incident-response - real-world - multi-step - microservices - reward-shaping tasks: - id: easy name: Single Service Anomaly description: > A payment service is crash-looping due to a JVM heap memory leak. Logs clearly show OutOfMemoryError and OOMKilled pod restarts. The agent must read logs/metrics, diagnose the memory leak, and restart the affected service without touching healthy services. difficulty: easy max_steps: 15 reward_range: [0.0, 1.0] expected_score_random_agent: 0.05 expected_score_strong_llm: 0.90 - id: medium name: Cascading Multi-Service Failure description: > A bad deployment of inventory-service introduced connection pool exhaustion, cascading to order-service timeouts and api-gateway errors. A red-herring alert fires on notification-service (high CPU from a scheduled batch job). The agent must trace the cascade to the root service and rollback — not restart downstream victims. difficulty: medium max_steps: 20 reward_range: [0.0, 1.0] expected_score_random_agent: 0.03 expected_score_strong_llm: 0.55 - id: hard name: Silent Data Corruption description: > A data pipeline deployment silently writes incorrect price values to the product catalog. No standard error-rate or latency alerts fire — all services show green health. The signal is buried in price-validation WARN logs (15% mismatch rate) and an analytics anomaly (avg order value 9x baseline). Full credit requires both rollback of the pipeline AND alerting on-call for a data audit. difficulty: hard max_steps: 25 reward_range: [0.0, 1.0] expected_score_random_agent: 0.01 expected_score_strong_llm: 0.35 - id: bonus name: Simultaneous Dual Failure description: > Two independent failures strike at once: log-aggregator disk is 100% full (causing log loss across all services) and ml-inference-service is stuck in a model reload CPU loop. Neither failure is related to the other. Full credit requires fixing both root causes independently. difficulty: hard max_steps: 25 reward_range: [0.0, 1.0] expected_score_random_agent: 0.01 expected_score_strong_llm: 0.40 - id: security name: Security Incident (DDoS) description: > A botnet is performing a DDoS and credential stuffing attack against the login endpoint. The API gateway and Auth service are overwhelmed. The agent must read access logs, diagnose the attack IP range, block the CIDR, and alert the security team. difficulty: hard max_steps: 20 reward_range: [0.0, 1.0] expected_score_random_agent: 0.01 expected_score_strong_llm: 0.35 - id: database name: Database Performance Degradation (Missing Index) description: > A database migration ran 15 minutes ago that added a new column but forgot to add an index. Now queries are doing full table scans sequentially, leading to major DB degradation. The agent must read the Postgres slow query logs, evaluate sequential scan rates via metrics, and correctly assign a missing index or rollback the migration. difficulty: hard max_steps: 20 reward_range: [0.0, 1.0] expected_score_random_agent: 0.01 expected_score_strong_llm: 0.35 - id: failover name: Multi-Region Failover description: > A primary datacenter region (us-east-1) is degraded due to a network partition. The agent must correctly identify which services support automatic multi-region failover (api-gateway, cdn-service, order-service, redis-cache) and which do not (payment-service, postgres-primary). Failing over the wrong services causes severe data inconsistency penalties. difficulty: hard max_steps: 25 reward_range: [0.0, 1.0] expected_score_random_agent: 0.01 expected_score_strong_llm: 0.25 - id: generated name: Procedural Incident description: > A seed-based procedural incident generated by ARIA's IncidentFactory. Deterministic and reproducible — any integer seed 0-99999 produces a unique, consistent incident scenario. Failure modes include OOM, cascade, corruption, security breaches, database degradation, and network partition. difficulty: variable max_steps: 20 reward_range: [0.0, 1.0] expected_score_random_agent: 0.02 expected_score_strong_llm: 0.60 action_space: type: structured description: > Discrete action types with optional service/parameter arguments. Actions are expressed as Pydantic Action objects with fields: action_type, service, root_cause, runbook, version, reason. actions: - name: diagnose description: Record the agent's root cause hypothesis - name: read_logs description: Read recent log lines for a named service - name: search_logs description: Search log lines for a service matching a query string - name: read_metrics description: Read CPU, memory, error rate, latency for a named service - name: read_runbook description: Read an operational runbook by filename - name: restart_service description: Restart a named service (clears memory, resets connections) - name: rollback description: Roll back a service to a previous version - name: scale_up description: Increase replica count for a named service - name: alert_oncall description: Page the on-call engineering team - name: acknowledge description: Acknowledge an active alert by ID - name: noop description: Take no action this step - name: block_ip_range description: Block traffic from an IP range (CIDR format) - name: create_index description: Create a database index on a specific table and column - name: failover description: Failover a service to a different target region observation_space: type: structured description: > Pydantic Observation object containing: current step, task description, list of ServiceStatus objects (name, status, cpu, memory, error_rate, latency_p99, replicas, version, last_deployed), list of Alert objects (severity, service, message, acknowledged), recent log lines per service (dict of service_name -> last 10 lines), available runbook names, last action result/error, and incident timing info. reward: type: dense range: [0.001, 0.999] description: > Partial credit for information gathering, correct diagnosis, and precise remediation. Penalties for collateral damage (restarting healthy services), excessive noops, and treating symptoms instead of root causes. Efficiency bonus for fast resolution. Rewards clamped to [0.001, 0.999] to avoid dead gradients in RL training. Anti-gaming mechanisms: collateral_damage_penalty, blind_remediation_penalty, semantic diagnosis matching (fuzzy match against ground truth root cause). training: algorithm: GRPO model: Llama-3.1-8B-Instruct adapter: https://huggingface.co/Arijit-07/aria-devops-llama8b episodes: 160 framework: HuggingFace TRL + Unsloth results: easy_pre: 0.42 easy_post: 0.87 medium_pre: 0.18 medium_post: 0.51 hard_pre: 0.05 hard_post: 0.22 average_improvement: 0.31 aria_features: curriculum_engine: true incident_generator: true dual_agent_mode: true websocket: endpoint: /ws protocol: json commands: [reset, step, state] docker: base_image: python:3.11-slim port: 7860 health_endpoint: /health reset_endpoint: /reset step_endpoint: /step state_endpoint: /state metrics_endpoint: /metrics leaderboard_endpoint: /leaderboard