Spaces:

Arijit-07
/

devops-incident-response

Running

App Files Files Community

devops-incident-response / openenv.yaml

Arijit-07

Finalizing ARIA for production deployment: 8B model migration, documentation polish, cleanup

bdd0439 about 1 month ago

raw

history blame contribute delete

8.91 kB

	name: devops-incident-response
	version: "2.0.0"
	description: >
	ARIA (Adaptive Reward & Incident Architecture) — an OpenEnv-compliant RL
	environment where AI agents learn to diagnose and remediate production
	software incidents under partial observability. Agents read logs, metrics,
	and alerts across a 12-service microservices architecture, then choose
	from 14 action types (restart, rollback, block_ip_range, create_index,
	failover, alert_oncall, and more). Seven curated tasks of escalating
	difficulty plus procedural seed-based generation provide a meaningful
	progression for benchmarking agent reasoning quality. Dense reward shaping
	with anti-gaming mechanisms (collateral damage penalty, blind remediation
	penalty, semantic diagnosis matching) ensures the reward signal is
	informative and resistant to exploitation. Curriculum engine tracks agent
	mastery per task and recommends adaptive training sequences. Multi-agent
	mode splits observability between an Observer (logs/alerts) and a
	Responder (metrics/dependencies), enabling communication and coordination
	research.

	author: "Arijit-07"
	tags:
	- openenv
	- devops
	- incident-response
	- real-world
	- multi-step
	- microservices
	- reward-shaping

	tasks:
	- id: easy
	name: Single Service Anomaly
	description: >
	A payment service is crash-looping due to a JVM heap memory leak.
	Logs clearly show OutOfMemoryError and OOMKilled pod restarts.
	The agent must read logs/metrics, diagnose the memory leak, and
	restart the affected service without touching healthy services.
	difficulty: easy
	max_steps: 15
	reward_range: [0.0, 1.0]
	expected_score_random_agent: 0.05
	expected_score_strong_llm: 0.90

	- id: medium
	name: Cascading Multi-Service Failure
	description: >
	A bad deployment of inventory-service introduced connection pool
	exhaustion, cascading to order-service timeouts and api-gateway
	errors. A red-herring alert fires on notification-service (high CPU
	from a scheduled batch job). The agent must trace the cascade to the
	root service and rollback — not restart downstream victims.
	difficulty: medium
	max_steps: 20
	reward_range: [0.0, 1.0]
	expected_score_random_agent: 0.03
	expected_score_strong_llm: 0.55

	- id: hard
	name: Silent Data Corruption
	description: >
	A data pipeline deployment silently writes incorrect price values to
	the product catalog. No standard error-rate or latency alerts fire —
	all services show green health. The signal is buried in
	price-validation WARN logs (15% mismatch rate) and an analytics
	anomaly (avg order value 9x baseline). Full credit requires both
	rollback of the pipeline AND alerting on-call for a data audit.
	difficulty: hard
	max_steps: 25
	reward_range: [0.0, 1.0]
	expected_score_random_agent: 0.01
	expected_score_strong_llm: 0.35

	- id: bonus
	name: Simultaneous Dual Failure
	description: >
	Two independent failures strike at once: log-aggregator disk is 100% full
	(causing log loss across all services) and ml-inference-service is stuck
	in a model reload CPU loop. Neither failure is related to the other.
	Full credit requires fixing both root causes independently.
	difficulty: hard
	max_steps: 25
	reward_range: [0.0, 1.0]
	expected_score_random_agent: 0.01
	expected_score_strong_llm: 0.40

	- id: security
	name: Security Incident (DDoS)
	description: >
	A botnet is performing a DDoS and credential stuffing attack against the login endpoint.
	The API gateway and Auth service are overwhelmed. The agent must read access logs,
	diagnose the attack IP range, block the CIDR, and alert the security team.
	difficulty: hard
	max_steps: 20
	reward_range: [0.0, 1.0]
	expected_score_random_agent: 0.01
	expected_score_strong_llm: 0.35

	- id: database
	name: Database Performance Degradation (Missing Index)
	description: >
	A database migration ran 15 minutes ago that added a new column but forgot to add an index.
	Now queries are doing full table scans sequentially, leading to major DB degradation.
	The agent must read the Postgres slow query logs, evaluate sequential scan rates via metrics, and correctly assign a missing index or rollback the migration.
	difficulty: hard
	max_steps: 20
	reward_range: [0.0, 1.0]
	expected_score_random_agent: 0.01
	expected_score_strong_llm: 0.35

	- id: failover
	name: Multi-Region Failover
	description: >
	A primary datacenter region (us-east-1) is degraded due to a network partition.
	The agent must correctly identify which services support automatic multi-region failover
	(api-gateway, cdn-service, order-service, redis-cache) and which do not (payment-service, postgres-primary).
	Failing over the wrong services causes severe data inconsistency penalties.
	difficulty: hard
	max_steps: 25
	reward_range: [0.0, 1.0]
	expected_score_random_agent: 0.01
	expected_score_strong_llm: 0.25

	- id: generated
	name: Procedural Incident
	description: >
	A seed-based procedural incident generated by ARIA's IncidentFactory.
	Deterministic and reproducible — any integer seed 0-99999 produces a unique,
	consistent incident scenario. Failure modes include OOM, cascade, corruption,
	security breaches, database degradation, and network partition.
	difficulty: variable
	max_steps: 20
	reward_range: [0.0, 1.0]
	expected_score_random_agent: 0.02
	expected_score_strong_llm: 0.60

	action_space:
	type: structured
	description: >
	Discrete action types with optional service/parameter arguments.
	Actions are expressed as Pydantic Action objects with fields:
	action_type, service, root_cause, runbook, version, reason.
	actions:
	- name: diagnose
	description: Record the agent's root cause hypothesis
	- name: read_logs
	description: Read recent log lines for a named service
	- name: search_logs
	description: Search log lines for a service matching a query string
	- name: read_metrics
	description: Read CPU, memory, error rate, latency for a named service
	- name: read_runbook
	description: Read an operational runbook by filename
	- name: restart_service
	description: Restart a named service (clears memory, resets connections)
	- name: rollback
	description: Roll back a service to a previous version
	- name: scale_up
	description: Increase replica count for a named service
	- name: alert_oncall
	description: Page the on-call engineering team
	- name: acknowledge
	description: Acknowledge an active alert by ID
	- name: noop
	description: Take no action this step
	- name: block_ip_range
	description: Block traffic from an IP range (CIDR format)
	- name: create_index
	description: Create a database index on a specific table and column
	- name: failover
	description: Failover a service to a different target region

	observation_space:
	type: structured
	description: >
	Pydantic Observation object containing: current step, task description,
	list of ServiceStatus objects (name, status, cpu, memory, error_rate,
	latency_p99, replicas, version, last_deployed), list of Alert objects
	(severity, service, message, acknowledged), recent log lines per
	service (dict of service_name -> last 10 lines), available runbook
	names, last action result/error, and incident timing info.

	reward:
	type: dense
	range: [0.001, 0.999]
	description: >
	Partial credit for information gathering, correct diagnosis, and
	precise remediation. Penalties for collateral damage (restarting
	healthy services), excessive noops, and treating symptoms instead
	of root causes. Efficiency bonus for fast resolution. Rewards
	clamped to [0.001, 0.999] to avoid dead gradients in RL training.
	Anti-gaming mechanisms: collateral_damage_penalty, blind_remediation_penalty,
	semantic diagnosis matching (fuzzy match against ground truth root cause).

	training:
	algorithm: GRPO
	model: Llama-3.1-8B-Instruct
	adapter: https://huggingface.co/Arijit-07/aria-devops-llama8b
	episodes: 160
	framework: HuggingFace TRL + Unsloth
	results:
	easy_pre: 0.42
	easy_post: 0.87
	medium_pre: 0.18
	medium_post: 0.51
	hard_pre: 0.05
	hard_post: 0.22
	average_improvement: 0.31

	aria_features:
	curriculum_engine: true
	incident_generator: true
	dual_agent_mode: true

	websocket:
	endpoint: /ws
	protocol: json
	commands: [reset, step, state]

	docker:
	base_image: python:3.11-slim
	port: 7860
	health_endpoint: /health
	reset_endpoint: /reset
	step_endpoint: /step
	state_endpoint: /state
	metrics_endpoint: /metrics
	leaderboard_endpoint: /leaderboard