End-to-End Incident Flow
How a single chaos scenario travels through AtlasOps from button click to Discord notification.
Sequence Diagram
sequenceDiagram
participant Judge as Judge / User
participant UI as Live Ops UI
participant App as app.py
participant Coord as Coordinator
participant Corr as Correlator
participant CB as Circuit Breaker
participant LLM as HF Router / vLLM
participant Tools as SRE Tools
participant Discord as Discord Webhook
Judge->>UI: Click scenario (e.g. Pod Kill)
UI->>App: POST /inject {scenario_id}
App->>App: kubectl apply (or skip on Space)
App-->>UI: 200 OK {correlation_id}
UI->>UI: Start SSE stream, show timeline
Note over App: 20s delay for alert propagation
App->>Corr: ingest(alert)
Corr-->>App: incident_id, should_dispatch=true
App->>Coord: handle_incident(alert, incident_id)
Coord->>CB: start_incident()
rect rgb(30,40,60)
Note over Coord,LLM: Triage Agent (max 10 turns)
Coord->>LLM: chat/completions (triage prompt + tools)
LLM-->>Coord: tool_call: kubectl_get
Coord->>Tools: kubectl_get(pods)
Tools-->>Coord: pod status JSON
Coord->>UI: SSE thought (triage / tool_call)
Coord->>LLM: tool result + continue
LLM-->>Coord: conclusion {severity, title, blast_radius}
Coord->>UI: SSE thought (triage / conclusion)
end
rect rgb(30,40,60)
Note over Coord,LLM: Diagnosis Agent
Coord->>LLM: chat/completions (diagnosis prompt)
LLM-->>Coord: tool_call: promql_query, jaeger_search
Coord->>Tools: promql + jaeger
Tools-->>Coord: metrics + traces
Coord->>UI: SSE thoughts
LLM-->>Coord: conclusion {root_cause, confidence}
end
alt P1 severity (approval required)
Coord->>Discord: Approval required embed
Coord->>UI: SSE thought (waiting_approval)
Judge->>UI: Click Approve / Reject
UI->>App: POST /approve {token, decision}
App->>Coord: approval callback
end
alt Approved or auto-timeout
rect rgb(30,40,60)
Note over Coord,LLM: Remediation Agent
Coord->>LLM: chat/completions (remediation prompt)
LLM-->>Coord: tool_call: argocd_rollback
Coord->>Tools: argocd_rollback(app, revision)
Tools-->>Coord: rollback result
Coord->>UI: SSE thoughts
LLM-->>Coord: conclusion {outcome: resolved}
end
else Rejected
Note over Coord: Skip remediation, record approval_rejected
end
rect rgb(30,40,60)
Note over Coord,LLM: Comms Agent
Coord->>LLM: chat/completions (comms prompt)
LLM-->>Coord: tool_call: slack_post_update, postmortem_draft
Coord->>Discord: Closure embed (if webhook configured)
Coord->>UI: SSE thought (comms / conclusion)
end
Coord->>CB: finish_incident(resolved, reason)
Note over CB: approval_rejected does NOT trip breaker
Coord->>Discord: Scenario run complete ping (finally block)
Coord-->>App: full_record
App-->>UI: Incident complete
Key Design Decisions
Correlator bypass for UI injects — Each /inject carries a correlation_id starting with inj-. The correlator always dispatches these as new incidents instead of merging with existing Alertmanager fingerprints. This prevents the second scenario from being silently swallowed when another run is still active.
Circuit breaker semantics — Only system_error and agent_error outcomes count toward the consecutive failure threshold. Designed outcomes like approval_rejected, manual_runbook, and approval_timeout do not trip the breaker, so judges can reject remediation freely without locking the system.
HTTP retry on LLM calls — The coordinator retries chat/completions on HTTP 429 (HF Router rate limit) and 5xx with exponential backoff, preventing transient inference hiccups from failing entire scenarios.
POST /reset clears everything — Resets chaos manifests, circuit breaker state, and correlator incident tracking. The UI Reset button is a true panic switch for demos.