| # End-to-End Incident Flow |
|
|
| How a single chaos scenario travels through AtlasOps from button click to Discord notification. |
|
|
| ## Sequence Diagram |
|
|
| ```mermaid |
| sequenceDiagram |
| participant Judge as Judge / User |
| participant UI as Live Ops UI |
| participant App as app.py |
| participant Coord as Coordinator |
| participant Corr as Correlator |
| participant CB as Circuit Breaker |
| participant LLM as HF Router / vLLM |
| participant Tools as SRE Tools |
| participant Discord as Discord Webhook |
| |
| Judge->>UI: Click scenario (e.g. Pod Kill) |
| UI->>App: POST /inject {scenario_id} |
| App->>App: kubectl apply (or skip on Space) |
| App-->>UI: 200 OK {correlation_id} |
| UI->>UI: Start SSE stream, show timeline |
| |
| Note over App: 20s delay for alert propagation |
| |
| App->>Corr: ingest(alert) |
| Corr-->>App: incident_id, should_dispatch=true |
| App->>Coord: handle_incident(alert, incident_id) |
| Coord->>CB: start_incident() |
| |
| rect rgb(30,40,60) |
| Note over Coord,LLM: Triage Agent (max 10 turns) |
| Coord->>LLM: chat/completions (triage prompt + tools) |
| LLM-->>Coord: tool_call: kubectl_get |
| Coord->>Tools: kubectl_get(pods) |
| Tools-->>Coord: pod status JSON |
| Coord->>UI: SSE thought (triage / tool_call) |
| Coord->>LLM: tool result + continue |
| LLM-->>Coord: conclusion {severity, title, blast_radius} |
| Coord->>UI: SSE thought (triage / conclusion) |
| end |
| |
| rect rgb(30,40,60) |
| Note over Coord,LLM: Diagnosis Agent |
| Coord->>LLM: chat/completions (diagnosis prompt) |
| LLM-->>Coord: tool_call: promql_query, jaeger_search |
| Coord->>Tools: promql + jaeger |
| Tools-->>Coord: metrics + traces |
| Coord->>UI: SSE thoughts |
| LLM-->>Coord: conclusion {root_cause, confidence} |
| end |
| |
| alt P1 severity (approval required) |
| Coord->>Discord: Approval required embed |
| Coord->>UI: SSE thought (waiting_approval) |
| Judge->>UI: Click Approve / Reject |
| UI->>App: POST /approve {token, decision} |
| App->>Coord: approval callback |
| end |
| |
| alt Approved or auto-timeout |
| rect rgb(30,40,60) |
| Note over Coord,LLM: Remediation Agent |
| Coord->>LLM: chat/completions (remediation prompt) |
| LLM-->>Coord: tool_call: argocd_rollback |
| Coord->>Tools: argocd_rollback(app, revision) |
| Tools-->>Coord: rollback result |
| Coord->>UI: SSE thoughts |
| LLM-->>Coord: conclusion {outcome: resolved} |
| end |
| else Rejected |
| Note over Coord: Skip remediation, record approval_rejected |
| end |
| |
| rect rgb(30,40,60) |
| Note over Coord,LLM: Comms Agent |
| Coord->>LLM: chat/completions (comms prompt) |
| LLM-->>Coord: tool_call: slack_post_update, postmortem_draft |
| Coord->>Discord: Closure embed (if webhook configured) |
| Coord->>UI: SSE thought (comms / conclusion) |
| end |
| |
| Coord->>CB: finish_incident(resolved, reason) |
| Note over CB: approval_rejected does NOT trip breaker |
| Coord->>Discord: Scenario run complete ping (finally block) |
| Coord-->>App: full_record |
| App-->>UI: Incident complete |
| ``` |
|
|
| ## Key Design Decisions |
|
|
| **Correlator bypass for UI injects** β Each `/inject` carries a `correlation_id` starting with `inj-`. The correlator always dispatches these as new incidents instead of merging with existing Alertmanager fingerprints. This prevents the second scenario from being silently swallowed when another run is still active. |
|
|
| **Circuit breaker semantics** β Only `system_error` and `agent_error` outcomes count toward the consecutive failure threshold. Designed outcomes like `approval_rejected`, `manual_runbook`, and `approval_timeout` do not trip the breaker, so judges can reject remediation freely without locking the system. |
|
|
| **HTTP retry on LLM calls** β The coordinator retries `chat/completions` on HTTP 429 (HF Router rate limit) and 5xx with exponential backoff, preventing transient inference hiccups from failing entire scenarios. |
|
|
| **POST /reset clears everything** β Resets chaos manifests, circuit breaker state, and correlator incident tracking. The UI Reset button is a true panic switch for demos. |
|
|