# End-to-End Incident Flow

How a single chaos scenario travels through AtlasOps from button click to Discord notification.

## Sequence Diagram

```mermaid
sequenceDiagram
    participant Judge as Judge / User
    participant UI as Live Ops UI
    participant App as app.py
    participant Coord as Coordinator
    participant Corr as Correlator
    participant CB as Circuit Breaker
    participant LLM as HF Router / vLLM
    participant Tools as SRE Tools
    participant Discord as Discord Webhook

    Judge->>UI: Click scenario (e.g. Pod Kill)
    UI->>App: POST /inject {scenario_id}
    App->>App: kubectl apply (or skip on Space)
    App-->>UI: 200 OK {correlation_id}
    UI->>UI: Start SSE stream, show timeline

    Note over App: 20s delay for alert propagation

    App->>Corr: ingest(alert)
    Corr-->>App: incident_id, should_dispatch=true
    App->>Coord: handle_incident(alert, incident_id)
    Coord->>CB: start_incident()

    rect rgb(30,40,60)
        Note over Coord,LLM: Triage Agent (max 10 turns)
        Coord->>LLM: chat/completions (triage prompt + tools)
        LLM-->>Coord: tool_call: kubectl_get
        Coord->>Tools: kubectl_get(pods)
        Tools-->>Coord: pod status JSON
        Coord->>UI: SSE thought (triage / tool_call)
        Coord->>LLM: tool result + continue
        LLM-->>Coord: conclusion {severity, title, blast_radius}
        Coord->>UI: SSE thought (triage / conclusion)
    end

    rect rgb(30,40,60)
        Note over Coord,LLM: Diagnosis Agent
        Coord->>LLM: chat/completions (diagnosis prompt)
        LLM-->>Coord: tool_call: promql_query, jaeger_search
        Coord->>Tools: promql + jaeger
        Tools-->>Coord: metrics + traces
        Coord->>UI: SSE thoughts
        LLM-->>Coord: conclusion {root_cause, confidence}
    end

    alt P1 severity (approval required)
        Coord->>Discord: Approval required embed
        Coord->>UI: SSE thought (waiting_approval)
        Judge->>UI: Click Approve / Reject
        UI->>App: POST /approve {token, decision}
        App->>Coord: approval callback
    end

    alt Approved or auto-timeout
        rect rgb(30,40,60)
            Note over Coord,LLM: Remediation Agent
            Coord->>LLM: chat/completions (remediation prompt)
            LLM-->>Coord: tool_call: argocd_rollback
            Coord->>Tools: argocd_rollback(app, revision)
            Tools-->>Coord: rollback result
            Coord->>UI: SSE thoughts
            LLM-->>Coord: conclusion {outcome: resolved}
        end
    else Rejected
        Note over Coord: Skip remediation, record approval_rejected
    end

    rect rgb(30,40,60)
        Note over Coord,LLM: Comms Agent
        Coord->>LLM: chat/completions (comms prompt)
        LLM-->>Coord: tool_call: slack_post_update, postmortem_draft
        Coord->>Discord: Closure embed (if webhook configured)
        Coord->>UI: SSE thought (comms / conclusion)
    end

    Coord->>CB: finish_incident(resolved, reason)
    Note over CB: approval_rejected does NOT trip breaker
    Coord->>Discord: Scenario run complete ping (finally block)
    Coord-->>App: full_record
    App-->>UI: Incident complete
```

## Key Design Decisions

**Correlator bypass for UI injects** — Each `/inject` carries a `correlation_id` starting with `inj-`. The correlator always dispatches these as new incidents instead of merging with existing Alertmanager fingerprints. This prevents the second scenario from being silently swallowed when another run is still active.

**Circuit breaker semantics** — Only `system_error` and `agent_error` outcomes count toward the consecutive failure threshold. Designed outcomes like `approval_rejected`, `manual_runbook`, and `approval_timeout` do not trip the breaker, so judges can reject remediation freely without locking the system.

**HTTP retry on LLM calls** — The coordinator retries `chat/completions` on HTTP 429 (HF Router rate limit) and 5xx with exponential backoff, preventing transient inference hiccups from failing entire scenarios.

**POST /reset clears everything** — Resets chaos manifests, circuit breaker state, and correlator incident tracking. The UI Reset button is a true panic switch for demos.