atlas-ops / docs /END_TO_END_FLOW.md
Harikishanth R
AtlasOps: full deploy with reliability fixes + training evidence
4a77231

End-to-End Incident Flow

How a single chaos scenario travels through AtlasOps from button click to Discord notification.

Sequence Diagram

sequenceDiagram
    participant Judge as Judge / User
    participant UI as Live Ops UI
    participant App as app.py
    participant Coord as Coordinator
    participant Corr as Correlator
    participant CB as Circuit Breaker
    participant LLM as HF Router / vLLM
    participant Tools as SRE Tools
    participant Discord as Discord Webhook

    Judge->>UI: Click scenario (e.g. Pod Kill)
    UI->>App: POST /inject {scenario_id}
    App->>App: kubectl apply (or skip on Space)
    App-->>UI: 200 OK {correlation_id}
    UI->>UI: Start SSE stream, show timeline

    Note over App: 20s delay for alert propagation

    App->>Corr: ingest(alert)
    Corr-->>App: incident_id, should_dispatch=true
    App->>Coord: handle_incident(alert, incident_id)
    Coord->>CB: start_incident()

    rect rgb(30,40,60)
        Note over Coord,LLM: Triage Agent (max 10 turns)
        Coord->>LLM: chat/completions (triage prompt + tools)
        LLM-->>Coord: tool_call: kubectl_get
        Coord->>Tools: kubectl_get(pods)
        Tools-->>Coord: pod status JSON
        Coord->>UI: SSE thought (triage / tool_call)
        Coord->>LLM: tool result + continue
        LLM-->>Coord: conclusion {severity, title, blast_radius}
        Coord->>UI: SSE thought (triage / conclusion)
    end

    rect rgb(30,40,60)
        Note over Coord,LLM: Diagnosis Agent
        Coord->>LLM: chat/completions (diagnosis prompt)
        LLM-->>Coord: tool_call: promql_query, jaeger_search
        Coord->>Tools: promql + jaeger
        Tools-->>Coord: metrics + traces
        Coord->>UI: SSE thoughts
        LLM-->>Coord: conclusion {root_cause, confidence}
    end

    alt P1 severity (approval required)
        Coord->>Discord: Approval required embed
        Coord->>UI: SSE thought (waiting_approval)
        Judge->>UI: Click Approve / Reject
        UI->>App: POST /approve {token, decision}
        App->>Coord: approval callback
    end

    alt Approved or auto-timeout
        rect rgb(30,40,60)
            Note over Coord,LLM: Remediation Agent
            Coord->>LLM: chat/completions (remediation prompt)
            LLM-->>Coord: tool_call: argocd_rollback
            Coord->>Tools: argocd_rollback(app, revision)
            Tools-->>Coord: rollback result
            Coord->>UI: SSE thoughts
            LLM-->>Coord: conclusion {outcome: resolved}
        end
    else Rejected
        Note over Coord: Skip remediation, record approval_rejected
    end

    rect rgb(30,40,60)
        Note over Coord,LLM: Comms Agent
        Coord->>LLM: chat/completions (comms prompt)
        LLM-->>Coord: tool_call: slack_post_update, postmortem_draft
        Coord->>Discord: Closure embed (if webhook configured)
        Coord->>UI: SSE thought (comms / conclusion)
    end

    Coord->>CB: finish_incident(resolved, reason)
    Note over CB: approval_rejected does NOT trip breaker
    Coord->>Discord: Scenario run complete ping (finally block)
    Coord-->>App: full_record
    App-->>UI: Incident complete

Key Design Decisions

Correlator bypass for UI injects — Each /inject carries a correlation_id starting with inj-. The correlator always dispatches these as new incidents instead of merging with existing Alertmanager fingerprints. This prevents the second scenario from being silently swallowed when another run is still active.

Circuit breaker semantics — Only system_error and agent_error outcomes count toward the consecutive failure threshold. Designed outcomes like approval_rejected, manual_runbook, and approval_timeout do not trip the breaker, so judges can reject remediation freely without locking the system.

HTTP retry on LLM calls — The coordinator retries chat/completions on HTTP 429 (HF Router rate limit) and 5xx with exponential backoff, preventing transient inference hiccups from failing entire scenarios.

POST /reset clears everything — Resets chaos manifests, circuit breaker state, and correlator incident tracking. The UI Reset button is a true panic switch for demos.