Spaces:

lablab-ai-amd-developer-hackathon
/

atlas-ops

Sleeping

App Files Files Community

atlas-ops / docs /END_TO_END_FLOW.md

Harikishanth R

AtlasOps: full deploy with reliability fixes + training evidence

4a77231 25 days ago

preview code

raw

history blame contribute delete

4.21 kB

	# End-to-End Incident Flow

	How a single chaos scenario travels through AtlasOps from button click to Discord notification.

	## Sequence Diagram

	```mermaid
	sequenceDiagram
	participant Judge as Judge / User
	participant UI as Live Ops UI
	participant App as app.py
	participant Coord as Coordinator
	participant Corr as Correlator
	participant CB as Circuit Breaker
	participant LLM as HF Router / vLLM
	participant Tools as SRE Tools
	participant Discord as Discord Webhook

	Judge->>UI: Click scenario (e.g. Pod Kill)
	UI->>App: POST /inject {scenario_id}
	App->>App: kubectl apply (or skip on Space)
	App-->>UI: 200 OK {correlation_id}
	UI->>UI: Start SSE stream, show timeline

	Note over App: 20s delay for alert propagation

	App->>Corr: ingest(alert)
	Corr-->>App: incident_id, should_dispatch=true
	App->>Coord: handle_incident(alert, incident_id)
	Coord->>CB: start_incident()

	rect rgb(30,40,60)
	Note over Coord,LLM: Triage Agent (max 10 turns)
	Coord->>LLM: chat/completions (triage prompt + tools)
	LLM-->>Coord: tool_call: kubectl_get
	Coord->>Tools: kubectl_get(pods)
	Tools-->>Coord: pod status JSON
	Coord->>UI: SSE thought (triage / tool_call)
	Coord->>LLM: tool result + continue
	LLM-->>Coord: conclusion {severity, title, blast_radius}
	Coord->>UI: SSE thought (triage / conclusion)
	end

	rect rgb(30,40,60)
	Note over Coord,LLM: Diagnosis Agent
	Coord->>LLM: chat/completions (diagnosis prompt)
	LLM-->>Coord: tool_call: promql_query, jaeger_search
	Coord->>Tools: promql + jaeger
	Tools-->>Coord: metrics + traces
	Coord->>UI: SSE thoughts
	LLM-->>Coord: conclusion {root_cause, confidence}
	end

	alt P1 severity (approval required)
	Coord->>Discord: Approval required embed
	Coord->>UI: SSE thought (waiting_approval)
	Judge->>UI: Click Approve / Reject
	UI->>App: POST /approve {token, decision}
	App->>Coord: approval callback
	end

	alt Approved or auto-timeout
	rect rgb(30,40,60)
	Note over Coord,LLM: Remediation Agent
	Coord->>LLM: chat/completions (remediation prompt)
	LLM-->>Coord: tool_call: argocd_rollback
	Coord->>Tools: argocd_rollback(app, revision)
	Tools-->>Coord: rollback result
	Coord->>UI: SSE thoughts
	LLM-->>Coord: conclusion {outcome: resolved}
	end
	else Rejected
	Note over Coord: Skip remediation, record approval_rejected
	end

	rect rgb(30,40,60)
	Note over Coord,LLM: Comms Agent
	Coord->>LLM: chat/completions (comms prompt)
	LLM-->>Coord: tool_call: slack_post_update, postmortem_draft
	Coord->>Discord: Closure embed (if webhook configured)
	Coord->>UI: SSE thought (comms / conclusion)
	end

	Coord->>CB: finish_incident(resolved, reason)
	Note over CB: approval_rejected does NOT trip breaker
	Coord->>Discord: Scenario run complete ping (finally block)
	Coord-->>App: full_record
	App-->>UI: Incident complete
	```

	## Key Design Decisions

	Correlator bypass for UI injects — Each `/inject` carries a `correlation_id` starting with `inj-`. The correlator always dispatches these as new incidents instead of merging with existing Alertmanager fingerprints. This prevents the second scenario from being silently swallowed when another run is still active.

	Circuit breaker semantics — Only `system_error` and `agent_error` outcomes count toward the consecutive failure threshold. Designed outcomes like `approval_rejected`, `manual_runbook`, and `approval_timeout` do not trip the breaker, so judges can reject remediation freely without locking the system.

	HTTP retry on LLM calls — The coordinator retries `chat/completions` on HTTP 429 (HF Router rate limit) and 5xx with exponential backoff, preventing transient inference hiccups from failing entire scenarios.

	POST /reset clears everything — Resets chaos manifests, circuit breaker state, and correlator incident tracking. The UI Reset button is a true panic switch for demos.