File size: 8,767 Bytes
4a77231
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
# Hugging Face Spaces β€” wired 7B agents + 72B judge

## Hackathon Space URL (avoid 404)

The LabLab org Space slug uses a **hyphen**: **`atlas-ops`**, not `atlasops`.

| What | URL |
|------|-----|
| Space repo (clone / git push) | `https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/atlas-ops` |
| Embedded app (typical) | `https://lablab-ai-amd-developer-hackathon-atlas-ops.hf.space` |
| Health check | Same host as app + `/health` (not the repo `.git` URL) |

If you **duplicate or rename** the Space, replace `atlas-ops` everywhere below with your actual slug.

**Git remote example**

```bash
git remote remove lablab 2>/dev/null
git remote add lablab https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/atlas-ops
git push lablab main
```

**Secrets:** set **`ATLASOPS_PUBLIC_BASE_URL`** to the **app** URL (no trailing slash), e.g. `https://lablab-ai-amd-developer-hackathon-atlas-ops.hf.space`, so approval / Discord hints use the Space the judges open β€” not localhost and not an old slug.

---

AtlasOps speaks **two** OpenAI-compatible HTTP endpoints:

| Role | Env vars | Typical model id |
|------|-----------|-------------------|
| **Incident agents** (triage→comms) | `VLLM_BASE`, `AGENT_MODEL`, token | Your merged 7B on Hub **or** `Qwen/Qwen2.5-7B-Instruct` |
| **Judge** (scores responses; benchmarks + optional live ribbon) | `JUDGE_URL`, `JUDGE_MODEL`, token | Smaller HF model if 72B is blocked on quota |

## One-switch setup on the Space root

Configure **Space β†’ Settings β†’ Variables and secrets**:

1. **`HF_TOKEN`** β€” your HF access token (**read** plus **Inference** / **fine-grained Inference** permission if you use Router). **This is the minimum to un-block the agent strip** on a fresh Space: the app now **auto-detects** a Hugging Face Space container and, if `VLLM_BASE` / `JUDGE_URL` still look like `localhost`, **replaces them** with the HF Inference Router. You do **not** have to add `ATLASOPS_USE_HF_INFERENCE=1` for that rescue (but you can set it explicitly for clarity).
2. **`ATLASOPS_USE_HF_INFERENCE`** = `1` β€” optional if auto-routing already applied; set manually if you want the β€œHF pack” on every environment.

`config/hf_space_env.py` copies `HF_TOKEN` into `LLM_API_KEY` and `JUDGE_API_KEY`, and overwrites **loopback** `VLLM_BASE` / `JUDGE_URL` with **`https://router.huggingface.co/v1`**. If you **already** set a **public** `VLLM_BASE` to your MI300X, that URL is **left alone**.

**Opt out of auto-routing** (use only your self-hosted URL, even on a Space): `ATLASOPS_AUTO_HF_INFERENCE=0` or set a non-loopback `VLLM_BASE` first.

Optional override:

```
HF_INFERENCE_BASE=https://router.huggingface.co/v1
```

## Model IDs agents will call

Required:

```
AGENT_MODEL=<your-namespace/your-atlasops-merged-7b>
JUDGE_MODEL=Qwen/Qwen2.5-72B-Instruct-AWQ
```

(or any Hub model Router actually serves under your billing tier)

72B AWQ often needs a **paid** Inference allotment or **dedicated Inference Endpoint**.  
If Router returns 429/403 on 72B, set for example **`JUDGE_MODEL=Qwen/Qwen2.5-32B-Instruct`** temporarily β€” AtlasOps keeps working.

## Putting your GRPO weights on Hugging Face (7B)

The coordinator sends **only** a `model` string (no silent LoRA layer). Serving options:

1. **Merge LoRA locally** into the base checkpoint, upload the merged weights to `your-org/atlasops-7b-grpo`, set `AGENT_MODEL` to that repo (see `training/merge_lora_for_hub.py` after `pip install -e ".[train]"`).
2. **Self-hosted vLLM + `--enable-lora`** on AMD hardware (not HF Space CPU) β€” would require coordinator changes to attach LoRA per request unless you bake merged weights yourself.

For most hackathon demos **merged Hub model + Router** is the least painful.

## Live judge inside the Ops UI

When `ATLASOPS_USE_HF_INFERENCE=1`, the coordinator fires **one judge call after comms**, and the timeline prints the score (`judge_trajectory` tool line).

Explicit flags:

```
ATLASOPS_LIVE_JUDGE=1   # force ON
ATLASOPS_LIVE_JUDGE=0   # force OFF even with HF inference pack
```

Local MI300x with `JUDGE_URL=http://localhost:8001/v1`: keep **`ATLASOPS_USE_HF_INFERENCE` unset**, set **`ATLASOPS_LIVE_JUDGE=1`** if you still want judge lines in Grafana.

## Health checks after deploy

- `GET https://<space>/health` β€” agent + judge **model names** + bases (inspect JSON).
- `GET https://<space>/api/health` β€” coordinator copy with `live_judge` Boolean.

Neither endpoint prints raw tokens.

After redeploy with the bundled UI, the **footer bar** polls `/health` and shows **`Discord webhook βœ“`** or **`βœ— add DISCORD_WEBHOOK_URL`** so you know whether `DISCORD_WEBHOOK_URL` reached the Space (no webhook URL is ever rendered).

## Summary checklist

```
HF_TOKEN=<secret>
ATLASOPS_USE_HF_INFERENCE=1
AGENT_MODEL=your-org/your-trained-merged-7b
JUDGE_MODEL=Qwen/Qwen2.5-72B-Instruct-AWQ    # or a smaller HF model Router allows
BACKEND=openai                                 # optional; bootstrap sets default when ATLASOPS_USE_HF_INFERENCE=1
```

Redeploy the Space; trigger chaos from the sidebar and confirm timeline shows both tool calls (`judge` / `judge_trajectory`) and agent turns.

## Space has no kubeconfig (Pod Kill returns 500)

The UI’s **Inject** endpoint runs `kubectl apply` on chaos manifests. Typical HF Space containers **do not** have credentials to your GKE API server, so you will see **`POST /inject` β†’ 500** in logs and the coordinator never starts.

Set this **Space variable** so inject **skips** kubectl but still schedules the incident pipeline (reads **live** Alertmanager after a short delay):

```
ATLASOPS_SKIP_KUBECTL_INJECT=1
```

Real fault injection still requires a reachable cluster from **somewhere** that has kubeconfig (CI, laptop, or another service). For a public demo, you can rely on **already-firing** alerts in Alertmanager or webhook-driven incidents.

---

## Discord (why nothing appears in your server)

AtlasOps does **not** use a Discord β€œbot” that shows online in the member list. It posts through an **Incoming Webhook** URL.

1. Discord β†’ your server β†’ **Server Settings** β†’ **Integrations** β†’ **Webhooks** β†’ **New Webhook** β†’ pick `#general` (or a channel) β†’ copy **Webhook URL**.
2. In HF Space **Secrets**, add **`DISCORD_WEBHOOK_URL`** = that URL (same as `agents/tools/comms.py` expects).
3. **Every scenario/incident pipeline** triggers **one automatic Discord embed** when `DISCORD_WEBHOOK_URL` is set (coordinator fires this in `finally` after each `handle_incident`, including demo injects β€” independent of whether the LLM calls `slack_post_update`). Approval + closure + this ping can **burst** Discord’s webhook quota (**HTTP 429**); delivery now **retries** with `Retry-After`. Disable noise with **`ATLASOPS_DISCORD_EVERY_RUN_PING=0`**. Extra detail still appears when **comms** uses `slack_post_update`. If Discord is still empty after a run finishes, check Space logs and **`VLLM_BASE`** (stalled pipelines never reach `finally`).

Optional: **`SLACK_WEBHOOK_URL`** for Slack in parallel; both can be set.

---

## Final hour β€” submission checklist (do in order)

1. **Right Space URL** β€” Use **`lablab-ai-amd-developer-hackathon/atlas-ops`** (hyphen). Wrong slug β†’ **404**. Push: `git push https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/atlas-ops main` (or add `lablab` remote; see section at top).

2. **Build green** β€” Space β†’ **Logs** β†’ build finished, app listening on **7860**.

3. **Secrets sanity**
   - **`VLLM_BASE`** = `http://<MI300X_PUBLIC_IP>:8000/v1` β€” **not** `localhost`.
   - **`ATLASOPS_SKIP_KUBECTL_INJECT=1`** on Space (no kubeconfig).
   - **`ATLASOPS_PUBLIC_BASE_URL`** = `https://lablab-ai-amd-developer-hackathon-atlas-ops.hf.space` (no trailing slash; fix if your slug differs).
   - **`DISCORD_WEBHOOK_URL`** set; rotate if it was ever pasted in chat.
   - **Do not set `ATLASOPS_API_KEY`** on the public demo Space unless the UI is updated to send `X-AtlasOps-Key` β€” otherwise **`POST /inject` returns 401**.

4. **Smoke test in browser (90 seconds)**  
   Open app β†’ DevTools console:
   - `fetch('/health').then(r=>r.json()).then(console.log)` β†’ `status: ok`, `agent_base` not localhost.  
   - Click **Pod Kill** (or one historical replay) β†’ timeline fills; **Approve** if shown; check Discord.

5. **Judge story (one sentence each)**  
   Real GKE + real metrics; four agents; **human approval** (UI buttons + Discord notice before remediation); MI300X for **SFT + online GRPO**; skip-kubectl only for **fault inject**, agents still use **live tools** elsewhere.

6. **GitHub + slides** β€” `README` / **`docs/slides.md`** links match the Space you submit; export PDF from Marp if required.