Hugging Face Spaces β wired 7B agents + 72B judge
Hackathon Space URL (avoid 404)
The LabLab org Space slug uses a hyphen: atlas-ops, not atlasops.
| What | URL |
|---|---|
| Space repo (clone / git push) | https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/atlas-ops |
| Embedded app (typical) | https://lablab-ai-amd-developer-hackathon-atlas-ops.hf.space |
| Health check | Same host as app + /health (not the repo .git URL) |
If you duplicate or rename the Space, replace atlas-ops everywhere below with your actual slug.
Git remote example
git remote remove lablab 2>/dev/null
git remote add lablab https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/atlas-ops
git push lablab main
Secrets: set ATLASOPS_PUBLIC_BASE_URL to the app URL (no trailing slash), e.g. https://lablab-ai-amd-developer-hackathon-atlas-ops.hf.space, so approval / Discord hints use the Space the judges open β not localhost and not an old slug.
AtlasOps speaks two OpenAI-compatible HTTP endpoints:
| Role | Env vars | Typical model id |
|---|---|---|
| Incident agents (triageβcomms) | VLLM_BASE, AGENT_MODEL, token |
Your merged 7B on Hub or Qwen/Qwen2.5-7B-Instruct |
| Judge (scores responses; benchmarks + optional live ribbon) | JUDGE_URL, JUDGE_MODEL, token |
Smaller HF model if 72B is blocked on quota |
One-switch setup on the Space root
Configure Space β Settings β Variables and secrets:
HF_TOKENβ your HF access token (read plus Inference / fine-grained Inference permission if you use Router). This is the minimum to un-block the agent strip on a fresh Space: the app now auto-detects a Hugging Face Space container and, ifVLLM_BASE/JUDGE_URLstill look likelocalhost, replaces them with the HF Inference Router. You do not have to addATLASOPS_USE_HF_INFERENCE=1for that rescue (but you can set it explicitly for clarity).ATLASOPS_USE_HF_INFERENCE=1β optional if auto-routing already applied; set manually if you want the βHF packβ on every environment.
config/hf_space_env.py copies HF_TOKEN into LLM_API_KEY and JUDGE_API_KEY, and overwrites loopback VLLM_BASE / JUDGE_URL with https://router.huggingface.co/v1. If you already set a public VLLM_BASE to your MI300X, that URL is left alone.
Opt out of auto-routing (use only your self-hosted URL, even on a Space): ATLASOPS_AUTO_HF_INFERENCE=0 or set a non-loopback VLLM_BASE first.
Optional override:
HF_INFERENCE_BASE=https://router.huggingface.co/v1
Model IDs agents will call
Required:
AGENT_MODEL=<your-namespace/your-atlasops-merged-7b>
JUDGE_MODEL=Qwen/Qwen2.5-72B-Instruct-AWQ
(or any Hub model Router actually serves under your billing tier)
72B AWQ often needs a paid Inference allotment or dedicated Inference Endpoint.
If Router returns 429/403 on 72B, set for example JUDGE_MODEL=Qwen/Qwen2.5-32B-Instruct temporarily β AtlasOps keeps working.
Putting your GRPO weights on Hugging Face (7B)
The coordinator sends only a model string (no silent LoRA layer). Serving options:
- Merge LoRA locally into the base checkpoint, upload the merged weights to
your-org/atlasops-7b-grpo, setAGENT_MODELto that repo (seetraining/merge_lora_for_hub.pyafterpip install -e ".[train]"). - Self-hosted vLLM +
--enable-loraon AMD hardware (not HF Space CPU) β would require coordinator changes to attach LoRA per request unless you bake merged weights yourself.
For most hackathon demos merged Hub model + Router is the least painful.
Live judge inside the Ops UI
When ATLASOPS_USE_HF_INFERENCE=1, the coordinator fires one judge call after comms, and the timeline prints the score (judge_trajectory tool line).
Explicit flags:
ATLASOPS_LIVE_JUDGE=1 # force ON
ATLASOPS_LIVE_JUDGE=0 # force OFF even with HF inference pack
Local MI300x with JUDGE_URL=http://localhost:8001/v1: keep ATLASOPS_USE_HF_INFERENCE unset, set ATLASOPS_LIVE_JUDGE=1 if you still want judge lines in Grafana.
Health checks after deploy
GET https://<space>/healthβ agent + judge model names + bases (inspect JSON).GET https://<space>/api/healthβ coordinator copy withlive_judgeBoolean.
Neither endpoint prints raw tokens.
After redeploy with the bundled UI, the footer bar polls /health and shows Discord webhook β or β add DISCORD_WEBHOOK_URL so you know whether DISCORD_WEBHOOK_URL reached the Space (no webhook URL is ever rendered).
Summary checklist
HF_TOKEN=<secret>
ATLASOPS_USE_HF_INFERENCE=1
AGENT_MODEL=your-org/your-trained-merged-7b
JUDGE_MODEL=Qwen/Qwen2.5-72B-Instruct-AWQ # or a smaller HF model Router allows
BACKEND=openai # optional; bootstrap sets default when ATLASOPS_USE_HF_INFERENCE=1
Redeploy the Space; trigger chaos from the sidebar and confirm timeline shows both tool calls (judge / judge_trajectory) and agent turns.
Space has no kubeconfig (Pod Kill returns 500)
The UIβs Inject endpoint runs kubectl apply on chaos manifests. Typical HF Space containers do not have credentials to your GKE API server, so you will see POST /inject β 500 in logs and the coordinator never starts.
Set this Space variable so inject skips kubectl but still schedules the incident pipeline (reads live Alertmanager after a short delay):
ATLASOPS_SKIP_KUBECTL_INJECT=1
Real fault injection still requires a reachable cluster from somewhere that has kubeconfig (CI, laptop, or another service). For a public demo, you can rely on already-firing alerts in Alertmanager or webhook-driven incidents.
Discord (why nothing appears in your server)
AtlasOps does not use a Discord βbotβ that shows online in the member list. It posts through an Incoming Webhook URL.
- Discord β your server β Server Settings β Integrations β Webhooks β New Webhook β pick
#general(or a channel) β copy Webhook URL. - In HF Space Secrets, add
DISCORD_WEBHOOK_URL= that URL (same asagents/tools/comms.pyexpects). - Every scenario/incident pipeline triggers one automatic Discord embed when
DISCORD_WEBHOOK_URLis set (coordinator fires this infinallyafter eachhandle_incident, including demo injects β independent of whether the LLM callsslack_post_update). Approval + closure + this ping can burst Discordβs webhook quota (HTTP 429); delivery now retries withRetry-After. Disable noise withATLASOPS_DISCORD_EVERY_RUN_PING=0. Extra detail still appears when comms usesslack_post_update. If Discord is still empty after a run finishes, check Space logs andVLLM_BASE(stalled pipelines never reachfinally).
Optional: SLACK_WEBHOOK_URL for Slack in parallel; both can be set.
Final hour β submission checklist (do in order)
Right Space URL β Use
lablab-ai-amd-developer-hackathon/atlas-ops(hyphen). Wrong slug β 404. Push:git push https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/atlas-ops main(or addlablabremote; see section at top).Build green β Space β Logs β build finished, app listening on 7860.
Secrets sanity
VLLM_BASE=http://<MI300X_PUBLIC_IP>:8000/v1β notlocalhost.ATLASOPS_SKIP_KUBECTL_INJECT=1on Space (no kubeconfig).ATLASOPS_PUBLIC_BASE_URL=https://lablab-ai-amd-developer-hackathon-atlas-ops.hf.space(no trailing slash; fix if your slug differs).DISCORD_WEBHOOK_URLset; rotate if it was ever pasted in chat.- Do not set
ATLASOPS_API_KEYon the public demo Space unless the UI is updated to sendX-AtlasOps-Keyβ otherwisePOST /injectreturns 401.
Smoke test in browser (90 seconds)
Open app β DevTools console:fetch('/health').then(r=>r.json()).then(console.log)βstatus: ok,agent_basenot localhost.- Click Pod Kill (or one historical replay) β timeline fills; Approve if shown; check Discord.
Judge story (one sentence each)
Real GKE + real metrics; four agents; human approval (UI buttons + Discord notice before remediation); MI300X for SFT + online GRPO; skip-kubectl only for fault inject, agents still use live tools elsewhere.GitHub + slides β
README/docs/slides.mdlinks match the Space you submit; export PDF from Marp if required.