atlas-ops / docs /HF_SPACE_SETUP.md
Harikishanth R
AtlasOps: full deploy with reliability fixes + training evidence
4a77231

Hugging Face Spaces β€” wired 7B agents + 72B judge

Hackathon Space URL (avoid 404)

The LabLab org Space slug uses a hyphen: atlas-ops, not atlasops.

What URL
Space repo (clone / git push) https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/atlas-ops
Embedded app (typical) https://lablab-ai-amd-developer-hackathon-atlas-ops.hf.space
Health check Same host as app + /health (not the repo .git URL)

If you duplicate or rename the Space, replace atlas-ops everywhere below with your actual slug.

Git remote example

git remote remove lablab 2>/dev/null
git remote add lablab https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/atlas-ops
git push lablab main

Secrets: set ATLASOPS_PUBLIC_BASE_URL to the app URL (no trailing slash), e.g. https://lablab-ai-amd-developer-hackathon-atlas-ops.hf.space, so approval / Discord hints use the Space the judges open β€” not localhost and not an old slug.


AtlasOps speaks two OpenAI-compatible HTTP endpoints:

Role Env vars Typical model id
Incident agents (triage→comms) VLLM_BASE, AGENT_MODEL, token Your merged 7B on Hub or Qwen/Qwen2.5-7B-Instruct
Judge (scores responses; benchmarks + optional live ribbon) JUDGE_URL, JUDGE_MODEL, token Smaller HF model if 72B is blocked on quota

One-switch setup on the Space root

Configure Space β†’ Settings β†’ Variables and secrets:

  1. HF_TOKEN β€” your HF access token (read plus Inference / fine-grained Inference permission if you use Router). This is the minimum to un-block the agent strip on a fresh Space: the app now auto-detects a Hugging Face Space container and, if VLLM_BASE / JUDGE_URL still look like localhost, replaces them with the HF Inference Router. You do not have to add ATLASOPS_USE_HF_INFERENCE=1 for that rescue (but you can set it explicitly for clarity).
  2. ATLASOPS_USE_HF_INFERENCE = 1 β€” optional if auto-routing already applied; set manually if you want the β€œHF pack” on every environment.

config/hf_space_env.py copies HF_TOKEN into LLM_API_KEY and JUDGE_API_KEY, and overwrites loopback VLLM_BASE / JUDGE_URL with https://router.huggingface.co/v1. If you already set a public VLLM_BASE to your MI300X, that URL is left alone.

Opt out of auto-routing (use only your self-hosted URL, even on a Space): ATLASOPS_AUTO_HF_INFERENCE=0 or set a non-loopback VLLM_BASE first.

Optional override:

HF_INFERENCE_BASE=https://router.huggingface.co/v1

Model IDs agents will call

Required:

AGENT_MODEL=<your-namespace/your-atlasops-merged-7b>
JUDGE_MODEL=Qwen/Qwen2.5-72B-Instruct-AWQ

(or any Hub model Router actually serves under your billing tier)

72B AWQ often needs a paid Inference allotment or dedicated Inference Endpoint.
If Router returns 429/403 on 72B, set for example JUDGE_MODEL=Qwen/Qwen2.5-32B-Instruct temporarily β€” AtlasOps keeps working.

Putting your GRPO weights on Hugging Face (7B)

The coordinator sends only a model string (no silent LoRA layer). Serving options:

  1. Merge LoRA locally into the base checkpoint, upload the merged weights to your-org/atlasops-7b-grpo, set AGENT_MODEL to that repo (see training/merge_lora_for_hub.py after pip install -e ".[train]").
  2. Self-hosted vLLM + --enable-lora on AMD hardware (not HF Space CPU) β€” would require coordinator changes to attach LoRA per request unless you bake merged weights yourself.

For most hackathon demos merged Hub model + Router is the least painful.

Live judge inside the Ops UI

When ATLASOPS_USE_HF_INFERENCE=1, the coordinator fires one judge call after comms, and the timeline prints the score (judge_trajectory tool line).

Explicit flags:

ATLASOPS_LIVE_JUDGE=1   # force ON
ATLASOPS_LIVE_JUDGE=0   # force OFF even with HF inference pack

Local MI300x with JUDGE_URL=http://localhost:8001/v1: keep ATLASOPS_USE_HF_INFERENCE unset, set ATLASOPS_LIVE_JUDGE=1 if you still want judge lines in Grafana.

Health checks after deploy

  • GET https://<space>/health β€” agent + judge model names + bases (inspect JSON).
  • GET https://<space>/api/health β€” coordinator copy with live_judge Boolean.

Neither endpoint prints raw tokens.

After redeploy with the bundled UI, the footer bar polls /health and shows Discord webhook βœ“ or βœ— add DISCORD_WEBHOOK_URL so you know whether DISCORD_WEBHOOK_URL reached the Space (no webhook URL is ever rendered).

Summary checklist

HF_TOKEN=<secret>
ATLASOPS_USE_HF_INFERENCE=1
AGENT_MODEL=your-org/your-trained-merged-7b
JUDGE_MODEL=Qwen/Qwen2.5-72B-Instruct-AWQ    # or a smaller HF model Router allows
BACKEND=openai                                 # optional; bootstrap sets default when ATLASOPS_USE_HF_INFERENCE=1

Redeploy the Space; trigger chaos from the sidebar and confirm timeline shows both tool calls (judge / judge_trajectory) and agent turns.

Space has no kubeconfig (Pod Kill returns 500)

The UI’s Inject endpoint runs kubectl apply on chaos manifests. Typical HF Space containers do not have credentials to your GKE API server, so you will see POST /inject β†’ 500 in logs and the coordinator never starts.

Set this Space variable so inject skips kubectl but still schedules the incident pipeline (reads live Alertmanager after a short delay):

ATLASOPS_SKIP_KUBECTL_INJECT=1

Real fault injection still requires a reachable cluster from somewhere that has kubeconfig (CI, laptop, or another service). For a public demo, you can rely on already-firing alerts in Alertmanager or webhook-driven incidents.


Discord (why nothing appears in your server)

AtlasOps does not use a Discord β€œbot” that shows online in the member list. It posts through an Incoming Webhook URL.

  1. Discord β†’ your server β†’ Server Settings β†’ Integrations β†’ Webhooks β†’ New Webhook β†’ pick #general (or a channel) β†’ copy Webhook URL.
  2. In HF Space Secrets, add DISCORD_WEBHOOK_URL = that URL (same as agents/tools/comms.py expects).
  3. Every scenario/incident pipeline triggers one automatic Discord embed when DISCORD_WEBHOOK_URL is set (coordinator fires this in finally after each handle_incident, including demo injects β€” independent of whether the LLM calls slack_post_update). Approval + closure + this ping can burst Discord’s webhook quota (HTTP 429); delivery now retries with Retry-After. Disable noise with ATLASOPS_DISCORD_EVERY_RUN_PING=0. Extra detail still appears when comms uses slack_post_update. If Discord is still empty after a run finishes, check Space logs and VLLM_BASE (stalled pipelines never reach finally).

Optional: SLACK_WEBHOOK_URL for Slack in parallel; both can be set.


Final hour β€” submission checklist (do in order)

  1. Right Space URL β€” Use lablab-ai-amd-developer-hackathon/atlas-ops (hyphen). Wrong slug β†’ 404. Push: git push https://huggingface.co/spaces/lablab-ai-amd-developer-hackathon/atlas-ops main (or add lablab remote; see section at top).

  2. Build green β€” Space β†’ Logs β†’ build finished, app listening on 7860.

  3. Secrets sanity

    • VLLM_BASE = http://<MI300X_PUBLIC_IP>:8000/v1 β€” not localhost.
    • ATLASOPS_SKIP_KUBECTL_INJECT=1 on Space (no kubeconfig).
    • ATLASOPS_PUBLIC_BASE_URL = https://lablab-ai-amd-developer-hackathon-atlas-ops.hf.space (no trailing slash; fix if your slug differs).
    • DISCORD_WEBHOOK_URL set; rotate if it was ever pasted in chat.
    • Do not set ATLASOPS_API_KEY on the public demo Space unless the UI is updated to send X-AtlasOps-Key β€” otherwise POST /inject returns 401.
  4. Smoke test in browser (90 seconds)
    Open app β†’ DevTools console:

    • fetch('/health').then(r=>r.json()).then(console.log) β†’ status: ok, agent_base not localhost.
    • Click Pod Kill (or one historical replay) β†’ timeline fills; Approve if shown; check Discord.
  5. Judge story (one sentence each)
    Real GKE + real metrics; four agents; human approval (UI buttons + Discord notice before remediation); MI300X for SFT + online GRPO; skip-kubectl only for fault inject, agents still use live tools elsewhere.

  6. GitHub + slides β€” README / docs/slides.md links match the Space you submit; export PDF from Marp if required.