Spaces:

10doshi12
/

firewatch-env

Running

App Files Files Community

firewatch-env / quickstart.md

10doshi12

Upload folder using huggingface_hub

44f306a verified 1 day ago

preview code

raw

history blame contribute delete

10.8 kB

	# 🔥 FirewatchEnv — Quickstart Guide

	> Get from zero to running your first AI SRE agent in under 5 minutes.

	---

	## What is FirewatchEnv?

	FirewatchEnv is an RL training environment for autonomous SRE incident response, built for the [Meta PyTorch OpenEnv Hackathon India 2026](https://github.com/meta-pytorch/OpenEnv). Your AI agent acts as an on-call Site Reliability Engineer — it receives simulated microservice telemetry (OTel-compatible metrics, Prometheus-style alerts, log excerpts) and must diagnose and remediate the root cause before the SLO error budget runs out.

	Key highlights:
	- Single container, no Kubernetes — runs on 2 vCPUs / 8 GB RAM
	- Three difficulty tiers (Easy → Medium → Hard) with adversarial prompt injection in Task 3
	- Outcome-only reward function — the agent can't game the grader; it must actually fix the system

	---

	## Prerequisites

	\| Tool \| Version \| Install \|
	\|------\|---------\|---------\|
	\| Python \| 3.10+ \| [python.org](https://www.python.org/downloads/) \|
	\| uv \| latest \| `pip install uv` or `curl -LsSf https://astral.sh/uv/install.sh \\| sh` \|
	\| Git \| any \| [git-scm.com](https://git-scm.com/) \|
	\| Docker \| latest (optional — only for containerized runs) \| [docker.com](https://docs.docker.com/get-docker/) \|

	---

	## 1 — Clone & Install

	```bash
	git clone https://huggingface.co/spaces/10doshi12/firewatch-env
	cd firewatch-env
	```

	> Important: All commands below should be run from inside the `firewatch_env/` directory, which contains the actual environment code.

	```bash
	cd firewatch_env
	uv sync # installs all Python dependencies from pyproject.toml + uv.lock
	```

	This installs:
	- `openenv-core[core]` ≥ 0.2.2 — FastAPI server + HTTP client types
	- `pydantic` ≥ 2.0 — data models
	- `openai` ≥ 1.0 — LLM inference via OpenAI-compatible API
	- `python-dotenv` — `.env` file loading

	---

	## 2 — Configure Environment Variables

	Copy the example and fill in your credentials:

	```bash
	cp .env.example .env
	```

	Edit `.env`:

	```dotenv
	# --- LLM Provider (HuggingFace Router) ---
	API_BASE_URL=https://router.huggingface.co/v1
	MODEL_NAME=Qwen/Qwen2.5-7B-Instruct
	HF_TOKEN=hf_your_huggingface_token_here

	# --- Server URL (usually auto-detected — leave commented for local dev) ---
	# SPACE_URL=https://10doshi12-firewatch-env.hf.space
	```

	Get your HF token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (requires a Pro or Enterprise plan for router access to gated models).

	\| Variable \| Required \| Description \|
	\|----------\|----------\|-------------\|
	\| `API_BASE_URL` \| Yes \| HuggingFace Router endpoint (`https://router.huggingface.co/v1`) \|
	\| `MODEL_NAME` \| Yes \| Model on HF Hub (e.g. `Qwen/Qwen2.5-7B-Instruct`, `Qwen/Qwen2.5-72B-Instruct`) \|
	\| `HF_TOKEN` \| No* \| HuggingFace API token. If omitted, inference runs a deterministic rule-based fallback agent (no LLM calls). \|
	\| `SPACE_URL` \| No \| Override server URL. Auto-detected in order: `localhost:8000` → `localhost:7860` → HF Space \|

	---

	## 3 — Start the Server

	```bash
	uv run server
	```

	The FastAPI server starts on http://localhost:8000 with these endpoints:

	\| Endpoint \| Method \| Description \|
	\|----------\|--------\|-------------\|
	\| `/health` \| GET \| Health check \|
	\| `/reset` \| POST \| Reset environment — `{"difficulty": "easy", "seed": 42}` \|
	\| `/step` \| POST \| Execute action — `{"action": {"action_type": "fetch_logs", "target_service": "auth-service"}}` \|
	\| `/state` \| GET \| Get current environment state \|
	\| `/schema` \| GET \| Action / observation JSON schemas \|
	\| `/ws` \| WS \| WebSocket for persistent sessions \|

	### Quick smoke test (new terminal):

	```bash
	# Reset an easy episode
	curl -X POST http://localhost:8000/reset \
	-H "Content-Type: application/json" \
	-d '{"difficulty": "easy", "seed": 42}'

	# Take an action
	curl -X POST http://localhost:8000/step \
	-H "Content-Type: application/json" \
	-d '{"action": {"action_type": "fetch_logs", "target_service": "cache"}}'

	# Check current state
	curl http://localhost:8000/state
	```

	---

	## 4 — Run the Inference Agent

	With the server running in one terminal, open a second terminal:

	```bash
	cd firewatch_env
	python inference.py
	```

	This runs your agent across all three tasks sequentially:

	\| Task \| Difficulty \| Services \| Red Herrings \| Max Ticks \| Seed \|
	\|------\|-----------\|----------\|-------------\|-----------\|------\|
	\| `task_easy` \| Easy \| 3 \| 0 \| 20 \| 42 \|
	\| `task_medium` \| Medium \| 5 \| 1 \| 30 \| 137 \|
	\| `task_hard` \| Hard \| 7 \| 3 (1 adversarial) \| 40 \| 256 \|

	### Expected Output

	```
	[START] task=task_easy env=firewatch-env model=x-ai/grok-4.1-fast
	[STEP] step=1 action=fetch_logs:cache reward=-0.14 done=false error=null
	[STEP] step=2 action=rollback_deploy:cache reward=-0.14 done=false error=null
	...
	[END] success=true steps=4 score=0.96 rewards=-0.14,-0.14,-0.14,1.86
	```

	Each `[STEP]` line shows the action taken, intermediate reward, and whether the episode ended. The `[END]` line reports the final graded score (0.0–1.0).

	---

	## 5 — Docker (Alternative)

	Build and run the environment as a Docker container:

	```bash
	# From the firewatch_env/ directory
	docker build -t firewatch-env ./server
	docker run -p 7860:7860 firewatch-env
	```

	The server will be available at http://localhost:7860. Set `SPACE_URL=http://localhost:7860` when running `inference.py` (or let auto-detection find it).

	---

	## 6 — Deploy to HuggingFace Spaces

	```bash
	openenv validate # must pass with zero errors
	openenv push --repo-id 10doshi12/firewatch-env
	```

	Your environment will be live at `https://10doshi12-firewatch-env.hf.space`.

	---

	## Project Structure

	```
	firewatch_env/
	├── models.py # Pydantic models (FirewatchAction, SystemObservation, etc.)
	├── simulation.py # ServiceMesh + generate_episode() + fault physics
	├── actions.py # ActionHandler — all 17 action types
	├── rewards.py # RewardEngine + grade() + EpisodeResult
	├── config.py # Constants, TASKS dict, topology (pure data)
	├── client.py # OpenEnv-generated WebSocket client
	├── inference.py # LLM agent loop (stdout eval format)
	├── openenv.yaml # OpenEnv spec definition
	├── .env.example # Environment variable template
	├── Dockerfile # Multi-stage Docker build
	├── pyproject.toml # Dependencies & entry points
	├── server/
	│ ├── app.py # FastAPI application (entry point)
	│ └── firewatch_env_environment.py # Environment wiring
	└── tests/
	├── test_integration.py
	├── test_simulation.py
	└── test_inference.py
	```

	---

	## Action Space Reference

	### Investigation Actions (read-only)

	\| Action \| Description \|
	\|--------\|-------------\|
	\| `fetch_logs` \| Populates `recent_logs` on the target service \|
	\| `get_metrics_detail` \| Returns 3-tick metric trend summary \|
	\| `trace_dependencies` \| Returns full upstream/downstream dependency chain \|
	\| `strace_process` \| System-call level process inspection \|
	\| `profiler_dump` \| CPU/memory profiler output \|
	\| `check_gc_pressure` \| GC pause times and heap pressure \|
	\| `trace_distributed_request` \| End-to-end distributed trace \|
	\| `inspect_thread_pool` \| Thread pool utilization and deadlock detection \|
	\| `inspect_commit_diff` \| Recent deployment diff \|

	### Remediation Actions (mutate state)

	\| Action \| Description \|
	\|--------\|-------------\|
	\| `restart_service` \| Resets OOM state; wrong if `error_rate < 0.10` \|
	\| `rollback_deploy` \| Halts bad deployment progression \|
	\| `revert_config` \| Restores connection pool / config settings \|
	\| `scale_replicas` \| Increases memory headroom \|
	\| `circuit_break` \| Suppresses cascade for 3 ticks \|
	\| `traffic_shift` \| Redirects traffic away from degraded service \|

	### Meta Actions

	\| Action \| Description \|
	\|--------\|-------------\|
	\| `declare_resolved` \| Terminates episode and triggers grading \|
	\| `escalate` \| Records escalation (no state change) \|

	---

	## Fault Types

	\| Fault \| Signal in Logs \| Correct Remediation \|
	\|-------\|---------------\|---------------------\|
	\| `oom` \| OOMKilled, exit code 137 \| `restart_service` \|
	\| `bad_deploy` \| Error spike post-deployment SHA \| `rollback_deploy` \|
	\| `config_drift` \| HikariCP pool exhaustion, 30s timeouts \| `revert_config` \|
	\| `network_partition` \| Connection refused, circuit breaker OPEN \| `circuit_break` or `restart_service` \|
	\| `memory_leak` \| Gradual latency increase, slow memory growth \| `scale_replicas` → `restart_service` \|

	---

	## Scoring

	The grader produces a score between 0.0 and 1.0 based on four components:

	\| Component \| Weight \| What it Measures \|
	\|-----------\|--------\|-----------------\|
	\| Recovery \| 40% \| Did system health improve? \|
	\| Speed \| 25% \| How quickly was MTTM achieved? \|
	\| Precision \| 20% \| Were wrong actions avoided? \|
	\| SLO \| 15% \| How much error budget remained? \|

	---

	## Running Tests

	```bash
	cd firewatch_env
	uv run pytest tests/ # all tests
	uv run pytest tests/test_integration.py # integration only
	uv run pytest tests/test_simulation.py # simulation logic
	uv run pytest tests/test_integration.py::test_reset_deterministic # single test
	```

	---

	## Troubleshooting

	\| Problem \| Solution \|
	\|---------\|----------\|
	\| `uv: command not found` \| Install uv: `pip install uv` or `curl -LsSf https://astral.sh/uv/install.sh \\| sh` \|
	\| `openenv-core` import error \| Run `uv sync` inside `firewatch_env/` \|
	\| Server won't start \| Check port 8000 isn't in use: `lsof -i :8000` \|
	\| `inference.py` can't find server \| Server auto-detection probes `localhost:8000` → `localhost:7860`. Ensure the server is running. \|
	\| LLM API errors / 401 \| Verify `HF_TOKEN` in `.env`. Without it, the rule-based fallback agent runs (no LLM). \|
	\| Score is 0.0 \| Agent didn't call `declare_resolved` or SLO budget hit 0%. Check action logs. \|
	\| Docker build fails \| Ensure Docker Desktop is running. Build from `firewatch_env/`: `docker build -t fw ./server` \|

	---

	## Next Steps

	- Swap the model: Change `MODEL_NAME` in `.env` to test different HF-hosted models (e.g. `Qwen/Qwen2.5-72B-Instruct`, `meta-llama/Llama-3.3-70B-Instruct`)
	- Tune the agent: Edit `SYSTEM_PROMPT` and `_recovery_hint()` in `inference.py` to improve decision-making
	- Add actions: Extend `actions.py` with new diagnostic or remediation actions
	- Custom tasks: Define new scenarios in `config.py` and `openenv.yaml`
	- Benchmark: Compare scores across models to find the best SRE agent

	---

	FirewatchEnv — Meta PyTorch OpenEnv Hackathon India 2026