Spaces:
Running
Running
File size: 10,763 Bytes
44f306a | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 | # π₯ FirewatchEnv β Quickstart Guide
> Get from zero to running your first AI SRE agent in under 5 minutes.
---
## What is FirewatchEnv?
FirewatchEnv is an **RL training environment** for autonomous SRE incident response, built for the [Meta PyTorch OpenEnv Hackathon India 2026](https://github.com/meta-pytorch/OpenEnv). Your AI agent acts as an on-call Site Reliability Engineer β it receives simulated microservice telemetry (OTel-compatible metrics, Prometheus-style alerts, log excerpts) and must **diagnose and remediate the root cause** before the SLO error budget runs out.
**Key highlights:**
- Single container, no Kubernetes β runs on 2 vCPUs / 8 GB RAM
- Three difficulty tiers (Easy β Medium β Hard) with adversarial prompt injection in Task 3
- Outcome-only reward function β the agent can't game the grader; it must actually fix the system
---
## Prerequisites
| Tool | Version | Install |
|------|---------|---------|
| **Python** | 3.10+ | [python.org](https://www.python.org/downloads/) |
| **uv** | latest | `pip install uv` or `curl -LsSf https://astral.sh/uv/install.sh \| sh` |
| **Git** | any | [git-scm.com](https://git-scm.com/) |
| **Docker** | latest *(optional β only for containerized runs)* | [docker.com](https://docs.docker.com/get-docker/) |
---
## 1 β Clone & Install
```bash
git clone https://huggingface.co/spaces/10doshi12/firewatch-env
cd firewatch-env
```
> **Important:** All commands below should be run from inside the `firewatch_env/` directory, which contains the actual environment code.
```bash
cd firewatch_env
uv sync # installs all Python dependencies from pyproject.toml + uv.lock
```
This installs:
- `openenv-core[core]` β₯ 0.2.2 β FastAPI server + HTTP client types
- `pydantic` β₯ 2.0 β data models
- `openai` β₯ 1.0 β LLM inference via OpenAI-compatible API
- `python-dotenv` β `.env` file loading
---
## 2 β Configure Environment Variables
Copy the example and fill in your credentials:
```bash
cp .env.example .env
```
Edit `.env`:
```dotenv
# --- LLM Provider (HuggingFace Router) ---
API_BASE_URL=https://router.huggingface.co/v1
MODEL_NAME=Qwen/Qwen2.5-7B-Instruct
HF_TOKEN=hf_your_huggingface_token_here
# --- Server URL (usually auto-detected β leave commented for local dev) ---
# SPACE_URL=https://10doshi12-firewatch-env.hf.space
```
Get your HF token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (requires a **Pro** or **Enterprise** plan for router access to gated models).
| Variable | Required | Description |
|----------|----------|-------------|
| `API_BASE_URL` | Yes | HuggingFace Router endpoint (`https://router.huggingface.co/v1`) |
| `MODEL_NAME` | Yes | Model on HF Hub (e.g. `Qwen/Qwen2.5-7B-Instruct`, `Qwen/Qwen2.5-72B-Instruct`) |
| `HF_TOKEN` | No* | HuggingFace API token. *If omitted, inference runs a deterministic rule-based fallback agent (no LLM calls).* |
| `SPACE_URL` | No | Override server URL. Auto-detected in order: `localhost:8000` β `localhost:7860` β HF Space |
---
## 3 β Start the Server
```bash
uv run server
```
The FastAPI server starts on **http://localhost:8000** with these endpoints:
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check |
| `/reset` | POST | Reset environment β `{"difficulty": "easy", "seed": 42}` |
| `/step` | POST | Execute action β `{"action": {"action_type": "fetch_logs", "target_service": "auth-service"}}` |
| `/state` | GET | Get current environment state |
| `/schema` | GET | Action / observation JSON schemas |
| `/ws` | WS | WebSocket for persistent sessions |
### Quick smoke test (new terminal):
```bash
# Reset an easy episode
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"difficulty": "easy", "seed": 42}'
# Take an action
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action": {"action_type": "fetch_logs", "target_service": "cache"}}'
# Check current state
curl http://localhost:8000/state
```
---
## 4 β Run the Inference Agent
With the server running in one terminal, open a **second terminal**:
```bash
cd firewatch_env
python inference.py
```
This runs your agent across all three tasks sequentially:
| Task | Difficulty | Services | Red Herrings | Max Ticks | Seed |
|------|-----------|----------|-------------|-----------|------|
| `task_easy` | Easy | 3 | 0 | 20 | 42 |
| `task_medium` | Medium | 5 | 1 | 30 | 137 |
| `task_hard` | Hard | 7 | 3 (1 adversarial) | 40 | 256 |
### Expected Output
```
[START] task=task_easy env=firewatch-env model=x-ai/grok-4.1-fast
[STEP] step=1 action=fetch_logs:cache reward=-0.14 done=false error=null
[STEP] step=2 action=rollback_deploy:cache reward=-0.14 done=false error=null
...
[END] success=true steps=4 score=0.96 rewards=-0.14,-0.14,-0.14,1.86
```
Each `[STEP]` line shows the action taken, intermediate reward, and whether the episode ended. The `[END]` line reports the final graded score (0.0β1.0).
---
## 5 β Docker (Alternative)
Build and run the environment as a Docker container:
```bash
# From the firewatch_env/ directory
docker build -t firewatch-env ./server
docker run -p 7860:7860 firewatch-env
```
The server will be available at **http://localhost:7860**. Set `SPACE_URL=http://localhost:7860` when running `inference.py` (or let auto-detection find it).
---
## 6 β Deploy to HuggingFace Spaces
```bash
openenv validate # must pass with zero errors
openenv push --repo-id 10doshi12/firewatch-env
```
Your environment will be live at `https://10doshi12-firewatch-env.hf.space`.
---
## Project Structure
```
firewatch_env/
βββ models.py # Pydantic models (FirewatchAction, SystemObservation, etc.)
βββ simulation.py # ServiceMesh + generate_episode() + fault physics
βββ actions.py # ActionHandler β all 17 action types
βββ rewards.py # RewardEngine + grade() + EpisodeResult
βββ config.py # Constants, TASKS dict, topology (pure data)
βββ client.py # OpenEnv-generated WebSocket client
βββ inference.py # LLM agent loop (stdout eval format)
βββ openenv.yaml # OpenEnv spec definition
βββ .env.example # Environment variable template
βββ Dockerfile # Multi-stage Docker build
βββ pyproject.toml # Dependencies & entry points
βββ server/
β βββ app.py # FastAPI application (entry point)
β βββ firewatch_env_environment.py # Environment wiring
βββ tests/
βββ test_integration.py
βββ test_simulation.py
βββ test_inference.py
```
---
## Action Space Reference
### Investigation Actions (read-only)
| Action | Description |
|--------|-------------|
| `fetch_logs` | Populates `recent_logs` on the target service |
| `get_metrics_detail` | Returns 3-tick metric trend summary |
| `trace_dependencies` | Returns full upstream/downstream dependency chain |
| `strace_process` | System-call level process inspection |
| `profiler_dump` | CPU/memory profiler output |
| `check_gc_pressure` | GC pause times and heap pressure |
| `trace_distributed_request` | End-to-end distributed trace |
| `inspect_thread_pool` | Thread pool utilization and deadlock detection |
| `inspect_commit_diff` | Recent deployment diff |
### Remediation Actions (mutate state)
| Action | Description |
|--------|-------------|
| `restart_service` | Resets OOM state; wrong if `error_rate < 0.10` |
| `rollback_deploy` | Halts bad deployment progression |
| `revert_config` | Restores connection pool / config settings |
| `scale_replicas` | Increases memory headroom |
| `circuit_break` | Suppresses cascade for 3 ticks |
| `traffic_shift` | Redirects traffic away from degraded service |
### Meta Actions
| Action | Description |
|--------|-------------|
| `declare_resolved` | Terminates episode and triggers grading |
| `escalate` | Records escalation (no state change) |
---
## Fault Types
| Fault | Signal in Logs | Correct Remediation |
|-------|---------------|---------------------|
| `oom` | OOMKilled, exit code 137 | `restart_service` |
| `bad_deploy` | Error spike post-deployment SHA | `rollback_deploy` |
| `config_drift` | HikariCP pool exhaustion, 30s timeouts | `revert_config` |
| `network_partition` | Connection refused, circuit breaker OPEN | `circuit_break` or `restart_service` |
| `memory_leak` | Gradual latency increase, slow memory growth | `scale_replicas` β `restart_service` |
---
## Scoring
The grader produces a score between **0.0 and 1.0** based on four components:
| Component | Weight | What it Measures |
|-----------|--------|-----------------|
| Recovery | 40% | Did system health improve? |
| Speed | 25% | How quickly was MTTM achieved? |
| Precision | 20% | Were wrong actions avoided? |
| SLO | 15% | How much error budget remained? |
---
## Running Tests
```bash
cd firewatch_env
uv run pytest tests/ # all tests
uv run pytest tests/test_integration.py # integration only
uv run pytest tests/test_simulation.py # simulation logic
uv run pytest tests/test_integration.py::test_reset_deterministic # single test
```
---
## Troubleshooting
| Problem | Solution |
|---------|----------|
| `uv: command not found` | Install uv: `pip install uv` or `curl -LsSf https://astral.sh/uv/install.sh \| sh` |
| `openenv-core` import error | Run `uv sync` inside `firewatch_env/` |
| Server won't start | Check port 8000 isn't in use: `lsof -i :8000` |
| `inference.py` can't find server | Server auto-detection probes `localhost:8000` β `localhost:7860`. Ensure the server is running. |
| LLM API errors / 401 | Verify `HF_TOKEN` in `.env`. Without it, the rule-based fallback agent runs (no LLM). |
| Score is 0.0 | Agent didn't call `declare_resolved` or SLO budget hit 0%. Check action logs. |
| Docker build fails | Ensure Docker Desktop is running. Build from `firewatch_env/`: `docker build -t fw ./server` |
---
## Next Steps
- **Swap the model**: Change `MODEL_NAME` in `.env` to test different HF-hosted models (e.g. `Qwen/Qwen2.5-72B-Instruct`, `meta-llama/Llama-3.3-70B-Instruct`)
- **Tune the agent**: Edit `SYSTEM_PROMPT` and `_recovery_hint()` in `inference.py` to improve decision-making
- **Add actions**: Extend `actions.py` with new diagnostic or remediation actions
- **Custom tasks**: Define new scenarios in `config.py` and `openenv.yaml`
- **Benchmark**: Compare scores across models to find the best SRE agent
---
*FirewatchEnv β Meta PyTorch OpenEnv Hackathon India 2026*
|