firewatch-env / quickstart.md
10doshi12's picture
Upload folder using huggingface_hub
44f306a verified
# πŸ”₯ FirewatchEnv β€” Quickstart Guide
> Get from zero to running your first AI SRE agent in under 5 minutes.
---
## What is FirewatchEnv?
FirewatchEnv is an **RL training environment** for autonomous SRE incident response, built for the [Meta PyTorch OpenEnv Hackathon India 2026](https://github.com/meta-pytorch/OpenEnv). Your AI agent acts as an on-call Site Reliability Engineer β€” it receives simulated microservice telemetry (OTel-compatible metrics, Prometheus-style alerts, log excerpts) and must **diagnose and remediate the root cause** before the SLO error budget runs out.
**Key highlights:**
- Single container, no Kubernetes β€” runs on 2 vCPUs / 8 GB RAM
- Three difficulty tiers (Easy β†’ Medium β†’ Hard) with adversarial prompt injection in Task 3
- Outcome-only reward function β€” the agent can't game the grader; it must actually fix the system
---
## Prerequisites
| Tool | Version | Install |
|------|---------|---------|
| **Python** | 3.10+ | [python.org](https://www.python.org/downloads/) |
| **uv** | latest | `pip install uv` or `curl -LsSf https://astral.sh/uv/install.sh \| sh` |
| **Git** | any | [git-scm.com](https://git-scm.com/) |
| **Docker** | latest *(optional β€” only for containerized runs)* | [docker.com](https://docs.docker.com/get-docker/) |
---
## 1 β€” Clone & Install
```bash
git clone https://huggingface.co/spaces/10doshi12/firewatch-env
cd firewatch-env
```
> **Important:** All commands below should be run from inside the `firewatch_env/` directory, which contains the actual environment code.
```bash
cd firewatch_env
uv sync # installs all Python dependencies from pyproject.toml + uv.lock
```
This installs:
- `openenv-core[core]` β‰₯ 0.2.2 β€” FastAPI server + HTTP client types
- `pydantic` β‰₯ 2.0 β€” data models
- `openai` β‰₯ 1.0 β€” LLM inference via OpenAI-compatible API
- `python-dotenv` β€” `.env` file loading
---
## 2 β€” Configure Environment Variables
Copy the example and fill in your credentials:
```bash
cp .env.example .env
```
Edit `.env`:
```dotenv
# --- LLM Provider (HuggingFace Router) ---
API_BASE_URL=https://router.huggingface.co/v1
MODEL_NAME=Qwen/Qwen2.5-7B-Instruct
HF_TOKEN=hf_your_huggingface_token_here
# --- Server URL (usually auto-detected β€” leave commented for local dev) ---
# SPACE_URL=https://10doshi12-firewatch-env.hf.space
```
Get your HF token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) (requires a **Pro** or **Enterprise** plan for router access to gated models).
| Variable | Required | Description |
|----------|----------|-------------|
| `API_BASE_URL` | Yes | HuggingFace Router endpoint (`https://router.huggingface.co/v1`) |
| `MODEL_NAME` | Yes | Model on HF Hub (e.g. `Qwen/Qwen2.5-7B-Instruct`, `Qwen/Qwen2.5-72B-Instruct`) |
| `HF_TOKEN` | No* | HuggingFace API token. *If omitted, inference runs a deterministic rule-based fallback agent (no LLM calls).* |
| `SPACE_URL` | No | Override server URL. Auto-detected in order: `localhost:8000` β†’ `localhost:7860` β†’ HF Space |
---
## 3 β€” Start the Server
```bash
uv run server
```
The FastAPI server starts on **http://localhost:8000** with these endpoints:
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Health check |
| `/reset` | POST | Reset environment β€” `{"difficulty": "easy", "seed": 42}` |
| `/step` | POST | Execute action β€” `{"action": {"action_type": "fetch_logs", "target_service": "auth-service"}}` |
| `/state` | GET | Get current environment state |
| `/schema` | GET | Action / observation JSON schemas |
| `/ws` | WS | WebSocket for persistent sessions |
### Quick smoke test (new terminal):
```bash
# Reset an easy episode
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"difficulty": "easy", "seed": 42}'
# Take an action
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action": {"action_type": "fetch_logs", "target_service": "cache"}}'
# Check current state
curl http://localhost:8000/state
```
---
## 4 β€” Run the Inference Agent
With the server running in one terminal, open a **second terminal**:
```bash
cd firewatch_env
python inference.py
```
This runs your agent across all three tasks sequentially:
| Task | Difficulty | Services | Red Herrings | Max Ticks | Seed |
|------|-----------|----------|-------------|-----------|------|
| `task_easy` | Easy | 3 | 0 | 20 | 42 |
| `task_medium` | Medium | 5 | 1 | 30 | 137 |
| `task_hard` | Hard | 7 | 3 (1 adversarial) | 40 | 256 |
### Expected Output
```
[START] task=task_easy env=firewatch-env model=x-ai/grok-4.1-fast
[STEP] step=1 action=fetch_logs:cache reward=-0.14 done=false error=null
[STEP] step=2 action=rollback_deploy:cache reward=-0.14 done=false error=null
...
[END] success=true steps=4 score=0.96 rewards=-0.14,-0.14,-0.14,1.86
```
Each `[STEP]` line shows the action taken, intermediate reward, and whether the episode ended. The `[END]` line reports the final graded score (0.0–1.0).
---
## 5 β€” Docker (Alternative)
Build and run the environment as a Docker container:
```bash
# From the firewatch_env/ directory
docker build -t firewatch-env ./server
docker run -p 7860:7860 firewatch-env
```
The server will be available at **http://localhost:7860**. Set `SPACE_URL=http://localhost:7860` when running `inference.py` (or let auto-detection find it).
---
## 6 β€” Deploy to HuggingFace Spaces
```bash
openenv validate # must pass with zero errors
openenv push --repo-id 10doshi12/firewatch-env
```
Your environment will be live at `https://10doshi12-firewatch-env.hf.space`.
---
## Project Structure
```
firewatch_env/
β”œβ”€β”€ models.py # Pydantic models (FirewatchAction, SystemObservation, etc.)
β”œβ”€β”€ simulation.py # ServiceMesh + generate_episode() + fault physics
β”œβ”€β”€ actions.py # ActionHandler β€” all 17 action types
β”œβ”€β”€ rewards.py # RewardEngine + grade() + EpisodeResult
β”œβ”€β”€ config.py # Constants, TASKS dict, topology (pure data)
β”œβ”€β”€ client.py # OpenEnv-generated WebSocket client
β”œβ”€β”€ inference.py # LLM agent loop (stdout eval format)
β”œβ”€β”€ openenv.yaml # OpenEnv spec definition
β”œβ”€β”€ .env.example # Environment variable template
β”œβ”€β”€ Dockerfile # Multi-stage Docker build
β”œβ”€β”€ pyproject.toml # Dependencies & entry points
β”œβ”€β”€ server/
β”‚ β”œβ”€β”€ app.py # FastAPI application (entry point)
β”‚ └── firewatch_env_environment.py # Environment wiring
└── tests/
β”œβ”€β”€ test_integration.py
β”œβ”€β”€ test_simulation.py
└── test_inference.py
```
---
## Action Space Reference
### Investigation Actions (read-only)
| Action | Description |
|--------|-------------|
| `fetch_logs` | Populates `recent_logs` on the target service |
| `get_metrics_detail` | Returns 3-tick metric trend summary |
| `trace_dependencies` | Returns full upstream/downstream dependency chain |
| `strace_process` | System-call level process inspection |
| `profiler_dump` | CPU/memory profiler output |
| `check_gc_pressure` | GC pause times and heap pressure |
| `trace_distributed_request` | End-to-end distributed trace |
| `inspect_thread_pool` | Thread pool utilization and deadlock detection |
| `inspect_commit_diff` | Recent deployment diff |
### Remediation Actions (mutate state)
| Action | Description |
|--------|-------------|
| `restart_service` | Resets OOM state; wrong if `error_rate < 0.10` |
| `rollback_deploy` | Halts bad deployment progression |
| `revert_config` | Restores connection pool / config settings |
| `scale_replicas` | Increases memory headroom |
| `circuit_break` | Suppresses cascade for 3 ticks |
| `traffic_shift` | Redirects traffic away from degraded service |
### Meta Actions
| Action | Description |
|--------|-------------|
| `declare_resolved` | Terminates episode and triggers grading |
| `escalate` | Records escalation (no state change) |
---
## Fault Types
| Fault | Signal in Logs | Correct Remediation |
|-------|---------------|---------------------|
| `oom` | OOMKilled, exit code 137 | `restart_service` |
| `bad_deploy` | Error spike post-deployment SHA | `rollback_deploy` |
| `config_drift` | HikariCP pool exhaustion, 30s timeouts | `revert_config` |
| `network_partition` | Connection refused, circuit breaker OPEN | `circuit_break` or `restart_service` |
| `memory_leak` | Gradual latency increase, slow memory growth | `scale_replicas` β†’ `restart_service` |
---
## Scoring
The grader produces a score between **0.0 and 1.0** based on four components:
| Component | Weight | What it Measures |
|-----------|--------|-----------------|
| Recovery | 40% | Did system health improve? |
| Speed | 25% | How quickly was MTTM achieved? |
| Precision | 20% | Were wrong actions avoided? |
| SLO | 15% | How much error budget remained? |
---
## Running Tests
```bash
cd firewatch_env
uv run pytest tests/ # all tests
uv run pytest tests/test_integration.py # integration only
uv run pytest tests/test_simulation.py # simulation logic
uv run pytest tests/test_integration.py::test_reset_deterministic # single test
```
---
## Troubleshooting
| Problem | Solution |
|---------|----------|
| `uv: command not found` | Install uv: `pip install uv` or `curl -LsSf https://astral.sh/uv/install.sh \| sh` |
| `openenv-core` import error | Run `uv sync` inside `firewatch_env/` |
| Server won't start | Check port 8000 isn't in use: `lsof -i :8000` |
| `inference.py` can't find server | Server auto-detection probes `localhost:8000` β†’ `localhost:7860`. Ensure the server is running. |
| LLM API errors / 401 | Verify `HF_TOKEN` in `.env`. Without it, the rule-based fallback agent runs (no LLM). |
| Score is 0.0 | Agent didn't call `declare_resolved` or SLO budget hit 0%. Check action logs. |
| Docker build fails | Ensure Docker Desktop is running. Build from `firewatch_env/`: `docker build -t fw ./server` |
---
## Next Steps
- **Swap the model**: Change `MODEL_NAME` in `.env` to test different HF-hosted models (e.g. `Qwen/Qwen2.5-72B-Instruct`, `meta-llama/Llama-3.3-70B-Instruct`)
- **Tune the agent**: Edit `SYSTEM_PROMPT` and `_recovery_hint()` in `inference.py` to improve decision-making
- **Add actions**: Extend `actions.py` with new diagnostic or remediation actions
- **Custom tasks**: Define new scenarios in `config.py` and `openenv.yaml`
- **Benchmark**: Compare scores across models to find the best SRE agent
---
*FirewatchEnv β€” Meta PyTorch OpenEnv Hackathon India 2026*