Spaces:
Running
π₯ FirewatchEnv β Quickstart Guide
Get from zero to running your first AI SRE agent in under 5 minutes.
What is FirewatchEnv?
FirewatchEnv is an RL training environment for autonomous SRE incident response, built for the Meta PyTorch OpenEnv Hackathon India 2026. Your AI agent acts as an on-call Site Reliability Engineer β it receives simulated microservice telemetry (OTel-compatible metrics, Prometheus-style alerts, log excerpts) and must diagnose and remediate the root cause before the SLO error budget runs out.
Key highlights:
- Single container, no Kubernetes β runs on 2 vCPUs / 8 GB RAM
- Three difficulty tiers (Easy β Medium β Hard) with adversarial prompt injection in Task 3
- Outcome-only reward function β the agent can't game the grader; it must actually fix the system
Prerequisites
| Tool | Version | Install |
|---|---|---|
| Python | 3.10+ | python.org |
| uv | latest | pip install uv or curl -LsSf https://astral.sh/uv/install.sh | sh |
| Git | any | git-scm.com |
| Docker | latest (optional β only for containerized runs) | docker.com |
1 β Clone & Install
git clone https://huggingface.co/spaces/10doshi12/firewatch-env
cd firewatch-env
Important: All commands below should be run from inside the
firewatch_env/directory, which contains the actual environment code.
cd firewatch_env
uv sync # installs all Python dependencies from pyproject.toml + uv.lock
This installs:
openenv-core[core]β₯ 0.2.2 β FastAPI server + HTTP client typespydanticβ₯ 2.0 β data modelsopenaiβ₯ 1.0 β LLM inference via OpenAI-compatible APIpython-dotenvβ.envfile loading
2 β Configure Environment Variables
Copy the example and fill in your credentials:
cp .env.example .env
Edit .env:
# --- LLM Provider (HuggingFace Router) ---
API_BASE_URL=https://router.huggingface.co/v1
MODEL_NAME=Qwen/Qwen2.5-7B-Instruct
HF_TOKEN=hf_your_huggingface_token_here
# --- Server URL (usually auto-detected β leave commented for local dev) ---
# SPACE_URL=https://10doshi12-firewatch-env.hf.space
Get your HF token from huggingface.co/settings/tokens (requires a Pro or Enterprise plan for router access to gated models).
| Variable | Required | Description |
|---|---|---|
API_BASE_URL |
Yes | HuggingFace Router endpoint (https://router.huggingface.co/v1) |
MODEL_NAME |
Yes | Model on HF Hub (e.g. Qwen/Qwen2.5-7B-Instruct, Qwen/Qwen2.5-72B-Instruct) |
HF_TOKEN |
No* | HuggingFace API token. If omitted, inference runs a deterministic rule-based fallback agent (no LLM calls). |
SPACE_URL |
No | Override server URL. Auto-detected in order: localhost:8000 β localhost:7860 β HF Space |
3 β Start the Server
uv run server
The FastAPI server starts on http://localhost:8000 with these endpoints:
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/reset |
POST | Reset environment β {"difficulty": "easy", "seed": 42} |
/step |
POST | Execute action β {"action": {"action_type": "fetch_logs", "target_service": "auth-service"}} |
/state |
GET | Get current environment state |
/schema |
GET | Action / observation JSON schemas |
/ws |
WS | WebSocket for persistent sessions |
Quick smoke test (new terminal):
# Reset an easy episode
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"difficulty": "easy", "seed": 42}'
# Take an action
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"action": {"action_type": "fetch_logs", "target_service": "cache"}}'
# Check current state
curl http://localhost:8000/state
4 β Run the Inference Agent
With the server running in one terminal, open a second terminal:
cd firewatch_env
python inference.py
This runs your agent across all three tasks sequentially:
| Task | Difficulty | Services | Red Herrings | Max Ticks | Seed |
|---|---|---|---|---|---|
task_easy |
Easy | 3 | 0 | 20 | 42 |
task_medium |
Medium | 5 | 1 | 30 | 137 |
task_hard |
Hard | 7 | 3 (1 adversarial) | 40 | 256 |
Expected Output
[START] task=task_easy env=firewatch-env model=x-ai/grok-4.1-fast
[STEP] step=1 action=fetch_logs:cache reward=-0.14 done=false error=null
[STEP] step=2 action=rollback_deploy:cache reward=-0.14 done=false error=null
...
[END] success=true steps=4 score=0.96 rewards=-0.14,-0.14,-0.14,1.86
Each [STEP] line shows the action taken, intermediate reward, and whether the episode ended. The [END] line reports the final graded score (0.0β1.0).
5 β Docker (Alternative)
Build and run the environment as a Docker container:
# From the firewatch_env/ directory
docker build -t firewatch-env ./server
docker run -p 7860:7860 firewatch-env
The server will be available at http://localhost:7860. Set SPACE_URL=http://localhost:7860 when running inference.py (or let auto-detection find it).
6 β Deploy to HuggingFace Spaces
openenv validate # must pass with zero errors
openenv push --repo-id 10doshi12/firewatch-env
Your environment will be live at https://10doshi12-firewatch-env.hf.space.
Project Structure
firewatch_env/
βββ models.py # Pydantic models (FirewatchAction, SystemObservation, etc.)
βββ simulation.py # ServiceMesh + generate_episode() + fault physics
βββ actions.py # ActionHandler β all 17 action types
βββ rewards.py # RewardEngine + grade() + EpisodeResult
βββ config.py # Constants, TASKS dict, topology (pure data)
βββ client.py # OpenEnv-generated WebSocket client
βββ inference.py # LLM agent loop (stdout eval format)
βββ openenv.yaml # OpenEnv spec definition
βββ .env.example # Environment variable template
βββ Dockerfile # Multi-stage Docker build
βββ pyproject.toml # Dependencies & entry points
βββ server/
β βββ app.py # FastAPI application (entry point)
β βββ firewatch_env_environment.py # Environment wiring
βββ tests/
βββ test_integration.py
βββ test_simulation.py
βββ test_inference.py
Action Space Reference
Investigation Actions (read-only)
| Action | Description |
|---|---|
fetch_logs |
Populates recent_logs on the target service |
get_metrics_detail |
Returns 3-tick metric trend summary |
trace_dependencies |
Returns full upstream/downstream dependency chain |
strace_process |
System-call level process inspection |
profiler_dump |
CPU/memory profiler output |
check_gc_pressure |
GC pause times and heap pressure |
trace_distributed_request |
End-to-end distributed trace |
inspect_thread_pool |
Thread pool utilization and deadlock detection |
inspect_commit_diff |
Recent deployment diff |
Remediation Actions (mutate state)
| Action | Description |
|---|---|
restart_service |
Resets OOM state; wrong if error_rate < 0.10 |
rollback_deploy |
Halts bad deployment progression |
revert_config |
Restores connection pool / config settings |
scale_replicas |
Increases memory headroom |
circuit_break |
Suppresses cascade for 3 ticks |
traffic_shift |
Redirects traffic away from degraded service |
Meta Actions
| Action | Description |
|---|---|
declare_resolved |
Terminates episode and triggers grading |
escalate |
Records escalation (no state change) |
Fault Types
| Fault | Signal in Logs | Correct Remediation |
|---|---|---|
oom |
OOMKilled, exit code 137 | restart_service |
bad_deploy |
Error spike post-deployment SHA | rollback_deploy |
config_drift |
HikariCP pool exhaustion, 30s timeouts | revert_config |
network_partition |
Connection refused, circuit breaker OPEN | circuit_break or restart_service |
memory_leak |
Gradual latency increase, slow memory growth | scale_replicas β restart_service |
Scoring
The grader produces a score between 0.0 and 1.0 based on four components:
| Component | Weight | What it Measures |
|---|---|---|
| Recovery | 40% | Did system health improve? |
| Speed | 25% | How quickly was MTTM achieved? |
| Precision | 20% | Were wrong actions avoided? |
| SLO | 15% | How much error budget remained? |
Running Tests
cd firewatch_env
uv run pytest tests/ # all tests
uv run pytest tests/test_integration.py # integration only
uv run pytest tests/test_simulation.py # simulation logic
uv run pytest tests/test_integration.py::test_reset_deterministic # single test
Troubleshooting
| Problem | Solution |
|---|---|
uv: command not found |
Install uv: pip install uv or curl -LsSf https://astral.sh/uv/install.sh | sh |
openenv-core import error |
Run uv sync inside firewatch_env/ |
| Server won't start | Check port 8000 isn't in use: lsof -i :8000 |
inference.py can't find server |
Server auto-detection probes localhost:8000 β localhost:7860. Ensure the server is running. |
| LLM API errors / 401 | Verify HF_TOKEN in .env. Without it, the rule-based fallback agent runs (no LLM). |
| Score is 0.0 | Agent didn't call declare_resolved or SLO budget hit 0%. Check action logs. |
| Docker build fails | Ensure Docker Desktop is running. Build from firewatch_env/: docker build -t fw ./server |
Next Steps
- Swap the model: Change
MODEL_NAMEin.envto test different HF-hosted models (e.g.Qwen/Qwen2.5-72B-Instruct,meta-llama/Llama-3.3-70B-Instruct) - Tune the agent: Edit
SYSTEM_PROMPTand_recovery_hint()ininference.pyto improve decision-making - Add actions: Extend
actions.pywith new diagnostic or remediation actions - Custom tasks: Define new scenarios in
config.pyandopenenv.yaml - Benchmark: Compare scores across models to find the best SRE agent
FirewatchEnv β Meta PyTorch OpenEnv Hackathon India 2026