File size: 7,456 Bytes
0da1902 b641d3d 9924524 b641d3d 80ee7f5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 | ---
title: distributed-systems-debug-env
sdk: docker
app_port: 8000
colorFrom: blue
colorTo: indigo
short_description: OpenEnv RL env for debugging distributed systems failures.
base_path: /web
---
# Distributed Systems Debug Environment
## Overview
This project provides an OpenEnv-compatible RL environment for debugging distributed systems failures.
The environment simulates a production-style pipeline:
- Gateway service (sync HTTP orchestration)
- Auth service (sync dependency)
- Redis queue (message bus)
- Worker service (async consumer + lock handling)
- SQLite sink (persistence simulation)
An agent interacts only through shell commands and must diagnose/fix injected faults.
## Why this environment
Most RL environments focus on games or synthetic workflows. This one targets some bugs that I have faced personally at my job focussing on debugging skills used in real systems engineering:
- reading logs under uncertainty
- triaging latency and queue symptoms
- fixing misconfigurations safely
- validating recovery from metrics
## Architecture
```
Agent command -> /step (FastAPI)
|
+-> executes shell command (sandboxed, 10s timeout)
+-> polls metrics
+-> grades progress
Services (same container):
gateway:3000 -> auth:3001 -> redis:6379 -> worker -> sqlite
```
## Observation Space
| Field | Type | Description |
|---|---|---|
| `command_output` | string | stdout+stderr of last command |
| `metrics.gateway_success_rate` | float [0,1] | rolling gateway success rate |
| `metrics.gateway_p99_latency_ms` | float | rolling p99 latency |
| `metrics.queue_depth` | int | Redis queue depth |
| `metrics.worker_restart_count` | int | simulated crash-loop count |
| `metrics.consumer_stall_count` | int | lock-starvation stall count |
| `process_status` | object | runtime status by service |
## Action Space
Single command action:
```json
{ "command": "<bash command>" }
```
Examples:
- `tail -20 /tmp/worker.log`
- `redis-cli DEL LOCK:job_processor`
- `cat /mesh/gateway/blocked_routes.json`
- `kill -HUP $(cat /tmp/worker.pid)`
## Tasks
| Task | Difficulty | Goal |
|---|---|---|
| `cascading-timeout` | easy | restore successful sync flow (auth delay vs gateway timeout) |
| `byzantine-queue-fault` | medium | remove poison message and stabilize worker |
| `distributed-lock-starvation` | hard | clear stale lock and resume consumption |
| `backpressure-cascade` | hard | recover throughput and reduce queue growth |
| `route-partition` | hard | unblock gateway->redis route policy |
| `registry-corruption` | medium | repair corrupted auth registry entry and restore request flow |
| `job-generator-runaway` | hard | reduce enqueue pressure so the queue drains sustainably |
## Reward Function
- Terminal reward: `1.0` when grader score >= `0.95`
- Dense shaping from grader progress + investigation command bonus + metric improvements
- Penalties for blocked/damaging actions and repeated non-productive behavior
- Reward clamped to `[0.0, 1.0]`
## Baseline Inference policy (3 of 7 by default)
All seven tasks are implemented in the environment.
`inference.py` runs these default tasks for runtime reliability:
1. `cascading-timeout` (easy)
2. `byzantine-queue-fault` (medium)
3. `distributed-lock-starvation` (hard)
Override with:
```bash
TASKS_CSV=cascading-timeout,route-partition python inference.py
```
## Setup
### Local
```bash
python3.12 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt
bun install --cwd mesh/gateway
bun install --cwd mesh/auth
bun install --cwd mesh/worker
APP_ROOT=$(pwd) MESH_ROOT=$(pwd)/mesh ./start.sh
```
### Docker
```bash
docker build -t dist-debug-env .
docker run -p 8000:8000 dist-debug-env
```
### API smoke check
```bash
curl http://localhost:8000/health
curl -X POST "http://localhost:8000/reset?task_name=cascading-timeout"
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"command":"ls /tmp"}'
```
## Inference script contract
`inference.py` emits strict logs:
```text
[START] task=<task_name> env=<benchmark> model=<model_name>
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END] success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...,rn>
```
## Logging
Service logs (JSON-lines):
- `/tmp/gateway.log`
- `/tmp/auth.log`
- `/tmp/worker.log`
Common fields:
- `ts`, `level`, `service`, `event`, `pattern`
Example investigation commands:
```bash
tail -30 /tmp/worker.log
jq 'select(.level=="ERROR")' /tmp/worker.log
redis-cli LLEN job_queue
```
## Baseline scores
Baseline scores depend on endpoint/model latency and quality. Reproduce with:
```bash
HF_TOKEN=<token> API_BASE_URL=<endpoint> MODEL_NAME=<model> python inference.py
```
## Run this locally
Use this checklist when running the full baseline end-to-end on your machine.
1. Install dependencies and validate project setup:
```bash
./setup-dev.sh
```
2. Activate the project virtual environment (required so `uvicorn` and Python deps are on PATH):
```bash
source .venv/bin/activate
```
3. Start the environment API (leave this terminal running):
```bash
APP_ROOT=$(pwd) MESH_ROOT=$(pwd)/mesh ./start.sh
```
4. In another terminal, activate venv again and export required inference variables:
```bash
source .venv/bin/activate
export API_BASE_URL="https://openrouter.ai/api/v1"
export MODEL_NAME="<your-model>"
export HF_TOKEN="<your-api-key>"
# Optional override; default already runs 3 baseline tasks
export TASKS_CSV="cascading-timeout,byzantine-queue-fault,distributed-lock-starvation"
```
If you have a .env file you can set the variables from the file via this command
```bash
set -a
source .env
set +a
```
5. Run inference with a 20 minute cap and capture output:
```bash
# macOS (coreutils): gtimeout ; Linux: timeout
gtimeout 1200 python inference.py | tee inference.out
```
6. Validate structured stdout format quickly:
```bash
python - <<'PY'
import re, sys
from pathlib import Path
lines = Path("inference.out").read_text(encoding="utf-8").splitlines()
if not lines:
print("FAIL: no output")
raise SystemExit(1)
start_re = re.compile(r'^\[START\] task=\S+ env=\S+ model=.+$')
step_re = re.compile(r'^\[STEP\]\s{2}step=\d+ action=.* reward=\d+\.\d{2} done=(true|false) error=.*$')
end_re = re.compile(r'^\[END\]\s{3}success=(true|false) steps=\d+ score=\d+\.\d{2} rewards=.*$')
for i, line in enumerate(lines, 1):
if line.startswith("[START]") and not start_re.match(line):
print(f"FAIL: bad START line {i}: {line}")
raise SystemExit(1)
if line.startswith("[STEP]") and not step_re.match(line):
print(f"FAIL: bad STEP line {i}: {line}")
raise SystemExit(1)
if line.startswith("[END]") and not end_re.match(line):
print(f"FAIL: bad END line {i}: {line}")
raise SystemExit(1)
print("PASS: stdout format valid")
PY
```
7. Re-run required submission gates:
```bash
openenv validate .
docker build -t dist-debug-env:local .
```
## Benchmarks b/w Models
### 3 Tasks Benchmark
<img width="1177" height="752" alt="Screenshot 2026-04-04 at 11 54 25 PM" src="https://github.com/user-attachments/assets/3dbfa87a-6696-4589-a908-baa3f498bda8" />
### 7 Task Benchmark
<img width="1294" height="240" alt="Screenshot 2026-04-05 at 12 30 45 AM" src="https://github.com/user-attachments/assets/1d0d3847-212e-46ba-967f-f79be3f9067c" />
|