File size: 7,456 Bytes
0da1902
 
 
 
 
 
 
 
 
 
 
b641d3d
9924524
b641d3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80ee7f5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
---
title: distributed-systems-debug-env
sdk: docker
app_port: 8000
colorFrom: blue
colorTo: indigo
short_description: OpenEnv RL env for debugging distributed systems failures.
base_path: /web
---


# Distributed Systems Debug Environment

## Overview
This project provides an OpenEnv-compatible RL environment for debugging distributed systems failures.

The environment simulates a production-style pipeline:

- Gateway service (sync HTTP orchestration)
- Auth service (sync dependency)
- Redis queue (message bus)
- Worker service (async consumer + lock handling)
- SQLite sink (persistence simulation)

An agent interacts only through shell commands and must diagnose/fix injected faults.

## Why this environment
Most RL environments focus on games or synthetic workflows. This one targets some bugs that I have faced personally at my job focussing on debugging skills used in real systems engineering:

- reading logs under uncertainty
- triaging latency and queue symptoms
- fixing misconfigurations safely
- validating recovery from metrics

## Architecture
```
Agent command -> /step (FastAPI)
                  |
                  +-> executes shell command (sandboxed, 10s timeout)
                  +-> polls metrics
                  +-> grades progress

Services (same container):
  gateway:3000 -> auth:3001 -> redis:6379 -> worker -> sqlite
```

## Observation Space
| Field | Type | Description |
|---|---|---|
| `command_output` | string | stdout+stderr of last command |
| `metrics.gateway_success_rate` | float [0,1] | rolling gateway success rate |
| `metrics.gateway_p99_latency_ms` | float | rolling p99 latency |
| `metrics.queue_depth` | int | Redis queue depth |
| `metrics.worker_restart_count` | int | simulated crash-loop count |
| `metrics.consumer_stall_count` | int | lock-starvation stall count |
| `process_status` | object | runtime status by service |

## Action Space
Single command action:

```json
{ "command": "<bash command>" }
```

Examples:
- `tail -20 /tmp/worker.log`
- `redis-cli DEL LOCK:job_processor`
- `cat /mesh/gateway/blocked_routes.json`
- `kill -HUP $(cat /tmp/worker.pid)`

## Tasks
| Task | Difficulty | Goal |
|---|---|---|
| `cascading-timeout` | easy | restore successful sync flow (auth delay vs gateway timeout) |
| `byzantine-queue-fault` | medium | remove poison message and stabilize worker |
| `distributed-lock-starvation` | hard | clear stale lock and resume consumption |
| `backpressure-cascade` | hard | recover throughput and reduce queue growth |
| `route-partition` | hard | unblock gateway->redis route policy |
| `registry-corruption` | medium | repair corrupted auth registry entry and restore request flow |
| `job-generator-runaway` | hard | reduce enqueue pressure so the queue drains sustainably |

## Reward Function
- Terminal reward: `1.0` when grader score >= `0.95`
- Dense shaping from grader progress + investigation command bonus + metric improvements
- Penalties for blocked/damaging actions and repeated non-productive behavior
- Reward clamped to `[0.0, 1.0]`

## Baseline Inference policy (3 of 7 by default)
All seven tasks are implemented in the environment.

`inference.py` runs these default tasks for runtime reliability:

1. `cascading-timeout` (easy)
2. `byzantine-queue-fault` (medium)
3. `distributed-lock-starvation` (hard)

Override with:

```bash
TASKS_CSV=cascading-timeout,route-partition python inference.py
```

## Setup
### Local
```bash
python3.12 -m venv .venv
. .venv/bin/activate
pip install -r requirements.txt

bun install --cwd mesh/gateway
bun install --cwd mesh/auth
bun install --cwd mesh/worker

APP_ROOT=$(pwd) MESH_ROOT=$(pwd)/mesh ./start.sh
```

### Docker
```bash
docker build -t dist-debug-env .
docker run -p 8000:8000 dist-debug-env
```

### API smoke check
```bash
curl http://localhost:8000/health
curl -X POST "http://localhost:8000/reset?task_name=cascading-timeout"
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"command":"ls /tmp"}'
```

## Inference script contract
`inference.py` emits strict logs:

```text
[START] task=<task_name> env=<benchmark> model=<model_name>
[STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END]   success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...,rn>
```

## Logging
Service logs (JSON-lines):
- `/tmp/gateway.log`
- `/tmp/auth.log`
- `/tmp/worker.log`

Common fields:
- `ts`, `level`, `service`, `event`, `pattern`

Example investigation commands:
```bash
tail -30 /tmp/worker.log
jq 'select(.level=="ERROR")' /tmp/worker.log
redis-cli LLEN job_queue
```

## Baseline scores
Baseline scores depend on endpoint/model latency and quality. Reproduce with:

```bash
HF_TOKEN=<token> API_BASE_URL=<endpoint> MODEL_NAME=<model> python inference.py
```


## Run this locally
Use this checklist when running the full baseline end-to-end on your machine.

1. Install dependencies and validate project setup:
```bash
./setup-dev.sh
```

2. Activate the project virtual environment (required so `uvicorn` and Python deps are on PATH):
```bash
source .venv/bin/activate
```

3. Start the environment API (leave this terminal running):
```bash
APP_ROOT=$(pwd) MESH_ROOT=$(pwd)/mesh ./start.sh
```

4. In another terminal, activate venv again and export required inference variables:
```bash
source .venv/bin/activate
export API_BASE_URL="https://openrouter.ai/api/v1"
export MODEL_NAME="<your-model>"
export HF_TOKEN="<your-api-key>"

# Optional override; default already runs 3 baseline tasks
export TASKS_CSV="cascading-timeout,byzantine-queue-fault,distributed-lock-starvation"
```

If you have a .env file you can set the variables from the file via this command 

```bash
set -a
source .env
set +a
```

5. Run inference with a 20 minute cap and capture output:
```bash
# macOS (coreutils): gtimeout ; Linux: timeout
gtimeout 1200 python inference.py | tee inference.out
```

6. Validate structured stdout format quickly:
```bash
python - <<'PY'
import re, sys
from pathlib import Path

lines = Path("inference.out").read_text(encoding="utf-8").splitlines()
if not lines:
    print("FAIL: no output")
    raise SystemExit(1)

start_re = re.compile(r'^\[START\] task=\S+ env=\S+ model=.+$')
step_re = re.compile(r'^\[STEP\]\s{2}step=\d+ action=.* reward=\d+\.\d{2} done=(true|false) error=.*$')
end_re = re.compile(r'^\[END\]\s{3}success=(true|false) steps=\d+ score=\d+\.\d{2} rewards=.*$')

for i, line in enumerate(lines, 1):
    if line.startswith("[START]") and not start_re.match(line):
        print(f"FAIL: bad START line {i}: {line}")
        raise SystemExit(1)
    if line.startswith("[STEP]") and not step_re.match(line):
        print(f"FAIL: bad STEP line {i}: {line}")
        raise SystemExit(1)
    if line.startswith("[END]") and not end_re.match(line):
        print(f"FAIL: bad END line {i}: {line}")
        raise SystemExit(1)

print("PASS: stdout format valid")
PY
```

7. Re-run required submission gates:
```bash
openenv validate .
docker build -t dist-debug-env:local .
```





## Benchmarks b/w Models

### 3 Tasks Benchmark
<img width="1177" height="752" alt="Screenshot 2026-04-04 at 11 54 25 PM" src="https://github.com/user-attachments/assets/3dbfa87a-6696-4589-a908-baa3f498bda8" />

### 7 Task Benchmark
<img width="1294" height="240" alt="Screenshot 2026-04-05 at 12 30 45 AM" src="https://github.com/user-attachments/assets/1d0d3847-212e-46ba-967f-f79be3f9067c" />