Spaces:

Veer15
/

openenv-distributed-systems-debugging

Sleeping

App Files Files Community

openenv-distributed-systems-debugging / README.md

Veer15

Upload folder using huggingface_hub

0da1902 verified 2 months ago

preview code

raw

history blame contribute delete

7.46 kB

	---
	title: distributed-systems-debug-env
	sdk: docker
	app_port: 8000
	colorFrom: blue
	colorTo: indigo
	short_description: OpenEnv RL env for debugging distributed systems failures.
	base_path: /web
	---


	# Distributed Systems Debug Environment

	## Overview
	This project provides an OpenEnv-compatible RL environment for debugging distributed systems failures.

	The environment simulates a production-style pipeline:

	- Gateway service (sync HTTP orchestration)
	- Auth service (sync dependency)
	- Redis queue (message bus)
	- Worker service (async consumer + lock handling)
	- SQLite sink (persistence simulation)

	An agent interacts only through shell commands and must diagnose/fix injected faults.

	## Why this environment
	Most RL environments focus on games or synthetic workflows. This one targets some bugs that I have faced personally at my job focussing on debugging skills used in real systems engineering:

	- reading logs under uncertainty
	- triaging latency and queue symptoms
	- fixing misconfigurations safely
	- validating recovery from metrics

	## Architecture
	```
	Agent command -> /step (FastAPI)
	\|
	+-> executes shell command (sandboxed, 10s timeout)
	+-> polls metrics
	+-> grades progress

	Services (same container):
	gateway:3000 -> auth:3001 -> redis:6379 -> worker -> sqlite
	```

	## Observation Space
	\| Field \| Type \| Description \|
	\|---\|---\|---\|
	\| `command_output` \| string \| stdout+stderr of last command \|
	\| `metrics.gateway_success_rate` \| float [0,1] \| rolling gateway success rate \|
	\| `metrics.gateway_p99_latency_ms` \| float \| rolling p99 latency \|
	\| `metrics.queue_depth` \| int \| Redis queue depth \|
	\| `metrics.worker_restart_count` \| int \| simulated crash-loop count \|
	\| `metrics.consumer_stall_count` \| int \| lock-starvation stall count \|
	\| `process_status` \| object \| runtime status by service \|

	## Action Space
	Single command action:

	```json
	{ "command": "<bash command>" }
	```

	Examples:
	- `tail -20 /tmp/worker.log`
	- `redis-cli DEL LOCK:job_processor`
	- `cat /mesh/gateway/blocked_routes.json`
	- `kill -HUP $(cat /tmp/worker.pid)`

	## Tasks
	\| Task \| Difficulty \| Goal \|
	\|---\|---\|---\|
	\| `cascading-timeout` \| easy \| restore successful sync flow (auth delay vs gateway timeout) \|
	\| `byzantine-queue-fault` \| medium \| remove poison message and stabilize worker \|
	\| `distributed-lock-starvation` \| hard \| clear stale lock and resume consumption \|
	\| `backpressure-cascade` \| hard \| recover throughput and reduce queue growth \|
	\| `route-partition` \| hard \| unblock gateway->redis route policy \|
	\| `registry-corruption` \| medium \| repair corrupted auth registry entry and restore request flow \|
	\| `job-generator-runaway` \| hard \| reduce enqueue pressure so the queue drains sustainably \|

	## Reward Function
	- Terminal reward: `1.0` when grader score >= `0.95`
	- Dense shaping from grader progress + investigation command bonus + metric improvements
	- Penalties for blocked/damaging actions and repeated non-productive behavior
	- Reward clamped to `[0.0, 1.0]`

	## Baseline Inference policy (3 of 7 by default)
	All seven tasks are implemented in the environment.

	`inference.py` runs these default tasks for runtime reliability:

	1. `cascading-timeout` (easy)
	2. `byzantine-queue-fault` (medium)
	3. `distributed-lock-starvation` (hard)

	Override with:

	```bash
	TASKS_CSV=cascading-timeout,route-partition python inference.py
	```

	## Setup
	### Local
	```bash
	python3.12 -m venv .venv
	. .venv/bin/activate
	pip install -r requirements.txt

	bun install --cwd mesh/gateway
	bun install --cwd mesh/auth
	bun install --cwd mesh/worker

	APP_ROOT=$(pwd) MESH_ROOT=$(pwd)/mesh ./start.sh
	```

	### Docker
	```bash
	docker build -t dist-debug-env .
	docker run -p 8000:8000 dist-debug-env
	```

	### API smoke check
	```bash
	curl http://localhost:8000/health
	curl -X POST "http://localhost:8000/reset?task_name=cascading-timeout"
	curl -X POST http://localhost:8000/step \
	-H "Content-Type: application/json" \
	-d '{"command":"ls /tmp"}'
	```

	## Inference script contract
	`inference.py` emits strict logs:

	```text
	[START] task=<task_name> env=<benchmark> model=<model_name>
	[STEP] step=<n> action=<action_str> reward=<0.00> done=<true\|false> error=<msg\|null>
	[END] success=<true\|false> steps=<n> score=<0.00> rewards=<r1,r2,...,rn>
	```

	## Logging
	Service logs (JSON-lines):
	- `/tmp/gateway.log`
	- `/tmp/auth.log`
	- `/tmp/worker.log`

	Common fields:
	- `ts`, `level`, `service`, `event`, `pattern`

	Example investigation commands:
	```bash
	tail -30 /tmp/worker.log
	jq 'select(.level=="ERROR")' /tmp/worker.log
	redis-cli LLEN job_queue
	```

	## Baseline scores
	Baseline scores depend on endpoint/model latency and quality. Reproduce with:

	```bash
	HF_TOKEN=<token> API_BASE_URL=<endpoint> MODEL_NAME=<model> python inference.py
	```


	## Run this locally
	Use this checklist when running the full baseline end-to-end on your machine.

	1. Install dependencies and validate project setup:
	```bash
	./setup-dev.sh
	```

	2. Activate the project virtual environment (required so `uvicorn` and Python deps are on PATH):
	```bash
	source .venv/bin/activate
	```

	3. Start the environment API (leave this terminal running):
	```bash
	APP_ROOT=$(pwd) MESH_ROOT=$(pwd)/mesh ./start.sh
	```

	4. In another terminal, activate venv again and export required inference variables:
	```bash
	source .venv/bin/activate
	export API_BASE_URL="https://openrouter.ai/api/v1"
	export MODEL_NAME="<your-model>"
	export HF_TOKEN="<your-api-key>"

	# Optional override; default already runs 3 baseline tasks
	export TASKS_CSV="cascading-timeout,byzantine-queue-fault,distributed-lock-starvation"
	```

	If you have a .env file you can set the variables from the file via this command

	```bash
	set -a
	source .env
	set +a
	```

	5. Run inference with a 20 minute cap and capture output:
	```bash
	# macOS (coreutils): gtimeout ; Linux: timeout
	gtimeout 1200 python inference.py \| tee inference.out
	```

	6. Validate structured stdout format quickly:
	```bash
	python - <<'PY'
	import re, sys
	from pathlib import Path

	lines = Path("inference.out").read_text(encoding="utf-8").splitlines()
	if not lines:
	print("FAIL: no output")
	raise SystemExit(1)

	start_re = re.compile(r'^\[START\] task=\S+ env=\S+ model=.+$')
	step_re = re.compile(r'^\[STEP\]\s{2}step=\d+ action=.* reward=\d+\.\d{2} done=(true\|false) error=.*$')
	end_re = re.compile(r'^\[END\]\s{3}success=(true\|false) steps=\d+ score=\d+\.\d{2} rewards=.*$')

	for i, line in enumerate(lines, 1):
	if line.startswith("[START]") and not start_re.match(line):
	print(f"FAIL: bad START line {i}: {line}")
	raise SystemExit(1)
	if line.startswith("[STEP]") and not step_re.match(line):
	print(f"FAIL: bad STEP line {i}: {line}")
	raise SystemExit(1)
	if line.startswith("[END]") and not end_re.match(line):
	print(f"FAIL: bad END line {i}: {line}")
	raise SystemExit(1)

	print("PASS: stdout format valid")
	PY
	```

	7. Re-run required submission gates:
	```bash
	openenv validate .
	docker build -t dist-debug-env:local .
	```





	## Benchmarks b/w Models

	### 3 Tasks Benchmark
	<img width="1177" height="752" alt="Screenshot 2026-04-04 at 11 54 25 PM" src="https://github.com/user-attachments/assets/3dbfa87a-6696-4589-a908-baa3f498bda8" />

	### 7 Task Benchmark
	<img width="1294" height="240" alt="Screenshot 2026-04-05 at 12 30 45 AM" src="https://github.com/user-attachments/assets/1d0d3847-212e-46ba-967f-f79be3f9067c" />