Spaces:

bledden
/

stack_doctor

Sleeping

App Files Files Community

stack_doctor / README.md

bledden

Upload folder using huggingface_hub

8b92d51 verified 3 days ago

preview code

raw

history blame contribute delete

3.74 kB

metadata

title: Stack Doctor Environment Server
emoji: 🩺
colorFrom: red
colorTo: blue
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

Stack Doctor

An OpenEnv RL environment where an overseer LLM diagnoses sick inference stacks. The agent probes subsystems, reconciles conflicting specialist-agent reports (some of which are wrong), and selects the minimal correct fix — all within a 6-step budget.

Inspired by real SM12x enablement bugs across vLLM, FlashInfer, SGLang, CUTLASS, and Flash-Attention.

Track: Statement 3.1 — World Modeling / Professional Tasks Sub-theme: Fleet AI — Scalable Oversight Agents ($10K)

Quick Start

from stack_doctor import StackDoctorEnv, StackDoctorAction
import json

env = StackDoctorEnv(base_url="https://bledden-stack-doctor.hf.space")
env.connect()

# Start a new incident
result = env.reset()
print(result.observation.incident_ticket)
print(result.observation.specialist_opinions)

# Investigate
result = env.step(StackDoctorAction(message=json.dumps(
    {"type": "inspect", "target": "logs"}
)))
print(result.observation.output)

# Submit diagnosis
result = env.step(StackDoctorAction(message=json.dumps(
    {"type": "submit", "root_cause": "arch_guard", "fix": "relax_arch_check"}
)))
print(f"Reward: {result.reward}, Done: {result.done}")

env.close()

Environment Design

Root Causes (6) and Fixes (6)

Root Cause	Fix	Real-World Motif
`arch_guard`	`relax_arch_check`	FlashInfer SM121 capability checks
`backend_whitelist`	`add_whitelist_entry`	vLLM Marlin SM121+ whitelist gaps
`runtime_loader`	`fix_runtime_path`	SGLang CUDA 13 runtime issues
`backend_selector`	`switch_backend`	CUTLASS dispatch mistakes
`model_config`	`update_model_config`	Model config mismatches on new hardware
`weight_layout`	`fix_weight_mapping`	Weight layout problems across backends

Specialists (4)

runtime, dispatch, kernel, loader — at least one gives wrong advice per scenario.

Action Space (JSON)

{"type":"inspect","target":"logs|config|snippet|metrics"}
{"type":"ask_specialist","specialist":"runtime|dispatch|kernel|loader"}
{"type":"apply_fix","fix":"<one of 6 fixes>"}
{"type":"submit","root_cause":"<one of 6>","fix":"<one of 6>"}

Reward Function

Event	Reward
`inspect` or `ask_specialist`	-0.25
Correct `apply_fix`	+3
Wrong `apply_fix`	-2
Correct `submit` (per field)	+8
Wrong `submit` (per field)	-4
Solved in ≤4 steps	+2 bonus
Invalid action	-2

Baselines

Policy	RC Accuracy	Fix Accuracy	Avg Steps	Avg Reward
Oracle	100%	100%	1.0	18.0
Heuristic	100%	100%	4.0	20.5
Random	18%	18%	3.2	-4.1

Fleet AI: Specialist Oversight

The core mechanic that targets Fleet AI's $10K sub-theme: the agent must act as a scalable oversight agent that reconciles conflicting specialist reports. Specialists have per-scenario reliability — the agent cannot learn "always trust specialist X" and must evaluate evidence on each case.

Training

Uses Unsloth + TRL GRPO with 3 reward signals:

Valid JSON — can the output be parsed as an action plan?
Environment reward — cumulative reward from executing the plan
Efficiency — bonus for shorter plans that still submit correctly

Development

# Local server
cd stack_doctor && PYTHONPATH=. uvicorn server.app:app --port 8000

# Run baselines
PYTHONPATH=. python3 -c "from server.baselines import *; ..."

# Deploy to HF Spaces
openenv push --repo-id bledden/stack-doctor