Spaces:
Sleeping
title: Bug Triage Env
emoji: π
colorFrom: red
colorTo: yellow
sdk: docker
pinned: false
tags:
- openenv
π Bug Triage Environment v2.0
OpenEnv RL environment for the Meta PyTorch Hackathon x Scaler School of Technology
A multi-step reinforcement learning environment where an AI agent investigates and triages GitHub-style bug reports β deciding priority, labels, team ownership, and milestone β just like a senior engineer would.
Live: https://siteshcodes-bug-triage-env.hf.space GitHub: https://github.com/Siteshcodes/bug-triage-env
What Makes This Different
| Feature | v1.0 (before) | v2.0 (now) |
|---|---|---|
| Episode length | 1 step (quiz) | Multi-step investigation |
| Bug pool | 15 hardcrafted | 200+ procedurally generated |
| Label matching | Exact string | Semantic (synonym-aware) |
| Concurrency | Broken (global state) | Session-based, thread-safe |
| Information reveal | Everything at once | Progressive (title β body β comments β logs) |
| Tests | None | 50+ unit & integration tests |
| Grading depth | String matching | Weighted scoring + reasoning bonus |
Multi-Step Investigation
Unlike simple Q&A environments, the agent must investigate before deciding:
reset() β Agent sees: bug title + body preview
step(read_body) β Full description revealed
step(read_comments) β User comments revealed
step(check_logs) β Stack traces + severity signals revealed
step(submit, ...) β Final triage graded (reward returned)
Each investigation step costs a step (out of a limited budget). The agent must learn when it has enough information to decide correctly β balancing accuracy vs. efficiency.
Action Space
| Field | Type | Values |
|---|---|---|
action_type |
string | read_body Β· read_comments Β· check_logs Β· check_similar Β· submit |
priority |
string | P0 Β· P1 Β· P2 Β· P3 (only for submit) |
labels |
list[str] | bug Β· performance Β· security Β· ux Β· data-integrity Β· payments β¦ |
assigned_team |
string | backend Β· frontend Β· infra Β· security Β· devx |
milestone |
string | hotfix Β· v2.1 Β· backlog |
reasoning |
string | Free-form explanation (earns bonus points) |
Observation Space
| Field | Type | Description |
|---|---|---|
bug_report |
BugReport | Title, body, author, labels_hint, comments, stack_trace |
task_id |
string | Current difficulty: easy / medium / hard |
score |
float | Score from grader (0.0β1.0) |
reward |
float | Reward from last action (0.0β1.0) |
feedback |
string | Human-readable grader feedback |
done |
bool | Episode complete flag |
body_visible |
bool | Whether full body has been revealed |
comments_visible |
bool | Whether comments have been revealed |
logs_visible |
bool | Whether logs/stack traces have been revealed |
steps_taken |
int | Steps used so far |
max_steps |
int | Maximum steps allowed |
Tasks
Task 1 β Easy: Priority Assignment
Assign a single P0βP3 priority. Up to 4 steps.
- Grader:
server.task:priority_match - Scoring: exact β 0.95, Β±1 β 0.50, Β±2 β 0.20, else β 0.05
- Reward range: (0.0, 1.0)
Task 2 β Medium: Priority + Labels + Team
Assign priority, category labels, and team routing. Up to 5 steps.
- Grader:
server.task:priority_label_team - Scoring: priority 45% + label Jaccard (semantic) 40% + team 15%
- Reward range: (0.0, 1.0)
Task 3 β Hard: Full Triage
Full triage with security escalation penalty. Up to 6 steps.
- Grader:
server.task:full_triage - Scoring: priority 35% + labels 30% + team 20% + milestone 15%
- Penalty: β0.15 for missing security escalation
- Bonus: up to +0.15 for relevant reasoning
- Reward range: (0.0, 1.0)
Reward Function
- Priority: Graduated partial credit (0.95 β 0.50 β 0.20 β 0.05)
- Labels: Semantic Jaccard similarity with synonym matching (e.g., "defect" β "bug")
- Team routing: Binary accuracy, weighted per difficulty
- Security escalation: Hard penalty (β0.15) for ignoring security signals
- Reasoning bonus: Up to +0.15 for mentioning relevant signals
- Efficiency: +0.05 bonus for correct answers with minimal investigation
- Clamping: All scores strictly within (0.0, 1.0)
Procedural Bug Generation
The environment generates bugs from 7 template categories:
| Category | Example Bugs |
|---|---|
crash |
Service crashes, unhandled exceptions, segfaults |
security |
SQL injection, XSS, auth bypass, data exposure |
performance |
Memory leaks, slow queries, CPU spikes |
ui_bug |
Layout breaks, dark mode issues, accessibility |
data_corruption |
Race conditions, encoding issues, stale cache |
documentation |
Typos, outdated docs, missing guides |
api_bug |
Rate limiting bugs, pagination issues, webhook failures |
Each category has 5-6 title templates Γ 2 body templates Γ 6-12 variables = hundreds of unique combinations. The 15 original handcrafted bugs are preserved as a high-quality subset (40% chance per sample).
Setup
Run Locally
git clone https://github.com/Siteshcodes/bug-triage-env.git
cd bug-triage-env
pip install -r server/requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860
Run with Docker
docker build -t bug-triage-env .
docker run -p 7860:7860 bug-triage-env
Run Tests
pip install -e ".[dev]"
pytest tests/ -v
Run Inference (Hackathon Submission)
pip install openai openenv-core requests pydantic
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct
export HF_TOKEN=your_hf_token_here
export ENV_BASE_URL=https://siteshcodes-bug-triage-env.hf.space
python inference.py
Environment Variables
| Variable | Description | Required |
|---|---|---|
API_BASE_URL |
LLM API endpoint | Yes |
MODEL_NAME |
Model identifier for inference | Yes |
HF_TOKEN |
Hugging Face / API key | Yes |
ENV_BASE_URL |
Bug Triage environment URL | Optional |
API Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | / |
Interactive demo frontend |
| GET | /health |
Health check + active sessions |
| POST | /reset |
Start new episode (returns session_id) |
| POST | /step |
Investigation or submit action |
| GET | /state |
Current episode state |
| GET | /tasks |
List all 3 tasks |
| GET | /tasks/{id} |
Task metadata |
| GET | /leaderboard |
Top agent scores |
| POST | /leaderboard/submit |
Submit agent scores |
Example: Multi-Step Episode
# 1. Reset β get a bug and session_id
curl -X POST https://siteshcodes-bug-triage-env.hf.space/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "hard"}'
# 2. Investigate β read full body (use session_id from step 1)
curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
-H "Content-Type: application/json" \
-d '{"session_id": "...", "action": {"action_type": "read_body"}}'
# 3. Investigate β read comments
curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
-H "Content-Type: application/json" \
-d '{"session_id": "...", "action": {"action_type": "read_comments"}}'
# 4. Submit triage decision
curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
-H "Content-Type: application/json" \
-d '{"session_id": "...", "action": {"action_type": "submit", "priority": "P0", "labels": ["bug", "security"], "assigned_team": "security", "milestone": "hotfix", "reasoning": "SQL injection in production β critical security vulnerability"}}'
Inference Log Format
Structured logs per OpenEnv spec (3 tasks, each with its own block):
[START] task=easy env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
[STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
[STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
[STEP] step=3 action=priority=P0,team=backend,milestone=hotfix reward=0.95 done=true error=null
[END] success=true steps=3 score=0.95 rewards=0.95
[START] task=medium env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
[STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
[STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
[STEP] step=3 action=priority=P0,team=backend,milestone=hotfix reward=0.85 done=true error=null
[END] success=true steps=3 score=0.85 rewards=0.85
[START] task=hard env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
[STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
[STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
[STEP] step=3 action=priority=P0,team=security,milestone=hotfix reward=0.92 done=true error=null
[END] success=true steps=3 score=0.92 rewards=0.92
Project Structure
bug-triage-env/
βββ server/
β βββ app.py # FastAPI routes + session management
β βββ environment.py # Multi-step environment + SessionManager
β βββ task.py # 200+ bugs (procedural + handcrafted) + semantic grading
β βββ __init__.py
β βββ requirements.txt
β βββ static/
β βββ index.html # Interactive demo
βββ tests/
β βββ test_grading.py # Grading logic tests
β βββ test_environment.py # Environment flow tests
β βββ test_api.py # HTTP endpoint integration tests
βββ model.py # Pydantic models (TriageAction, TriageObservation, TriageState)
βββ client.py # HTTP client (single source of truth)
βββ inference.py # Multi-step OpenAI agent (hackathon submission)
βββ baseline.py # Groq baseline agent
βββ openenv.yaml # OpenEnv spec manifest
βββ Dockerfile # Docker config
βββ pyproject.toml # Package metadata + dev deps
βββ README.md
OpenEnv Spec Compliance
| Requirement | Status |
|---|---|
| Typed models (Action/Observation/State) | β |
step() / reset() / state() API |
β |
openenv.yaml manifest |
β |
| 3+ tasks with graders (easy β hard) | β |
| Reward range strictly (0.0, 1.0) | β |
| Multi-step episodes | β |
| Baseline inference with reproducible scores | β |
| Dockerfile builds | β |
| Deployed on HF Spaces | β |
Structured [START]/[STEP]/[END] logs |
β |
| Session-based concurrency | β |
| 50+ automated tests | β |
Built for the Meta PyTorch Hackathon x Scaler School of Technology β Round 1