bug-triage-env / README.md
Siteshcodes's picture
v2.0: multi-step episodes, procedural bugs, semantic grading, sessions, 71 tests
703aa57
metadata
title: Bug Triage Env
emoji: πŸ›
colorFrom: red
colorTo: yellow
sdk: docker
pinned: false
tags:
  - openenv

πŸ› Bug Triage Environment v2.0

OpenEnv RL environment for the Meta PyTorch Hackathon x Scaler School of Technology

A multi-step reinforcement learning environment where an AI agent investigates and triages GitHub-style bug reports β€” deciding priority, labels, team ownership, and milestone β€” just like a senior engineer would.

Live: https://siteshcodes-bug-triage-env.hf.space GitHub: https://github.com/Siteshcodes/bug-triage-env


What Makes This Different

Feature v1.0 (before) v2.0 (now)
Episode length 1 step (quiz) Multi-step investigation
Bug pool 15 hardcrafted 200+ procedurally generated
Label matching Exact string Semantic (synonym-aware)
Concurrency Broken (global state) Session-based, thread-safe
Information reveal Everything at once Progressive (title β†’ body β†’ comments β†’ logs)
Tests None 50+ unit & integration tests
Grading depth String matching Weighted scoring + reasoning bonus

Multi-Step Investigation

Unlike simple Q&A environments, the agent must investigate before deciding:

reset()     β†’ Agent sees: bug title + body preview
step(read_body)      β†’ Full description revealed
step(read_comments)  β†’ User comments revealed
step(check_logs)     β†’ Stack traces + severity signals revealed
step(submit, ...)    β†’ Final triage graded (reward returned)

Each investigation step costs a step (out of a limited budget). The agent must learn when it has enough information to decide correctly β€” balancing accuracy vs. efficiency.


Action Space

Field Type Values
action_type string read_body Β· read_comments Β· check_logs Β· check_similar Β· submit
priority string P0 Β· P1 Β· P2 Β· P3 (only for submit)
labels list[str] bug Β· performance Β· security Β· ux Β· data-integrity Β· payments …
assigned_team string backend Β· frontend Β· infra Β· security Β· devx
milestone string hotfix Β· v2.1 Β· backlog
reasoning string Free-form explanation (earns bonus points)

Observation Space

Field Type Description
bug_report BugReport Title, body, author, labels_hint, comments, stack_trace
task_id string Current difficulty: easy / medium / hard
score float Score from grader (0.0–1.0)
reward float Reward from last action (0.0–1.0)
feedback string Human-readable grader feedback
done bool Episode complete flag
body_visible bool Whether full body has been revealed
comments_visible bool Whether comments have been revealed
logs_visible bool Whether logs/stack traces have been revealed
steps_taken int Steps used so far
max_steps int Maximum steps allowed

Tasks

Task 1 β€” Easy: Priority Assignment

Assign a single P0–P3 priority. Up to 4 steps.

  • Grader: server.task:priority_match
  • Scoring: exact β†’ 0.95, Β±1 β†’ 0.50, Β±2 β†’ 0.20, else β†’ 0.05
  • Reward range: (0.0, 1.0)

Task 2 β€” Medium: Priority + Labels + Team

Assign priority, category labels, and team routing. Up to 5 steps.

  • Grader: server.task:priority_label_team
  • Scoring: priority 45% + label Jaccard (semantic) 40% + team 15%
  • Reward range: (0.0, 1.0)

Task 3 β€” Hard: Full Triage

Full triage with security escalation penalty. Up to 6 steps.

  • Grader: server.task:full_triage
  • Scoring: priority 35% + labels 30% + team 20% + milestone 15%
  • Penalty: βˆ’0.15 for missing security escalation
  • Bonus: up to +0.15 for relevant reasoning
  • Reward range: (0.0, 1.0)

Reward Function

  • Priority: Graduated partial credit (0.95 β†’ 0.50 β†’ 0.20 β†’ 0.05)
  • Labels: Semantic Jaccard similarity with synonym matching (e.g., "defect" β‰ˆ "bug")
  • Team routing: Binary accuracy, weighted per difficulty
  • Security escalation: Hard penalty (βˆ’0.15) for ignoring security signals
  • Reasoning bonus: Up to +0.15 for mentioning relevant signals
  • Efficiency: +0.05 bonus for correct answers with minimal investigation
  • Clamping: All scores strictly within (0.0, 1.0)

Procedural Bug Generation

The environment generates bugs from 7 template categories:

Category Example Bugs
crash Service crashes, unhandled exceptions, segfaults
security SQL injection, XSS, auth bypass, data exposure
performance Memory leaks, slow queries, CPU spikes
ui_bug Layout breaks, dark mode issues, accessibility
data_corruption Race conditions, encoding issues, stale cache
documentation Typos, outdated docs, missing guides
api_bug Rate limiting bugs, pagination issues, webhook failures

Each category has 5-6 title templates Γ— 2 body templates Γ— 6-12 variables = hundreds of unique combinations. The 15 original handcrafted bugs are preserved as a high-quality subset (40% chance per sample).


Setup

Run Locally

git clone https://github.com/Siteshcodes/bug-triage-env.git
cd bug-triage-env
pip install -r server/requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860

Run with Docker

docker build -t bug-triage-env .
docker run -p 7860:7860 bug-triage-env

Run Tests

pip install -e ".[dev]"
pytest tests/ -v

Run Inference (Hackathon Submission)

pip install openai openenv-core requests pydantic
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct
export HF_TOKEN=your_hf_token_here
export ENV_BASE_URL=https://siteshcodes-bug-triage-env.hf.space
python inference.py

Environment Variables

Variable Description Required
API_BASE_URL LLM API endpoint Yes
MODEL_NAME Model identifier for inference Yes
HF_TOKEN Hugging Face / API key Yes
ENV_BASE_URL Bug Triage environment URL Optional

API Endpoints

Method Endpoint Description
GET / Interactive demo frontend
GET /health Health check + active sessions
POST /reset Start new episode (returns session_id)
POST /step Investigation or submit action
GET /state Current episode state
GET /tasks List all 3 tasks
GET /tasks/{id} Task metadata
GET /leaderboard Top agent scores
POST /leaderboard/submit Submit agent scores

Example: Multi-Step Episode

# 1. Reset β€” get a bug and session_id
curl -X POST https://siteshcodes-bug-triage-env.hf.space/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "hard"}'

# 2. Investigate β€” read full body (use session_id from step 1)
curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"session_id": "...", "action": {"action_type": "read_body"}}'

# 3. Investigate β€” read comments
curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"session_id": "...", "action": {"action_type": "read_comments"}}'

# 4. Submit triage decision
curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
  -H "Content-Type: application/json" \
  -d '{"session_id": "...", "action": {"action_type": "submit", "priority": "P0", "labels": ["bug", "security"], "assigned_team": "security", "milestone": "hotfix", "reasoning": "SQL injection in production β€” critical security vulnerability"}}'

Inference Log Format

Structured logs per OpenEnv spec (3 tasks, each with its own block):

[START] task=easy env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
[STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
[STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
[STEP] step=3 action=priority=P0,team=backend,milestone=hotfix reward=0.95 done=true error=null
[END] success=true steps=3 score=0.95 rewards=0.95

[START] task=medium env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
[STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
[STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
[STEP] step=3 action=priority=P0,team=backend,milestone=hotfix reward=0.85 done=true error=null
[END] success=true steps=3 score=0.85 rewards=0.85

[START] task=hard env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
[STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
[STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
[STEP] step=3 action=priority=P0,team=security,milestone=hotfix reward=0.92 done=true error=null
[END] success=true steps=3 score=0.92 rewards=0.92

Project Structure

bug-triage-env/
β”œβ”€β”€ server/
β”‚   β”œβ”€β”€ app.py             # FastAPI routes + session management
β”‚   β”œβ”€β”€ environment.py     # Multi-step environment + SessionManager
β”‚   β”œβ”€β”€ task.py            # 200+ bugs (procedural + handcrafted) + semantic grading
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ requirements.txt
β”‚   └── static/
β”‚       └── index.html     # Interactive demo
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_grading.py    # Grading logic tests
β”‚   β”œβ”€β”€ test_environment.py # Environment flow tests
β”‚   └── test_api.py        # HTTP endpoint integration tests
β”œβ”€β”€ model.py               # Pydantic models (TriageAction, TriageObservation, TriageState)
β”œβ”€β”€ client.py              # HTTP client (single source of truth)
β”œβ”€β”€ inference.py           # Multi-step OpenAI agent (hackathon submission)
β”œβ”€β”€ baseline.py            # Groq baseline agent
β”œβ”€β”€ openenv.yaml           # OpenEnv spec manifest
β”œβ”€β”€ Dockerfile             # Docker config
β”œβ”€β”€ pyproject.toml         # Package metadata + dev deps
└── README.md

OpenEnv Spec Compliance

Requirement Status
Typed models (Action/Observation/State) βœ…
step() / reset() / state() API βœ…
openenv.yaml manifest βœ…
3+ tasks with graders (easy β†’ hard) βœ…
Reward range strictly (0.0, 1.0) βœ…
Multi-step episodes βœ…
Baseline inference with reproducible scores βœ…
Dockerfile builds βœ…
Deployed on HF Spaces βœ…
Structured [START]/[STEP]/[END] logs βœ…
Session-based concurrency βœ…
50+ automated tests βœ…

Built for the Meta PyTorch Hackathon x Scaler School of Technology β€” Round 1