Spaces:
Running
title: Bug Triage OpenEnv
emoji: 🐛
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
- openenv
Bug Triage OpenEnv
A production-grade reinforcement learning environment for automated software bug triage, built on the OpenEnv framework.
| Live Space | huggingface.co/spaces/savetrees/bug-triage-openenv |
| Repository | github.com/savetree-1/bug-triage-openenv |
| Framework | FastAPI + Pydantic v2 |
| License | MIT |
Table of Contents
- Overview
- Getting Started
- Tasks
- API Reference
- Observation Space
- Action Space
- Reward Design
- Baseline Agent
- Architecture
- Deployment
- Project Structure
- References
Overview
Bug Triage OpenEnv simulates a real-world issue tracking system (comparable to Jira, GitHub Issues, or Linear) where an AI agent must read incoming bug reports and make triage decisions:
- Classify the bug type (crash, UI, security, performance, data loss, compatibility)
- Prioritize the severity (low, medium, high, critical)
- Route to the correct developer based on their domain expertise
- Recommend the appropriate action (fix immediately, schedule for sprint, etc.)
The environment includes 25 carefully crafted bug reports drawn from real-world software engineering workflows, each designed to test different reasoning capabilities of frontier language models.
Motivation
| Problem | Why It Matters |
|---|---|
| Every software company triages hundreds to thousands of bugs daily | High-volume, repetitive task ideal for automation |
| Manual triage costs senior engineering hours | Direct cost savings from accurate automation |
| Misrouted bugs cause cascading delays and outages | Incorrect triage has measurable downstream impact |
| Ambiguous bug reports require deep contextual reasoning | LLM agents must parse unstructured text and infer intent |
This environment was built for the Meta x PyTorch Hackathon and is designed for training RL agents via GRPO (Group Relative Policy Optimization).
Getting Started
Prerequisites
- Python 3.10+
- pip
- Docker (optional, for containerized deployment)
Installation
git clone https://github.com/savetree-1/bug-triage-openenv.git
cd bug-triage-openenv
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
Quick Start
Start the server:
uvicorn bug_triage_env.server.app:app --host 0.0.0.0 --port 8000
Verify that the server is running:
curl http://localhost:8000/health
Expected response:
{"status": "healthy"}
Run a complete episode (reset, then step):
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "task_1"}'
Submit a triage action using the episode_id returned from /reset:
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"episode_id": "<episode_id>", "action": {"task_id": "task_1", "bug_type": "crash"}}'
Python Client
A synchronous and asynchronous client is provided for programmatic access:
from bug_triage_env.client import BugTriageEnvClient
from bug_triage_env.models import BugTriageAction
with BugTriageEnvClient("http://localhost:8000") as client:
obs = client.reset(task_id="task_3")
action = BugTriageAction(
task_id="task_3",
bug_type="security",
priority="critical",
assigned_developer="Bob",
suggested_action="fix_immediately",
)
result = client.step(obs["episode_id"], action)
print(f"Grader score: {result['grader_score']}")
Tasks
The environment defines three tasks of increasing difficulty. Each task has a deterministic grader that returns a score in the range [0.0, 1.0].
Task 1: Bug Type Classification (Easy)
Given a bug report, classify it into one of six categories.
| Property | Value |
|---|---|
| Input | Bug title, description, logs, environment metadata |
| Output | bug_type: one of crash, ui, performance, security, data_loss, compatibility |
| Scoring | Exact match = 1.0; incorrect = 0.0 |
| Grader | task1_grader.py |
Task 2: Priority Assignment (Medium)
Given a bug report, assign the correct severity level.
| Property | Value |
|---|---|
| Input | Bug title, description, logs, environment metadata |
| Output | priority: one of low, medium, high, critical |
| Scoring | Exact = 1.0; 1 level off = 0.67; 2 levels = 0.33; 3 levels = 0.0 |
| Grader | task2_grader.py |
Task 3: Full Bug Triage (Hard)
Perform complete triage: classify the bug type, assign priority, route to the correct developer, and recommend an action.
| Property | Value |
|---|---|
| Output | bug_type + priority + assigned_developer + suggested_action |
| Developers | Alice (crash, performance), Bob (crash, security), Carol (UI, compatibility), David (security, data loss), Eve (UI, performance, compatibility) |
| Actions | fix_immediately, schedule_sprint, needs_more_info, wontfix, duplicate |
| Scoring | Weighted composite: 0.3 * type + 0.3 * priority + 0.2 * developer + 0.2 * action |
| Grader | task3_grader.py |
API Reference
All endpoints conform to the OpenEnv specification.
Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/health |
Liveness probe. Returns {"status": "healthy"}. |
POST |
/reset |
Start a new episode. Accepts optional {"task_id": "task_1"}. Returns an observation containing a bug report. |
POST |
/step |
Submit a triage action. Requires episode_id and action. Returns observation with reward and grader score. |
GET |
/state |
Returns metadata about active episodes. |
GET |
/tasks |
Lists all available tasks with their action schemas. |
POST |
/grader |
Re-grade a completed episode. Requires episode_id and task_id. |
POST |
/baseline |
Trigger baseline inference (requires OPENAI_API_KEY or GEMINI_API_KEY). |
GET |
/docs |
Auto-generated Swagger UI documentation. |
POST /reset
Request:
{"task_id": "task_1"}
Response (abbreviated):
{
"done": false,
"reward": 0.0,
"task_id": "task_1",
"episode_id": "abc123",
"step_number": 0,
"feedback": "New bug report received. Please triage.",
"available_developers": ["Alice", "Bob", "Carol", "David", "Eve"],
"bug_report": {
"bug_id": "BUG-001",
"title": "Application crashes on login with SSO enabled",
"description": "...",
"logs": "...",
"environment": "macOS 14.2, Chrome 120",
"reporter": "user_42",
"created_at": "2024-01-15T09:30:00Z",
"metadata": {}
}
}
POST /step
Request:
{
"episode_id": "abc123",
"action": {
"task_id": "task_3",
"bug_type": "crash",
"priority": "critical",
"assigned_developer": "Alice",
"suggested_action": "fix_immediately"
}
}
Response:
{
"done": true,
"reward": 1.0,
"grader_score": 1.0,
"task_id": "task_3",
"feedback": "Grader score: 1.00 | Bug type: correct | Priority: correct | Developer: correct | Action: correct",
"step_number": 1,
"episode_id": "abc123"
}
Observation Space
Each observation returned by /reset and /step contains the following fields:
| Field | Type | Description |
|---|---|---|
bug_report.bug_id |
string | Unique bug identifier (e.g., BUG-001) |
bug_report.title |
string | Short summary of the bug |
bug_report.description |
string | Detailed description of the issue |
bug_report.logs |
string or null | Error logs, stack traces, or crash output |
bug_report.environment |
string or null | OS, browser, hardware, and version details |
bug_report.reporter |
string | Username of the person who filed the bug |
bug_report.created_at |
string | ISO 8601 timestamp |
bug_report.metadata |
object | Additional context (component, affected users, regression flag) |
available_developers |
array of strings | The 5 developers available for routing |
done |
boolean | Whether the episode has ended |
reward |
float | Shaped reward signal for RL training |
grader_score |
float or null | Raw evaluation score in [0.0, 1.0] (null before stepping) |
episode_id |
string | Unique episode identifier |
step_number |
integer | Current step count (0 after reset, 1 after step) |
feedback |
string | Human-readable feedback about the triage result |
Action Space
Actions are submitted as JSON objects to the /step endpoint. Required fields vary by task:
| Field | Type | Task 1 | Task 2 | Task 3 |
|---|---|---|---|---|
task_id |
string | Required | Required | Required |
bug_type |
string | Required | -- | Required |
priority |
string | -- | Required | Required |
assigned_developer |
string | -- | -- | Required |
suggested_action |
string | -- | -- | Required |
confidence |
float (0.0-1.0) | Optional | Optional | Optional |
reasoning |
string | Optional | Optional | Optional |
Valid values:
- bug_type:
crash,ui,performance,security,data_loss,compatibility - priority:
low,medium,high,critical - assigned_developer:
Alice,Bob,Carol,David,Eve - suggested_action:
fix_immediately,schedule_sprint,needs_more_info,wontfix,duplicate
Reward Design
The environment provides two distinct signals:
| Signal | Range | Purpose |
|---|---|---|
| Grader Score | [0.0, 1.0] |
Deterministic evaluation metric for benchmarking |
| Shaped Reward | [-0.5, 1.0] |
Continuous training signal optimized for GRPO |
The shaped reward is derived from the grader score using the following formula:
reward = (grader_score * 1.5) - 0.5 + calibration_bonus
This mapping ensures:
- A score of 0.0 produces a reward of -0.5 (penalizes random guessing)
- A score of 0.33 produces a reward of 0.0 (breakeven point)
- A score of 1.0 produces a reward of 1.0 (maximum)
Confidence Calibration
Agents may optionally submit a confidence value (float between 0.0 and 1.0) with their action. The environment applies a calibration bonus or penalty based on how well the agent's confidence aligns with its actual performance:
| Condition | Bonus | Description |
|---|---|---|
| Correct and confident (score >= 0.8, confidence >= 0.8) | +0.10 | Rewards agents that are confident and right |
| Wrong and overconfident (score < 0.5, confidence >= 0.8) | -0.15 | Penalizes dangerous overconfidence |
| Well-calibrated (absolute difference < 0.2) | +0.05 | Rewards honest uncertainty estimation |
| Poorly calibrated (absolute difference >= 0.2) | -0.05 | Penalizes miscalibrated confidence |
This mechanic introduces a genuine RL challenge: the agent must learn not only what is correct, but also when it is certain. In production bug triage, overconfident misrouting of a critical outage has severe downstream consequences.
Baseline Agent
The baseline inference script supports two LLM providers with automatic fallback:
| Priority | Provider | Environment Variable | Default Model |
|---|---|---|---|
| Primary | OpenAI | OPENAI_API_KEY |
gpt-4o-mini |
| Fallback | Google Gemini | GEMINI_API_KEY |
gemini-2.5-flash |
| Last resort | Random | -- | Random valid action |
Both providers implement exponential backoff with retry logic for HTTP 429 (rate limit) and 503 (service unavailable) responses.
Running the Baseline
# Using OpenAI (required by hackathon spec)
export OPENAI_API_KEY="sk-..."
python -m bug_triage_env.baseline --all-tasks --episodes 5
# Using Gemini (free tier available at https://aistudio.google.com/apikey)
export GEMINI_API_KEY="AI..."
python -m bug_triage_env.baseline --all-tasks --episodes 5
# Single task with more episodes
python -m bug_triage_env.baseline --task task_1 --episodes 10
# JSON output
python -m bug_triage_env.baseline --all-tasks --json
Baseline Scores
| Task | Mean Score | Range | Description |
|---|---|---|---|
| Task 1 (Easy) | 0.80 | 0.00 - 1.00 | Bug type classification |
| Task 2 (Medium) | 0.93 | 0.67 - 1.00 | Priority assignment |
| Task 3 (Hard) | 0.78 | 0.60 - 1.00 | Full triage pipeline |
| Overall | 0.84 | Weighted average across all tasks |
Without any API key configured, the baseline falls back to random actions and achieves an average score of approximately 0.15.
Hackathon Inference Script
The root-level inference.py is the hackathon-mandated entry point. It:
- Uses the OpenAI Python client exclusively
- Reads
API_BASE_URL,MODEL_NAME, andHF_TOKENfrom the environment - Emits structured
[START],[STEP], and[END]logs to stdout - Completes in under 20 minutes on 2 vCPU / 8 GB RAM
Architecture
+------------------------------------------+
| FastAPI Server |
| +--------+ +--------+ +-----------+ |
| | /reset | | /step | | /grader | |
| +---+----+ +---+----+ +-----+-----+ |
| | | | |
| +---v-----------v--------------v-----+ |
| | BugTriageEnvironment | |
| | +----------+ +---------------+ | |
| | | Dataset | | Episode Store | | |
| | | 25 Bugs | | (thread-safe) | | |
| | +----------+ +---------------+ | |
| +--------------------+---------------+ |
| | |
| +--------------------v---------------+ |
| | Graders Registry | |
| | task1: exact match | |
| | task2: distance penalty | |
| | task3: weighted composite | |
| +------------------------------------+ |
+------------------------------------------+
^ ^
| HTTP | HTTP
+----+-----+ +----+----------+
| Client | | Baseline |
| (Python) | | OpenAI/Gemini |
+----------+ +---------------+
Key implementation details:
- Thread safety: The episode store uses Python
threading.Lockto support concurrent requests from multiple agents. - Single-step episodes: Each episode consists of one reset (observation) and one step (action). The episode terminates immediately after the step.
- Deterministic grading: All three graders produce identical scores for identical inputs. No randomness is involved in evaluation.
- Dataset: 25 bug reports stored in
bugs.json, covering crash reports, security vulnerabilities, performance regressions, UI glitches, data corruption, and compatibility issues.
Deployment
Docker
docker build -t bug-triage-env .
docker run -d -p 8000:8000 \
-e OPENAI_API_KEY="sk-..." \
bug-triage-env
curl http://localhost:8000/health
The Dockerfile uses Python 3.11-slim, installs only production dependencies, and includes a built-in health check.
Hugging Face Spaces
The environment is deployed as a Docker-based Hugging Face Space:
pip install huggingface_hub
python3 -c "
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id='<username>/bug-triage-openenv', repo_type='space', space_sdk='docker', exist_ok=True)
api.upload_folder(folder_path='.', repo_id='<username>/bug-triage-openenv', repo_type='space')
"
The live deployment is accessible at: https://huggingface.co/spaces/savetrees/bug-triage-openenv
Project Structure
bug-triage-openenv/
|-- README.md Documentation
|-- Dockerfile Production container (Python 3.11-slim)
|-- openenv.yaml OpenEnv environment manifest
|-- inference.py Hackathon inference entry point
|-- pyproject.toml Python package configuration
|-- requirements.txt Pinned production dependencies
|-- .dockerignore Files excluded from Docker build
|-- .gitignore Files excluded from version control
|
|-- bug_triage_env/ Main Python package
| |-- __init__.py Package initialization
| |-- models.py Pydantic v2 data models (Action, Observation, State)
| |-- client.py Synchronous and asynchronous HTTP client
| |-- baseline.py Dual-provider LLM baseline (OpenAI + Gemini)
| |
| |-- data/
| | |-- __init__.py Dataset loader
| | |-- bugs.json 25 curated real-world bug reports
| |
| |-- graders/
| | |-- __init__.py Grader registry
| | |-- task1_grader.py Bug classification grader (exact match)
| | |-- task2_grader.py Priority assignment grader (distance penalty)
| | |-- task3_grader.py Full triage grader (weighted composite)
| |
| |-- server/
| |-- __init__.py Server package initialization
| |-- app.py FastAPI application with all 8 endpoints
| |-- environment.py Core RL environment (reset, step, state)
Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
OPENAI_API_KEY |
For baseline | (none) | OpenAI API key for primary baseline inference |
GEMINI_API_KEY |
For fallback | (none) | Google Gemini API key for fallback inference |
API_BASE_URL |
For hackathon | https://api.openai.com/v1 |
LLM API endpoint (used by inference.py) |
MODEL_NAME |
For hackathon | gpt-4o-mini |
Model identifier (used by inference.py) |
HF_TOKEN |
For hackathon | (none) | Hugging Face token (used by inference.py) |
PORT |
No | 8000 |
Server port |
HOST |
No | 0.0.0.0 |
Server bind address |
WORKERS |
No | 4 |
Number of Uvicorn worker processes |
References
- OpenEnv Framework -- Standardized RL environment specification
- FastAPI -- High-performance Python web framework
- Pydantic v2 -- Data validation using Python type annotations
- OpenAI API -- Primary LLM provider for baseline inference
- Google Gemini API -- Fallback LLM provider
- GRPO (Group Relative Policy Optimization) -- RL training algorithm
- Hugging Face Spaces -- Deployment platform for ML applications
- Docker -- Containerization platform
License
This project is licensed under the MIT License. See LICENSE for details.