bug-triage-openenv / README.md
savetrees's picture
Upload folder using huggingface_hub
766521e verified
---
title: Bug Triage OpenEnv
emoji: 🐛
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
- openenv
---
# Bug Triage OpenEnv
A production-grade reinforcement learning environment for automated software bug triage, built on the [OpenEnv](https://github.com/OpenEnv-AI/openenv) framework.
| | |
|---|---|
| **Live Space** | [huggingface.co/spaces/savetrees/bug-triage-openenv](https://huggingface.co/spaces/savetrees/bug-triage-openenv) |
| **Repository** | [github.com/savetree-1/bug-triage-openenv](https://github.com/savetree-1/bug-triage-openenv) |
| **Framework** | [FastAPI](https://fastapi.tiangolo.com) + [Pydantic v2](https://docs.pydantic.dev/latest/) |
| **License** | MIT |
---
## Table of Contents
- [Overview](#overview)
- [Getting Started](#getting-started)
- [Tasks](#tasks)
- [API Reference](#api-reference)
- [Observation Space](#observation-space)
- [Action Space](#action-space)
- [Reward Design](#reward-design)
- [Baseline Agent](#baseline-agent)
- [Architecture](#architecture)
- [Deployment](#deployment)
- [Project Structure](#project-structure)
- [References](#references)
---
## Overview
**Bug Triage OpenEnv** simulates a real-world issue tracking system (comparable to [Jira](https://www.atlassian.com/software/jira), [GitHub Issues](https://github.com/features/issues), or [Linear](https://linear.app)) where an AI agent must read incoming bug reports and make triage decisions:
1. **Classify** the bug type (crash, UI, security, performance, data loss, compatibility)
2. **Prioritize** the severity (low, medium, high, critical)
3. **Route** to the correct developer based on their domain expertise
4. **Recommend** the appropriate action (fix immediately, schedule for sprint, etc.)
The environment includes 25 carefully crafted bug reports drawn from real-world software engineering workflows, each designed to test different reasoning capabilities of frontier language models.
### Motivation
| Problem | Why It Matters |
|---|---|
| Every software company triages hundreds to thousands of bugs daily | High-volume, repetitive task ideal for automation |
| Manual triage costs senior engineering hours | Direct cost savings from accurate automation |
| Misrouted bugs cause cascading delays and outages | Incorrect triage has measurable downstream impact |
| Ambiguous bug reports require deep contextual reasoning | LLM agents must parse unstructured text and infer intent |
This environment was built for the [Meta x PyTorch Hackathon](https://pytorch.org/) and is designed for training RL agents via [GRPO](https://arxiv.org/abs/2402.03300) (Group Relative Policy Optimization).
---
## Getting Started
### Prerequisites
- [Python 3.10+](https://www.python.org/downloads/)
- [pip](https://pip.pypa.io/en/stable/)
- [Docker](https://docs.docker.com/get-docker/) (optional, for containerized deployment)
### Installation
```bash
git clone https://github.com/savetree-1/bug-triage-openenv.git
cd bug-triage-openenv
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```
### Quick Start
Start the server:
```bash
uvicorn bug_triage_env.server.app:app --host 0.0.0.0 --port 8000
```
Verify that the server is running:
```bash
curl http://localhost:8000/health
```
Expected response:
```json
{"status": "healthy"}
```
Run a complete episode (reset, then step):
```bash
curl -X POST http://localhost:8000/reset \
-H "Content-Type: application/json" \
-d '{"task_id": "task_1"}'
```
Submit a triage action using the `episode_id` returned from `/reset`:
```bash
curl -X POST http://localhost:8000/step \
-H "Content-Type: application/json" \
-d '{"episode_id": "<episode_id>", "action": {"task_id": "task_1", "bug_type": "crash"}}'
```
### Python Client
A synchronous and asynchronous client is provided for programmatic access:
```python
from bug_triage_env.client import BugTriageEnvClient
from bug_triage_env.models import BugTriageAction
with BugTriageEnvClient("http://localhost:8000") as client:
obs = client.reset(task_id="task_3")
action = BugTriageAction(
task_id="task_3",
bug_type="security",
priority="critical",
assigned_developer="Bob",
suggested_action="fix_immediately",
)
result = client.step(obs["episode_id"], action)
print(f"Grader score: {result['grader_score']}")
```
---
## Tasks
The environment defines three tasks of increasing difficulty. Each task has a deterministic grader that returns a score in the range `[0.0, 1.0]`.
### Task 1: Bug Type Classification (Easy)
Given a bug report, classify it into one of six categories.
| Property | Value |
|---|---|
| Input | Bug title, description, logs, environment metadata |
| Output | `bug_type`: one of `crash`, `ui`, `performance`, `security`, `data_loss`, `compatibility` |
| Scoring | Exact match = 1.0; incorrect = 0.0 |
| Grader | [task1_grader.py](bug_triage_env/graders/task1_grader.py) |
### Task 2: Priority Assignment (Medium)
Given a bug report, assign the correct severity level.
| Property | Value |
|---|---|
| Input | Bug title, description, logs, environment metadata |
| Output | `priority`: one of `low`, `medium`, `high`, `critical` |
| Scoring | Exact = 1.0; 1 level off = 0.67; 2 levels = 0.33; 3 levels = 0.0 |
| Grader | [task2_grader.py](bug_triage_env/graders/task2_grader.py) |
### Task 3: Full Bug Triage (Hard)
Perform complete triage: classify the bug type, assign priority, route to the correct developer, and recommend an action.
| Property | Value |
|---|---|
| Output | `bug_type` + `priority` + `assigned_developer` + `suggested_action` |
| Developers | Alice (crash, performance), Bob (crash, security), Carol (UI, compatibility), David (security, data loss), Eve (UI, performance, compatibility) |
| Actions | `fix_immediately`, `schedule_sprint`, `needs_more_info`, `wontfix`, `duplicate` |
| Scoring | Weighted composite: `0.3 * type + 0.3 * priority + 0.2 * developer + 0.2 * action` |
| Grader | [task3_grader.py](bug_triage_env/graders/task3_grader.py) |
---
## API Reference
All endpoints conform to the [OpenEnv specification](https://github.com/OpenEnv-AI/openenv).
### Endpoints
| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Liveness probe. Returns `{"status": "healthy"}`. |
| `POST` | `/reset` | Start a new episode. Accepts optional `{"task_id": "task_1"}`. Returns an observation containing a bug report. |
| `POST` | `/step` | Submit a triage action. Requires `episode_id` and `action`. Returns observation with reward and grader score. |
| `GET` | `/state` | Returns metadata about active episodes. |
| `GET` | `/tasks` | Lists all available tasks with their action schemas. |
| `POST` | `/grader` | Re-grade a completed episode. Requires `episode_id` and `task_id`. |
| `POST` | `/baseline` | Trigger baseline inference (requires `OPENAI_API_KEY` or `GEMINI_API_KEY`). |
| `GET` | `/docs` | Auto-generated [Swagger UI](https://swagger.io/tools/swagger-ui/) documentation. |
### POST /reset
Request:
```json
{"task_id": "task_1"}
```
Response (abbreviated):
```json
{
"done": false,
"reward": 0.0,
"task_id": "task_1",
"episode_id": "abc123",
"step_number": 0,
"feedback": "New bug report received. Please triage.",
"available_developers": ["Alice", "Bob", "Carol", "David", "Eve"],
"bug_report": {
"bug_id": "BUG-001",
"title": "Application crashes on login with SSO enabled",
"description": "...",
"logs": "...",
"environment": "macOS 14.2, Chrome 120",
"reporter": "user_42",
"created_at": "2024-01-15T09:30:00Z",
"metadata": {}
}
}
```
### POST /step
Request:
```json
{
"episode_id": "abc123",
"action": {
"task_id": "task_3",
"bug_type": "crash",
"priority": "critical",
"assigned_developer": "Alice",
"suggested_action": "fix_immediately"
}
}
```
Response:
```json
{
"done": true,
"reward": 1.0,
"grader_score": 1.0,
"task_id": "task_3",
"feedback": "Grader score: 1.00 | Bug type: correct | Priority: correct | Developer: correct | Action: correct",
"step_number": 1,
"episode_id": "abc123"
}
```
---
## Observation Space
Each observation returned by `/reset` and `/step` contains the following fields:
| Field | Type | Description |
|---|---|---|
| `bug_report.bug_id` | string | Unique bug identifier (e.g., `BUG-001`) |
| `bug_report.title` | string | Short summary of the bug |
| `bug_report.description` | string | Detailed description of the issue |
| `bug_report.logs` | string or null | Error logs, stack traces, or crash output |
| `bug_report.environment` | string or null | OS, browser, hardware, and version details |
| `bug_report.reporter` | string | Username of the person who filed the bug |
| `bug_report.created_at` | string | ISO 8601 timestamp |
| `bug_report.metadata` | object | Additional context (component, affected users, regression flag) |
| `available_developers` | array of strings | The 5 developers available for routing |
| `done` | boolean | Whether the episode has ended |
| `reward` | float | Shaped reward signal for RL training |
| `grader_score` | float or null | Raw evaluation score in `[0.0, 1.0]` (null before stepping) |
| `episode_id` | string | Unique episode identifier |
| `step_number` | integer | Current step count (0 after reset, 1 after step) |
| `feedback` | string | Human-readable feedback about the triage result |
---
## Action Space
Actions are submitted as JSON objects to the `/step` endpoint. Required fields vary by task:
| Field | Type | Task 1 | Task 2 | Task 3 |
|---|---|---|---|---|
| `task_id` | string | Required | Required | Required |
| `bug_type` | string | Required | -- | Required |
| `priority` | string | -- | Required | Required |
| `assigned_developer` | string | -- | -- | Required |
| `suggested_action` | string | -- | -- | Required |
| `confidence` | float (0.0-1.0) | Optional | Optional | Optional |
| `reasoning` | string | Optional | Optional | Optional |
Valid values:
- **bug_type**: `crash`, `ui`, `performance`, `security`, `data_loss`, `compatibility`
- **priority**: `low`, `medium`, `high`, `critical`
- **assigned_developer**: `Alice`, `Bob`, `Carol`, `David`, `Eve`
- **suggested_action**: `fix_immediately`, `schedule_sprint`, `needs_more_info`, `wontfix`, `duplicate`
---
## Reward Design
The environment provides two distinct signals:
| Signal | Range | Purpose |
|---|---|---|
| **Grader Score** | `[0.0, 1.0]` | Deterministic evaluation metric for benchmarking |
| **Shaped Reward** | `[-0.5, 1.0]` | Continuous training signal optimized for [GRPO](https://arxiv.org/abs/2402.03300) |
The shaped reward is derived from the grader score using the following formula:
```
reward = (grader_score * 1.5) - 0.5 + calibration_bonus
```
This mapping ensures:
- A score of 0.0 produces a reward of -0.5 (penalizes random guessing)
- A score of 0.33 produces a reward of 0.0 (breakeven point)
- A score of 1.0 produces a reward of 1.0 (maximum)
### Confidence Calibration
Agents may optionally submit a `confidence` value (float between 0.0 and 1.0) with their action. The environment applies a calibration bonus or penalty based on how well the agent's confidence aligns with its actual performance:
| Condition | Bonus | Description |
|---|---|---|
| Correct and confident (score >= 0.8, confidence >= 0.8) | +0.10 | Rewards agents that are confident and right |
| Wrong and overconfident (score < 0.5, confidence >= 0.8) | -0.15 | Penalizes dangerous overconfidence |
| Well-calibrated (absolute difference < 0.2) | +0.05 | Rewards honest uncertainty estimation |
| Poorly calibrated (absolute difference >= 0.2) | -0.05 | Penalizes miscalibrated confidence |
This mechanic introduces a genuine RL challenge: the agent must learn not only what is correct, but also when it is certain. In production bug triage, overconfident misrouting of a critical outage has severe downstream consequences.
---
## Baseline Agent
The baseline inference script supports two LLM providers with automatic fallback:
| Priority | Provider | Environment Variable | Default Model |
|---|---|---|---|
| Primary | [OpenAI](https://platform.openai.com/docs) | `OPENAI_API_KEY` | gpt-4o-mini |
| Fallback | [Google Gemini](https://ai.google.dev/gemini-api/docs) | `GEMINI_API_KEY` | gemini-2.5-flash |
| Last resort | Random | -- | Random valid action |
Both providers implement exponential backoff with retry logic for HTTP 429 (rate limit) and 503 (service unavailable) responses.
### Running the Baseline
```bash
# Using OpenAI (required by hackathon spec)
export OPENAI_API_KEY="sk-..."
python -m bug_triage_env.baseline --all-tasks --episodes 5
# Using Gemini (free tier available at https://aistudio.google.com/apikey)
export GEMINI_API_KEY="AI..."
python -m bug_triage_env.baseline --all-tasks --episodes 5
# Single task with more episodes
python -m bug_triage_env.baseline --task task_1 --episodes 10
# JSON output
python -m bug_triage_env.baseline --all-tasks --json
```
### Baseline Scores
| Task | Mean Score | Range | Description |
|---|---|---|---|
| Task 1 (Easy) | 0.80 | 0.00 - 1.00 | Bug type classification |
| Task 2 (Medium) | 0.93 | 0.67 - 1.00 | Priority assignment |
| Task 3 (Hard) | 0.78 | 0.60 - 1.00 | Full triage pipeline |
| **Overall** | **0.84** | | Weighted average across all tasks |
Without any API key configured, the baseline falls back to random actions and achieves an average score of approximately 0.15.
### Hackathon Inference Script
The root-level [`inference.py`](inference.py) is the hackathon-mandated entry point. It:
- Uses the [OpenAI Python client](https://github.com/openai/openai-python) exclusively
- Reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from the environment
- Emits structured `[START]`, `[STEP]`, and `[END]` logs to stdout
- Completes in under 20 minutes on 2 vCPU / 8 GB RAM
---
## Architecture
```
+------------------------------------------+
| FastAPI Server |
| +--------+ +--------+ +-----------+ |
| | /reset | | /step | | /grader | |
| +---+----+ +---+----+ +-----+-----+ |
| | | | |
| +---v-----------v--------------v-----+ |
| | BugTriageEnvironment | |
| | +----------+ +---------------+ | |
| | | Dataset | | Episode Store | | |
| | | 25 Bugs | | (thread-safe) | | |
| | +----------+ +---------------+ | |
| +--------------------+---------------+ |
| | |
| +--------------------v---------------+ |
| | Graders Registry | |
| | task1: exact match | |
| | task2: distance penalty | |
| | task3: weighted composite | |
| +------------------------------------+ |
+------------------------------------------+
^ ^
| HTTP | HTTP
+----+-----+ +----+----------+
| Client | | Baseline |
| (Python) | | OpenAI/Gemini |
+----------+ +---------------+
```
Key implementation details:
- **Thread safety**: The episode store uses Python `threading.Lock` to support concurrent requests from multiple agents.
- **Single-step episodes**: Each episode consists of one reset (observation) and one step (action). The episode terminates immediately after the step.
- **Deterministic grading**: All three graders produce identical scores for identical inputs. No randomness is involved in evaluation.
- **Dataset**: 25 bug reports stored in [`bugs.json`](bug_triage_env/data/bugs.json), covering crash reports, security vulnerabilities, performance regressions, UI glitches, data corruption, and compatibility issues.
---
## Deployment
### Docker
```bash
docker build -t bug-triage-env .
docker run -d -p 8000:8000 \
-e OPENAI_API_KEY="sk-..." \
bug-triage-env
curl http://localhost:8000/health
```
The Dockerfile uses Python 3.11-slim, installs only production dependencies, and includes a built-in health check.
### Hugging Face Spaces
The environment is deployed as a Docker-based [Hugging Face Space](https://huggingface.co/docs/hub/spaces):
```bash
pip install huggingface_hub
python3 -c "
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id='<username>/bug-triage-openenv', repo_type='space', space_sdk='docker', exist_ok=True)
api.upload_folder(folder_path='.', repo_id='<username>/bug-triage-openenv', repo_type='space')
"
```
The live deployment is accessible at:
**[https://huggingface.co/spaces/savetrees/bug-triage-openenv](https://huggingface.co/spaces/savetrees/bug-triage-openenv)**
---
## Project Structure
```
bug-triage-openenv/
|-- README.md Documentation
|-- Dockerfile Production container (Python 3.11-slim)
|-- openenv.yaml OpenEnv environment manifest
|-- inference.py Hackathon inference entry point
|-- pyproject.toml Python package configuration
|-- requirements.txt Pinned production dependencies
|-- .dockerignore Files excluded from Docker build
|-- .gitignore Files excluded from version control
|
|-- bug_triage_env/ Main Python package
| |-- __init__.py Package initialization
| |-- models.py Pydantic v2 data models (Action, Observation, State)
| |-- client.py Synchronous and asynchronous HTTP client
| |-- baseline.py Dual-provider LLM baseline (OpenAI + Gemini)
| |
| |-- data/
| | |-- __init__.py Dataset loader
| | |-- bugs.json 25 curated real-world bug reports
| |
| |-- graders/
| | |-- __init__.py Grader registry
| | |-- task1_grader.py Bug classification grader (exact match)
| | |-- task2_grader.py Priority assignment grader (distance penalty)
| | |-- task3_grader.py Full triage grader (weighted composite)
| |
| |-- server/
| |-- __init__.py Server package initialization
| |-- app.py FastAPI application with all 8 endpoints
| |-- environment.py Core RL environment (reset, step, state)
```
---
## Environment Variables
| Variable | Required | Default | Description |
|---|---|---|---|
| `OPENAI_API_KEY` | For baseline | (none) | [OpenAI API key](https://platform.openai.com/api-keys) for primary baseline inference |
| `GEMINI_API_KEY` | For fallback | (none) | [Google Gemini API key](https://aistudio.google.com/apikey) for fallback inference |
| `API_BASE_URL` | For hackathon | `https://api.openai.com/v1` | LLM API endpoint (used by `inference.py`) |
| `MODEL_NAME` | For hackathon | `gpt-4o-mini` | Model identifier (used by `inference.py`) |
| `HF_TOKEN` | For hackathon | (none) | Hugging Face token (used by `inference.py`) |
| `PORT` | No | `8000` | Server port |
| `HOST` | No | `0.0.0.0` | Server bind address |
| `WORKERS` | No | `4` | Number of Uvicorn worker processes |
---
## References
- [OpenEnv Framework](https://github.com/OpenEnv-AI/openenv) -- Standardized RL environment specification
- [FastAPI](https://fastapi.tiangolo.com) -- High-performance Python web framework
- [Pydantic v2](https://docs.pydantic.dev/latest/) -- Data validation using Python type annotations
- [OpenAI API](https://platform.openai.com/docs) -- Primary LLM provider for baseline inference
- [Google Gemini API](https://ai.google.dev/gemini-api/docs) -- Fallback LLM provider
- [GRPO (Group Relative Policy Optimization)](https://arxiv.org/abs/2402.03300) -- RL training algorithm
- [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces) -- Deployment platform for ML applications
- [Docker](https://docs.docker.com/) -- Containerization platform
---
## License
This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.