--- title: Bug Triage OpenEnv emoji: 🐛 colorFrom: red colorTo: yellow sdk: docker app_port: 8000 tags: - openenv --- # Bug Triage OpenEnv A production-grade reinforcement learning environment for automated software bug triage, built on the [OpenEnv](https://github.com/OpenEnv-AI/openenv) framework. | | | |---|---| | **Live Space** | [huggingface.co/spaces/savetrees/bug-triage-openenv](https://huggingface.co/spaces/savetrees/bug-triage-openenv) | | **Repository** | [github.com/savetree-1/bug-triage-openenv](https://github.com/savetree-1/bug-triage-openenv) | | **Framework** | [FastAPI](https://fastapi.tiangolo.com) + [Pydantic v2](https://docs.pydantic.dev/latest/) | | **License** | MIT | --- ## Table of Contents - [Overview](#overview) - [Getting Started](#getting-started) - [Tasks](#tasks) - [API Reference](#api-reference) - [Observation Space](#observation-space) - [Action Space](#action-space) - [Reward Design](#reward-design) - [Baseline Agent](#baseline-agent) - [Architecture](#architecture) - [Deployment](#deployment) - [Project Structure](#project-structure) - [References](#references) --- ## Overview **Bug Triage OpenEnv** simulates a real-world issue tracking system (comparable to [Jira](https://www.atlassian.com/software/jira), [GitHub Issues](https://github.com/features/issues), or [Linear](https://linear.app)) where an AI agent must read incoming bug reports and make triage decisions: 1. **Classify** the bug type (crash, UI, security, performance, data loss, compatibility) 2. **Prioritize** the severity (low, medium, high, critical) 3. **Route** to the correct developer based on their domain expertise 4. **Recommend** the appropriate action (fix immediately, schedule for sprint, etc.) The environment includes 25 carefully crafted bug reports drawn from real-world software engineering workflows, each designed to test different reasoning capabilities of frontier language models. ### Motivation | Problem | Why It Matters | |---|---| | Every software company triages hundreds to thousands of bugs daily | High-volume, repetitive task ideal for automation | | Manual triage costs senior engineering hours | Direct cost savings from accurate automation | | Misrouted bugs cause cascading delays and outages | Incorrect triage has measurable downstream impact | | Ambiguous bug reports require deep contextual reasoning | LLM agents must parse unstructured text and infer intent | This environment was built for the [Meta x PyTorch Hackathon](https://pytorch.org/) and is designed for training RL agents via [GRPO](https://arxiv.org/abs/2402.03300) (Group Relative Policy Optimization). --- ## Getting Started ### Prerequisites - [Python 3.10+](https://www.python.org/downloads/) - [pip](https://pip.pypa.io/en/stable/) - [Docker](https://docs.docker.com/get-docker/) (optional, for containerized deployment) ### Installation ```bash git clone https://github.com/savetree-1/bug-triage-openenv.git cd bug-triage-openenv python -m venv .venv source .venv/bin/activate pip install -r requirements.txt ``` ### Quick Start Start the server: ```bash uvicorn bug_triage_env.server.app:app --host 0.0.0.0 --port 8000 ``` Verify that the server is running: ```bash curl http://localhost:8000/health ``` Expected response: ```json {"status": "healthy"} ``` Run a complete episode (reset, then step): ```bash curl -X POST http://localhost:8000/reset \ -H "Content-Type: application/json" \ -d '{"task_id": "task_1"}' ``` Submit a triage action using the `episode_id` returned from `/reset`: ```bash curl -X POST http://localhost:8000/step \ -H "Content-Type: application/json" \ -d '{"episode_id": "", "action": {"task_id": "task_1", "bug_type": "crash"}}' ``` ### Python Client A synchronous and asynchronous client is provided for programmatic access: ```python from bug_triage_env.client import BugTriageEnvClient from bug_triage_env.models import BugTriageAction with BugTriageEnvClient("http://localhost:8000") as client: obs = client.reset(task_id="task_3") action = BugTriageAction( task_id="task_3", bug_type="security", priority="critical", assigned_developer="Bob", suggested_action="fix_immediately", ) result = client.step(obs["episode_id"], action) print(f"Grader score: {result['grader_score']}") ``` --- ## Tasks The environment defines three tasks of increasing difficulty. Each task has a deterministic grader that returns a score in the range `[0.0, 1.0]`. ### Task 1: Bug Type Classification (Easy) Given a bug report, classify it into one of six categories. | Property | Value | |---|---| | Input | Bug title, description, logs, environment metadata | | Output | `bug_type`: one of `crash`, `ui`, `performance`, `security`, `data_loss`, `compatibility` | | Scoring | Exact match = 1.0; incorrect = 0.0 | | Grader | [task1_grader.py](bug_triage_env/graders/task1_grader.py) | ### Task 2: Priority Assignment (Medium) Given a bug report, assign the correct severity level. | Property | Value | |---|---| | Input | Bug title, description, logs, environment metadata | | Output | `priority`: one of `low`, `medium`, `high`, `critical` | | Scoring | Exact = 1.0; 1 level off = 0.67; 2 levels = 0.33; 3 levels = 0.0 | | Grader | [task2_grader.py](bug_triage_env/graders/task2_grader.py) | ### Task 3: Full Bug Triage (Hard) Perform complete triage: classify the bug type, assign priority, route to the correct developer, and recommend an action. | Property | Value | |---|---| | Output | `bug_type` + `priority` + `assigned_developer` + `suggested_action` | | Developers | Alice (crash, performance), Bob (crash, security), Carol (UI, compatibility), David (security, data loss), Eve (UI, performance, compatibility) | | Actions | `fix_immediately`, `schedule_sprint`, `needs_more_info`, `wontfix`, `duplicate` | | Scoring | Weighted composite: `0.3 * type + 0.3 * priority + 0.2 * developer + 0.2 * action` | | Grader | [task3_grader.py](bug_triage_env/graders/task3_grader.py) | --- ## API Reference All endpoints conform to the [OpenEnv specification](https://github.com/OpenEnv-AI/openenv). ### Endpoints | Method | Path | Description | |---|---|---| | `GET` | `/health` | Liveness probe. Returns `{"status": "healthy"}`. | | `POST` | `/reset` | Start a new episode. Accepts optional `{"task_id": "task_1"}`. Returns an observation containing a bug report. | | `POST` | `/step` | Submit a triage action. Requires `episode_id` and `action`. Returns observation with reward and grader score. | | `GET` | `/state` | Returns metadata about active episodes. | | `GET` | `/tasks` | Lists all available tasks with their action schemas. | | `POST` | `/grader` | Re-grade a completed episode. Requires `episode_id` and `task_id`. | | `POST` | `/baseline` | Trigger baseline inference (requires `OPENAI_API_KEY` or `GEMINI_API_KEY`). | | `GET` | `/docs` | Auto-generated [Swagger UI](https://swagger.io/tools/swagger-ui/) documentation. | ### POST /reset Request: ```json {"task_id": "task_1"} ``` Response (abbreviated): ```json { "done": false, "reward": 0.0, "task_id": "task_1", "episode_id": "abc123", "step_number": 0, "feedback": "New bug report received. Please triage.", "available_developers": ["Alice", "Bob", "Carol", "David", "Eve"], "bug_report": { "bug_id": "BUG-001", "title": "Application crashes on login with SSO enabled", "description": "...", "logs": "...", "environment": "macOS 14.2, Chrome 120", "reporter": "user_42", "created_at": "2024-01-15T09:30:00Z", "metadata": {} } } ``` ### POST /step Request: ```json { "episode_id": "abc123", "action": { "task_id": "task_3", "bug_type": "crash", "priority": "critical", "assigned_developer": "Alice", "suggested_action": "fix_immediately" } } ``` Response: ```json { "done": true, "reward": 1.0, "grader_score": 1.0, "task_id": "task_3", "feedback": "Grader score: 1.00 | Bug type: correct | Priority: correct | Developer: correct | Action: correct", "step_number": 1, "episode_id": "abc123" } ``` --- ## Observation Space Each observation returned by `/reset` and `/step` contains the following fields: | Field | Type | Description | |---|---|---| | `bug_report.bug_id` | string | Unique bug identifier (e.g., `BUG-001`) | | `bug_report.title` | string | Short summary of the bug | | `bug_report.description` | string | Detailed description of the issue | | `bug_report.logs` | string or null | Error logs, stack traces, or crash output | | `bug_report.environment` | string or null | OS, browser, hardware, and version details | | `bug_report.reporter` | string | Username of the person who filed the bug | | `bug_report.created_at` | string | ISO 8601 timestamp | | `bug_report.metadata` | object | Additional context (component, affected users, regression flag) | | `available_developers` | array of strings | The 5 developers available for routing | | `done` | boolean | Whether the episode has ended | | `reward` | float | Shaped reward signal for RL training | | `grader_score` | float or null | Raw evaluation score in `[0.0, 1.0]` (null before stepping) | | `episode_id` | string | Unique episode identifier | | `step_number` | integer | Current step count (0 after reset, 1 after step) | | `feedback` | string | Human-readable feedback about the triage result | --- ## Action Space Actions are submitted as JSON objects to the `/step` endpoint. Required fields vary by task: | Field | Type | Task 1 | Task 2 | Task 3 | |---|---|---|---|---| | `task_id` | string | Required | Required | Required | | `bug_type` | string | Required | -- | Required | | `priority` | string | -- | Required | Required | | `assigned_developer` | string | -- | -- | Required | | `suggested_action` | string | -- | -- | Required | | `confidence` | float (0.0-1.0) | Optional | Optional | Optional | | `reasoning` | string | Optional | Optional | Optional | Valid values: - **bug_type**: `crash`, `ui`, `performance`, `security`, `data_loss`, `compatibility` - **priority**: `low`, `medium`, `high`, `critical` - **assigned_developer**: `Alice`, `Bob`, `Carol`, `David`, `Eve` - **suggested_action**: `fix_immediately`, `schedule_sprint`, `needs_more_info`, `wontfix`, `duplicate` --- ## Reward Design The environment provides two distinct signals: | Signal | Range | Purpose | |---|---|---| | **Grader Score** | `[0.0, 1.0]` | Deterministic evaluation metric for benchmarking | | **Shaped Reward** | `[-0.5, 1.0]` | Continuous training signal optimized for [GRPO](https://arxiv.org/abs/2402.03300) | The shaped reward is derived from the grader score using the following formula: ``` reward = (grader_score * 1.5) - 0.5 + calibration_bonus ``` This mapping ensures: - A score of 0.0 produces a reward of -0.5 (penalizes random guessing) - A score of 0.33 produces a reward of 0.0 (breakeven point) - A score of 1.0 produces a reward of 1.0 (maximum) ### Confidence Calibration Agents may optionally submit a `confidence` value (float between 0.0 and 1.0) with their action. The environment applies a calibration bonus or penalty based on how well the agent's confidence aligns with its actual performance: | Condition | Bonus | Description | |---|---|---| | Correct and confident (score >= 0.8, confidence >= 0.8) | +0.10 | Rewards agents that are confident and right | | Wrong and overconfident (score < 0.5, confidence >= 0.8) | -0.15 | Penalizes dangerous overconfidence | | Well-calibrated (absolute difference < 0.2) | +0.05 | Rewards honest uncertainty estimation | | Poorly calibrated (absolute difference >= 0.2) | -0.05 | Penalizes miscalibrated confidence | This mechanic introduces a genuine RL challenge: the agent must learn not only what is correct, but also when it is certain. In production bug triage, overconfident misrouting of a critical outage has severe downstream consequences. --- ## Baseline Agent The baseline inference script supports two LLM providers with automatic fallback: | Priority | Provider | Environment Variable | Default Model | |---|---|---|---| | Primary | [OpenAI](https://platform.openai.com/docs) | `OPENAI_API_KEY` | gpt-4o-mini | | Fallback | [Google Gemini](https://ai.google.dev/gemini-api/docs) | `GEMINI_API_KEY` | gemini-2.5-flash | | Last resort | Random | -- | Random valid action | Both providers implement exponential backoff with retry logic for HTTP 429 (rate limit) and 503 (service unavailable) responses. ### Running the Baseline ```bash # Using OpenAI (required by hackathon spec) export OPENAI_API_KEY="sk-..." python -m bug_triage_env.baseline --all-tasks --episodes 5 # Using Gemini (free tier available at https://aistudio.google.com/apikey) export GEMINI_API_KEY="AI..." python -m bug_triage_env.baseline --all-tasks --episodes 5 # Single task with more episodes python -m bug_triage_env.baseline --task task_1 --episodes 10 # JSON output python -m bug_triage_env.baseline --all-tasks --json ``` ### Baseline Scores | Task | Mean Score | Range | Description | |---|---|---|---| | Task 1 (Easy) | 0.80 | 0.00 - 1.00 | Bug type classification | | Task 2 (Medium) | 0.93 | 0.67 - 1.00 | Priority assignment | | Task 3 (Hard) | 0.78 | 0.60 - 1.00 | Full triage pipeline | | **Overall** | **0.84** | | Weighted average across all tasks | Without any API key configured, the baseline falls back to random actions and achieves an average score of approximately 0.15. ### Hackathon Inference Script The root-level [`inference.py`](inference.py) is the hackathon-mandated entry point. It: - Uses the [OpenAI Python client](https://github.com/openai/openai-python) exclusively - Reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from the environment - Emits structured `[START]`, `[STEP]`, and `[END]` logs to stdout - Completes in under 20 minutes on 2 vCPU / 8 GB RAM --- ## Architecture ``` +------------------------------------------+ | FastAPI Server | | +--------+ +--------+ +-----------+ | | | /reset | | /step | | /grader | | | +---+----+ +---+----+ +-----+-----+ | | | | | | | +---v-----------v--------------v-----+ | | | BugTriageEnvironment | | | | +----------+ +---------------+ | | | | | Dataset | | Episode Store | | | | | | 25 Bugs | | (thread-safe) | | | | | +----------+ +---------------+ | | | +--------------------+---------------+ | | | | | +--------------------v---------------+ | | | Graders Registry | | | | task1: exact match | | | | task2: distance penalty | | | | task3: weighted composite | | | +------------------------------------+ | +------------------------------------------+ ^ ^ | HTTP | HTTP +----+-----+ +----+----------+ | Client | | Baseline | | (Python) | | OpenAI/Gemini | +----------+ +---------------+ ``` Key implementation details: - **Thread safety**: The episode store uses Python `threading.Lock` to support concurrent requests from multiple agents. - **Single-step episodes**: Each episode consists of one reset (observation) and one step (action). The episode terminates immediately after the step. - **Deterministic grading**: All three graders produce identical scores for identical inputs. No randomness is involved in evaluation. - **Dataset**: 25 bug reports stored in [`bugs.json`](bug_triage_env/data/bugs.json), covering crash reports, security vulnerabilities, performance regressions, UI glitches, data corruption, and compatibility issues. --- ## Deployment ### Docker ```bash docker build -t bug-triage-env . docker run -d -p 8000:8000 \ -e OPENAI_API_KEY="sk-..." \ bug-triage-env curl http://localhost:8000/health ``` The Dockerfile uses Python 3.11-slim, installs only production dependencies, and includes a built-in health check. ### Hugging Face Spaces The environment is deployed as a Docker-based [Hugging Face Space](https://huggingface.co/docs/hub/spaces): ```bash pip install huggingface_hub python3 -c " from huggingface_hub import HfApi api = HfApi() api.create_repo(repo_id='/bug-triage-openenv', repo_type='space', space_sdk='docker', exist_ok=True) api.upload_folder(folder_path='.', repo_id='/bug-triage-openenv', repo_type='space') " ``` The live deployment is accessible at: **[https://huggingface.co/spaces/savetrees/bug-triage-openenv](https://huggingface.co/spaces/savetrees/bug-triage-openenv)** --- ## Project Structure ``` bug-triage-openenv/ |-- README.md Documentation |-- Dockerfile Production container (Python 3.11-slim) |-- openenv.yaml OpenEnv environment manifest |-- inference.py Hackathon inference entry point |-- pyproject.toml Python package configuration |-- requirements.txt Pinned production dependencies |-- .dockerignore Files excluded from Docker build |-- .gitignore Files excluded from version control | |-- bug_triage_env/ Main Python package | |-- __init__.py Package initialization | |-- models.py Pydantic v2 data models (Action, Observation, State) | |-- client.py Synchronous and asynchronous HTTP client | |-- baseline.py Dual-provider LLM baseline (OpenAI + Gemini) | | | |-- data/ | | |-- __init__.py Dataset loader | | |-- bugs.json 25 curated real-world bug reports | | | |-- graders/ | | |-- __init__.py Grader registry | | |-- task1_grader.py Bug classification grader (exact match) | | |-- task2_grader.py Priority assignment grader (distance penalty) | | |-- task3_grader.py Full triage grader (weighted composite) | | | |-- server/ | |-- __init__.py Server package initialization | |-- app.py FastAPI application with all 8 endpoints | |-- environment.py Core RL environment (reset, step, state) ``` --- ## Environment Variables | Variable | Required | Default | Description | |---|---|---|---| | `OPENAI_API_KEY` | For baseline | (none) | [OpenAI API key](https://platform.openai.com/api-keys) for primary baseline inference | | `GEMINI_API_KEY` | For fallback | (none) | [Google Gemini API key](https://aistudio.google.com/apikey) for fallback inference | | `API_BASE_URL` | For hackathon | `https://api.openai.com/v1` | LLM API endpoint (used by `inference.py`) | | `MODEL_NAME` | For hackathon | `gpt-4o-mini` | Model identifier (used by `inference.py`) | | `HF_TOKEN` | For hackathon | (none) | Hugging Face token (used by `inference.py`) | | `PORT` | No | `8000` | Server port | | `HOST` | No | `0.0.0.0` | Server bind address | | `WORKERS` | No | `4` | Number of Uvicorn worker processes | --- ## References - [OpenEnv Framework](https://github.com/OpenEnv-AI/openenv) -- Standardized RL environment specification - [FastAPI](https://fastapi.tiangolo.com) -- High-performance Python web framework - [Pydantic v2](https://docs.pydantic.dev/latest/) -- Data validation using Python type annotations - [OpenAI API](https://platform.openai.com/docs) -- Primary LLM provider for baseline inference - [Google Gemini API](https://ai.google.dev/gemini-api/docs) -- Fallback LLM provider - [GRPO (Group Relative Policy Optimization)](https://arxiv.org/abs/2402.03300) -- RL training algorithm - [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces) -- Deployment platform for ML applications - [Docker](https://docs.docker.com/) -- Containerization platform --- ## License This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.