Spaces:

savetrees
/

bug-triage-openenv

Running

File size: 19,938 Bytes

---
title: Bug Triage OpenEnv
emoji: 🐛
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
  - openenv
---

# Bug Triage OpenEnv

A production-grade reinforcement learning environment for automated software bug triage, built on the [OpenEnv](https://github.com/OpenEnv-AI/openenv) framework.

| | |
|---|---|
| **Live Space** | [huggingface.co/spaces/savetrees/bug-triage-openenv](https://huggingface.co/spaces/savetrees/bug-triage-openenv) |
| **Repository** | [github.com/savetree-1/bug-triage-openenv](https://github.com/savetree-1/bug-triage-openenv) |
| **Framework** | [FastAPI](https://fastapi.tiangolo.com) + [Pydantic v2](https://docs.pydantic.dev/latest/) |
| **License** | MIT |

---

## Table of Contents

- [Overview](#overview)
- [Getting Started](#getting-started)
- [Tasks](#tasks)
- [API Reference](#api-reference)
- [Observation Space](#observation-space)
- [Action Space](#action-space)
- [Reward Design](#reward-design)
- [Baseline Agent](#baseline-agent)
- [Architecture](#architecture)
- [Deployment](#deployment)
- [Project Structure](#project-structure)
- [References](#references)

---

## Overview

**Bug Triage OpenEnv** simulates a real-world issue tracking system (comparable to [Jira](https://www.atlassian.com/software/jira), [GitHub Issues](https://github.com/features/issues), or [Linear](https://linear.app)) where an AI agent must read incoming bug reports and make triage decisions:

1. **Classify** the bug type (crash, UI, security, performance, data loss, compatibility)
2. **Prioritize** the severity (low, medium, high, critical)
3. **Route** to the correct developer based on their domain expertise
4. **Recommend** the appropriate action (fix immediately, schedule for sprint, etc.)

The environment includes 25 carefully crafted bug reports drawn from real-world software engineering workflows, each designed to test different reasoning capabilities of frontier language models.

### Motivation

| Problem | Why It Matters |
|---|---|
| Every software company triages hundreds to thousands of bugs daily | High-volume, repetitive task ideal for automation |
| Manual triage costs senior engineering hours | Direct cost savings from accurate automation |
| Misrouted bugs cause cascading delays and outages | Incorrect triage has measurable downstream impact |
| Ambiguous bug reports require deep contextual reasoning | LLM agents must parse unstructured text and infer intent |

This environment was built for the [Meta x PyTorch Hackathon](https://pytorch.org/) and is designed for training RL agents via [GRPO](https://arxiv.org/abs/2402.03300) (Group Relative Policy Optimization).

---

## Getting Started

### Prerequisites

- [Python 3.10+](https://www.python.org/downloads/)
- [pip](https://pip.pypa.io/en/stable/)
- [Docker](https://docs.docker.com/get-docker/) (optional, for containerized deployment)

### Installation

```bash
git clone https://github.com/savetree-1/bug-triage-openenv.git
cd bug-triage-openenv

python -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt
```

### Quick Start

Start the server:

```bash
uvicorn bug_triage_env.server.app:app --host 0.0.0.0 --port 8000
```

Verify that the server is running:

```bash
curl http://localhost:8000/health
```

Expected response:

```json
{"status": "healthy"}
```

Run a complete episode (reset, then step):

```bash
curl -X POST http://localhost:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "task_1"}'
```

Submit a triage action using the `episode_id` returned from `/reset`:

```bash
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"episode_id": "<episode_id>", "action": {"task_id": "task_1", "bug_type": "crash"}}'
```

### Python Client

A synchronous and asynchronous client is provided for programmatic access:

```python
from bug_triage_env.client import BugTriageEnvClient
from bug_triage_env.models import BugTriageAction

with BugTriageEnvClient("http://localhost:8000") as client:
    obs = client.reset(task_id="task_3")

    action = BugTriageAction(
        task_id="task_3",
        bug_type="security",
        priority="critical",
        assigned_developer="Bob",
        suggested_action="fix_immediately",
    )

    result = client.step(obs["episode_id"], action)
    print(f"Grader score: {result['grader_score']}")
```

---

## Tasks

The environment defines three tasks of increasing difficulty. Each task has a deterministic grader that returns a score in the range `[0.0, 1.0]`.

### Task 1: Bug Type Classification (Easy)

Given a bug report, classify it into one of six categories.

| Property | Value |
|---|---|
| Input | Bug title, description, logs, environment metadata |
| Output | `bug_type`: one of `crash`, `ui`, `performance`, `security`, `data_loss`, `compatibility` |
| Scoring | Exact match = 1.0; incorrect = 0.0 |
| Grader | [task1_grader.py](bug_triage_env/graders/task1_grader.py) |

### Task 2: Priority Assignment (Medium)

Given a bug report, assign the correct severity level.

| Property | Value |
|---|---|
| Input | Bug title, description, logs, environment metadata |
| Output | `priority`: one of `low`, `medium`, `high`, `critical` |
| Scoring | Exact = 1.0; 1 level off = 0.67; 2 levels = 0.33; 3 levels = 0.0 |
| Grader | [task2_grader.py](bug_triage_env/graders/task2_grader.py) |

### Task 3: Full Bug Triage (Hard)

Perform complete triage: classify the bug type, assign priority, route to the correct developer, and recommend an action.

| Property | Value |
|---|---|
| Output | `bug_type` + `priority` + `assigned_developer` + `suggested_action` |
| Developers | Alice (crash, performance), Bob (crash, security), Carol (UI, compatibility), David (security, data loss), Eve (UI, performance, compatibility) |
| Actions | `fix_immediately`, `schedule_sprint`, `needs_more_info`, `wontfix`, `duplicate` |
| Scoring | Weighted composite: `0.3 * type + 0.3 * priority + 0.2 * developer + 0.2 * action` |
| Grader | [task3_grader.py](bug_triage_env/graders/task3_grader.py) |

---

## API Reference

All endpoints conform to the [OpenEnv specification](https://github.com/OpenEnv-AI/openenv).

### Endpoints

| Method | Path | Description |
|---|---|---|
| `GET` | `/health` | Liveness probe. Returns `{"status": "healthy"}`. |
| `POST` | `/reset` | Start a new episode. Accepts optional `{"task_id": "task_1"}`. Returns an observation containing a bug report. |
| `POST` | `/step` | Submit a triage action. Requires `episode_id` and `action`. Returns observation with reward and grader score. |
| `GET` | `/state` | Returns metadata about active episodes. |
| `GET` | `/tasks` | Lists all available tasks with their action schemas. |
| `POST` | `/grader` | Re-grade a completed episode. Requires `episode_id` and `task_id`. |
| `POST` | `/baseline` | Trigger baseline inference (requires `OPENAI_API_KEY` or `GEMINI_API_KEY`). |
| `GET` | `/docs` | Auto-generated [Swagger UI](https://swagger.io/tools/swagger-ui/) documentation. |

### POST /reset

Request:

```json
{"task_id": "task_1"}
```

Response (abbreviated):

```json
{
  "done": false,
  "reward": 0.0,
  "task_id": "task_1",
  "episode_id": "abc123",
  "step_number": 0,
  "feedback": "New bug report received. Please triage.",
  "available_developers": ["Alice", "Bob", "Carol", "David", "Eve"],
  "bug_report": {
    "bug_id": "BUG-001",
    "title": "Application crashes on login with SSO enabled",
    "description": "...",
    "logs": "...",
    "environment": "macOS 14.2, Chrome 120",
    "reporter": "user_42",
    "created_at": "2024-01-15T09:30:00Z",
    "metadata": {}
  }
}
```

### POST /step

Request:

```json
{
  "episode_id": "abc123",
  "action": {
    "task_id": "task_3",
    "bug_type": "crash",
    "priority": "critical",
    "assigned_developer": "Alice",
    "suggested_action": "fix_immediately"
  }
}
```

Response:

```json
{
  "done": true,
  "reward": 1.0,
  "grader_score": 1.0,
  "task_id": "task_3",
  "feedback": "Grader score: 1.00 | Bug type: correct | Priority: correct | Developer: correct | Action: correct",
  "step_number": 1,
  "episode_id": "abc123"
}
```

---

## Observation Space

Each observation returned by `/reset` and `/step` contains the following fields:

| Field | Type | Description |
|---|---|---|
| `bug_report.bug_id` | string | Unique bug identifier (e.g., `BUG-001`) |
| `bug_report.title` | string | Short summary of the bug |
| `bug_report.description` | string | Detailed description of the issue |
| `bug_report.logs` | string or null | Error logs, stack traces, or crash output |
| `bug_report.environment` | string or null | OS, browser, hardware, and version details |
| `bug_report.reporter` | string | Username of the person who filed the bug |
| `bug_report.created_at` | string | ISO 8601 timestamp |
| `bug_report.metadata` | object | Additional context (component, affected users, regression flag) |
| `available_developers` | array of strings | The 5 developers available for routing |
| `done` | boolean | Whether the episode has ended |
| `reward` | float | Shaped reward signal for RL training |
| `grader_score` | float or null | Raw evaluation score in `[0.0, 1.0]` (null before stepping) |
| `episode_id` | string | Unique episode identifier |
| `step_number` | integer | Current step count (0 after reset, 1 after step) |
| `feedback` | string | Human-readable feedback about the triage result |

---

## Action Space

Actions are submitted as JSON objects to the `/step` endpoint. Required fields vary by task:

| Field | Type | Task 1 | Task 2 | Task 3 |
|---|---|---|---|---|
| `task_id` | string | Required | Required | Required |
| `bug_type` | string | Required | -- | Required |
| `priority` | string | -- | Required | Required |
| `assigned_developer` | string | -- | -- | Required |
| `suggested_action` | string | -- | -- | Required |
| `confidence` | float (0.0-1.0) | Optional | Optional | Optional |
| `reasoning` | string | Optional | Optional | Optional |

Valid values:

- **bug_type**: `crash`, `ui`, `performance`, `security`, `data_loss`, `compatibility`
- **priority**: `low`, `medium`, `high`, `critical`
- **assigned_developer**: `Alice`, `Bob`, `Carol`, `David`, `Eve`
- **suggested_action**: `fix_immediately`, `schedule_sprint`, `needs_more_info`, `wontfix`, `duplicate`

---

## Reward Design

The environment provides two distinct signals:

| Signal | Range | Purpose |
|---|---|---|
| **Grader Score** | `[0.0, 1.0]` | Deterministic evaluation metric for benchmarking |
| **Shaped Reward** | `[-0.5, 1.0]` | Continuous training signal optimized for [GRPO](https://arxiv.org/abs/2402.03300) |

The shaped reward is derived from the grader score using the following formula:

```
reward = (grader_score * 1.5) - 0.5 + calibration_bonus
```

This mapping ensures:
- A score of 0.0 produces a reward of -0.5 (penalizes random guessing)
- A score of 0.33 produces a reward of 0.0 (breakeven point)
- A score of 1.0 produces a reward of 1.0 (maximum)

### Confidence Calibration

Agents may optionally submit a `confidence` value (float between 0.0 and 1.0) with their action. The environment applies a calibration bonus or penalty based on how well the agent's confidence aligns with its actual performance:

| Condition | Bonus | Description |
|---|---|---|
| Correct and confident (score >= 0.8, confidence >= 0.8) | +0.10 | Rewards agents that are confident and right |
| Wrong and overconfident (score < 0.5, confidence >= 0.8) | -0.15 | Penalizes dangerous overconfidence |
| Well-calibrated (absolute difference < 0.2) | +0.05 | Rewards honest uncertainty estimation |
| Poorly calibrated (absolute difference >= 0.2) | -0.05 | Penalizes miscalibrated confidence |

This mechanic introduces a genuine RL challenge: the agent must learn not only what is correct, but also when it is certain. In production bug triage, overconfident misrouting of a critical outage has severe downstream consequences.

---

## Baseline Agent

The baseline inference script supports two LLM providers with automatic fallback:

| Priority | Provider | Environment Variable | Default Model |
|---|---|---|---|
| Primary | [OpenAI](https://platform.openai.com/docs) | `OPENAI_API_KEY` | gpt-4o-mini |
| Fallback | [Google Gemini](https://ai.google.dev/gemini-api/docs) | `GEMINI_API_KEY` | gemini-2.5-flash |
| Last resort | Random | -- | Random valid action |

Both providers implement exponential backoff with retry logic for HTTP 429 (rate limit) and 503 (service unavailable) responses.

### Running the Baseline

```bash
# Using OpenAI (required by hackathon spec)
export OPENAI_API_KEY="sk-..."
python -m bug_triage_env.baseline --all-tasks --episodes 5

# Using Gemini (free tier available at https://aistudio.google.com/apikey)
export GEMINI_API_KEY="AI..."
python -m bug_triage_env.baseline --all-tasks --episodes 5

# Single task with more episodes
python -m bug_triage_env.baseline --task task_1 --episodes 10

# JSON output
python -m bug_triage_env.baseline --all-tasks --json
```

### Baseline Scores

| Task | Mean Score | Range | Description |
|---|---|---|---|
| Task 1 (Easy) | 0.80 | 0.00 - 1.00 | Bug type classification |
| Task 2 (Medium) | 0.93 | 0.67 - 1.00 | Priority assignment |
| Task 3 (Hard) | 0.78 | 0.60 - 1.00 | Full triage pipeline |
| **Overall** | **0.84** | | Weighted average across all tasks |

Without any API key configured, the baseline falls back to random actions and achieves an average score of approximately 0.15.

### Hackathon Inference Script

The root-level [`inference.py`](inference.py) is the hackathon-mandated entry point. It:

- Uses the [OpenAI Python client](https://github.com/openai/openai-python) exclusively
- Reads `API_BASE_URL`, `MODEL_NAME`, and `HF_TOKEN` from the environment
- Emits structured `[START]`, `[STEP]`, and `[END]` logs to stdout
- Completes in under 20 minutes on 2 vCPU / 8 GB RAM

---

## Architecture

```
+------------------------------------------+
|              FastAPI Server              |
|  +--------+  +--------+  +-----------+  |
|  | /reset |  | /step  |  | /grader   |  |
|  +---+----+  +---+----+  +-----+-----+  |
|      |           |              |        |
|  +---v-----------v--------------v-----+  |
|  |      BugTriageEnvironment          |  |
|  |  +----------+  +---------------+   |  |
|  |  | Dataset  |  | Episode Store |   |  |
|  |  | 25 Bugs  |  | (thread-safe) |   |  |
|  |  +----------+  +---------------+   |  |
|  +--------------------+---------------+  |
|                       |                  |
|  +--------------------v---------------+  |
|  |         Graders Registry           |  |
|  |  task1: exact match                |  |
|  |  task2: distance penalty           |  |
|  |  task3: weighted composite         |  |
|  +------------------------------------+  |
+------------------------------------------+
         ^                      ^
         | HTTP                 | HTTP
    +----+-----+          +----+----------+
    |  Client  |          |   Baseline    |
    | (Python) |          | OpenAI/Gemini |
    +----------+          +---------------+
```

Key implementation details:

- **Thread safety**: The episode store uses Python `threading.Lock` to support concurrent requests from multiple agents.
- **Single-step episodes**: Each episode consists of one reset (observation) and one step (action). The episode terminates immediately after the step.
- **Deterministic grading**: All three graders produce identical scores for identical inputs. No randomness is involved in evaluation.
- **Dataset**: 25 bug reports stored in [`bugs.json`](bug_triage_env/data/bugs.json), covering crash reports, security vulnerabilities, performance regressions, UI glitches, data corruption, and compatibility issues.

---

## Deployment

### Docker

```bash
docker build -t bug-triage-env .

docker run -d -p 8000:8000 \
  -e OPENAI_API_KEY="sk-..." \
  bug-triage-env

curl http://localhost:8000/health
```

The Dockerfile uses Python 3.11-slim, installs only production dependencies, and includes a built-in health check.

### Hugging Face Spaces

The environment is deployed as a Docker-based [Hugging Face Space](https://huggingface.co/docs/hub/spaces):

```bash
pip install huggingface_hub
python3 -c "
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id='<username>/bug-triage-openenv', repo_type='space', space_sdk='docker', exist_ok=True)
api.upload_folder(folder_path='.', repo_id='<username>/bug-triage-openenv', repo_type='space')
"
```

The live deployment is accessible at:
**[https://huggingface.co/spaces/savetrees/bug-triage-openenv](https://huggingface.co/spaces/savetrees/bug-triage-openenv)**

---

## Project Structure

```
bug-triage-openenv/
|-- README.md                    Documentation
|-- Dockerfile                   Production container (Python 3.11-slim)
|-- openenv.yaml                 OpenEnv environment manifest
|-- inference.py                 Hackathon inference entry point
|-- pyproject.toml               Python package configuration
|-- requirements.txt             Pinned production dependencies
|-- .dockerignore                Files excluded from Docker build
|-- .gitignore                   Files excluded from version control
|
|-- bug_triage_env/              Main Python package
|   |-- __init__.py              Package initialization
|   |-- models.py                Pydantic v2 data models (Action, Observation, State)
|   |-- client.py                Synchronous and asynchronous HTTP client
|   |-- baseline.py              Dual-provider LLM baseline (OpenAI + Gemini)
|   |
|   |-- data/
|   |   |-- __init__.py          Dataset loader
|   |   |-- bugs.json            25 curated real-world bug reports
|   |
|   |-- graders/
|   |   |-- __init__.py          Grader registry
|   |   |-- task1_grader.py      Bug classification grader (exact match)
|   |   |-- task2_grader.py      Priority assignment grader (distance penalty)
|   |   |-- task3_grader.py      Full triage grader (weighted composite)
|   |
|   |-- server/
|       |-- __init__.py          Server package initialization
|       |-- app.py               FastAPI application with all 8 endpoints
|       |-- environment.py       Core RL environment (reset, step, state)
```

---

## Environment Variables

| Variable | Required | Default | Description |
|---|---|---|---|
| `OPENAI_API_KEY` | For baseline | (none) | [OpenAI API key](https://platform.openai.com/api-keys) for primary baseline inference |
| `GEMINI_API_KEY` | For fallback | (none) | [Google Gemini API key](https://aistudio.google.com/apikey) for fallback inference |
| `API_BASE_URL` | For hackathon | `https://api.openai.com/v1` | LLM API endpoint (used by `inference.py`) |
| `MODEL_NAME` | For hackathon | `gpt-4o-mini` | Model identifier (used by `inference.py`) |
| `HF_TOKEN` | For hackathon | (none) | Hugging Face token (used by `inference.py`) |
| `PORT` | No | `8000` | Server port |
| `HOST` | No | `0.0.0.0` | Server bind address |
| `WORKERS` | No | `4` | Number of Uvicorn worker processes |

---

## References

- [OpenEnv Framework](https://github.com/OpenEnv-AI/openenv) -- Standardized RL environment specification
- [FastAPI](https://fastapi.tiangolo.com) -- High-performance Python web framework
- [Pydantic v2](https://docs.pydantic.dev/latest/) -- Data validation using Python type annotations
- [OpenAI API](https://platform.openai.com/docs) -- Primary LLM provider for baseline inference
- [Google Gemini API](https://ai.google.dev/gemini-api/docs) -- Fallback LLM provider
- [GRPO (Group Relative Policy Optimization)](https://arxiv.org/abs/2402.03300) -- RL training algorithm
- [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces) -- Deployment platform for ML applications
- [Docker](https://docs.docker.com/) -- Containerization platform

---

## License

This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.