bug-triage-openenv / README.md
savetrees's picture
Upload folder using huggingface_hub
766521e verified
metadata
title: Bug Triage OpenEnv
emoji: 🐛
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
  - openenv

Bug Triage OpenEnv

A production-grade reinforcement learning environment for automated software bug triage, built on the OpenEnv framework.


Table of Contents


Overview

Bug Triage OpenEnv simulates a real-world issue tracking system (comparable to Jira, GitHub Issues, or Linear) where an AI agent must read incoming bug reports and make triage decisions:

  1. Classify the bug type (crash, UI, security, performance, data loss, compatibility)
  2. Prioritize the severity (low, medium, high, critical)
  3. Route to the correct developer based on their domain expertise
  4. Recommend the appropriate action (fix immediately, schedule for sprint, etc.)

The environment includes 25 carefully crafted bug reports drawn from real-world software engineering workflows, each designed to test different reasoning capabilities of frontier language models.

Motivation

Problem Why It Matters
Every software company triages hundreds to thousands of bugs daily High-volume, repetitive task ideal for automation
Manual triage costs senior engineering hours Direct cost savings from accurate automation
Misrouted bugs cause cascading delays and outages Incorrect triage has measurable downstream impact
Ambiguous bug reports require deep contextual reasoning LLM agents must parse unstructured text and infer intent

This environment was built for the Meta x PyTorch Hackathon and is designed for training RL agents via GRPO (Group Relative Policy Optimization).


Getting Started

Prerequisites

Installation

git clone https://github.com/savetree-1/bug-triage-openenv.git
cd bug-triage-openenv

python -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

Quick Start

Start the server:

uvicorn bug_triage_env.server.app:app --host 0.0.0.0 --port 8000

Verify that the server is running:

curl http://localhost:8000/health

Expected response:

{"status": "healthy"}

Run a complete episode (reset, then step):

curl -X POST http://localhost:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "task_1"}'

Submit a triage action using the episode_id returned from /reset:

curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"episode_id": "<episode_id>", "action": {"task_id": "task_1", "bug_type": "crash"}}'

Python Client

A synchronous and asynchronous client is provided for programmatic access:

from bug_triage_env.client import BugTriageEnvClient
from bug_triage_env.models import BugTriageAction

with BugTriageEnvClient("http://localhost:8000") as client:
    obs = client.reset(task_id="task_3")

    action = BugTriageAction(
        task_id="task_3",
        bug_type="security",
        priority="critical",
        assigned_developer="Bob",
        suggested_action="fix_immediately",
    )

    result = client.step(obs["episode_id"], action)
    print(f"Grader score: {result['grader_score']}")

Tasks

The environment defines three tasks of increasing difficulty. Each task has a deterministic grader that returns a score in the range [0.0, 1.0].

Task 1: Bug Type Classification (Easy)

Given a bug report, classify it into one of six categories.

Property Value
Input Bug title, description, logs, environment metadata
Output bug_type: one of crash, ui, performance, security, data_loss, compatibility
Scoring Exact match = 1.0; incorrect = 0.0
Grader task1_grader.py

Task 2: Priority Assignment (Medium)

Given a bug report, assign the correct severity level.

Property Value
Input Bug title, description, logs, environment metadata
Output priority: one of low, medium, high, critical
Scoring Exact = 1.0; 1 level off = 0.67; 2 levels = 0.33; 3 levels = 0.0
Grader task2_grader.py

Task 3: Full Bug Triage (Hard)

Perform complete triage: classify the bug type, assign priority, route to the correct developer, and recommend an action.

Property Value
Output bug_type + priority + assigned_developer + suggested_action
Developers Alice (crash, performance), Bob (crash, security), Carol (UI, compatibility), David (security, data loss), Eve (UI, performance, compatibility)
Actions fix_immediately, schedule_sprint, needs_more_info, wontfix, duplicate
Scoring Weighted composite: 0.3 * type + 0.3 * priority + 0.2 * developer + 0.2 * action
Grader task3_grader.py

API Reference

All endpoints conform to the OpenEnv specification.

Endpoints

Method Path Description
GET /health Liveness probe. Returns {"status": "healthy"}.
POST /reset Start a new episode. Accepts optional {"task_id": "task_1"}. Returns an observation containing a bug report.
POST /step Submit a triage action. Requires episode_id and action. Returns observation with reward and grader score.
GET /state Returns metadata about active episodes.
GET /tasks Lists all available tasks with their action schemas.
POST /grader Re-grade a completed episode. Requires episode_id and task_id.
POST /baseline Trigger baseline inference (requires OPENAI_API_KEY or GEMINI_API_KEY).
GET /docs Auto-generated Swagger UI documentation.

POST /reset

Request:

{"task_id": "task_1"}

Response (abbreviated):

{
  "done": false,
  "reward": 0.0,
  "task_id": "task_1",
  "episode_id": "abc123",
  "step_number": 0,
  "feedback": "New bug report received. Please triage.",
  "available_developers": ["Alice", "Bob", "Carol", "David", "Eve"],
  "bug_report": {
    "bug_id": "BUG-001",
    "title": "Application crashes on login with SSO enabled",
    "description": "...",
    "logs": "...",
    "environment": "macOS 14.2, Chrome 120",
    "reporter": "user_42",
    "created_at": "2024-01-15T09:30:00Z",
    "metadata": {}
  }
}

POST /step

Request:

{
  "episode_id": "abc123",
  "action": {
    "task_id": "task_3",
    "bug_type": "crash",
    "priority": "critical",
    "assigned_developer": "Alice",
    "suggested_action": "fix_immediately"
  }
}

Response:

{
  "done": true,
  "reward": 1.0,
  "grader_score": 1.0,
  "task_id": "task_3",
  "feedback": "Grader score: 1.00 | Bug type: correct | Priority: correct | Developer: correct | Action: correct",
  "step_number": 1,
  "episode_id": "abc123"
}

Observation Space

Each observation returned by /reset and /step contains the following fields:

Field Type Description
bug_report.bug_id string Unique bug identifier (e.g., BUG-001)
bug_report.title string Short summary of the bug
bug_report.description string Detailed description of the issue
bug_report.logs string or null Error logs, stack traces, or crash output
bug_report.environment string or null OS, browser, hardware, and version details
bug_report.reporter string Username of the person who filed the bug
bug_report.created_at string ISO 8601 timestamp
bug_report.metadata object Additional context (component, affected users, regression flag)
available_developers array of strings The 5 developers available for routing
done boolean Whether the episode has ended
reward float Shaped reward signal for RL training
grader_score float or null Raw evaluation score in [0.0, 1.0] (null before stepping)
episode_id string Unique episode identifier
step_number integer Current step count (0 after reset, 1 after step)
feedback string Human-readable feedback about the triage result

Action Space

Actions are submitted as JSON objects to the /step endpoint. Required fields vary by task:

Field Type Task 1 Task 2 Task 3
task_id string Required Required Required
bug_type string Required -- Required
priority string -- Required Required
assigned_developer string -- -- Required
suggested_action string -- -- Required
confidence float (0.0-1.0) Optional Optional Optional
reasoning string Optional Optional Optional

Valid values:

  • bug_type: crash, ui, performance, security, data_loss, compatibility
  • priority: low, medium, high, critical
  • assigned_developer: Alice, Bob, Carol, David, Eve
  • suggested_action: fix_immediately, schedule_sprint, needs_more_info, wontfix, duplicate

Reward Design

The environment provides two distinct signals:

Signal Range Purpose
Grader Score [0.0, 1.0] Deterministic evaluation metric for benchmarking
Shaped Reward [-0.5, 1.0] Continuous training signal optimized for GRPO

The shaped reward is derived from the grader score using the following formula:

reward = (grader_score * 1.5) - 0.5 + calibration_bonus

This mapping ensures:

  • A score of 0.0 produces a reward of -0.5 (penalizes random guessing)
  • A score of 0.33 produces a reward of 0.0 (breakeven point)
  • A score of 1.0 produces a reward of 1.0 (maximum)

Confidence Calibration

Agents may optionally submit a confidence value (float between 0.0 and 1.0) with their action. The environment applies a calibration bonus or penalty based on how well the agent's confidence aligns with its actual performance:

Condition Bonus Description
Correct and confident (score >= 0.8, confidence >= 0.8) +0.10 Rewards agents that are confident and right
Wrong and overconfident (score < 0.5, confidence >= 0.8) -0.15 Penalizes dangerous overconfidence
Well-calibrated (absolute difference < 0.2) +0.05 Rewards honest uncertainty estimation
Poorly calibrated (absolute difference >= 0.2) -0.05 Penalizes miscalibrated confidence

This mechanic introduces a genuine RL challenge: the agent must learn not only what is correct, but also when it is certain. In production bug triage, overconfident misrouting of a critical outage has severe downstream consequences.


Baseline Agent

The baseline inference script supports two LLM providers with automatic fallback:

Priority Provider Environment Variable Default Model
Primary OpenAI OPENAI_API_KEY gpt-4o-mini
Fallback Google Gemini GEMINI_API_KEY gemini-2.5-flash
Last resort Random -- Random valid action

Both providers implement exponential backoff with retry logic for HTTP 429 (rate limit) and 503 (service unavailable) responses.

Running the Baseline

# Using OpenAI (required by hackathon spec)
export OPENAI_API_KEY="sk-..."
python -m bug_triage_env.baseline --all-tasks --episodes 5

# Using Gemini (free tier available at https://aistudio.google.com/apikey)
export GEMINI_API_KEY="AI..."
python -m bug_triage_env.baseline --all-tasks --episodes 5

# Single task with more episodes
python -m bug_triage_env.baseline --task task_1 --episodes 10

# JSON output
python -m bug_triage_env.baseline --all-tasks --json

Baseline Scores

Task Mean Score Range Description
Task 1 (Easy) 0.80 0.00 - 1.00 Bug type classification
Task 2 (Medium) 0.93 0.67 - 1.00 Priority assignment
Task 3 (Hard) 0.78 0.60 - 1.00 Full triage pipeline
Overall 0.84 Weighted average across all tasks

Without any API key configured, the baseline falls back to random actions and achieves an average score of approximately 0.15.

Hackathon Inference Script

The root-level inference.py is the hackathon-mandated entry point. It:

  • Uses the OpenAI Python client exclusively
  • Reads API_BASE_URL, MODEL_NAME, and HF_TOKEN from the environment
  • Emits structured [START], [STEP], and [END] logs to stdout
  • Completes in under 20 minutes on 2 vCPU / 8 GB RAM

Architecture

+------------------------------------------+
|              FastAPI Server              |
|  +--------+  +--------+  +-----------+  |
|  | /reset |  | /step  |  | /grader   |  |
|  +---+----+  +---+----+  +-----+-----+  |
|      |           |              |        |
|  +---v-----------v--------------v-----+  |
|  |      BugTriageEnvironment          |  |
|  |  +----------+  +---------------+   |  |
|  |  | Dataset  |  | Episode Store |   |  |
|  |  | 25 Bugs  |  | (thread-safe) |   |  |
|  |  +----------+  +---------------+   |  |
|  +--------------------+---------------+  |
|                       |                  |
|  +--------------------v---------------+  |
|  |         Graders Registry           |  |
|  |  task1: exact match                |  |
|  |  task2: distance penalty           |  |
|  |  task3: weighted composite         |  |
|  +------------------------------------+  |
+------------------------------------------+
         ^                      ^
         | HTTP                 | HTTP
    +----+-----+          +----+----------+
    |  Client  |          |   Baseline    |
    | (Python) |          | OpenAI/Gemini |
    +----------+          +---------------+

Key implementation details:

  • Thread safety: The episode store uses Python threading.Lock to support concurrent requests from multiple agents.
  • Single-step episodes: Each episode consists of one reset (observation) and one step (action). The episode terminates immediately after the step.
  • Deterministic grading: All three graders produce identical scores for identical inputs. No randomness is involved in evaluation.
  • Dataset: 25 bug reports stored in bugs.json, covering crash reports, security vulnerabilities, performance regressions, UI glitches, data corruption, and compatibility issues.

Deployment

Docker

docker build -t bug-triage-env .

docker run -d -p 8000:8000 \
  -e OPENAI_API_KEY="sk-..." \
  bug-triage-env

curl http://localhost:8000/health

The Dockerfile uses Python 3.11-slim, installs only production dependencies, and includes a built-in health check.

Hugging Face Spaces

The environment is deployed as a Docker-based Hugging Face Space:

pip install huggingface_hub
python3 -c "
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id='<username>/bug-triage-openenv', repo_type='space', space_sdk='docker', exist_ok=True)
api.upload_folder(folder_path='.', repo_id='<username>/bug-triage-openenv', repo_type='space')
"

The live deployment is accessible at: https://huggingface.co/spaces/savetrees/bug-triage-openenv


Project Structure

bug-triage-openenv/
|-- README.md                    Documentation
|-- Dockerfile                   Production container (Python 3.11-slim)
|-- openenv.yaml                 OpenEnv environment manifest
|-- inference.py                 Hackathon inference entry point
|-- pyproject.toml               Python package configuration
|-- requirements.txt             Pinned production dependencies
|-- .dockerignore                Files excluded from Docker build
|-- .gitignore                   Files excluded from version control
|
|-- bug_triage_env/              Main Python package
|   |-- __init__.py              Package initialization
|   |-- models.py                Pydantic v2 data models (Action, Observation, State)
|   |-- client.py                Synchronous and asynchronous HTTP client
|   |-- baseline.py              Dual-provider LLM baseline (OpenAI + Gemini)
|   |
|   |-- data/
|   |   |-- __init__.py          Dataset loader
|   |   |-- bugs.json            25 curated real-world bug reports
|   |
|   |-- graders/
|   |   |-- __init__.py          Grader registry
|   |   |-- task1_grader.py      Bug classification grader (exact match)
|   |   |-- task2_grader.py      Priority assignment grader (distance penalty)
|   |   |-- task3_grader.py      Full triage grader (weighted composite)
|   |
|   |-- server/
|       |-- __init__.py          Server package initialization
|       |-- app.py               FastAPI application with all 8 endpoints
|       |-- environment.py       Core RL environment (reset, step, state)

Environment Variables

Variable Required Default Description
OPENAI_API_KEY For baseline (none) OpenAI API key for primary baseline inference
GEMINI_API_KEY For fallback (none) Google Gemini API key for fallback inference
API_BASE_URL For hackathon https://api.openai.com/v1 LLM API endpoint (used by inference.py)
MODEL_NAME For hackathon gpt-4o-mini Model identifier (used by inference.py)
HF_TOKEN For hackathon (none) Hugging Face token (used by inference.py)
PORT No 8000 Server port
HOST No 0.0.0.0 Server bind address
WORKERS No 4 Number of Uvicorn worker processes

References


License

This project is licensed under the MIT License. See LICENSE for details.