Spaces:

savetrees
/

bug-triage-openenv

Running

App Files Files Community

bug-triage-openenv / README.md

savetrees

Upload folder using huggingface_hub

766521e verified 2 days ago

preview code

raw

history blame contribute delete

19.9 kB

metadata

title: Bug Triage OpenEnv
emoji: 🐛
colorFrom: red
colorTo: yellow
sdk: docker
app_port: 8000
tags:
  - openenv

Bug Triage OpenEnv

A production-grade reinforcement learning environment for automated software bug triage, built on the OpenEnv framework.


Live Space	huggingface.co/spaces/savetrees/bug-triage-openenv
Repository	github.com/savetree-1/bug-triage-openenv
Framework	FastAPI + Pydantic v2
License	MIT

Overview
Getting Started
Tasks
API Reference
Observation Space
Action Space
Reward Design
Baseline Agent
Architecture
Deployment
Project Structure
References

Overview

Bug Triage OpenEnv simulates a real-world issue tracking system (comparable to Jira, GitHub Issues, or Linear) where an AI agent must read incoming bug reports and make triage decisions:

Classify the bug type (crash, UI, security, performance, data loss, compatibility)
Prioritize the severity (low, medium, high, critical)
Route to the correct developer based on their domain expertise
Recommend the appropriate action (fix immediately, schedule for sprint, etc.)

The environment includes 25 carefully crafted bug reports drawn from real-world software engineering workflows, each designed to test different reasoning capabilities of frontier language models.

Motivation

Problem	Why It Matters
Every software company triages hundreds to thousands of bugs daily	High-volume, repetitive task ideal for automation
Manual triage costs senior engineering hours	Direct cost savings from accurate automation
Misrouted bugs cause cascading delays and outages	Incorrect triage has measurable downstream impact
Ambiguous bug reports require deep contextual reasoning	LLM agents must parse unstructured text and infer intent

This environment was built for the Meta x PyTorch Hackathon and is designed for training RL agents via GRPO (Group Relative Policy Optimization).

Getting Started

Prerequisites

Python 3.10+
pip
Docker (optional, for containerized deployment)

Installation

git clone https://github.com/savetree-1/bug-triage-openenv.git
cd bug-triage-openenv

python -m venv .venv
source .venv/bin/activate

pip install -r requirements.txt

Quick Start

Start the server:

uvicorn bug_triage_env.server.app:app --host 0.0.0.0 --port 8000

Verify that the server is running:

curl http://localhost:8000/health

Expected response:

{"status": "healthy"}

Run a complete episode (reset, then step):

curl -X POST http://localhost:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": "task_1"}'

Submit a triage action using the episode_id returned from /reset:

curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"episode_id": "<episode_id>", "action": {"task_id": "task_1", "bug_type": "crash"}}'

Python Client

A synchronous and asynchronous client is provided for programmatic access:

from bug_triage_env.client import BugTriageEnvClient
from bug_triage_env.models import BugTriageAction

with BugTriageEnvClient("http://localhost:8000") as client:
    obs = client.reset(task_id="task_3")

    action = BugTriageAction(
        task_id="task_3",
        bug_type="security",
        priority="critical",
        assigned_developer="Bob",
        suggested_action="fix_immediately",
    )

    result = client.step(obs["episode_id"], action)
    print(f"Grader score: {result['grader_score']}")

Tasks

The environment defines three tasks of increasing difficulty. Each task has a deterministic grader that returns a score in the range [0.0, 1.0].

Task 1: Bug Type Classification (Easy)

Given a bug report, classify it into one of six categories.

Property	Value
Input	Bug title, description, logs, environment metadata
Output	`bug_type`: one of `crash`, `ui`, `performance`, `security`, `data_loss`, `compatibility`
Scoring	Exact match = 1.0; incorrect = 0.0
Grader	task1_grader.py

Task 2: Priority Assignment (Medium)

Given a bug report, assign the correct severity level.

Property	Value
Input	Bug title, description, logs, environment metadata
Output	`priority`: one of `low`, `medium`, `high`, `critical`
Scoring	Exact = 1.0; 1 level off = 0.67; 2 levels = 0.33; 3 levels = 0.0
Grader	task2_grader.py

Task 3: Full Bug Triage (Hard)

Perform complete triage: classify the bug type, assign priority, route to the correct developer, and recommend an action.

Property	Value
Output	`bug_type` + `priority` + `assigned_developer` + `suggested_action`
Developers	Alice (crash, performance), Bob (crash, security), Carol (UI, compatibility), David (security, data loss), Eve (UI, performance, compatibility)
Actions	`fix_immediately`, `schedule_sprint`, `needs_more_info`, `wontfix`, `duplicate`
Scoring	Weighted composite: `0.3 * type + 0.3 * priority + 0.2 * developer + 0.2 * action`
Grader	task3_grader.py

API Reference

All endpoints conform to the OpenEnv specification.

Endpoints

Method	Path	Description
`GET`	`/health`	Liveness probe. Returns `{"status": "healthy"}`.
`POST`	`/reset`	Start a new episode. Accepts optional `{"task_id": "task_1"}`. Returns an observation containing a bug report.
`POST`	`/step`	Submit a triage action. Requires `episode_id` and `action`. Returns observation with reward and grader score.
`GET`	`/state`	Returns metadata about active episodes.
`GET`	`/tasks`	Lists all available tasks with their action schemas.
`POST`	`/grader`	Re-grade a completed episode. Requires `episode_id` and `task_id`.
`POST`	`/baseline`	Trigger baseline inference (requires `OPENAI_API_KEY` or `GEMINI_API_KEY`).
`GET`	`/docs`	Auto-generated Swagger UI documentation.

POST /reset

Request:

{"task_id": "task_1"}

Response (abbreviated):

{
  "done": false,
  "reward": 0.0,
  "task_id": "task_1",
  "episode_id": "abc123",
  "step_number": 0,
  "feedback": "New bug report received. Please triage.",
  "available_developers": ["Alice", "Bob", "Carol", "David", "Eve"],
  "bug_report": {
    "bug_id": "BUG-001",
    "title": "Application crashes on login with SSO enabled",
    "description": "...",
    "logs": "...",
    "environment": "macOS 14.2, Chrome 120",
    "reporter": "user_42",
    "created_at": "2024-01-15T09:30:00Z",
    "metadata": {}
  }
}

POST /step

Request:

{
  "episode_id": "abc123",
  "action": {
    "task_id": "task_3",
    "bug_type": "crash",
    "priority": "critical",
    "assigned_developer": "Alice",
    "suggested_action": "fix_immediately"
  }
}

Response:

{
  "done": true,
  "reward": 1.0,
  "grader_score": 1.0,
  "task_id": "task_3",
  "feedback": "Grader score: 1.00 | Bug type: correct | Priority: correct | Developer: correct | Action: correct",
  "step_number": 1,
  "episode_id": "abc123"
}

Observation Space

Each observation returned by /reset and /step contains the following fields:

Field	Type	Description
`bug_report.bug_id`	string	Unique bug identifier (e.g., `BUG-001`)
`bug_report.title`	string	Short summary of the bug
`bug_report.description`	string	Detailed description of the issue
`bug_report.logs`	string or null	Error logs, stack traces, or crash output
`bug_report.environment`	string or null	OS, browser, hardware, and version details
`bug_report.reporter`	string	Username of the person who filed the bug
`bug_report.created_at`	string	ISO 8601 timestamp
`bug_report.metadata`	object	Additional context (component, affected users, regression flag)
`available_developers`	array of strings	The 5 developers available for routing
`done`	boolean	Whether the episode has ended
`reward`	float	Shaped reward signal for RL training
`grader_score`	float or null	Raw evaluation score in `[0.0, 1.0]` (null before stepping)
`episode_id`	string	Unique episode identifier
`step_number`	integer	Current step count (0 after reset, 1 after step)
`feedback`	string	Human-readable feedback about the triage result

Action Space

Actions are submitted as JSON objects to the /step endpoint. Required fields vary by task:

Field	Type	Task 1	Task 2	Task 3
`task_id`	string	Required	Required	Required
`bug_type`	string	Required	--	Required
`priority`	string	--	Required	Required
`assigned_developer`	string	--	--	Required
`suggested_action`	string	--	--	Required
`confidence`	float (0.0-1.0)	Optional	Optional	Optional
`reasoning`	string	Optional	Optional	Optional

Valid values:

bug_type: crash, ui, performance, security, data_loss, compatibility
priority: low, medium, high, critical
assigned_developer: Alice, Bob, Carol, David, Eve
suggested_action: fix_immediately, schedule_sprint, needs_more_info, wontfix, duplicate

Reward Design

The environment provides two distinct signals:

Signal	Range	Purpose
Grader Score	`[0.0, 1.0]`	Deterministic evaluation metric for benchmarking
Shaped Reward	`[-0.5, 1.0]`	Continuous training signal optimized for GRPO

The shaped reward is derived from the grader score using the following formula:

reward = (grader_score * 1.5) - 0.5 + calibration_bonus

This mapping ensures:

A score of 0.0 produces a reward of -0.5 (penalizes random guessing)
A score of 0.33 produces a reward of 0.0 (breakeven point)
A score of 1.0 produces a reward of 1.0 (maximum)

Confidence Calibration

Agents may optionally submit a confidence value (float between 0.0 and 1.0) with their action. The environment applies a calibration bonus or penalty based on how well the agent's confidence aligns with its actual performance:

Condition	Bonus	Description
Correct and confident (score >= 0.8, confidence >= 0.8)	+0.10	Rewards agents that are confident and right
Wrong and overconfident (score < 0.5, confidence >= 0.8)	-0.15	Penalizes dangerous overconfidence
Well-calibrated (absolute difference < 0.2)	+0.05	Rewards honest uncertainty estimation
Poorly calibrated (absolute difference >= 0.2)	-0.05	Penalizes miscalibrated confidence

This mechanic introduces a genuine RL challenge: the agent must learn not only what is correct, but also when it is certain. In production bug triage, overconfident misrouting of a critical outage has severe downstream consequences.

Baseline Agent

The baseline inference script supports two LLM providers with automatic fallback:

Priority	Provider	Environment Variable	Default Model
Primary	OpenAI	`OPENAI_API_KEY`	gpt-4o-mini
Fallback	Google Gemini	`GEMINI_API_KEY`	gemini-2.5-flash
Last resort	Random	--	Random valid action

Both providers implement exponential backoff with retry logic for HTTP 429 (rate limit) and 503 (service unavailable) responses.

Running the Baseline

# Using OpenAI (required by hackathon spec)
export OPENAI_API_KEY="sk-..."
python -m bug_triage_env.baseline --all-tasks --episodes 5

# Using Gemini (free tier available at https://aistudio.google.com/apikey)
export GEMINI_API_KEY="AI..."
python -m bug_triage_env.baseline --all-tasks --episodes 5

# Single task with more episodes
python -m bug_triage_env.baseline --task task_1 --episodes 10

# JSON output
python -m bug_triage_env.baseline --all-tasks --json

Baseline Scores

Task	Mean Score	Range	Description
Task 1 (Easy)	0.80	0.00 - 1.00	Bug type classification
Task 2 (Medium)	0.93	0.67 - 1.00	Priority assignment
Task 3 (Hard)	0.78	0.60 - 1.00	Full triage pipeline
Overall	0.84		Weighted average across all tasks

Without any API key configured, the baseline falls back to random actions and achieves an average score of approximately 0.15.

Hackathon Inference Script

The root-level inference.py is the hackathon-mandated entry point. It:

Uses the OpenAI Python client exclusively
Reads API_BASE_URL, MODEL_NAME, and HF_TOKEN from the environment
Emits structured [START], [STEP], and [END] logs to stdout
Completes in under 20 minutes on 2 vCPU / 8 GB RAM

Architecture

+------------------------------------------+
|              FastAPI Server              |
|  +--------+  +--------+  +-----------+  |
|  | /reset |  | /step  |  | /grader   |  |
|  +---+----+  +---+----+  +-----+-----+  |
|      |           |              |        |
|  +---v-----------v--------------v-----+  |
|  |      BugTriageEnvironment          |  |
|  |  +----------+  +---------------+   |  |
|  |  | Dataset  |  | Episode Store |   |  |
|  |  | 25 Bugs  |  | (thread-safe) |   |  |
|  |  +----------+  +---------------+   |  |
|  +--------------------+---------------+  |
|                       |                  |
|  +--------------------v---------------+  |
|  |         Graders Registry           |  |
|  |  task1: exact match                |  |
|  |  task2: distance penalty           |  |
|  |  task3: weighted composite         |  |
|  +------------------------------------+  |
+------------------------------------------+
         ^                      ^
         | HTTP                 | HTTP
    +----+-----+          +----+----------+
    |  Client  |          |   Baseline    |
    | (Python) |          | OpenAI/Gemini |
    +----------+          +---------------+

Key implementation details:

Thread safety: The episode store uses Python threading.Lock to support concurrent requests from multiple agents.
Single-step episodes: Each episode consists of one reset (observation) and one step (action). The episode terminates immediately after the step.
Deterministic grading: All three graders produce identical scores for identical inputs. No randomness is involved in evaluation.
Dataset: 25 bug reports stored in bugs.json, covering crash reports, security vulnerabilities, performance regressions, UI glitches, data corruption, and compatibility issues.

Deployment

Docker

docker build -t bug-triage-env .

docker run -d -p 8000:8000 \
  -e OPENAI_API_KEY="sk-..." \
  bug-triage-env

curl http://localhost:8000/health

The Dockerfile uses Python 3.11-slim, installs only production dependencies, and includes a built-in health check.

Hugging Face Spaces

The environment is deployed as a Docker-based Hugging Face Space:

pip install huggingface_hub
python3 -c "
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id='<username>/bug-triage-openenv', repo_type='space', space_sdk='docker', exist_ok=True)
api.upload_folder(folder_path='.', repo_id='<username>/bug-triage-openenv', repo_type='space')
"

The live deployment is accessible at: https://huggingface.co/spaces/savetrees/bug-triage-openenv

Project Structure

bug-triage-openenv/
|-- README.md                    Documentation
|-- Dockerfile                   Production container (Python 3.11-slim)
|-- openenv.yaml                 OpenEnv environment manifest
|-- inference.py                 Hackathon inference entry point
|-- pyproject.toml               Python package configuration
|-- requirements.txt             Pinned production dependencies
|-- .dockerignore                Files excluded from Docker build
|-- .gitignore                   Files excluded from version control
|
|-- bug_triage_env/              Main Python package
|   |-- __init__.py              Package initialization
|   |-- models.py                Pydantic v2 data models (Action, Observation, State)
|   |-- client.py                Synchronous and asynchronous HTTP client
|   |-- baseline.py              Dual-provider LLM baseline (OpenAI + Gemini)
|   |
|   |-- data/
|   |   |-- __init__.py          Dataset loader
|   |   |-- bugs.json            25 curated real-world bug reports
|   |
|   |-- graders/
|   |   |-- __init__.py          Grader registry
|   |   |-- task1_grader.py      Bug classification grader (exact match)
|   |   |-- task2_grader.py      Priority assignment grader (distance penalty)
|   |   |-- task3_grader.py      Full triage grader (weighted composite)
|   |
|   |-- server/
|       |-- __init__.py          Server package initialization
|       |-- app.py               FastAPI application with all 8 endpoints
|       |-- environment.py       Core RL environment (reset, step, state)

Environment Variables

Variable	Required	Default	Description
`OPENAI_API_KEY`	For baseline	(none)	OpenAI API key for primary baseline inference
`GEMINI_API_KEY`	For fallback	(none)	Google Gemini API key for fallback inference
`API_BASE_URL`	For hackathon	`https://api.openai.com/v1`	LLM API endpoint (used by `inference.py`)
`MODEL_NAME`	For hackathon	`gpt-4o-mini`	Model identifier (used by `inference.py`)
`HF_TOKEN`	For hackathon	(none)	Hugging Face token (used by `inference.py`)
`PORT`	No	`8000`	Server port
`HOST`	No	`0.0.0.0`	Server bind address
`WORKERS`	No	`4`	Number of Uvicorn worker processes

References

OpenEnv Framework -- Standardized RL environment specification
FastAPI -- High-performance Python web framework
Pydantic v2 -- Data validation using Python type annotations
OpenAI API -- Primary LLM provider for baseline inference
Google Gemini API -- Fallback LLM provider
GRPO (Group Relative Policy Optimization) -- RL training algorithm
Hugging Face Spaces -- Deployment platform for ML applications
Docker -- Containerization platform

License

This project is licensed under the MIT License. See LICENSE for details.