Spaces:

savetrees
/

bug-triage-openenv

Running

App Files Files Community

bug-triage-openenv / docs /PROJECT.md

savetrees

Upload folder using huggingface_hub

0135a17 verified 2 days ago

preview code

raw

history blame contribute delete

4.56 kB

PROJECT.md — Bug Triage OpenEnv Environment

Round 1 opens: April 1, 2026 | Deadline: April 7, 2026, 11:59 PM IST

Problem Statement

Software teams receive hundreds of bug reports daily. Manually triaging each report — classifying the bug type, assigning a priority level, routing to the right developer, and deciding the appropriate action — is slow, error-prone, and a significant cost to engineering velocity.

This environment trains an LLM agent to perform automated bug triage at production scale, simulating real-world issue tracking systems like Jira and GitHub Issues. The agent reads raw bug reports (title, description, stack traces, metadata) and must classify, prioritize, and route each issue accurately.

Real-World Relevance

Dimension	Detail
Domain	Software Engineering / DevOps / Issue Tracking
Industry use	Every software company with a bug tracker (GitHub Issues, Jira, Linear, etc.)
Scale	Large teams process 100–500+ bug reports per week
Cost of wrong triage	Critical bugs missed → outages; low-priority spamming senior devs → waste
Current solutions	Manual labels, basic keyword rules, ML classifiers (limited context)
LLM advantage	Can reason over free-text descriptions, logs, and metadata together

Users

User	Need
Developers	Only receive bugs relevant to their specialization
QA Engineers	Know which bugs to test first (priority-ordered)
Project Managers	Accurate sprint planning based on classified backlogs
Engineering Leads	Automated triage frees team from manual label overhead

Example Scenario

Bug Reported:
  Title:       "App crashes on iOS 17 when uploading files > 50MB"
  Description: "Consistently crashes immediately on upload tap. Blocking 
                3 enterprise customers."
  Logs:        "FATAL: EXC_BAD_ACCESS KERN_INVALID_ADDRESS 0x0000000000000000
                Stack: FileUploadManager.upload(url:size:)"
  Environment: "iOS 17.2, iPhone 15 Pro, App v3.2.1"

Agent Must Output:
  bug_type:           crash          ← Task 1
  priority:           critical       ← Task 2
  assigned_developer: Alice          ← Task 3 (crash specialist)
  suggested_action:   fix_immediately ← Task 3

Real-World Constraints the Agent Must Handle

Ambiguous descriptions — reporters with varying technical skill
Incomplete logs — stack traces cut off or missing
Priority inflation — reporters who label everything as "critical"
Routing uncertainty — cross-cutting bugs (e.g., security + crash)
Missing environment info — agent must infer from context

Tech Stack

Layer	Technology
Environment server	FastAPI + Uvicorn
Containerisation	Docker
Deployment	Hugging Face Spaces
Training	TRL GRPOTrainer + vLLM
Base model	Qwen/Qwen3-1.7B
Package manager	uv
Validation	Pydantic v2
Baseline LLM	OpenAI GPT-4o-mini (via `OPENAI_API_KEY`)

Repository Layout

bug_triage_env/
├── models.py             ← Pydantic-typed Action / Observation / State
├── client.py             ← HTTP client for training code
├── baseline.py           ← OpenAI-backed inference script
├── openenv.yaml          ← Manifest
├── pyproject.toml
├── requirements.txt
├── data/
│   └── bugs.json         ← 15 diverse real-world bug reports
├── graders/
│   ├── task1_grader.py   ← Bug type classification [0.0–1.0]
│   ├── task2_grader.py   ← Priority assignment [0.0–1.0]
│   └── task3_grader.py   ← Full triage [0.0–1.0]
└── server/
    ├── environment.py    ← OpenEnv Environment ABC implementation
    ├── app.py            ← FastAPI (standard + hackathon endpoints)
    └── Dockerfile

Hackathon Compliance Checklist

Requirement	Status
HF Space deploys + `/reset` returns 200	☐
`openenv.yaml` present	☐
Typed models (`Action`, `Observation`, `State`)	☐
`step()` / `reset()` / `state()` implemented	☐
Dockerfile builds	☐
`/tasks` — task list + action schema	☐
`/grader` — score in `[0.0, 1.0]`	☐
`/baseline` — OpenAI inference, all 3 tasks	☐
3+ graded tasks with varying scores	☐
`baseline.py` runs without error	☐