Spaces:
Running
Running
Commit Β·
47bc4be
1
Parent(s): 7eb0325
docs: comprehensive README with spec compliance checklist, log format, API examples
Browse files
README.md
CHANGED
|
@@ -9,68 +9,93 @@ tags:
|
|
| 9 |
- openenv
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# Bug Triage Environment
|
|
|
|
|
|
|
| 13 |
|
| 14 |
An OpenEnv reinforcement learning environment where an AI agent triages GitHub-style bug reports β assigning priority, labels, team ownership, and milestone β exactly as a senior engineer would.
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
Every software team triages dozens of bug reports weekly. Getting prioritization wrong delays critical fixes and wastes engineering time. This environment trains and evaluates agents on real triage decision-making, with graders that reflect actual engineering judgment.
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
| Field | Type | Values |
|
| 23 |
|-----------------|-----------|-------------------------------------------------|
|
| 24 |
-
| `priority` | string | `P0` `P1` `P2` `P3`
|
| 25 |
-
| `labels` | list[str] | `bug` `performance` `security` `ux` `
|
| 26 |
-
| `assigned_team` | string | `backend` `frontend` `infra` `security` `devx` |
|
| 27 |
-
| `milestone` | string | `hotfix` `v2.1` `backlog`
|
| 28 |
-
| `reasoning` | string | Free-form explanation
|
| 29 |
|
| 30 |
-
## Observation
|
| 31 |
|
| 32 |
| Field | Type | Description |
|
| 33 |
|--------------|-----------|------------------------------------------|
|
| 34 |
-
| `bug_report` | BugReport | Title, body, author, comments
|
| 35 |
-
| `task_id` | string | Current difficulty: easy / medium / hard |
|
| 36 |
-
| `score` | float |
|
| 37 |
-
| `reward` | float | Reward from last action (0.
|
| 38 |
| `feedback` | string | Human-readable grader feedback |
|
| 39 |
| `done` | bool | Episode complete flag |
|
| 40 |
|
|
|
|
|
|
|
| 41 |
## Tasks
|
| 42 |
|
| 43 |
-
### Task 1 β Easy
|
| 44 |
-
|
| 45 |
-
- Grader:
|
| 46 |
-
-
|
| 47 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
-
|
| 50 |
-
Agent assigns priority, category labels, and team routing.
|
| 51 |
-
- Grader: priority 45% + label Jaccard similarity 40% + team routing 15%
|
| 52 |
-
- Reward range: 0.05β0.95
|
| 53 |
|
| 54 |
-
##
|
| 55 |
-
Agent must assign priority, labels, team, and milestone. Security escalation failures are penalized.
|
| 56 |
-
- Grader: priority 35% + labels 30% + team 20% + milestone 15%
|
| 57 |
-
- Penalty: β0.15 for missing security escalation
|
| 58 |
-
- Reward range: 0.05β0.95
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
|
| 63 |
-
- Partial credit for close-but-not-exact priority (0.5 vs 0.05 vs 0.95)
|
| 64 |
-
- Label overlap via Jaccard similarity (continuous signal)
|
| 65 |
-
- Team routing accuracy (binary, but weighted)
|
| 66 |
-
- Security escalation penalty discourages ignoring critical signals
|
| 67 |
-
- All scores clamped strictly to (0.05, 0.95)
|
| 68 |
|
| 69 |
## Setup
|
| 70 |
|
| 71 |
-
### Run
|
| 72 |
```bash
|
| 73 |
-
git clone https://
|
| 74 |
cd bug-triage-env
|
| 75 |
pip install -r server/requirements.txt
|
| 76 |
uvicorn server.app:app --host 0.0.0.0 --port 7860
|
|
@@ -82,9 +107,9 @@ docker build -t bug-triage-env .
|
|
| 82 |
docker run -p 7860:7860 bug-triage-env
|
| 83 |
```
|
| 84 |
|
| 85 |
-
### Run
|
| 86 |
```bash
|
| 87 |
-
pip install openai openenv-core
|
| 88 |
export API_BASE_URL=https://router.huggingface.co/v1
|
| 89 |
export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct
|
| 90 |
export HF_TOKEN=your_hf_token_here
|
|
@@ -92,27 +117,31 @@ export ENV_BASE_URL=https://siteshcodes-bug-triage-env.hf.space
|
|
| 92 |
python inference.py
|
| 93 |
```
|
| 94 |
|
| 95 |
-
###
|
| 96 |
-
```bash
|
| 97 |
-
pip install groq openenv-core
|
| 98 |
-
export GROQ_API_KEY=your_key_here
|
| 99 |
-
python baseline.py
|
| 100 |
-
```
|
| 101 |
|
| 102 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
-
|
|
|
|
|
|
|
| 105 |
|
| 106 |
Evaluated with `meta-llama/Llama-3.3-70B-Instruct` via HuggingFace router (temperature=0):
|
| 107 |
|
| 108 |
-
| Task | Score |
|
| 109 |
-
|------------|-------|
|
| 110 |
-
| Easy | 0.
|
| 111 |
-
| Medium | 0.
|
| 112 |
-
| Hard | 0.
|
| 113 |
-
| **
|
|
|
|
|
|
|
| 114 |
|
| 115 |
-
|
| 116 |
|
| 117 |
## API Endpoints
|
| 118 |
|
|
@@ -125,20 +154,78 @@ Scores vary per run due to random bug sampling from a pool of 5 bugs per task.
|
|
| 125 |
| GET | `/tasks` | List all tasks with grader info |
|
| 126 |
| GET | `/tasks/{id}` | Get specific task metadata |
|
| 127 |
|
| 128 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
```
|
| 131 |
bug-triage-env/
|
| 132 |
βββ server/
|
| 133 |
-
β βββ app.py
|
| 134 |
-
β βββ environment.py
|
| 135 |
-
β βββ task.py
|
|
|
|
| 136 |
β βββ requirements.txt
|
| 137 |
-
βββ model.py
|
| 138 |
-
βββ
|
| 139 |
-
βββ
|
| 140 |
-
βββ
|
| 141 |
-
βββ
|
| 142 |
-
βββ Dockerfile
|
| 143 |
βββ README.md
|
| 144 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
- openenv
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# π Bug Triage Environment
|
| 13 |
+
|
| 14 |
+
> **OpenEnv RL environment for the Meta PyTorch Hackathon x Scaler School of Technology**
|
| 15 |
|
| 16 |
An OpenEnv reinforcement learning environment where an AI agent triages GitHub-style bug reports β assigning priority, labels, team ownership, and milestone β exactly as a senior engineer would.
|
| 17 |
|
| 18 |
+
**Live:** [https://siteshcodes-bug-triage-env.hf.space](https://siteshcodes-bug-triage-env.hf.space)
|
| 19 |
+
**GitHub:** [https://github.com/Siteshcodes/bug-triage-env](https://github.com/Siteshcodes/bug-triage-env)
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Why This Environment?
|
| 24 |
|
| 25 |
Every software team triages dozens of bug reports weekly. Getting prioritization wrong delays critical fixes and wastes engineering time. This environment trains and evaluates agents on real triage decision-making, with graders that reflect actual engineering judgment.
|
| 26 |
|
| 27 |
+
**Key features:**
|
| 28 |
+
- π― Simulates a real-world engineering task (not a game or toy)
|
| 29 |
+
- π 3 tasks of increasing difficulty with deterministic graders
|
| 30 |
+
- π Meaningful partial-credit reward function
|
| 31 |
+
- π‘οΈ Security escalation penalty for missed critical vulnerabilities
|
| 32 |
+
- π¦ Full OpenEnv spec compliance: `step()` / `reset()` / `state()`
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## Action Space
|
| 37 |
|
| 38 |
| Field | Type | Values |
|
| 39 |
|-----------------|-----------|-------------------------------------------------|
|
| 40 |
+
| `priority` | string | `P0` Β· `P1` Β· `P2` Β· `P3` |
|
| 41 |
+
| `labels` | list[str] | `bug` Β· `performance` Β· `security` Β· `ux` Β· `data-integrity` Β· `payments` β¦ |
|
| 42 |
+
| `assigned_team` | string | `backend` Β· `frontend` Β· `infra` Β· `security` Β· `devx` |
|
| 43 |
+
| `milestone` | string | `hotfix` Β· `v2.1` Β· `backlog` |
|
| 44 |
+
| `reasoning` | string | Free-form explanation of triage decision |
|
| 45 |
|
| 46 |
+
## Observation Space
|
| 47 |
|
| 48 |
| Field | Type | Description |
|
| 49 |
|--------------|-----------|------------------------------------------|
|
| 50 |
+
| `bug_report` | BugReport | Title, body, author, labels_hint, comments |
|
| 51 |
+
| `task_id` | string | Current difficulty: `easy` / `medium` / `hard` |
|
| 52 |
+
| `score` | float | Score from grader (0.0β1.0) |
|
| 53 |
+
| `reward` | float | Reward from last action (0.0β1.0) |
|
| 54 |
| `feedback` | string | Human-readable grader feedback |
|
| 55 |
| `done` | bool | Episode complete flag |
|
| 56 |
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
## Tasks
|
| 60 |
|
| 61 |
+
### Task 1 β Easy: Priority Assignment
|
| 62 |
+
Assign a single P0βP3 priority to a bug report.
|
| 63 |
+
- **Grader:** `server.task:priority_match`
|
| 64 |
+
- **Scoring:** exact match β 0.95, one level off β 0.50, else β 0.05
|
| 65 |
+
- **Weight:** priority 100%
|
| 66 |
+
- **Reward range:** (0.0, 1.0) β strictly exclusive
|
| 67 |
+
|
| 68 |
+
### Task 2 β Medium: Priority + Labels + Team
|
| 69 |
+
Assign priority, category labels, and team routing.
|
| 70 |
+
- **Grader:** `server.task:priority_label_team`
|
| 71 |
+
- **Scoring:** priority 45% + label Jaccard similarity 40% + team routing 15%
|
| 72 |
+
- **Reward range:** (0.0, 1.0) β strictly exclusive
|
| 73 |
+
|
| 74 |
+
### Task 3 β Hard: Full Triage
|
| 75 |
+
Full triage: priority, labels, team, and milestone. Security escalation failures are penalized.
|
| 76 |
+
- **Grader:** `server.task:full_triage`
|
| 77 |
+
- **Scoring:** priority 35% + labels 30% + team 20% + milestone 15%
|
| 78 |
+
- **Penalty:** β0.15 for missing security escalation (e.g., SQL injection assigned to `backend` instead of `security`)
|
| 79 |
+
- **Reward range:** (0.0, 1.0) β strictly exclusive
|
| 80 |
|
| 81 |
+
---
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
+
## Reward Function
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
+
Rewards provide meaningful partial-credit signals at every step:
|
| 86 |
+
- **Priority:** Close-but-wrong gets partial credit (0.50 for 1-level off vs 0.05 for 2+ levels off vs 0.95 for exact match)
|
| 87 |
+
- **Labels:** Jaccard similarity between predicted and expected label sets (continuous signal)
|
| 88 |
+
- **Team routing:** Binary accuracy, weighted per task difficulty
|
| 89 |
+
- **Security escalation:** Hard penalty (β0.15) discourages ignoring critical security signals
|
| 90 |
+
- **Clamping:** All scores strictly within (0.0, 1.0) β never exactly 0 or 1
|
| 91 |
|
| 92 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
## Setup
|
| 95 |
|
| 96 |
+
### Run Locally
|
| 97 |
```bash
|
| 98 |
+
git clone https://github.com/Siteshcodes/bug-triage-env.git
|
| 99 |
cd bug-triage-env
|
| 100 |
pip install -r server/requirements.txt
|
| 101 |
uvicorn server.app:app --host 0.0.0.0 --port 7860
|
|
|
|
| 107 |
docker run -p 7860:7860 bug-triage-env
|
| 108 |
```
|
| 109 |
|
| 110 |
+
### Run Inference (Hackathon Submission Script)
|
| 111 |
```bash
|
| 112 |
+
pip install openai openenv-core requests pydantic
|
| 113 |
export API_BASE_URL=https://router.huggingface.co/v1
|
| 114 |
export MODEL_NAME=meta-llama/Llama-3.3-70B-Instruct
|
| 115 |
export HF_TOKEN=your_hf_token_here
|
|
|
|
| 117 |
python inference.py
|
| 118 |
```
|
| 119 |
|
| 120 |
+
### Environment Variables
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
+
| Variable | Description | Required |
|
| 123 |
+
|----------------|--------------------------------------|----------|
|
| 124 |
+
| `API_BASE_URL` | LLM API endpoint | Yes |
|
| 125 |
+
| `MODEL_NAME` | Model identifier for inference | Yes |
|
| 126 |
+
| `HF_TOKEN` | Hugging Face / API key | Yes |
|
| 127 |
+
| `ENV_BASE_URL` | Bug Triage environment URL | Optional |
|
| 128 |
|
| 129 |
+
---
|
| 130 |
+
|
| 131 |
+
## Baseline Scores
|
| 132 |
|
| 133 |
Evaluated with `meta-llama/Llama-3.3-70B-Instruct` via HuggingFace router (temperature=0):
|
| 134 |
|
| 135 |
+
| Task | Difficulty | Score |
|
| 136 |
+
|------------|------------|-------|
|
| 137 |
+
| Easy | easy | 0.95 |
|
| 138 |
+
| Medium | medium | 0.50 |
|
| 139 |
+
| Hard | hard | 0.85 |
|
| 140 |
+
| **Average**| | **0.77** |
|
| 141 |
+
|
| 142 |
+
> Scores vary per run due to random bug sampling from a pool of 5 bugs per task.
|
| 143 |
|
| 144 |
+
---
|
| 145 |
|
| 146 |
## API Endpoints
|
| 147 |
|
|
|
|
| 154 |
| GET | `/tasks` | List all tasks with grader info |
|
| 155 |
| GET | `/tasks/{id}` | Get specific task metadata |
|
| 156 |
|
| 157 |
+
### Example: Reset + Step
|
| 158 |
+
|
| 159 |
+
```bash
|
| 160 |
+
# Reset for easy task
|
| 161 |
+
curl -X POST https://siteshcodes-bug-triage-env.hf.space/reset \
|
| 162 |
+
-H "Content-Type: application/json" \
|
| 163 |
+
-d '{"task_id": "easy"}'
|
| 164 |
+
|
| 165 |
+
# Submit triage action
|
| 166 |
+
curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
|
| 167 |
+
-H "Content-Type: application/json" \
|
| 168 |
+
-d '{"action": {"priority": "P0", "labels": ["bug"], "assigned_team": "backend", "milestone": "hotfix", "reasoning": "App crash affecting all users"}}'
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
---
|
| 172 |
+
|
| 173 |
+
## Inference Log Format
|
| 174 |
+
|
| 175 |
+
The inference script emits structured logs per the OpenEnv spec:
|
| 176 |
+
|
| 177 |
+
```
|
| 178 |
+
[START] task=easy env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
|
| 179 |
+
[STEP] step=1 action=priority=P0,team=backend,milestone=hotfix reward=0.95 done=true error=null
|
| 180 |
+
[END] success=true steps=1 score=0.95 rewards=0.95
|
| 181 |
+
|
| 182 |
+
[START] task=medium env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
|
| 183 |
+
[STEP] step=1 action=priority=P0,team=backend,milestone=hotfix reward=0.85 done=true error=null
|
| 184 |
+
[END] success=true steps=1 score=0.85 rewards=0.85
|
| 185 |
+
|
| 186 |
+
[START] task=hard env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
|
| 187 |
+
[STEP] step=1 action=priority=P0,team=security,milestone=hotfix reward=0.72 done=true error=null
|
| 188 |
+
[END] success=true steps=1 score=0.72 rewards=0.72
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
Each task gets its own `[START]` β `[STEP]` β `[END]` block.
|
| 192 |
+
|
| 193 |
+
---
|
| 194 |
+
|
| 195 |
+
## Project Structure
|
| 196 |
|
| 197 |
```
|
| 198 |
bug-triage-env/
|
| 199 |
βββ server/
|
| 200 |
+
β βββ app.py # FastAPI + OpenEnv stateful endpoints
|
| 201 |
+
β βββ environment.py # BugTriageEnvironment (reset/step/state)
|
| 202 |
+
β βββ task.py # 15 bug reports + 3 graders
|
| 203 |
+
β βββ __init__.py
|
| 204 |
β βββ requirements.txt
|
| 205 |
+
βββ model.py # Pydantic models (TriageAction, TriageObservation, TriageState)
|
| 206 |
+
βββ inference.py # OpenAI client submission script (per-task logs)
|
| 207 |
+
βββ openenv.yaml # OpenEnv spec manifest (3 tasks with graders)
|
| 208 |
+
βββ Dockerfile # Docker container config
|
| 209 |
+
βββ pyproject.toml # Package metadata
|
|
|
|
| 210 |
βββ README.md
|
| 211 |
+
```
|
| 212 |
+
|
| 213 |
+
---
|
| 214 |
+
|
| 215 |
+
## OpenEnv Spec Compliance
|
| 216 |
+
|
| 217 |
+
| Requirement | Status |
|
| 218 |
+
|-------------------------------------|--------|
|
| 219 |
+
| Typed models (Action/Observation/State) | β
|
|
| 220 |
+
| `step()` / `reset()` / `state()` API | β
|
|
| 221 |
+
| `openenv.yaml` manifest | β
|
|
| 222 |
+
| 3+ tasks with graders (easyβhard) | β
|
|
| 223 |
+
| Reward range strictly (0.0, 1.0) | β
|
|
| 224 |
+
| Baseline inference with reproducible scores | β
|
|
| 225 |
+
| Dockerfile builds | β
|
|
| 226 |
+
| Deployed on HF Spaces | β
|
|
| 227 |
+
| Structured `[START]/[STEP]/[END]` logs | β
|
|
| 228 |
+
|
| 229 |
+
---
|
| 230 |
+
|
| 231 |
+
*Built for the Meta PyTorch Hackathon x Scaler School of Technology β Round 1*
|