Spaces:
Sleeping
Sleeping
Commit Β·
703aa57
1
Parent(s): 1893444
v2.0: multi-step episodes, procedural bugs, semantic grading, sessions, 71 tests
Browse files- .gitignore +0 -0
- README.md +163 -102
- __pycache__/client.cpython-314.pyc +0 -0
- __pycache__/model.cpython-314.pyc +0 -0
- baseline.py +41 -24
- bug_triage_client.py +0 -75
- client.py +81 -17
- inference.py +257 -69
- model.py +21 -9
- openenv.yaml +26 -8
- pyproject.toml +11 -2
- server/__pycache__/__init__.cpython-314.pyc +0 -0
- server/__pycache__/task.cpython-314.pyc +0 -0
- server/app.py +161 -81
- server/environment.py +263 -42
- server/requirements.txt +2 -1
- server/task.py +725 -78
- tests/__init__.py +1 -0
- tests/test_api.py +190 -0
- tests/test_environment.py +205 -0
- tests/test_grading.py +253 -0
.gitignore
CHANGED
|
Binary files a/.gitignore and b/.gitignore differ
|
|
|
README.md
CHANGED
|
@@ -9,85 +9,127 @@ tags:
|
|
| 9 |
- openenv
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# π Bug Triage Environment
|
| 13 |
|
| 14 |
> **OpenEnv RL environment for the Meta PyTorch Hackathon x Scaler School of Technology**
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
**Live:** [https://siteshcodes-bug-triage-env.hf.space](https://siteshcodes-bug-triage-env.hf.space)
|
| 19 |
**GitHub:** [https://github.com/Siteshcodes/bug-triage-env](https://github.com/Siteshcodes/bug-triage-env)
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
-
##
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
---
|
| 35 |
|
| 36 |
## Action Space
|
| 37 |
|
| 38 |
-
| Field
|
| 39 |
-
|-------
|
| 40 |
-
| `
|
| 41 |
-
| `
|
| 42 |
-
| `
|
| 43 |
-
| `
|
| 44 |
-
| `
|
|
|
|
| 45 |
|
| 46 |
## Observation Space
|
| 47 |
|
| 48 |
-
| Field
|
| 49 |
-
|-------
|
| 50 |
-
| `bug_report` | BugReport | Title, body, author, labels_hint, comments |
|
| 51 |
-
| `task_id`
|
| 52 |
-
| `score`
|
| 53 |
-
| `reward`
|
| 54 |
-
| `feedback`
|
| 55 |
-
| `done`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
---
|
| 58 |
|
| 59 |
## Tasks
|
| 60 |
|
| 61 |
### Task 1 β Easy: Priority Assignment
|
| 62 |
-
Assign a single P0βP3 priority to
|
| 63 |
- **Grader:** `server.task:priority_match`
|
| 64 |
-
- **Scoring:** exact
|
| 65 |
-
- **
|
| 66 |
-
- **Reward range:** (0.0, 1.0) β strictly exclusive
|
| 67 |
|
| 68 |
### Task 2 β Medium: Priority + Labels + Team
|
| 69 |
-
Assign priority, category labels, and team routing.
|
| 70 |
- **Grader:** `server.task:priority_label_team`
|
| 71 |
-
- **Scoring:** priority 45% + label Jaccard
|
| 72 |
-
- **Reward range:** (0.0, 1.0)
|
| 73 |
|
| 74 |
### Task 3 β Hard: Full Triage
|
| 75 |
-
Full triage
|
| 76 |
- **Grader:** `server.task:full_triage`
|
| 77 |
- **Scoring:** priority 35% + labels 30% + team 20% + milestone 15%
|
| 78 |
-
- **Penalty:** β0.15 for missing security escalation
|
| 79 |
-
- **
|
|
|
|
| 80 |
|
| 81 |
---
|
| 82 |
|
| 83 |
## Reward Function
|
| 84 |
|
| 85 |
-
|
| 86 |
-
- **
|
| 87 |
-
- **
|
| 88 |
-
- **
|
| 89 |
-
- **
|
| 90 |
-
- **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
---
|
| 93 |
|
|
@@ -107,7 +149,13 @@ docker build -t bug-triage-env .
|
|
| 107 |
docker run -p 7860:7860 bug-triage-env
|
| 108 |
```
|
| 109 |
|
| 110 |
-
### Run
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 111 |
```bash
|
| 112 |
pip install openai openenv-core requests pydantic
|
| 113 |
export API_BASE_URL=https://router.huggingface.co/v1
|
|
@@ -119,77 +167,79 @@ python inference.py
|
|
| 119 |
|
| 120 |
### Environment Variables
|
| 121 |
|
| 122 |
-
| Variable
|
| 123 |
-
|----------
|
| 124 |
-
| `API_BASE_URL` | LLM API endpoint
|
| 125 |
-
| `MODEL_NAME`
|
| 126 |
-
| `HF_TOKEN`
|
| 127 |
-
| `ENV_BASE_URL` | Bug Triage environment URL
|
| 128 |
-
|
| 129 |
-
---
|
| 130 |
-
|
| 131 |
-
## Baseline Scores
|
| 132 |
-
|
| 133 |
-
Evaluated with `meta-llama/Llama-3.3-70B-Instruct` via HuggingFace router (temperature=0):
|
| 134 |
-
|
| 135 |
-
| Task | Difficulty | Score |
|
| 136 |
-
|------------|------------|-------|
|
| 137 |
-
| Easy | easy | 0.95 |
|
| 138 |
-
| Medium | medium | 0.50 |
|
| 139 |
-
| Hard | hard | 0.85 |
|
| 140 |
-
| **Average**| | **0.77** |
|
| 141 |
-
|
| 142 |
-
> Scores vary per run due to random bug sampling from a pool of 5 bugs per task.
|
| 143 |
|
| 144 |
---
|
| 145 |
|
| 146 |
## API Endpoints
|
| 147 |
|
| 148 |
-
| Method | Endpoint
|
| 149 |
-
|--------|----------
|
| 150 |
-
| GET
|
| 151 |
-
|
|
| 152 |
-
| POST
|
| 153 |
-
|
|
| 154 |
-
| GET
|
| 155 |
-
| GET
|
|
|
|
|
|
|
|
|
|
| 156 |
|
| 157 |
-
### Example:
|
| 158 |
|
| 159 |
```bash
|
| 160 |
-
# Reset
|
| 161 |
curl -X POST https://siteshcodes-bug-triage-env.hf.space/reset \
|
| 162 |
-H "Content-Type: application/json" \
|
| 163 |
-
-d '{"task_id": "
|
| 164 |
|
| 165 |
-
#
|
| 166 |
curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
|
| 167 |
-H "Content-Type: application/json" \
|
| 168 |
-
-d '{"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
```
|
| 170 |
|
| 171 |
---
|
| 172 |
|
| 173 |
## Inference Log Format
|
| 174 |
|
| 175 |
-
|
| 176 |
|
| 177 |
```
|
| 178 |
[START] task=easy env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
|
| 179 |
-
[STEP] step=1 action=
|
| 180 |
-
[
|
|
|
|
|
|
|
| 181 |
|
| 182 |
[START] task=medium env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
|
| 183 |
-
[STEP] step=1 action=
|
| 184 |
-
[
|
|
|
|
|
|
|
| 185 |
|
| 186 |
[START] task=hard env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
|
| 187 |
-
[STEP] step=1 action=
|
| 188 |
-
[
|
|
|
|
|
|
|
| 189 |
```
|
| 190 |
|
| 191 |
-
Each task gets its own `[START]` β `[STEP]` β `[END]` block.
|
| 192 |
-
|
| 193 |
---
|
| 194 |
|
| 195 |
## Project Structure
|
|
@@ -197,16 +247,24 @@ Each task gets its own `[START]` β `[STEP]` β `[END]` block.
|
|
| 197 |
```
|
| 198 |
bug-triage-env/
|
| 199 |
βββ server/
|
| 200 |
-
β βββ app.py # FastAPI +
|
| 201 |
-
β βββ environment.py #
|
| 202 |
-
β βββ task.py #
|
| 203 |
β βββ __init__.py
|
| 204 |
-
β
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 205 |
βββ model.py # Pydantic models (TriageAction, TriageObservation, TriageState)
|
| 206 |
-
βββ
|
| 207 |
-
βββ
|
| 208 |
-
βββ
|
| 209 |
-
βββ
|
|
|
|
|
|
|
| 210 |
βββ README.md
|
| 211 |
```
|
| 212 |
|
|
@@ -214,17 +272,20 @@ bug-triage-env/
|
|
| 214 |
|
| 215 |
## OpenEnv Spec Compliance
|
| 216 |
|
| 217 |
-
| Requirement
|
| 218 |
-
|-------------
|
| 219 |
| Typed models (Action/Observation/State) | β
|
|
| 220 |
-
| `step()` / `reset()` / `state()` API
|
| 221 |
-
| `openenv.yaml` manifest
|
| 222 |
-
| 3+ tasks with graders (easyβhard)
|
| 223 |
-
| Reward range strictly (0.0, 1.0)
|
|
|
|
| 224 |
| Baseline inference with reproducible scores | β
|
|
| 225 |
-
| Dockerfile builds
|
| 226 |
-
| Deployed on HF Spaces
|
| 227 |
-
| Structured `[START]/[STEP]/[END]` logs
|
|
|
|
|
|
|
| 228 |
|
| 229 |
---
|
| 230 |
|
|
|
|
| 9 |
- openenv
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# π Bug Triage Environment v2.0
|
| 13 |
|
| 14 |
> **OpenEnv RL environment for the Meta PyTorch Hackathon x Scaler School of Technology**
|
| 15 |
|
| 16 |
+
A multi-step reinforcement learning environment where an AI agent investigates and triages GitHub-style bug reports β deciding priority, labels, team ownership, and milestone β just like a senior engineer would.
|
| 17 |
|
| 18 |
**Live:** [https://siteshcodes-bug-triage-env.hf.space](https://siteshcodes-bug-triage-env.hf.space)
|
| 19 |
**GitHub:** [https://github.com/Siteshcodes/bug-triage-env](https://github.com/Siteshcodes/bug-triage-env)
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
+
## What Makes This Different
|
| 24 |
|
| 25 |
+
| Feature | v1.0 (before) | v2.0 (now) |
|
| 26 |
+
|---------|---------------|------------|
|
| 27 |
+
| Episode length | 1 step (quiz) | Multi-step investigation |
|
| 28 |
+
| Bug pool | 15 hardcrafted | 200+ procedurally generated |
|
| 29 |
+
| Label matching | Exact string | Semantic (synonym-aware) |
|
| 30 |
+
| Concurrency | Broken (global state) | Session-based, thread-safe |
|
| 31 |
+
| Information reveal | Everything at once | Progressive (title β body β comments β logs) |
|
| 32 |
+
| Tests | None | 50+ unit & integration tests |
|
| 33 |
+
| Grading depth | String matching | Weighted scoring + reasoning bonus |
|
| 34 |
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## Multi-Step Investigation
|
| 38 |
+
|
| 39 |
+
Unlike simple Q&A environments, the agent must **investigate before deciding**:
|
| 40 |
+
|
| 41 |
+
```
|
| 42 |
+
reset() β Agent sees: bug title + body preview
|
| 43 |
+
step(read_body) β Full description revealed
|
| 44 |
+
step(read_comments) β User comments revealed
|
| 45 |
+
step(check_logs) β Stack traces + severity signals revealed
|
| 46 |
+
step(submit, ...) β Final triage graded (reward returned)
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
Each investigation step costs a step (out of a limited budget). The agent must learn **when it has enough information to decide correctly** β balancing accuracy vs. efficiency.
|
| 50 |
|
| 51 |
---
|
| 52 |
|
| 53 |
## Action Space
|
| 54 |
|
| 55 |
+
| Field | Type | Values |
|
| 56 |
+
|-------|------|--------|
|
| 57 |
+
| `action_type` | string | `read_body` Β· `read_comments` Β· `check_logs` Β· `check_similar` Β· `submit` |
|
| 58 |
+
| `priority` | string | `P0` Β· `P1` Β· `P2` Β· `P3` (only for submit) |
|
| 59 |
+
| `labels` | list[str] | `bug` Β· `performance` Β· `security` Β· `ux` Β· `data-integrity` Β· `payments` β¦ |
|
| 60 |
+
| `assigned_team` | string | `backend` Β· `frontend` Β· `infra` Β· `security` Β· `devx` |
|
| 61 |
+
| `milestone` | string | `hotfix` Β· `v2.1` Β· `backlog` |
|
| 62 |
+
| `reasoning` | string | Free-form explanation (earns bonus points) |
|
| 63 |
|
| 64 |
## Observation Space
|
| 65 |
|
| 66 |
+
| Field | Type | Description |
|
| 67 |
+
|-------|------|-------------|
|
| 68 |
+
| `bug_report` | BugReport | Title, body, author, labels_hint, comments, stack_trace |
|
| 69 |
+
| `task_id` | string | Current difficulty: `easy` / `medium` / `hard` |
|
| 70 |
+
| `score` | float | Score from grader (0.0β1.0) |
|
| 71 |
+
| `reward` | float | Reward from last action (0.0β1.0) |
|
| 72 |
+
| `feedback` | string | Human-readable grader feedback |
|
| 73 |
+
| `done` | bool | Episode complete flag |
|
| 74 |
+
| `body_visible` | bool | Whether full body has been revealed |
|
| 75 |
+
| `comments_visible` | bool | Whether comments have been revealed |
|
| 76 |
+
| `logs_visible` | bool | Whether logs/stack traces have been revealed |
|
| 77 |
+
| `steps_taken` | int | Steps used so far |
|
| 78 |
+
| `max_steps` | int | Maximum steps allowed |
|
| 79 |
|
| 80 |
---
|
| 81 |
|
| 82 |
## Tasks
|
| 83 |
|
| 84 |
### Task 1 β Easy: Priority Assignment
|
| 85 |
+
Assign a single P0βP3 priority. Up to 4 steps.
|
| 86 |
- **Grader:** `server.task:priority_match`
|
| 87 |
+
- **Scoring:** exact β 0.95, Β±1 β 0.50, Β±2 β 0.20, else β 0.05
|
| 88 |
+
- **Reward range:** (0.0, 1.0)
|
|
|
|
| 89 |
|
| 90 |
### Task 2 β Medium: Priority + Labels + Team
|
| 91 |
+
Assign priority, category labels, and team routing. Up to 5 steps.
|
| 92 |
- **Grader:** `server.task:priority_label_team`
|
| 93 |
+
- **Scoring:** priority 45% + label Jaccard (semantic) 40% + team 15%
|
| 94 |
+
- **Reward range:** (0.0, 1.0)
|
| 95 |
|
| 96 |
### Task 3 β Hard: Full Triage
|
| 97 |
+
Full triage with security escalation penalty. Up to 6 steps.
|
| 98 |
- **Grader:** `server.task:full_triage`
|
| 99 |
- **Scoring:** priority 35% + labels 30% + team 20% + milestone 15%
|
| 100 |
+
- **Penalty:** β0.15 for missing security escalation
|
| 101 |
+
- **Bonus:** up to +0.15 for relevant reasoning
|
| 102 |
+
- **Reward range:** (0.0, 1.0)
|
| 103 |
|
| 104 |
---
|
| 105 |
|
| 106 |
## Reward Function
|
| 107 |
|
| 108 |
+
- **Priority:** Graduated partial credit (0.95 β 0.50 β 0.20 β 0.05)
|
| 109 |
+
- **Labels:** Semantic Jaccard similarity with synonym matching (e.g., "defect" β "bug")
|
| 110 |
+
- **Team routing:** Binary accuracy, weighted per difficulty
|
| 111 |
+
- **Security escalation:** Hard penalty (β0.15) for ignoring security signals
|
| 112 |
+
- **Reasoning bonus:** Up to +0.15 for mentioning relevant signals
|
| 113 |
+
- **Efficiency:** +0.05 bonus for correct answers with minimal investigation
|
| 114 |
+
- **Clamping:** All scores strictly within (0.0, 1.0)
|
| 115 |
+
|
| 116 |
+
---
|
| 117 |
+
|
| 118 |
+
## Procedural Bug Generation
|
| 119 |
+
|
| 120 |
+
The environment generates bugs from **7 template categories**:
|
| 121 |
+
|
| 122 |
+
| Category | Example Bugs |
|
| 123 |
+
|----------|-------------|
|
| 124 |
+
| `crash` | Service crashes, unhandled exceptions, segfaults |
|
| 125 |
+
| `security` | SQL injection, XSS, auth bypass, data exposure |
|
| 126 |
+
| `performance` | Memory leaks, slow queries, CPU spikes |
|
| 127 |
+
| `ui_bug` | Layout breaks, dark mode issues, accessibility |
|
| 128 |
+
| `data_corruption` | Race conditions, encoding issues, stale cache |
|
| 129 |
+
| `documentation` | Typos, outdated docs, missing guides |
|
| 130 |
+
| `api_bug` | Rate limiting bugs, pagination issues, webhook failures |
|
| 131 |
+
|
| 132 |
+
Each category has 5-6 title templates Γ 2 body templates Γ 6-12 variables = hundreds of unique combinations. The 15 original handcrafted bugs are preserved as a high-quality subset (40% chance per sample).
|
| 133 |
|
| 134 |
---
|
| 135 |
|
|
|
|
| 149 |
docker run -p 7860:7860 bug-triage-env
|
| 150 |
```
|
| 151 |
|
| 152 |
+
### Run Tests
|
| 153 |
+
```bash
|
| 154 |
+
pip install -e ".[dev]"
|
| 155 |
+
pytest tests/ -v
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
### Run Inference (Hackathon Submission)
|
| 159 |
```bash
|
| 160 |
pip install openai openenv-core requests pydantic
|
| 161 |
export API_BASE_URL=https://router.huggingface.co/v1
|
|
|
|
| 167 |
|
| 168 |
### Environment Variables
|
| 169 |
|
| 170 |
+
| Variable | Description | Required |
|
| 171 |
+
|----------|-------------|----------|
|
| 172 |
+
| `API_BASE_URL` | LLM API endpoint | Yes |
|
| 173 |
+
| `MODEL_NAME` | Model identifier for inference | Yes |
|
| 174 |
+
| `HF_TOKEN` | Hugging Face / API key | Yes |
|
| 175 |
+
| `ENV_BASE_URL` | Bug Triage environment URL | Optional |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 176 |
|
| 177 |
---
|
| 178 |
|
| 179 |
## API Endpoints
|
| 180 |
|
| 181 |
+
| Method | Endpoint | Description |
|
| 182 |
+
|--------|----------|-------------|
|
| 183 |
+
| GET | `/` | Interactive demo frontend |
|
| 184 |
+
| GET | `/health` | Health check + active sessions |
|
| 185 |
+
| POST | `/reset` | Start new episode (returns session_id) |
|
| 186 |
+
| POST | `/step` | Investigation or submit action |
|
| 187 |
+
| GET | `/state` | Current episode state |
|
| 188 |
+
| GET | `/tasks` | List all 3 tasks |
|
| 189 |
+
| GET | `/tasks/{id}` | Task metadata |
|
| 190 |
+
| GET | `/leaderboard` | Top agent scores |
|
| 191 |
+
| POST | `/leaderboard/submit` | Submit agent scores |
|
| 192 |
|
| 193 |
+
### Example: Multi-Step Episode
|
| 194 |
|
| 195 |
```bash
|
| 196 |
+
# 1. Reset β get a bug and session_id
|
| 197 |
curl -X POST https://siteshcodes-bug-triage-env.hf.space/reset \
|
| 198 |
-H "Content-Type: application/json" \
|
| 199 |
+
-d '{"task_id": "hard"}'
|
| 200 |
|
| 201 |
+
# 2. Investigate β read full body (use session_id from step 1)
|
| 202 |
curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
|
| 203 |
-H "Content-Type: application/json" \
|
| 204 |
+
-d '{"session_id": "...", "action": {"action_type": "read_body"}}'
|
| 205 |
+
|
| 206 |
+
# 3. Investigate β read comments
|
| 207 |
+
curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
|
| 208 |
+
-H "Content-Type: application/json" \
|
| 209 |
+
-d '{"session_id": "...", "action": {"action_type": "read_comments"}}'
|
| 210 |
+
|
| 211 |
+
# 4. Submit triage decision
|
| 212 |
+
curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
|
| 213 |
+
-H "Content-Type: application/json" \
|
| 214 |
+
-d '{"session_id": "...", "action": {"action_type": "submit", "priority": "P0", "labels": ["bug", "security"], "assigned_team": "security", "milestone": "hotfix", "reasoning": "SQL injection in production β critical security vulnerability"}}'
|
| 215 |
```
|
| 216 |
|
| 217 |
---
|
| 218 |
|
| 219 |
## Inference Log Format
|
| 220 |
|
| 221 |
+
Structured logs per OpenEnv spec (3 tasks, each with its own block):
|
| 222 |
|
| 223 |
```
|
| 224 |
[START] task=easy env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
|
| 225 |
+
[STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
|
| 226 |
+
[STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
|
| 227 |
+
[STEP] step=3 action=priority=P0,team=backend,milestone=hotfix reward=0.95 done=true error=null
|
| 228 |
+
[END] success=true steps=3 score=0.95 rewards=0.95
|
| 229 |
|
| 230 |
[START] task=medium env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
|
| 231 |
+
[STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
|
| 232 |
+
[STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
|
| 233 |
+
[STEP] step=3 action=priority=P0,team=backend,milestone=hotfix reward=0.85 done=true error=null
|
| 234 |
+
[END] success=true steps=3 score=0.85 rewards=0.85
|
| 235 |
|
| 236 |
[START] task=hard env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
|
| 237 |
+
[STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
|
| 238 |
+
[STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
|
| 239 |
+
[STEP] step=3 action=priority=P0,team=security,milestone=hotfix reward=0.92 done=true error=null
|
| 240 |
+
[END] success=true steps=3 score=0.92 rewards=0.92
|
| 241 |
```
|
| 242 |
|
|
|
|
|
|
|
| 243 |
---
|
| 244 |
|
| 245 |
## Project Structure
|
|
|
|
| 247 |
```
|
| 248 |
bug-triage-env/
|
| 249 |
βββ server/
|
| 250 |
+
β βββ app.py # FastAPI routes + session management
|
| 251 |
+
β βββ environment.py # Multi-step environment + SessionManager
|
| 252 |
+
β βββ task.py # 200+ bugs (procedural + handcrafted) + semantic grading
|
| 253 |
β βββ __init__.py
|
| 254 |
+
β βββ requirements.txt
|
| 255 |
+
β βββ static/
|
| 256 |
+
β βββ index.html # Interactive demo
|
| 257 |
+
βββ tests/
|
| 258 |
+
β βββ test_grading.py # Grading logic tests
|
| 259 |
+
β βββ test_environment.py # Environment flow tests
|
| 260 |
+
β βββ test_api.py # HTTP endpoint integration tests
|
| 261 |
βββ model.py # Pydantic models (TriageAction, TriageObservation, TriageState)
|
| 262 |
+
βββ client.py # HTTP client (single source of truth)
|
| 263 |
+
βββ inference.py # Multi-step OpenAI agent (hackathon submission)
|
| 264 |
+
βββ baseline.py # Groq baseline agent
|
| 265 |
+
βββ openenv.yaml # OpenEnv spec manifest
|
| 266 |
+
βββ Dockerfile # Docker config
|
| 267 |
+
βββ pyproject.toml # Package metadata + dev deps
|
| 268 |
βββ README.md
|
| 269 |
```
|
| 270 |
|
|
|
|
| 272 |
|
| 273 |
## OpenEnv Spec Compliance
|
| 274 |
|
| 275 |
+
| Requirement | Status |
|
| 276 |
+
|-------------|--------|
|
| 277 |
| Typed models (Action/Observation/State) | β
|
|
| 278 |
+
| `step()` / `reset()` / `state()` API | β
|
|
| 279 |
+
| `openenv.yaml` manifest | β
|
|
| 280 |
+
| 3+ tasks with graders (easy β hard) | β
|
|
| 281 |
+
| Reward range strictly (0.0, 1.0) | β
|
|
| 282 |
+
| Multi-step episodes | β
|
|
| 283 |
| Baseline inference with reproducible scores | β
|
|
| 284 |
+
| Dockerfile builds | β
|
|
| 285 |
+
| Deployed on HF Spaces | β
|
|
| 286 |
+
| Structured `[START]/[STEP]/[END]` logs | β
|
|
| 287 |
+
| Session-based concurrency | β
|
|
| 288 |
+
| 50+ automated tests | β
|
|
| 289 |
|
| 290 |
---
|
| 291 |
|
__pycache__/client.cpython-314.pyc
DELETED
|
Binary file (5.72 kB)
|
|
|
__pycache__/model.cpython-314.pyc
DELETED
|
Binary file (4.18 kB)
|
|
|
baseline.py
CHANGED
|
@@ -1,17 +1,16 @@
|
|
| 1 |
# baseline.py
|
| 2 |
-
# Runs a Groq-hosted LLaMA model against all 3 tasks
|
| 3 |
# Set env vars: GROQ_API_KEY, ENV_BASE_URL (optional)
|
| 4 |
|
| 5 |
import os
|
| 6 |
import json
|
|
|
|
| 7 |
from groq import Groq
|
| 8 |
from client import BugTriageClient
|
| 9 |
from model import TriageAction
|
| 10 |
-
import time
|
| 11 |
|
| 12 |
-
# ββ config βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 13 |
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
|
| 14 |
-
MODEL = "llama-3.3-70b-versatile"
|
| 15 |
TEMPERATURE = 0.0
|
| 16 |
MAX_TOKENS = 400
|
| 17 |
|
|
@@ -40,12 +39,19 @@ Milestones: hotfix | v2.1 | backlog"""
|
|
| 40 |
|
| 41 |
def format_bug(obs) -> str:
|
| 42 |
bug = obs.bug_report
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
f"
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
|
| 51 |
def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
|
|
@@ -60,7 +66,6 @@ def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
|
|
| 60 |
)
|
| 61 |
raw = response.choices[0].message.content.strip()
|
| 62 |
|
| 63 |
-
# strip accidental markdown fences
|
| 64 |
if raw.startswith("```"):
|
| 65 |
raw = raw.split("```")[1]
|
| 66 |
if raw.startswith("json"):
|
|
@@ -68,6 +73,7 @@ def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
|
|
| 68 |
|
| 69 |
data = json.loads(raw)
|
| 70 |
return TriageAction(
|
|
|
|
| 71 |
priority=data["priority"],
|
| 72 |
labels=data.get("labels", []),
|
| 73 |
assigned_team=data.get("assigned_team", "backend"),
|
|
@@ -78,26 +84,39 @@ def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
|
|
| 78 |
|
| 79 |
def main():
|
| 80 |
if not GROQ_API_KEY:
|
| 81 |
-
raise EnvironmentError(
|
|
|
|
| 82 |
|
| 83 |
groq_client = Groq(api_key=GROQ_API_KEY)
|
| 84 |
scores = {}
|
| 85 |
-
step_count = 0
|
| 86 |
|
| 87 |
print("=" * 50)
|
| 88 |
-
print(" Bug Triage Env β Baseline
|
| 89 |
print(f" Model: {MODEL}")
|
| 90 |
print("=" * 50)
|
| 91 |
|
|
|
|
|
|
|
| 92 |
with BugTriageClient() as env:
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
task = obs.task_id
|
| 98 |
-
print(f"\nββ Task: {task.upper()} ββ")
|
| 99 |
print(f" Bug: {obs.bug_report.title}")
|
| 100 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
bug_text = format_bug(obs)
|
| 102 |
action = call_model(groq_client, bug_text)
|
| 103 |
|
|
@@ -112,21 +131,19 @@ def main():
|
|
| 112 |
print(f" β Reward: {result.reward:.3f}")
|
| 113 |
print(f" β Feedback: {obs.feedback}")
|
| 114 |
|
| 115 |
-
scores[
|
| 116 |
-
step_count += 1
|
| 117 |
time.sleep(2)
|
| 118 |
|
| 119 |
print("\n" + "=" * 50)
|
| 120 |
print(" BASELINE SCORES")
|
| 121 |
print("=" * 50)
|
| 122 |
-
task_order = ["easy", "medium", "hard"]
|
| 123 |
total = 0.0
|
| 124 |
for task in task_order:
|
| 125 |
s = scores.get(task, 0.0)
|
| 126 |
bar = "β" * int(s * 20) + "β" * (20 - int(s * 20))
|
| 127 |
print(f" {task:<8} {bar} {s:.3f}")
|
| 128 |
total += s
|
| 129 |
-
avg = total / max(
|
| 130 |
print(f"\n Average score: {avg:.3f}")
|
| 131 |
print("=" * 50)
|
| 132 |
|
|
|
|
| 1 |
# baseline.py
|
| 2 |
+
# Runs a Groq-hosted LLaMA model against all 3 tasks with multi-step investigation
|
| 3 |
# Set env vars: GROQ_API_KEY, ENV_BASE_URL (optional)
|
| 4 |
|
| 5 |
import os
|
| 6 |
import json
|
| 7 |
+
import time
|
| 8 |
from groq import Groq
|
| 9 |
from client import BugTriageClient
|
| 10 |
from model import TriageAction
|
|
|
|
| 11 |
|
|
|
|
| 12 |
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
|
| 13 |
+
MODEL = "llama-3.3-70b-versatile"
|
| 14 |
TEMPERATURE = 0.0
|
| 15 |
MAX_TOKENS = 400
|
| 16 |
|
|
|
|
| 39 |
|
| 40 |
def format_bug(obs) -> str:
|
| 41 |
bug = obs.bug_report
|
| 42 |
+
parts = [f"Title: {bug.title}", f"\nDescription:\n{bug.body}"]
|
| 43 |
+
|
| 44 |
+
if obs.comments_visible and bug.comments:
|
| 45 |
+
comments = "\n".join(f" - {c}" for c in bug.comments)
|
| 46 |
+
parts.append(f"\nComments:\n{comments}")
|
| 47 |
+
|
| 48 |
+
if bug.labels_hint:
|
| 49 |
+
parts.append(f"\nExisting labels: {', '.join(bug.labels_hint)}")
|
| 50 |
+
|
| 51 |
+
if obs.logs_visible and bug.stack_trace:
|
| 52 |
+
parts.append(f"\nStack trace: {bug.stack_trace}")
|
| 53 |
+
|
| 54 |
+
return "\n".join(parts)
|
| 55 |
|
| 56 |
|
| 57 |
def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
|
|
|
|
| 66 |
)
|
| 67 |
raw = response.choices[0].message.content.strip()
|
| 68 |
|
|
|
|
| 69 |
if raw.startswith("```"):
|
| 70 |
raw = raw.split("```")[1]
|
| 71 |
if raw.startswith("json"):
|
|
|
|
| 73 |
|
| 74 |
data = json.loads(raw)
|
| 75 |
return TriageAction(
|
| 76 |
+
action_type="submit",
|
| 77 |
priority=data["priority"],
|
| 78 |
labels=data.get("labels", []),
|
| 79 |
assigned_team=data.get("assigned_team", "backend"),
|
|
|
|
| 84 |
|
| 85 |
def main():
|
| 86 |
if not GROQ_API_KEY:
|
| 87 |
+
raise EnvironmentError(
|
| 88 |
+
"GROQ_API_KEY not set. Get a free key at console.groq.com")
|
| 89 |
|
| 90 |
groq_client = Groq(api_key=GROQ_API_KEY)
|
| 91 |
scores = {}
|
|
|
|
| 92 |
|
| 93 |
print("=" * 50)
|
| 94 |
+
print(" Bug Triage Env β Baseline (Multi-Step Agent)")
|
| 95 |
print(f" Model: {MODEL}")
|
| 96 |
print("=" * 50)
|
| 97 |
|
| 98 |
+
task_order = ["easy", "medium", "hard"]
|
| 99 |
+
|
| 100 |
with BugTriageClient() as env:
|
| 101 |
+
for task_id in task_order:
|
| 102 |
+
obs = env.reset(task_id=task_id)
|
| 103 |
+
|
| 104 |
+
print(f"\nββ Task: {task_id.upper()} ββ")
|
|
|
|
|
|
|
| 105 |
print(f" Bug: {obs.bug_report.title}")
|
| 106 |
|
| 107 |
+
# Step 1: Read full body
|
| 108 |
+
if not obs.body_visible:
|
| 109 |
+
result = env.investigate("read_body")
|
| 110 |
+
obs = result.observation
|
| 111 |
+
print(f" π Investigated: read_body")
|
| 112 |
+
|
| 113 |
+
# Step 2: Read comments
|
| 114 |
+
if not obs.comments_visible:
|
| 115 |
+
result = env.investigate("read_comments")
|
| 116 |
+
obs = result.observation
|
| 117 |
+
print(f" π¬ Investigated: read_comments")
|
| 118 |
+
|
| 119 |
+
# Step 3: Submit triage
|
| 120 |
bug_text = format_bug(obs)
|
| 121 |
action = call_model(groq_client, bug_text)
|
| 122 |
|
|
|
|
| 131 |
print(f" β Reward: {result.reward:.3f}")
|
| 132 |
print(f" β Feedback: {obs.feedback}")
|
| 133 |
|
| 134 |
+
scores[task_id] = result.reward
|
|
|
|
| 135 |
time.sleep(2)
|
| 136 |
|
| 137 |
print("\n" + "=" * 50)
|
| 138 |
print(" BASELINE SCORES")
|
| 139 |
print("=" * 50)
|
|
|
|
| 140 |
total = 0.0
|
| 141 |
for task in task_order:
|
| 142 |
s = scores.get(task, 0.0)
|
| 143 |
bar = "β" * int(s * 20) + "β" * (20 - int(s * 20))
|
| 144 |
print(f" {task:<8} {bar} {s:.3f}")
|
| 145 |
total += s
|
| 146 |
+
avg = total / max(len(scores), 1)
|
| 147 |
print(f"\n Average score: {avg:.3f}")
|
| 148 |
print("=" * 50)
|
| 149 |
|
bug_triage_client.py
DELETED
|
@@ -1,75 +0,0 @@
|
|
| 1 |
-
# client.py
|
| 2 |
-
import os
|
| 3 |
-
import requests
|
| 4 |
-
from typing import Optional
|
| 5 |
-
from model import TriageAction, TriageObservation, BugReport
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
class StepResult:
|
| 9 |
-
def __init__(self, observation: TriageObservation, reward: float, done: bool, info: dict):
|
| 10 |
-
self.observation = observation
|
| 11 |
-
self.reward = reward
|
| 12 |
-
self.done = done
|
| 13 |
-
self.info = info
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
def _parse_observation(data: dict) -> TriageObservation:
|
| 17 |
-
bug_data = data["bug_report"]
|
| 18 |
-
bug = BugReport(**bug_data)
|
| 19 |
-
return TriageObservation(
|
| 20 |
-
bug_report=bug,
|
| 21 |
-
task_id=data.get("task_id", "easy"),
|
| 22 |
-
score=data.get("score", 0.0),
|
| 23 |
-
feedback=data.get("feedback", ""),
|
| 24 |
-
done=data.get("done", False),
|
| 25 |
-
reward=data.get("reward", 0.0),
|
| 26 |
-
)
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
class BugTriageClient:
|
| 30 |
-
def __init__(self, base_url: Optional[str] = None):
|
| 31 |
-
self.base_url = (
|
| 32 |
-
base_url
|
| 33 |
-
or os.getenv("ENV_BASE_URL", "https://siteshcodes-bug-triage-env.hf.space")
|
| 34 |
-
).rstrip("/")
|
| 35 |
-
self.session = requests.Session()
|
| 36 |
-
self.session.headers.update({"Content-Type": "application/json"})
|
| 37 |
-
|
| 38 |
-
def reset(self) -> TriageObservation:
|
| 39 |
-
response = self.session.post(f"{self.base_url}/reset", json={}, timeout=30)
|
| 40 |
-
response.raise_for_status()
|
| 41 |
-
data = response.json()
|
| 42 |
-
obs_data = data.get("observation", data)
|
| 43 |
-
return _parse_observation(obs_data)
|
| 44 |
-
|
| 45 |
-
def step(self, action: TriageAction) -> StepResult:
|
| 46 |
-
try:
|
| 47 |
-
action_dict = action.model_dump()
|
| 48 |
-
except AttributeError:
|
| 49 |
-
action_dict = action.dict()
|
| 50 |
-
payload = {"action": action_dict}
|
| 51 |
-
response = self.session.post(f"{self.base_url}/step", json=payload, timeout=30)
|
| 52 |
-
response.raise_for_status()
|
| 53 |
-
data = response.json()
|
| 54 |
-
obs_data = data.get("observation", data)
|
| 55 |
-
obs = _parse_observation(obs_data)
|
| 56 |
-
return StepResult(
|
| 57 |
-
observation=obs,
|
| 58 |
-
reward=data.get("reward", obs.reward) or 0.0,
|
| 59 |
-
done=data.get("done", obs.done),
|
| 60 |
-
info={},
|
| 61 |
-
)
|
| 62 |
-
|
| 63 |
-
def state(self) -> dict:
|
| 64 |
-
response = self.session.get(f"{self.base_url}/state", timeout=30)
|
| 65 |
-
response.raise_for_status()
|
| 66 |
-
return response.json()
|
| 67 |
-
|
| 68 |
-
def close(self):
|
| 69 |
-
self.session.close()
|
| 70 |
-
|
| 71 |
-
def __enter__(self):
|
| 72 |
-
return self
|
| 73 |
-
|
| 74 |
-
def __exit__(self, *args):
|
| 75 |
-
self.close()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
client.py
CHANGED
|
@@ -1,12 +1,14 @@
|
|
| 1 |
-
# client.py
|
| 2 |
import os
|
| 3 |
import requests
|
| 4 |
-
from typing import Optional
|
| 5 |
from model import TriageAction, TriageObservation, BugReport
|
| 6 |
|
| 7 |
|
| 8 |
class StepResult:
|
| 9 |
-
|
|
|
|
|
|
|
| 10 |
self.observation = observation
|
| 11 |
self.reward = reward
|
| 12 |
self.done = done
|
|
@@ -14,11 +16,13 @@ class StepResult:
|
|
| 14 |
|
| 15 |
|
| 16 |
def _parse_observation(data: dict) -> TriageObservation:
|
|
|
|
| 17 |
bug_data = data["bug_report"]
|
| 18 |
try:
|
| 19 |
bug = BugReport.model_validate(bug_data)
|
| 20 |
except Exception:
|
| 21 |
bug = BugReport(**bug_data)
|
|
|
|
| 22 |
return TriageObservation(
|
| 23 |
bug_report=bug,
|
| 24 |
task_id=data.get("task_id", "easy"),
|
|
@@ -26,10 +30,18 @@ def _parse_observation(data: dict) -> TriageObservation:
|
|
| 26 |
feedback=data.get("feedback", ""),
|
| 27 |
done=data.get("done", False),
|
| 28 |
reward=data.get("reward", 0.0),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
)
|
| 30 |
|
| 31 |
|
| 32 |
class BugTriageClient:
|
|
|
|
|
|
|
| 33 |
def __init__(self, base_url: Optional[str] = None):
|
| 34 |
self.base_url = (
|
| 35 |
base_url
|
|
@@ -37,39 +49,91 @@ class BugTriageClient:
|
|
| 37 |
).rstrip("/")
|
| 38 |
self.session = requests.Session()
|
| 39 |
self.session.headers.update({"Content-Type": "application/json"})
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
-
def reset(self, task_id: str = "easy") -> TriageObservation:
|
| 42 |
response = self.session.post(
|
| 43 |
-
f"{self.base_url}/reset",
|
| 44 |
-
json={"task_id": task_id},
|
| 45 |
-
timeout=30,
|
| 46 |
)
|
| 47 |
response.raise_for_status()
|
| 48 |
data = response.json()
|
| 49 |
-
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
def step(self, action: TriageAction) -> StepResult:
|
|
|
|
| 52 |
try:
|
| 53 |
-
action_dict = action.model_dump()
|
| 54 |
except AttributeError:
|
| 55 |
-
action_dict = action.dict()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 56 |
response = self.session.post(
|
| 57 |
-
f"{self.base_url}/step",
|
| 58 |
-
json={"action": action_dict},
|
| 59 |
-
timeout=30,
|
| 60 |
)
|
| 61 |
response.raise_for_status()
|
| 62 |
data = response.json()
|
| 63 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 64 |
return StepResult(
|
| 65 |
observation=obs,
|
| 66 |
-
reward=
|
| 67 |
done=data.get("done", obs.done),
|
| 68 |
-
info={},
|
| 69 |
)
|
| 70 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 71 |
def state(self) -> dict:
|
| 72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
response.raise_for_status()
|
| 74 |
return response.json()
|
| 75 |
|
|
|
|
| 1 |
+
# client.py β Single source of truth for environment client
|
| 2 |
import os
|
| 3 |
import requests
|
| 4 |
+
from typing import Optional, List
|
| 5 |
from model import TriageAction, TriageObservation, BugReport
|
| 6 |
|
| 7 |
|
| 8 |
class StepResult:
|
| 9 |
+
"""Result returned by env.step()."""
|
| 10 |
+
def __init__(self, observation: TriageObservation, reward: float,
|
| 11 |
+
done: bool, info: dict):
|
| 12 |
self.observation = observation
|
| 13 |
self.reward = reward
|
| 14 |
self.done = done
|
|
|
|
| 16 |
|
| 17 |
|
| 18 |
def _parse_observation(data: dict) -> TriageObservation:
|
| 19 |
+
"""Parse a JSON dict into a TriageObservation."""
|
| 20 |
bug_data = data["bug_report"]
|
| 21 |
try:
|
| 22 |
bug = BugReport.model_validate(bug_data)
|
| 23 |
except Exception:
|
| 24 |
bug = BugReport(**bug_data)
|
| 25 |
+
|
| 26 |
return TriageObservation(
|
| 27 |
bug_report=bug,
|
| 28 |
task_id=data.get("task_id", "easy"),
|
|
|
|
| 30 |
feedback=data.get("feedback", ""),
|
| 31 |
done=data.get("done", False),
|
| 32 |
reward=data.get("reward", 0.0),
|
| 33 |
+
body_visible=data.get("body_visible", False),
|
| 34 |
+
comments_visible=data.get("comments_visible", False),
|
| 35 |
+
logs_visible=data.get("logs_visible", False),
|
| 36 |
+
similar_visible=data.get("similar_visible", False),
|
| 37 |
+
steps_taken=data.get("steps_taken", 0),
|
| 38 |
+
max_steps=data.get("max_steps", 6),
|
| 39 |
)
|
| 40 |
|
| 41 |
|
| 42 |
class BugTriageClient:
|
| 43 |
+
"""HTTP client for the Bug Triage Environment server."""
|
| 44 |
+
|
| 45 |
def __init__(self, base_url: Optional[str] = None):
|
| 46 |
self.base_url = (
|
| 47 |
base_url
|
|
|
|
| 49 |
).rstrip("/")
|
| 50 |
self.session = requests.Session()
|
| 51 |
self.session.headers.update({"Content-Type": "application/json"})
|
| 52 |
+
self._session_id: Optional[str] = None
|
| 53 |
+
|
| 54 |
+
@property
|
| 55 |
+
def session_id(self) -> Optional[str]:
|
| 56 |
+
return self._session_id
|
| 57 |
+
|
| 58 |
+
def reset(self, task_id: str = "easy", seed: int = None) -> TriageObservation:
|
| 59 |
+
"""Start a new episode. Stores session_id for subsequent step() calls."""
|
| 60 |
+
payload = {"task_id": task_id}
|
| 61 |
+
if seed is not None:
|
| 62 |
+
payload["seed"] = seed
|
| 63 |
+
if self._session_id:
|
| 64 |
+
payload["session_id"] = self._session_id
|
| 65 |
|
|
|
|
| 66 |
response = self.session.post(
|
| 67 |
+
f"{self.base_url}/reset", json=payload, timeout=30,
|
|
|
|
|
|
|
| 68 |
)
|
| 69 |
response.raise_for_status()
|
| 70 |
data = response.json()
|
| 71 |
+
|
| 72 |
+
self._session_id = data.get("session_id")
|
| 73 |
+
obs_data = data.get("observation", data)
|
| 74 |
+
return _parse_observation(obs_data)
|
| 75 |
|
| 76 |
def step(self, action: TriageAction) -> StepResult:
|
| 77 |
+
"""Send an action (investigation or submit) and get the result."""
|
| 78 |
try:
|
| 79 |
+
action_dict = action.model_dump()
|
| 80 |
except AttributeError:
|
| 81 |
+
action_dict = action.dict()
|
| 82 |
+
|
| 83 |
+
payload = {"action": action_dict}
|
| 84 |
+
if self._session_id:
|
| 85 |
+
payload["session_id"] = self._session_id
|
| 86 |
+
|
| 87 |
response = self.session.post(
|
| 88 |
+
f"{self.base_url}/step", json=payload, timeout=30,
|
|
|
|
|
|
|
| 89 |
)
|
| 90 |
response.raise_for_status()
|
| 91 |
data = response.json()
|
| 92 |
+
|
| 93 |
+
obs_data = data.get("observation", data)
|
| 94 |
+
obs = _parse_observation(obs_data)
|
| 95 |
+
|
| 96 |
+
reward = data.get("reward", obs.reward) or 0.0
|
| 97 |
+
reward = float(reward)
|
| 98 |
+
|
| 99 |
+
# Update session_id if server returned one
|
| 100 |
+
if "session_id" in data:
|
| 101 |
+
self._session_id = data["session_id"]
|
| 102 |
+
|
| 103 |
return StepResult(
|
| 104 |
observation=obs,
|
| 105 |
+
reward=reward,
|
| 106 |
done=data.get("done", obs.done),
|
| 107 |
+
info=data.get("info", {}),
|
| 108 |
)
|
| 109 |
|
| 110 |
+
def investigate(self, action_type: str) -> StepResult:
|
| 111 |
+
"""Shortcut for investigation actions."""
|
| 112 |
+
action = TriageAction(action_type=action_type)
|
| 113 |
+
return self.step(action)
|
| 114 |
+
|
| 115 |
+
def submit(self, priority: str, labels: List[str] = None,
|
| 116 |
+
assigned_team: str = "backend", milestone: str = "backlog",
|
| 117 |
+
reasoning: str = "") -> StepResult:
|
| 118 |
+
"""Shortcut for submitting the final triage decision."""
|
| 119 |
+
action = TriageAction(
|
| 120 |
+
action_type="submit",
|
| 121 |
+
priority=priority,
|
| 122 |
+
labels=labels or ["bug"],
|
| 123 |
+
assigned_team=assigned_team,
|
| 124 |
+
milestone=milestone,
|
| 125 |
+
reasoning=reasoning,
|
| 126 |
+
)
|
| 127 |
+
return self.step(action)
|
| 128 |
+
|
| 129 |
def state(self) -> dict:
|
| 130 |
+
"""Get current environment state."""
|
| 131 |
+
params = {}
|
| 132 |
+
if self._session_id:
|
| 133 |
+
params["session_id"] = self._session_id
|
| 134 |
+
response = self.session.get(
|
| 135 |
+
f"{self.base_url}/state", params=params, timeout=30,
|
| 136 |
+
)
|
| 137 |
response.raise_for_status()
|
| 138 |
return response.json()
|
| 139 |
|
inference.py
CHANGED
|
@@ -20,6 +20,10 @@ from openai import OpenAI
|
|
| 20 |
from model import TriageAction, TriageObservation, BugReport
|
| 21 |
|
| 22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
|
| 24 |
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY") or os.getenv("OPENAI_API_KEY")
|
| 25 |
MODEL_NAME = os.getenv("MODEL_NAME") or "meta-llama/Llama-3.3-70B-Instruct"
|
|
@@ -31,9 +35,9 @@ if not API_KEY:
|
|
| 31 |
TASK_IDS = ["easy", "medium", "hard"]
|
| 32 |
BENCHMARK = "bug-triage-env"
|
| 33 |
TEMPERATURE = 0.0
|
| 34 |
-
MAX_TOKENS =
|
| 35 |
-
MAX_STEPS =
|
| 36 |
-
MAX_TOTAL_REWARD = 1.0
|
| 37 |
SUCCESS_SCORE_THRESHOLD = 0.4
|
| 38 |
|
| 39 |
print(f"[CONFIG] API_BASE_URL={API_BASE_URL}", flush=True)
|
|
@@ -41,7 +45,10 @@ print(f"[CONFIG] MODEL_NAME={MODEL_NAME}", flush=True)
|
|
| 41 |
print(f"[CONFIG] ENV_BASE_URL={ENV_BASE_URL}", flush=True)
|
| 42 |
print(f"[CONFIG] API_KEY={'set' if API_KEY else 'MISSING'}", flush=True)
|
| 43 |
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
def _parse_observation(data: dict) -> TriageObservation:
|
| 47 |
try:
|
|
@@ -51,15 +58,22 @@ def _parse_observation(data: dict) -> TriageObservation:
|
|
| 51 |
return TriageObservation(
|
| 52 |
bug_report=bug,
|
| 53 |
task_id=data.get("task_id", "easy"),
|
| 54 |
-
score=data.get("score", 0.
|
| 55 |
feedback=data.get("feedback", ""),
|
| 56 |
done=data.get("done", False),
|
| 57 |
-
reward=data.get("reward", 0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
)
|
| 59 |
|
| 60 |
|
| 61 |
class StepResult:
|
| 62 |
-
def __init__(self, observation: TriageObservation, reward: float,
|
|
|
|
| 63 |
self.observation = observation
|
| 64 |
self.reward = reward
|
| 65 |
self.done = done
|
|
@@ -71,42 +85,53 @@ class BugTriageClient:
|
|
| 71 |
self.base_url = (base_url or ENV_BASE_URL).rstrip("/")
|
| 72 |
self.session = requests.Session()
|
| 73 |
self.session.headers.update({"Content-Type": "application/json"})
|
|
|
|
| 74 |
|
| 75 |
def reset(self, task_id: str = "easy") -> TriageObservation:
|
| 76 |
print(f"[ENV] Resetting env for task={task_id}", flush=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
response = self.session.post(
|
| 78 |
-
f"{self.base_url}/reset",
|
| 79 |
-
json={"task_id": task_id},
|
| 80 |
-
timeout=30,
|
| 81 |
)
|
| 82 |
response.raise_for_status()
|
| 83 |
data = response.json()
|
|
|
|
| 84 |
return _parse_observation(data.get("observation", data))
|
| 85 |
|
| 86 |
def step(self, action: TriageAction) -> StepResult:
|
| 87 |
-
print("[ENV] Sending step action.
|
| 88 |
try:
|
| 89 |
action_dict = action.model_dump()
|
| 90 |
except AttributeError:
|
| 91 |
action_dict = action.dict()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
response = self.session.post(
|
| 93 |
-
f"{self.base_url}/step",
|
| 94 |
-
json={"action": action_dict},
|
| 95 |
-
timeout=30,
|
| 96 |
)
|
| 97 |
response.raise_for_status()
|
| 98 |
data = response.json()
|
| 99 |
obs = _parse_observation(data.get("observation", data))
|
|
|
|
| 100 |
reward = data.get("reward", obs.reward)
|
| 101 |
-
if reward is None
|
| 102 |
-
reward = 0.
|
| 103 |
reward = float(reward)
|
| 104 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
return StepResult(
|
| 106 |
-
observation=obs,
|
| 107 |
-
|
| 108 |
-
done=data.get("done", obs.done),
|
| 109 |
-
info={},
|
| 110 |
)
|
| 111 |
|
| 112 |
def close(self):
|
|
@@ -119,12 +144,14 @@ class BugTriageClient:
|
|
| 119 |
self.close()
|
| 120 |
|
| 121 |
|
| 122 |
-
|
|
|
|
|
|
|
| 123 |
|
| 124 |
SYSTEM_PROMPT = textwrap.dedent("""
|
| 125 |
-
You are a senior software engineering manager.
|
| 126 |
-
You will receive a bug report
|
| 127 |
-
valid JSON β no markdown, no explanation, no backticks.
|
| 128 |
|
| 129 |
Return exactly this structure:
|
| 130 |
{
|
|
@@ -143,22 +170,44 @@ SYSTEM_PROMPT = textwrap.dedent("""
|
|
| 143 |
|
| 144 |
Teams: backend | frontend | infra | security | devx
|
| 145 |
Milestones: hotfix | v2.1 | backlog
|
|
|
|
|
|
|
|
|
|
| 146 |
""").strip()
|
| 147 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
|
|
|
|
| 149 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
|
| 151 |
def log_start(task: str, env: str, model: str) -> None:
|
| 152 |
print(f"[START] task={task} env={env} model={model}", flush=True)
|
| 153 |
|
| 154 |
|
| 155 |
-
def log_step(
|
| 156 |
-
|
| 157 |
-
action: str,
|
| 158 |
-
reward: float,
|
| 159 |
-
done: bool,
|
| 160 |
-
error: Optional[str] = None,
|
| 161 |
-
) -> None:
|
| 162 |
print(
|
| 163 |
f"[STEP] step={step} action={action} "
|
| 164 |
f"reward={reward:.2f} done={str(done).lower()} error={error or 'null'}",
|
|
@@ -166,7 +215,8 @@ def log_step(
|
|
| 166 |
)
|
| 167 |
|
| 168 |
|
| 169 |
-
def log_end(success: bool, steps: int, score: float,
|
|
|
|
| 170 |
rewards_str = ",".join(f"{r:.2f}" for r in rewards)
|
| 171 |
print(
|
| 172 |
f"[END] success={str(success).lower()} steps={steps} "
|
|
@@ -175,21 +225,97 @@ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> No
|
|
| 175 |
)
|
| 176 |
|
| 177 |
|
| 178 |
-
|
|
|
|
|
|
|
| 179 |
|
| 180 |
def format_bug(obs: TriageObservation) -> str:
|
|
|
|
| 181 |
bug = obs.bug_report
|
| 182 |
-
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
| 187 |
-
|
| 188 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 189 |
|
| 190 |
|
| 191 |
def call_model(client: OpenAI, bug_text: str) -> TriageAction:
|
| 192 |
-
|
|
|
|
| 193 |
|
| 194 |
completion = client.chat.completions.create(
|
| 195 |
model=MODEL_NAME,
|
|
@@ -218,6 +344,7 @@ def call_model(client: OpenAI, bug_text: str) -> TriageAction:
|
|
| 218 |
data = {}
|
| 219 |
|
| 220 |
action = TriageAction(
|
|
|
|
| 221 |
priority=data.get("priority", "P2"),
|
| 222 |
labels=data.get("labels", ["bug"]),
|
| 223 |
assigned_team=data.get("assigned_team", "backend"),
|
|
@@ -233,12 +360,13 @@ def call_model(client: OpenAI, bug_text: str) -> TriageAction:
|
|
| 233 |
return action
|
| 234 |
|
| 235 |
|
| 236 |
-
|
|
|
|
|
|
|
| 237 |
|
| 238 |
def main() -> None:
|
| 239 |
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
|
| 240 |
|
| 241 |
-
|
| 242 |
all_scores = []
|
| 243 |
|
| 244 |
with BugTriageClient(base_url=ENV_BASE_URL) as env:
|
|
@@ -247,32 +375,90 @@ def main() -> None:
|
|
| 247 |
score = 0.0
|
| 248 |
success = False
|
| 249 |
steps_taken = 0
|
|
|
|
| 250 |
log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
|
| 251 |
|
| 252 |
try:
|
| 253 |
obs = env.reset(task_id=task_id)
|
| 254 |
-
|
| 255 |
-
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
| 263 |
-
|
| 264 |
-
|
| 265 |
-
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
|
| 272 |
-
|
| 273 |
-
|
| 274 |
-
|
| 275 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 276 |
score = min(max(score, 0.01), 0.99)
|
| 277 |
success = score >= SUCCESS_SCORE_THRESHOLD
|
| 278 |
|
|
@@ -282,15 +468,17 @@ def main() -> None:
|
|
| 282 |
score = min(max(score, 0.01), 0.99)
|
| 283 |
success = False
|
| 284 |
|
| 285 |
-
# [END] for this task
|
| 286 |
log_end(success, steps_taken, score, rewards)
|
| 287 |
all_scores.append(score)
|
| 288 |
|
| 289 |
time.sleep(0.5)
|
| 290 |
|
| 291 |
-
|
| 292 |
avg_score = sum(all_scores) / len(all_scores) if all_scores else 0.0
|
| 293 |
-
print(
|
|
|
|
|
|
|
|
|
|
|
|
|
| 294 |
|
| 295 |
|
| 296 |
if __name__ == "__main__":
|
|
|
|
| 20 |
from model import TriageAction, TriageObservation, BugReport
|
| 21 |
|
| 22 |
|
| 23 |
+
# ---------------------------------------------------------------------------
|
| 24 |
+
# CONFIG β uses env vars required by hackathon spec
|
| 25 |
+
# ---------------------------------------------------------------------------
|
| 26 |
+
|
| 27 |
API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
|
| 28 |
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY") or os.getenv("OPENAI_API_KEY")
|
| 29 |
MODEL_NAME = os.getenv("MODEL_NAME") or "meta-llama/Llama-3.3-70B-Instruct"
|
|
|
|
| 35 |
TASK_IDS = ["easy", "medium", "hard"]
|
| 36 |
BENCHMARK = "bug-triage-env"
|
| 37 |
TEMPERATURE = 0.0
|
| 38 |
+
MAX_TOKENS = 500
|
| 39 |
+
MAX_STEPS = 4 # Max steps per task (investigate + submit)
|
| 40 |
+
MAX_TOTAL_REWARD = 1.0
|
| 41 |
SUCCESS_SCORE_THRESHOLD = 0.4
|
| 42 |
|
| 43 |
print(f"[CONFIG] API_BASE_URL={API_BASE_URL}", flush=True)
|
|
|
|
| 45 |
print(f"[CONFIG] ENV_BASE_URL={ENV_BASE_URL}", flush=True)
|
| 46 |
print(f"[CONFIG] API_KEY={'set' if API_KEY else 'MISSING'}", flush=True)
|
| 47 |
|
| 48 |
+
|
| 49 |
+
# ---------------------------------------------------------------------------
|
| 50 |
+
# INLINED CLIENT β self-contained, no external dependency
|
| 51 |
+
# ---------------------------------------------------------------------------
|
| 52 |
|
| 53 |
def _parse_observation(data: dict) -> TriageObservation:
|
| 54 |
try:
|
|
|
|
| 58 |
return TriageObservation(
|
| 59 |
bug_report=bug,
|
| 60 |
task_id=data.get("task_id", "easy"),
|
| 61 |
+
score=data.get("score", 0.0),
|
| 62 |
feedback=data.get("feedback", ""),
|
| 63 |
done=data.get("done", False),
|
| 64 |
+
reward=data.get("reward", 0.0),
|
| 65 |
+
body_visible=data.get("body_visible", False),
|
| 66 |
+
comments_visible=data.get("comments_visible", False),
|
| 67 |
+
logs_visible=data.get("logs_visible", False),
|
| 68 |
+
similar_visible=data.get("similar_visible", False),
|
| 69 |
+
steps_taken=data.get("steps_taken", 0),
|
| 70 |
+
max_steps=data.get("max_steps", 6),
|
| 71 |
)
|
| 72 |
|
| 73 |
|
| 74 |
class StepResult:
|
| 75 |
+
def __init__(self, observation: TriageObservation, reward: float,
|
| 76 |
+
done: bool, info: dict):
|
| 77 |
self.observation = observation
|
| 78 |
self.reward = reward
|
| 79 |
self.done = done
|
|
|
|
| 85 |
self.base_url = (base_url or ENV_BASE_URL).rstrip("/")
|
| 86 |
self.session = requests.Session()
|
| 87 |
self.session.headers.update({"Content-Type": "application/json"})
|
| 88 |
+
self._session_id: Optional[str] = None
|
| 89 |
|
| 90 |
def reset(self, task_id: str = "easy") -> TriageObservation:
|
| 91 |
print(f"[ENV] Resetting env for task={task_id}", flush=True)
|
| 92 |
+
payload = {"task_id": task_id}
|
| 93 |
+
if self._session_id:
|
| 94 |
+
payload["session_id"] = self._session_id
|
| 95 |
+
|
| 96 |
response = self.session.post(
|
| 97 |
+
f"{self.base_url}/reset", json=payload, timeout=30,
|
|
|
|
|
|
|
| 98 |
)
|
| 99 |
response.raise_for_status()
|
| 100 |
data = response.json()
|
| 101 |
+
self._session_id = data.get("session_id")
|
| 102 |
return _parse_observation(data.get("observation", data))
|
| 103 |
|
| 104 |
def step(self, action: TriageAction) -> StepResult:
|
| 105 |
+
print(f"[ENV] Sending step: action_type={action.action_type}", flush=True)
|
| 106 |
try:
|
| 107 |
action_dict = action.model_dump()
|
| 108 |
except AttributeError:
|
| 109 |
action_dict = action.dict()
|
| 110 |
+
|
| 111 |
+
payload = {"action": action_dict}
|
| 112 |
+
if self._session_id:
|
| 113 |
+
payload["session_id"] = self._session_id
|
| 114 |
+
|
| 115 |
response = self.session.post(
|
| 116 |
+
f"{self.base_url}/step", json=payload, timeout=30,
|
|
|
|
|
|
|
| 117 |
)
|
| 118 |
response.raise_for_status()
|
| 119 |
data = response.json()
|
| 120 |
obs = _parse_observation(data.get("observation", data))
|
| 121 |
+
|
| 122 |
reward = data.get("reward", obs.reward)
|
| 123 |
+
if reward is None:
|
| 124 |
+
reward = 0.0
|
| 125 |
reward = float(reward)
|
| 126 |
+
if obs.done:
|
| 127 |
+
reward = max(0.01, min(0.99, reward))
|
| 128 |
+
|
| 129 |
+
if "session_id" in data:
|
| 130 |
+
self._session_id = data["session_id"]
|
| 131 |
+
|
| 132 |
return StepResult(
|
| 133 |
+
observation=obs, reward=reward,
|
| 134 |
+
done=data.get("done", obs.done), info={},
|
|
|
|
|
|
|
| 135 |
)
|
| 136 |
|
| 137 |
def close(self):
|
|
|
|
| 144 |
self.close()
|
| 145 |
|
| 146 |
|
| 147 |
+
# ---------------------------------------------------------------------------
|
| 148 |
+
# LLM PROMPTS
|
| 149 |
+
# ---------------------------------------------------------------------------
|
| 150 |
|
| 151 |
SYSTEM_PROMPT = textwrap.dedent("""
|
| 152 |
+
You are a senior software engineering manager triaging a bug report.
|
| 153 |
+
You will receive a bug report (possibly with partial information).
|
| 154 |
+
Respond ONLY with valid JSON β no markdown, no explanation, no backticks.
|
| 155 |
|
| 156 |
Return exactly this structure:
|
| 157 |
{
|
|
|
|
| 170 |
|
| 171 |
Teams: backend | frontend | infra | security | devx
|
| 172 |
Milestones: hotfix | v2.1 | backlog
|
| 173 |
+
|
| 174 |
+
Important: Pay attention to security signals (SQL injection, XSS, auth bypass,
|
| 175 |
+
data exposure). Security bugs should almost always be P0 + security team + hotfix.
|
| 176 |
""").strip()
|
| 177 |
|
| 178 |
+
INVESTIGATION_PROMPT = textwrap.dedent("""
|
| 179 |
+
You are deciding whether to investigate further or submit your triage.
|
| 180 |
+
You have seen the following information about a bug. Based on what you see,
|
| 181 |
+
decide if you need more information or can triage now.
|
| 182 |
|
| 183 |
+
Respond with ONLY one of these JSON formats:
|
| 184 |
|
| 185 |
+
To investigate: {"action": "read_body"} or {"action": "read_comments"} or {"action": "check_logs"}
|
| 186 |
+
To submit:
|
| 187 |
+
{
|
| 188 |
+
"action": "submit",
|
| 189 |
+
"priority": "P0",
|
| 190 |
+
"labels": ["bug"],
|
| 191 |
+
"assigned_team": "backend",
|
| 192 |
+
"milestone": "hotfix",
|
| 193 |
+
"reasoning": "explanation"
|
| 194 |
+
}
|
| 195 |
+
|
| 196 |
+
Only investigate if the title and preview are genuinely ambiguous.
|
| 197 |
+
If the bug is clearly a typo or clearly critical, submit immediately.
|
| 198 |
+
""").strip()
|
| 199 |
+
|
| 200 |
+
|
| 201 |
+
# ---------------------------------------------------------------------------
|
| 202 |
+
# STRUCTURED LOGGING β strict [START]/[STEP]/[END] format
|
| 203 |
+
# ---------------------------------------------------------------------------
|
| 204 |
|
| 205 |
def log_start(task: str, env: str, model: str) -> None:
|
| 206 |
print(f"[START] task={task} env={env} model={model}", flush=True)
|
| 207 |
|
| 208 |
|
| 209 |
+
def log_step(step: int, action: str, reward: float, done: bool,
|
| 210 |
+
error: Optional[str] = None) -> None:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 211 |
print(
|
| 212 |
f"[STEP] step={step} action={action} "
|
| 213 |
f"reward={reward:.2f} done={str(done).lower()} error={error or 'null'}",
|
|
|
|
| 215 |
)
|
| 216 |
|
| 217 |
|
| 218 |
+
def log_end(success: bool, steps: int, score: float,
|
| 219 |
+
rewards: List[float]) -> None:
|
| 220 |
rewards_str = ",".join(f"{r:.2f}" for r in rewards)
|
| 221 |
print(
|
| 222 |
f"[END] success={str(success).lower()} steps={steps} "
|
|
|
|
| 225 |
)
|
| 226 |
|
| 227 |
|
| 228 |
+
# ---------------------------------------------------------------------------
|
| 229 |
+
# BUG FORMATTING
|
| 230 |
+
# ---------------------------------------------------------------------------
|
| 231 |
|
| 232 |
def format_bug(obs: TriageObservation) -> str:
|
| 233 |
+
"""Format a bug observation into text the LLM can read."""
|
| 234 |
bug = obs.bug_report
|
| 235 |
+
parts = [f"Title: {bug.title}"]
|
| 236 |
+
|
| 237 |
+
parts.append(f"\nDescription:\n{bug.body}")
|
| 238 |
+
|
| 239 |
+
if obs.comments_visible and bug.comments:
|
| 240 |
+
comments = "\n".join(f" - {c}" for c in bug.comments)
|
| 241 |
+
parts.append(f"\nComments:\n{comments}")
|
| 242 |
+
|
| 243 |
+
if bug.labels_hint:
|
| 244 |
+
parts.append(f"\nExisting labels: {', '.join(bug.labels_hint)}")
|
| 245 |
+
|
| 246 |
+
if obs.logs_visible:
|
| 247 |
+
if bug.stack_trace:
|
| 248 |
+
parts.append(f"\nStack trace: {bug.stack_trace}")
|
| 249 |
+
if bug.affected_component:
|
| 250 |
+
parts.append(f"\nAffected component: {bug.affected_component}")
|
| 251 |
+
if bug.severity_signals:
|
| 252 |
+
parts.append(f"\nSeverity signals: {', '.join(bug.severity_signals)}")
|
| 253 |
+
|
| 254 |
+
if obs.similar_visible and bug.related_bugs:
|
| 255 |
+
parts.append(f"\nRelated bugs: {', '.join(bug.related_bugs)}")
|
| 256 |
+
|
| 257 |
+
# Add visibility context
|
| 258 |
+
visibility = []
|
| 259 |
+
if not obs.body_visible:
|
| 260 |
+
visibility.append("body (truncated)")
|
| 261 |
+
if not obs.comments_visible:
|
| 262 |
+
visibility.append("comments (hidden)")
|
| 263 |
+
if not obs.logs_visible:
|
| 264 |
+
visibility.append("logs (hidden)")
|
| 265 |
+
if visibility:
|
| 266 |
+
parts.append(f"\n[Hidden info: {', '.join(visibility)}]")
|
| 267 |
+
|
| 268 |
+
parts.append(f"\nSteps used: {obs.steps_taken}/{obs.max_steps}")
|
| 269 |
+
|
| 270 |
+
return "\n".join(parts)
|
| 271 |
+
|
| 272 |
+
|
| 273 |
+
def format_bug_for_decision(obs: TriageObservation) -> str:
|
| 274 |
+
"""Shorter format for the investigation decision."""
|
| 275 |
+
bug = obs.bug_report
|
| 276 |
+
text = f"Title: {bug.title}\nPreview: {bug.body[:150]}"
|
| 277 |
+
if obs.body_visible:
|
| 278 |
+
text += f"\n\nFull body visible."
|
| 279 |
+
if obs.comments_visible and bug.comments:
|
| 280 |
+
text += f"\nComments: {len(bug.comments)} visible."
|
| 281 |
+
text += f"\nSteps remaining: {obs.max_steps - obs.steps_taken}"
|
| 282 |
+
return text
|
| 283 |
+
|
| 284 |
+
|
| 285 |
+
# ---------------------------------------------------------------------------
|
| 286 |
+
# MODEL CALLS
|
| 287 |
+
# ---------------------------------------------------------------------------
|
| 288 |
+
|
| 289 |
+
def decide_action(client: OpenAI, obs: TriageObservation) -> dict:
|
| 290 |
+
"""Ask the LLM whether to investigate or submit."""
|
| 291 |
+
bug_text = format_bug_for_decision(obs)
|
| 292 |
+
|
| 293 |
+
try:
|
| 294 |
+
completion = client.chat.completions.create(
|
| 295 |
+
model=MODEL_NAME,
|
| 296 |
+
messages=[
|
| 297 |
+
{"role": "system", "content": INVESTIGATION_PROMPT},
|
| 298 |
+
{"role": "user", "content": bug_text},
|
| 299 |
+
],
|
| 300 |
+
temperature=TEMPERATURE,
|
| 301 |
+
max_tokens=200,
|
| 302 |
+
stream=False,
|
| 303 |
+
)
|
| 304 |
+
raw = (completion.choices[0].message.content or "").strip()
|
| 305 |
+
if raw.startswith("```"):
|
| 306 |
+
parts = raw.split("```")
|
| 307 |
+
raw = parts[1] if len(parts) > 1 else raw
|
| 308 |
+
if raw.startswith("json"):
|
| 309 |
+
raw = raw[4:].strip()
|
| 310 |
+
return json.loads(raw)
|
| 311 |
+
except Exception as e:
|
| 312 |
+
print(f"[DEBUG] Decision model call failed: {e}", flush=True)
|
| 313 |
+
return {"action": "submit"}
|
| 314 |
|
| 315 |
|
| 316 |
def call_model(client: OpenAI, bug_text: str) -> TriageAction:
|
| 317 |
+
"""Ask the LLM to triage the bug report."""
|
| 318 |
+
print("[LLM] Sending triage request to model...", flush=True)
|
| 319 |
|
| 320 |
completion = client.chat.completions.create(
|
| 321 |
model=MODEL_NAME,
|
|
|
|
| 344 |
data = {}
|
| 345 |
|
| 346 |
action = TriageAction(
|
| 347 |
+
action_type="submit",
|
| 348 |
priority=data.get("priority", "P2"),
|
| 349 |
labels=data.get("labels", ["bug"]),
|
| 350 |
assigned_team=data.get("assigned_team", "backend"),
|
|
|
|
| 360 |
return action
|
| 361 |
|
| 362 |
|
| 363 |
+
# ---------------------------------------------------------------------------
|
| 364 |
+
# MAIN β multi-step agent with per-task [START]/[STEP]/[END] logging
|
| 365 |
+
# ---------------------------------------------------------------------------
|
| 366 |
|
| 367 |
def main() -> None:
|
| 368 |
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
|
| 369 |
|
|
|
|
| 370 |
all_scores = []
|
| 371 |
|
| 372 |
with BugTriageClient(base_url=ENV_BASE_URL) as env:
|
|
|
|
| 375 |
score = 0.0
|
| 376 |
success = False
|
| 377 |
steps_taken = 0
|
| 378 |
+
|
| 379 |
log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
|
| 380 |
|
| 381 |
try:
|
| 382 |
obs = env.reset(task_id=task_id)
|
| 383 |
+
|
| 384 |
+
for step_num in range(1, MAX_STEPS + 1):
|
| 385 |
+
if obs.done:
|
| 386 |
+
break
|
| 387 |
+
|
| 388 |
+
# Decide: investigate or submit?
|
| 389 |
+
# For efficiency, check if we have enough info
|
| 390 |
+
# On step 1, always read full body; on later steps, decide
|
| 391 |
+
if step_num == 1 and not obs.body_visible:
|
| 392 |
+
# First step: read the full body
|
| 393 |
+
action = TriageAction(action_type="read_body")
|
| 394 |
+
result = env.step(action)
|
| 395 |
+
obs = result.observation
|
| 396 |
+
steps_taken = step_num
|
| 397 |
+
|
| 398 |
+
log_step(
|
| 399 |
+
step=step_num,
|
| 400 |
+
action="investigate:read_body",
|
| 401 |
+
reward=0.0,
|
| 402 |
+
done=result.done,
|
| 403 |
+
)
|
| 404 |
+
|
| 405 |
+
if result.done:
|
| 406 |
+
rewards.append(result.reward)
|
| 407 |
+
break
|
| 408 |
+
continue
|
| 409 |
+
|
| 410 |
+
elif step_num == 2 and not obs.comments_visible:
|
| 411 |
+
# Second step: read comments for extra context
|
| 412 |
+
action = TriageAction(action_type="read_comments")
|
| 413 |
+
result = env.step(action)
|
| 414 |
+
obs = result.observation
|
| 415 |
+
steps_taken = step_num
|
| 416 |
+
|
| 417 |
+
log_step(
|
| 418 |
+
step=step_num,
|
| 419 |
+
action="investigate:read_comments",
|
| 420 |
+
reward=0.0,
|
| 421 |
+
done=result.done,
|
| 422 |
+
)
|
| 423 |
+
|
| 424 |
+
if result.done:
|
| 425 |
+
rewards.append(result.reward)
|
| 426 |
+
break
|
| 427 |
+
continue
|
| 428 |
+
|
| 429 |
+
# Now submit the triage decision
|
| 430 |
+
bug_text = format_bug(obs)
|
| 431 |
+
action = call_model(client, bug_text)
|
| 432 |
+
result = env.step(action)
|
| 433 |
+
obs = result.observation
|
| 434 |
+
steps_taken = step_num
|
| 435 |
+
|
| 436 |
+
reward = float(result.reward or 0.0)
|
| 437 |
+
if result.done:
|
| 438 |
+
reward = max(0.01, min(0.99, reward))
|
| 439 |
+
rewards.append(reward)
|
| 440 |
+
|
| 441 |
+
action_str = (
|
| 442 |
+
f"priority={action.priority},"
|
| 443 |
+
f"team={action.assigned_team},"
|
| 444 |
+
f"milestone={action.milestone}"
|
| 445 |
+
)
|
| 446 |
+
|
| 447 |
+
log_step(
|
| 448 |
+
step=step_num,
|
| 449 |
+
action=action_str,
|
| 450 |
+
reward=reward,
|
| 451 |
+
done=result.done,
|
| 452 |
+
)
|
| 453 |
+
|
| 454 |
+
if result.done:
|
| 455 |
+
break
|
| 456 |
+
|
| 457 |
+
# Calculate score
|
| 458 |
+
if rewards:
|
| 459 |
+
score = sum(rewards) / MAX_TOTAL_REWARD
|
| 460 |
+
else:
|
| 461 |
+
score = 0.0
|
| 462 |
score = min(max(score, 0.01), 0.99)
|
| 463 |
success = score >= SUCCESS_SCORE_THRESHOLD
|
| 464 |
|
|
|
|
| 468 |
score = min(max(score, 0.01), 0.99)
|
| 469 |
success = False
|
| 470 |
|
|
|
|
| 471 |
log_end(success, steps_taken, score, rewards)
|
| 472 |
all_scores.append(score)
|
| 473 |
|
| 474 |
time.sleep(0.5)
|
| 475 |
|
|
|
|
| 476 |
avg_score = sum(all_scores) / len(all_scores) if all_scores else 0.0
|
| 477 |
+
print(
|
| 478 |
+
f"[SUMMARY] tasks={len(all_scores)} avg_score={avg_score:.2f} "
|
| 479 |
+
f"scores={all_scores}",
|
| 480 |
+
flush=True,
|
| 481 |
+
)
|
| 482 |
|
| 483 |
|
| 484 |
if __name__ == "__main__":
|
model.py
CHANGED
|
@@ -1,13 +1,10 @@
|
|
| 1 |
# model.py
|
| 2 |
-
from typing import List
|
| 3 |
from pydantic import BaseModel, Field
|
| 4 |
from openenv.core.env_server import Action, Observation
|
| 5 |
from openenv.core.env_server.types import State
|
| 6 |
|
| 7 |
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
class BugReport(BaseModel):
|
| 12 |
"""A single GitHub-style bug report."""
|
| 13 |
id: str
|
|
@@ -16,16 +13,21 @@ class BugReport(BaseModel):
|
|
| 16 |
author: str
|
| 17 |
labels_hint: List[str] = Field(default_factory=list)
|
| 18 |
comments: List[str] = Field(default_factory=list)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
class Config:
|
| 21 |
arbitrary_types_allowed = True
|
| 22 |
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
| 26 |
class TriageAction(Action):
|
| 27 |
-
"""What the agent submits
|
| 28 |
-
|
|
|
|
|
|
|
|
|
|
| 29 |
labels: List[str] = Field(default_factory=list)
|
| 30 |
assigned_team: str = "backend"
|
| 31 |
milestone: str = "backlog"
|
|
@@ -36,7 +38,7 @@ class TriageAction(Action):
|
|
| 36 |
|
| 37 |
|
| 38 |
class TriageObservation(Observation):
|
| 39 |
-
"""What the agent sees after each step."""
|
| 40 |
bug_report: BugReport
|
| 41 |
task_id: str = "easy"
|
| 42 |
score: float = 0.0
|
|
@@ -44,6 +46,14 @@ class TriageObservation(Observation):
|
|
| 44 |
done: bool = False
|
| 45 |
reward: float = 0.0
|
| 46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
class Config:
|
| 48 |
arbitrary_types_allowed = True
|
| 49 |
|
|
@@ -51,10 +61,12 @@ class TriageObservation(Observation):
|
|
| 51 |
class TriageState(State):
|
| 52 |
"""Internal episode state."""
|
| 53 |
episode_id: str = ""
|
|
|
|
| 54 |
current_task: str = "easy"
|
| 55 |
step_count: int = 0
|
| 56 |
total_score: float = 0.0
|
| 57 |
tasks_completed: List[str] = Field(default_factory=list)
|
|
|
|
| 58 |
|
| 59 |
class Config:
|
| 60 |
arbitrary_types_allowed = True
|
|
|
|
| 1 |
# model.py
|
| 2 |
+
from typing import List, Optional, Dict, Any
|
| 3 |
from pydantic import BaseModel, Field
|
| 4 |
from openenv.core.env_server import Action, Observation
|
| 5 |
from openenv.core.env_server.types import State
|
| 6 |
|
| 7 |
|
|
|
|
|
|
|
|
|
|
| 8 |
class BugReport(BaseModel):
|
| 9 |
"""A single GitHub-style bug report."""
|
| 10 |
id: str
|
|
|
|
| 13 |
author: str
|
| 14 |
labels_hint: List[str] = Field(default_factory=list)
|
| 15 |
comments: List[str] = Field(default_factory=list)
|
| 16 |
+
severity_signals: List[str] = Field(default_factory=list)
|
| 17 |
+
related_bugs: List[str] = Field(default_factory=list)
|
| 18 |
+
stack_trace: str = ""
|
| 19 |
+
affected_component: str = ""
|
| 20 |
|
| 21 |
class Config:
|
| 22 |
arbitrary_types_allowed = True
|
| 23 |
|
| 24 |
|
|
|
|
|
|
|
| 25 |
class TriageAction(Action):
|
| 26 |
+
"""What the agent submits β either an investigation or a final triage decision."""
|
| 27 |
+
action_type: str = "submit" # "read_body" | "read_comments" | "check_logs" | "check_similar" | "submit"
|
| 28 |
+
|
| 29 |
+
# Only used when action_type == "submit"
|
| 30 |
+
priority: str = "P2"
|
| 31 |
labels: List[str] = Field(default_factory=list)
|
| 32 |
assigned_team: str = "backend"
|
| 33 |
milestone: str = "backlog"
|
|
|
|
| 38 |
|
| 39 |
|
| 40 |
class TriageObservation(Observation):
|
| 41 |
+
"""What the agent sees after each step β progressively reveals info."""
|
| 42 |
bug_report: BugReport
|
| 43 |
task_id: str = "easy"
|
| 44 |
score: float = 0.0
|
|
|
|
| 46 |
done: bool = False
|
| 47 |
reward: float = 0.0
|
| 48 |
|
| 49 |
+
# Progressive visibility fields
|
| 50 |
+
body_visible: bool = False
|
| 51 |
+
comments_visible: bool = False
|
| 52 |
+
logs_visible: bool = False
|
| 53 |
+
similar_visible: bool = False
|
| 54 |
+
steps_taken: int = 0
|
| 55 |
+
max_steps: int = 6
|
| 56 |
+
|
| 57 |
class Config:
|
| 58 |
arbitrary_types_allowed = True
|
| 59 |
|
|
|
|
| 61 |
class TriageState(State):
|
| 62 |
"""Internal episode state."""
|
| 63 |
episode_id: str = ""
|
| 64 |
+
session_id: str = ""
|
| 65 |
current_task: str = "easy"
|
| 66 |
step_count: int = 0
|
| 67 |
total_score: float = 0.0
|
| 68 |
tasks_completed: List[str] = Field(default_factory=list)
|
| 69 |
+
actions_taken: List[str] = Field(default_factory=list)
|
| 70 |
|
| 71 |
class Config:
|
| 72 |
arbitrary_types_allowed = True
|
openenv.yaml
CHANGED
|
@@ -1,32 +1,43 @@
|
|
| 1 |
spec_version: 1
|
| 2 |
name: bug-triage-env
|
| 3 |
-
version: "
|
| 4 |
description: >
|
| 5 |
-
A reinforcement learning environment where an
|
| 6 |
-
GitHub-style bug reports by assigning
|
| 7 |
-
and milestone.
|
|
|
|
|
|
|
|
|
|
| 8 |
endpoint: https://siteshcodes-bug-triage-env.hf.space
|
| 9 |
tags:
|
| 10 |
- openenv
|
| 11 |
- bug-triage
|
| 12 |
- real-world
|
| 13 |
- nlp
|
|
|
|
| 14 |
tasks:
|
| 15 |
- id: easy
|
| 16 |
name: Priority Assignment
|
| 17 |
-
description:
|
|
|
|
|
|
|
| 18 |
difficulty: easy
|
| 19 |
grader: server.task:priority_match
|
| 20 |
reward_range: [0.0, 1.0]
|
| 21 |
- id: medium
|
| 22 |
name: Priority Labels and Team
|
| 23 |
-
description:
|
|
|
|
|
|
|
| 24 |
difficulty: medium
|
| 25 |
grader: server.task:priority_label_team
|
| 26 |
reward_range: [0.0, 1.0]
|
| 27 |
- id: hard
|
| 28 |
name: Full Triage
|
| 29 |
-
description:
|
|
|
|
|
|
|
|
|
|
| 30 |
difficulty: hard
|
| 31 |
grader: server.task:full_triage
|
| 32 |
reward_range: [0.0, 1.0]
|
|
@@ -35,6 +46,7 @@ endpoints:
|
|
| 35 |
step: /step
|
| 36 |
state: /state
|
| 37 |
actions:
|
|
|
|
| 38 |
priority: string
|
| 39 |
labels: list
|
| 40 |
assigned_team: string
|
|
@@ -46,4 +58,10 @@ observations:
|
|
| 46 |
score: float
|
| 47 |
reward: float
|
| 48 |
feedback: string
|
| 49 |
-
done: bool
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
spec_version: 1
|
| 2 |
name: bug-triage-env
|
| 3 |
+
version: "2.0.0"
|
| 4 |
description: >
|
| 5 |
+
A multi-step reinforcement learning environment where an AI agent
|
| 6 |
+
investigates and triages GitHub-style bug reports by assigning
|
| 7 |
+
priority, labels, team, and milestone. Features progressive
|
| 8 |
+
information reveal, procedural bug generation (200+ unique bugs),
|
| 9 |
+
semantic label matching, and a security escalation penalty.
|
| 10 |
+
3 tasks of increasing difficulty (easy β medium β hard).
|
| 11 |
endpoint: https://siteshcodes-bug-triage-env.hf.space
|
| 12 |
tags:
|
| 13 |
- openenv
|
| 14 |
- bug-triage
|
| 15 |
- real-world
|
| 16 |
- nlp
|
| 17 |
+
- multi-step
|
| 18 |
tasks:
|
| 19 |
- id: easy
|
| 20 |
name: Priority Assignment
|
| 21 |
+
description: >
|
| 22 |
+
Investigate a bug report and assign correct P0-P3 priority.
|
| 23 |
+
Use investigation actions to gather info before submitting.
|
| 24 |
difficulty: easy
|
| 25 |
grader: server.task:priority_match
|
| 26 |
reward_range: [0.0, 1.0]
|
| 27 |
- id: medium
|
| 28 |
name: Priority Labels and Team
|
| 29 |
+
description: >
|
| 30 |
+
Investigate and assign correct priority, labels, and team
|
| 31 |
+
routing. More investigation steps available.
|
| 32 |
difficulty: medium
|
| 33 |
grader: server.task:priority_label_team
|
| 34 |
reward_range: [0.0, 1.0]
|
| 35 |
- id: hard
|
| 36 |
name: Full Triage
|
| 37 |
+
description: >
|
| 38 |
+
Full triage with priority, labels, team, milestone and
|
| 39 |
+
security escalation penalty. Investigation is critical β
|
| 40 |
+
missing security signals is penalized.
|
| 41 |
difficulty: hard
|
| 42 |
grader: server.task:full_triage
|
| 43 |
reward_range: [0.0, 1.0]
|
|
|
|
| 46 |
step: /step
|
| 47 |
state: /state
|
| 48 |
actions:
|
| 49 |
+
action_type: string
|
| 50 |
priority: string
|
| 51 |
labels: list
|
| 52 |
assigned_team: string
|
|
|
|
| 58 |
score: float
|
| 59 |
reward: float
|
| 60 |
feedback: string
|
| 61 |
+
done: bool
|
| 62 |
+
body_visible: bool
|
| 63 |
+
comments_visible: bool
|
| 64 |
+
logs_visible: bool
|
| 65 |
+
similar_visible: bool
|
| 66 |
+
steps_taken: int
|
| 67 |
+
max_steps: int
|
pyproject.toml
CHANGED
|
@@ -4,8 +4,8 @@ build-backend = "setuptools.backends.legacy:build"
|
|
| 4 |
|
| 5 |
[project]
|
| 6 |
name = "bug-triage-env"
|
| 7 |
-
version = "
|
| 8 |
-
description = "OpenEnv RL environment for bug report triage"
|
| 9 |
requires-python = ">=3.11"
|
| 10 |
dependencies = [
|
| 11 |
"openenv-core>=0.2.0",
|
|
@@ -13,6 +13,15 @@ dependencies = [
|
|
| 13 |
"uvicorn[standard]",
|
| 14 |
"pydantic",
|
| 15 |
"websockets",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
"groq",
|
| 17 |
]
|
| 18 |
|
|
|
|
| 4 |
|
| 5 |
[project]
|
| 6 |
name = "bug-triage-env"
|
| 7 |
+
version = "2.0.0"
|
| 8 |
+
description = "Multi-step OpenEnv RL environment for bug report triage"
|
| 9 |
requires-python = ">=3.11"
|
| 10 |
dependencies = [
|
| 11 |
"openenv-core>=0.2.0",
|
|
|
|
| 13 |
"uvicorn[standard]",
|
| 14 |
"pydantic",
|
| 15 |
"websockets",
|
| 16 |
+
"requests",
|
| 17 |
+
"openai",
|
| 18 |
+
]
|
| 19 |
+
|
| 20 |
+
[project.optional-dependencies]
|
| 21 |
+
dev = [
|
| 22 |
+
"pytest>=7.0",
|
| 23 |
+
"pytest-cov",
|
| 24 |
+
"httpx",
|
| 25 |
"groq",
|
| 26 |
]
|
| 27 |
|
server/__pycache__/__init__.cpython-314.pyc
DELETED
|
Binary file (434 Bytes)
|
|
|
server/__pycache__/task.cpython-314.pyc
DELETED
|
Binary file (14.5 kB)
|
|
|
server/app.py
CHANGED
|
@@ -1,18 +1,16 @@
|
|
| 1 |
# server/app.py
|
| 2 |
import sys
|
| 3 |
import os
|
| 4 |
-
import json
|
| 5 |
sys.path.insert(0, "/app")
|
| 6 |
sys.path.insert(0, "/app/server")
|
| 7 |
|
| 8 |
from openenv.core.env_server import create_app
|
| 9 |
from model import TriageAction, TriageObservation
|
| 10 |
-
from environment import BugTriageEnvironment
|
| 11 |
from task import sample_bug, grade_action, TASKS
|
| 12 |
-
from fastapi import Response, Request
|
| 13 |
-
from fastapi.responses import FileResponse
|
| 14 |
from fastapi.staticfiles import StaticFiles
|
| 15 |
-
from pydantic import BaseModel
|
| 16 |
from typing import Optional, Dict, Any
|
| 17 |
|
| 18 |
app = create_app(
|
|
@@ -22,39 +20,15 @@ app = create_app(
|
|
| 22 |
env_name="bug-triage-env",
|
| 23 |
)
|
| 24 |
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
"id": "easy",
|
| 28 |
-
"name": "Priority Assignment",
|
| 29 |
-
"description": "Assign correct P0-P3 priority to a bug report",
|
| 30 |
-
"difficulty": "easy",
|
| 31 |
-
"grader": "server.task:priority_match",
|
| 32 |
-
"reward_range": [0.0, 1.0]
|
| 33 |
-
},
|
| 34 |
-
{
|
| 35 |
-
"id": "medium",
|
| 36 |
-
"name": "Priority Labels and Team",
|
| 37 |
-
"description": "Assign correct priority, labels, and team routing",
|
| 38 |
-
"difficulty": "medium",
|
| 39 |
-
"grader": "server.task:priority_label_team",
|
| 40 |
-
"reward_range": [0.0, 1.0]
|
| 41 |
-
},
|
| 42 |
-
{
|
| 43 |
-
"id": "hard",
|
| 44 |
-
"name": "Full Triage",
|
| 45 |
-
"description": "Full triage with priority, labels, team, milestone and security penalty",
|
| 46 |
-
"difficulty": "hard",
|
| 47 |
-
"grader": "server.task:full_triage",
|
| 48 |
-
"reward_range": [0.0, 1.0]
|
| 49 |
-
}
|
| 50 |
-
]
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
_global_env = BugTriageEnvironment()
|
| 55 |
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
|
|
|
|
| 58 |
routes_to_remove = []
|
| 59 |
for route in app.routes:
|
| 60 |
if hasattr(route, "path") and route.path in ("/reset", "/step", "/state"):
|
|
@@ -63,44 +37,60 @@ for route in routes_to_remove:
|
|
| 63 |
app.routes.remove(route)
|
| 64 |
|
| 65 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
@app.get("/health")
|
| 67 |
def health():
|
| 68 |
-
return {
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
@app.get("/")
|
| 71 |
def root():
|
| 72 |
-
"""Serve the interactive demo frontend
|
| 73 |
static_dir = os.path.join(os.path.dirname(__file__), "static")
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
@app.get("/web")
|
| 77 |
def web_ui():
|
| 78 |
"""Alias for the frontend."""
|
| 79 |
-
|
| 80 |
-
|
| 81 |
|
| 82 |
@app.get("/tasks")
|
| 83 |
def list_tasks():
|
| 84 |
return TASKS_META
|
| 85 |
|
| 86 |
-
@app.get("/tasks/easy")
|
| 87 |
-
def task_easy():
|
| 88 |
-
return TASKS_META[0]
|
| 89 |
|
| 90 |
-
@app.get("/tasks/
|
| 91 |
-
def
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
|
|
|
|
|
|
| 97 |
|
| 98 |
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
@app.post("/reset")
|
| 101 |
async def custom_reset(request: Request):
|
| 102 |
-
"""
|
| 103 |
-
global
|
| 104 |
|
| 105 |
body = {}
|
| 106 |
try:
|
|
@@ -111,9 +101,20 @@ async def custom_reset(request: Request):
|
|
| 111 |
task_id = body.get("task_id", "easy")
|
| 112 |
seed = body.get("seed", None)
|
| 113 |
episode_id = body.get("episode_id", None)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
|
| 115 |
-
|
| 116 |
-
|
| 117 |
|
| 118 |
try:
|
| 119 |
obs_dict = obs.model_dump(exclude={"reward", "done", "metadata"})
|
|
@@ -124,21 +125,31 @@ async def custom_reset(request: Request):
|
|
| 124 |
obs_dict.pop("metadata", None)
|
| 125 |
|
| 126 |
return {
|
|
|
|
| 127 |
"observation": obs_dict,
|
| 128 |
-
"reward":
|
| 129 |
-
"done":
|
| 130 |
}
|
| 131 |
|
| 132 |
|
| 133 |
@app.post("/step")
|
| 134 |
async def custom_step(request: Request):
|
| 135 |
-
"""
|
| 136 |
-
global
|
| 137 |
|
| 138 |
body = await request.json()
|
| 139 |
action_data = body.get("action", body)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
action = TriageAction(
|
|
|
|
| 142 |
priority=action_data.get("priority", "P2"),
|
| 143 |
labels=action_data.get("labels", ["bug"]),
|
| 144 |
assigned_team=action_data.get("assigned_team", "backend"),
|
|
@@ -146,7 +157,7 @@ async def custom_step(request: Request):
|
|
| 146 |
reasoning=action_data.get("reasoning", ""),
|
| 147 |
)
|
| 148 |
|
| 149 |
-
obs =
|
| 150 |
|
| 151 |
try:
|
| 152 |
obs_dict = obs.model_dump(exclude={"reward", "done", "metadata"})
|
|
@@ -156,53 +167,122 @@ async def custom_step(request: Request):
|
|
| 156 |
obs_dict.pop("done", None)
|
| 157 |
obs_dict.pop("metadata", None)
|
| 158 |
|
| 159 |
-
reward = float(obs.reward) if obs.reward is not None else 0.
|
| 160 |
-
|
| 161 |
-
reward = max(0.01, min(0.99, reward))
|
| 162 |
|
| 163 |
-
|
| 164 |
"observation": obs_dict,
|
| 165 |
"reward": reward,
|
| 166 |
"done": obs.done,
|
| 167 |
}
|
| 168 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 169 |
|
| 170 |
@app.get("/state")
|
| 171 |
-
def custom_state():
|
| 172 |
"""Return current environment state."""
|
| 173 |
-
|
| 174 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 175 |
try:
|
| 176 |
return state.model_dump()
|
| 177 |
except AttributeError:
|
| 178 |
return state.dict()
|
| 179 |
|
| 180 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
@app.post("/tasks/easy/reset")
|
| 182 |
-
def reset_easy():
|
| 183 |
-
|
| 184 |
-
|
| 185 |
-
|
| 186 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 187 |
|
| 188 |
@app.post("/tasks/medium/reset")
|
| 189 |
-
def reset_medium():
|
| 190 |
-
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 194 |
|
| 195 |
@app.post("/tasks/hard/reset")
|
| 196 |
-
def reset_hard():
|
| 197 |
-
|
| 198 |
-
|
| 199 |
-
|
| 200 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 202 |
|
| 203 |
def main():
|
| 204 |
import uvicorn
|
| 205 |
uvicorn.run(app, host="0.0.0.0", port=7860)
|
| 206 |
|
|
|
|
| 207 |
if __name__ == "__main__":
|
| 208 |
main()
|
|
|
|
| 1 |
# server/app.py
|
| 2 |
import sys
|
| 3 |
import os
|
|
|
|
| 4 |
sys.path.insert(0, "/app")
|
| 5 |
sys.path.insert(0, "/app/server")
|
| 6 |
|
| 7 |
from openenv.core.env_server import create_app
|
| 8 |
from model import TriageAction, TriageObservation
|
| 9 |
+
from environment import BugTriageEnvironment, SessionManager, TASKS_META
|
| 10 |
from task import sample_bug, grade_action, TASKS
|
| 11 |
+
from fastapi import Response, Request, HTTPException
|
| 12 |
+
from fastapi.responses import FileResponse, JSONResponse
|
| 13 |
from fastapi.staticfiles import StaticFiles
|
|
|
|
| 14 |
from typing import Optional, Dict, Any
|
| 15 |
|
| 16 |
app = create_app(
|
|
|
|
| 20 |
env_name="bug-triage-env",
|
| 21 |
)
|
| 22 |
|
| 23 |
+
# Session manager replaces the broken global state
|
| 24 |
+
sessions = SessionManager(max_sessions=500, ttl_seconds=600)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
# Fallback env for backward-compatible (non-session) requests
|
| 27 |
+
_fallback_env = BugTriageEnvironment()
|
| 28 |
+
_fallback_answer = None
|
| 29 |
|
| 30 |
|
| 31 |
+
# Remove default routes from create_app β we override them
|
| 32 |
routes_to_remove = []
|
| 33 |
for route in app.routes:
|
| 34 |
if hasattr(route, "path") and route.path in ("/reset", "/step", "/state"):
|
|
|
|
| 37 |
app.routes.remove(route)
|
| 38 |
|
| 39 |
|
| 40 |
+
# ---------------------------------------------------------------------------
|
| 41 |
+
# CORE ENDPOINTS
|
| 42 |
+
# ---------------------------------------------------------------------------
|
| 43 |
+
|
| 44 |
@app.get("/health")
|
| 45 |
def health():
|
| 46 |
+
return {
|
| 47 |
+
"status": "ok",
|
| 48 |
+
"env": "bug-triage-env",
|
| 49 |
+
"version": "2.0.0",
|
| 50 |
+
"active_sessions": sessions.active_count,
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
|
| 54 |
@app.get("/")
|
| 55 |
def root():
|
| 56 |
+
"""Serve the interactive demo frontend."""
|
| 57 |
static_dir = os.path.join(os.path.dirname(__file__), "static")
|
| 58 |
+
index_path = os.path.join(static_dir, "index.html")
|
| 59 |
+
if os.path.exists(index_path):
|
| 60 |
+
return FileResponse(index_path)
|
| 61 |
+
return {"message": "Bug Triage Environment v2.0.0", "docs": "/docs"}
|
| 62 |
+
|
| 63 |
|
| 64 |
@app.get("/web")
|
| 65 |
def web_ui():
|
| 66 |
"""Alias for the frontend."""
|
| 67 |
+
return root()
|
| 68 |
+
|
| 69 |
|
| 70 |
@app.get("/tasks")
|
| 71 |
def list_tasks():
|
| 72 |
return TASKS_META
|
| 73 |
|
|
|
|
|
|
|
|
|
|
| 74 |
|
| 75 |
+
@app.get("/tasks/{task_id}")
|
| 76 |
+
def get_task(task_id: str):
|
| 77 |
+
for t in TASKS_META:
|
| 78 |
+
if t["id"] == task_id:
|
| 79 |
+
return t
|
| 80 |
+
raise HTTPException(404, detail={
|
| 81 |
+
"error": "task_not_found",
|
| 82 |
+
"message": f"Task '{task_id}' not found. Valid: easy, medium, hard",
|
| 83 |
+
})
|
| 84 |
|
| 85 |
|
| 86 |
+
# ---------------------------------------------------------------------------
|
| 87 |
+
# SESSION-BASED RESET / STEP / STATE
|
| 88 |
+
# ---------------------------------------------------------------------------
|
| 89 |
|
| 90 |
@app.post("/reset")
|
| 91 |
async def custom_reset(request: Request):
|
| 92 |
+
"""Start a new episode. Returns a session_id for subsequent step() calls."""
|
| 93 |
+
global _fallback_env, _fallback_answer
|
| 94 |
|
| 95 |
body = {}
|
| 96 |
try:
|
|
|
|
| 101 |
task_id = body.get("task_id", "easy")
|
| 102 |
seed = body.get("seed", None)
|
| 103 |
episode_id = body.get("episode_id", None)
|
| 104 |
+
session_id = body.get("session_id", None)
|
| 105 |
+
|
| 106 |
+
# If session_id provided, reuse that session
|
| 107 |
+
if session_id:
|
| 108 |
+
env = sessions.get_session(session_id)
|
| 109 |
+
if env is None:
|
| 110 |
+
session_id, env = sessions.create_session()
|
| 111 |
+
else:
|
| 112 |
+
session_id, env = sessions.create_session()
|
| 113 |
+
|
| 114 |
+
obs = env.reset(task_id=task_id, seed=seed, episode_id=episode_id)
|
| 115 |
|
| 116 |
+
# Also update fallback for backward compatibility
|
| 117 |
+
_fallback_env = env
|
| 118 |
|
| 119 |
try:
|
| 120 |
obs_dict = obs.model_dump(exclude={"reward", "done", "metadata"})
|
|
|
|
| 125 |
obs_dict.pop("metadata", None)
|
| 126 |
|
| 127 |
return {
|
| 128 |
+
"session_id": session_id,
|
| 129 |
"observation": obs_dict,
|
| 130 |
+
"reward": 0.0,
|
| 131 |
+
"done": False,
|
| 132 |
}
|
| 133 |
|
| 134 |
|
| 135 |
@app.post("/step")
|
| 136 |
async def custom_step(request: Request):
|
| 137 |
+
"""Process an action β either investigation or final triage submission."""
|
| 138 |
+
global _fallback_env
|
| 139 |
|
| 140 |
body = await request.json()
|
| 141 |
action_data = body.get("action", body)
|
| 142 |
+
session_id = body.get("session_id", None)
|
| 143 |
+
|
| 144 |
+
# Find the right environment
|
| 145 |
+
env = None
|
| 146 |
+
if session_id:
|
| 147 |
+
env = sessions.get_session(session_id)
|
| 148 |
+
if env is None:
|
| 149 |
+
env = _fallback_env
|
| 150 |
|
| 151 |
action = TriageAction(
|
| 152 |
+
action_type=action_data.get("action_type", "submit"),
|
| 153 |
priority=action_data.get("priority", "P2"),
|
| 154 |
labels=action_data.get("labels", ["bug"]),
|
| 155 |
assigned_team=action_data.get("assigned_team", "backend"),
|
|
|
|
| 157 |
reasoning=action_data.get("reasoning", ""),
|
| 158 |
)
|
| 159 |
|
| 160 |
+
obs = env.step(action)
|
| 161 |
|
| 162 |
try:
|
| 163 |
obs_dict = obs.model_dump(exclude={"reward", "done", "metadata"})
|
|
|
|
| 167 |
obs_dict.pop("done", None)
|
| 168 |
obs_dict.pop("metadata", None)
|
| 169 |
|
| 170 |
+
reward = float(obs.reward) if obs.reward is not None else 0.0
|
| 171 |
+
reward = max(0.01, min(0.99, reward)) if obs.done else 0.0
|
|
|
|
| 172 |
|
| 173 |
+
response_data = {
|
| 174 |
"observation": obs_dict,
|
| 175 |
"reward": reward,
|
| 176 |
"done": obs.done,
|
| 177 |
}
|
| 178 |
|
| 179 |
+
if session_id:
|
| 180 |
+
response_data["session_id"] = session_id
|
| 181 |
+
|
| 182 |
+
# Cleanup session when episode is done
|
| 183 |
+
if obs.done and session_id:
|
| 184 |
+
sessions.remove_session(session_id)
|
| 185 |
+
|
| 186 |
+
return response_data
|
| 187 |
+
|
| 188 |
|
| 189 |
@app.get("/state")
|
| 190 |
+
def custom_state(session_id: Optional[str] = None):
|
| 191 |
"""Return current environment state."""
|
| 192 |
+
env = None
|
| 193 |
+
if session_id:
|
| 194 |
+
env = sessions.get_session(session_id)
|
| 195 |
+
if env is None:
|
| 196 |
+
env = _fallback_env
|
| 197 |
+
|
| 198 |
+
state = env.get_state()
|
| 199 |
try:
|
| 200 |
return state.model_dump()
|
| 201 |
except AttributeError:
|
| 202 |
return state.dict()
|
| 203 |
|
| 204 |
|
| 205 |
+
# ---------------------------------------------------------------------------
|
| 206 |
+
# PER-TASK SHORTCUT ENDPOINTS
|
| 207 |
+
# ---------------------------------------------------------------------------
|
| 208 |
+
|
| 209 |
@app.post("/tasks/easy/reset")
|
| 210 |
+
async def reset_easy():
|
| 211 |
+
session_id, env = sessions.create_session()
|
| 212 |
+
obs = env.reset(task_id="easy")
|
| 213 |
+
return {
|
| 214 |
+
"session_id": session_id,
|
| 215 |
+
"task_id": "easy",
|
| 216 |
+
"bug_report": obs.bug_report.model_dump(),
|
| 217 |
+
"done": False,
|
| 218 |
+
"reward": 0.0,
|
| 219 |
+
}
|
| 220 |
+
|
| 221 |
|
| 222 |
@app.post("/tasks/medium/reset")
|
| 223 |
+
async def reset_medium():
|
| 224 |
+
session_id, env = sessions.create_session()
|
| 225 |
+
obs = env.reset(task_id="medium")
|
| 226 |
+
return {
|
| 227 |
+
"session_id": session_id,
|
| 228 |
+
"task_id": "medium",
|
| 229 |
+
"bug_report": obs.bug_report.model_dump(),
|
| 230 |
+
"done": False,
|
| 231 |
+
"reward": 0.0,
|
| 232 |
+
}
|
| 233 |
+
|
| 234 |
|
| 235 |
@app.post("/tasks/hard/reset")
|
| 236 |
+
async def reset_hard():
|
| 237 |
+
session_id, env = sessions.create_session()
|
| 238 |
+
obs = env.reset(task_id="hard")
|
| 239 |
+
return {
|
| 240 |
+
"session_id": session_id,
|
| 241 |
+
"task_id": "hard",
|
| 242 |
+
"bug_report": obs.bug_report.model_dump(),
|
| 243 |
+
"done": False,
|
| 244 |
+
"reward": 0.0,
|
| 245 |
+
}
|
| 246 |
+
|
| 247 |
|
| 248 |
+
# ---------------------------------------------------------------------------
|
| 249 |
+
# LEADERBOARD
|
| 250 |
+
# ---------------------------------------------------------------------------
|
| 251 |
+
|
| 252 |
+
_leaderboard = []
|
| 253 |
+
|
| 254 |
+
|
| 255 |
+
@app.get("/leaderboard")
|
| 256 |
+
def get_leaderboard():
|
| 257 |
+
"""Return top 50 agent scores."""
|
| 258 |
+
return sorted(_leaderboard, key=lambda x: x.get("avg_score", 0), reverse=True)[:50]
|
| 259 |
+
|
| 260 |
+
|
| 261 |
+
@app.post("/leaderboard/submit")
|
| 262 |
+
async def submit_to_leaderboard(request: Request):
|
| 263 |
+
"""Submit agent scores to the leaderboard."""
|
| 264 |
+
body = await request.json()
|
| 265 |
+
entry = {
|
| 266 |
+
"agent_name": body.get("agent_name", "anonymous"),
|
| 267 |
+
"model": body.get("model", "unknown"),
|
| 268 |
+
"scores": body.get("scores", {}),
|
| 269 |
+
"avg_score": body.get("avg_score", 0.0),
|
| 270 |
+
}
|
| 271 |
+
_leaderboard.append(entry)
|
| 272 |
+
rank = sorted(
|
| 273 |
+
_leaderboard, key=lambda x: x.get("avg_score", 0), reverse=True
|
| 274 |
+
).index(entry) + 1
|
| 275 |
+
return {"status": "submitted", "rank": rank, "total_entries": len(_leaderboard)}
|
| 276 |
+
|
| 277 |
+
|
| 278 |
+
# ---------------------------------------------------------------------------
|
| 279 |
+
# ENTRYPOINT
|
| 280 |
+
# ---------------------------------------------------------------------------
|
| 281 |
|
| 282 |
def main():
|
| 283 |
import uvicorn
|
| 284 |
uvicorn.run(app, host="0.0.0.0", port=7860)
|
| 285 |
|
| 286 |
+
|
| 287 |
if __name__ == "__main__":
|
| 288 |
main()
|
server/environment.py
CHANGED
|
@@ -3,25 +3,39 @@ import sys
|
|
| 3 |
sys.path.insert(0, "/app")
|
| 4 |
sys.path.insert(0, "/app/server")
|
| 5 |
import uuid
|
|
|
|
|
|
|
| 6 |
from openenv.core.env_server.interfaces import Environment
|
| 7 |
from model import TriageAction, TriageObservation, TriageState, BugReport
|
| 8 |
from task import grade_action, sample_bug
|
| 9 |
|
| 10 |
VALID_TASKS = ["easy", "medium", "hard"]
|
| 11 |
|
|
|
|
|
|
|
| 12 |
TASKS_META = [
|
| 13 |
-
{"id": "easy", "name": "Priority Assignment",
|
|
|
|
| 14 |
"difficulty": "easy", "reward_range": [0.0, 1.0],
|
| 15 |
-
"description": "
|
| 16 |
-
|
|
|
|
|
|
|
| 17 |
"difficulty": "medium", "reward_range": [0.0, 1.0],
|
| 18 |
-
"description": "
|
| 19 |
-
|
|
|
|
|
|
|
| 20 |
"difficulty": "hard", "reward_range": [0.0, 1.0],
|
| 21 |
-
"description": "Full triage with
|
|
|
|
| 22 |
]
|
| 23 |
|
|
|
|
|
|
|
|
|
|
| 24 |
class BugTriageEnvironment(Environment):
|
|
|
|
| 25 |
|
| 26 |
SUPPORTS_CONCURRENT_SESSIONS = True
|
| 27 |
|
|
@@ -29,13 +43,25 @@ class BugTriageEnvironment(Environment):
|
|
| 29 |
super().__init__()
|
| 30 |
self._current_task_key: str = "easy"
|
| 31 |
self._episode_done: bool = False
|
| 32 |
-
self._current_bug: BugReport =
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
self._state = TriageState(
|
| 34 |
episode_id=str(uuid.uuid4()),
|
| 35 |
current_task="easy",
|
| 36 |
step_count=0,
|
| 37 |
-
total_score=0.
|
| 38 |
tasks_completed=[],
|
|
|
|
| 39 |
)
|
| 40 |
|
| 41 |
def get_metadata(self):
|
|
@@ -43,72 +69,203 @@ class BugTriageEnvironment(Environment):
|
|
| 43 |
from openenv.core.env_server.types import EnvironmentMetadata
|
| 44 |
return EnvironmentMetadata(
|
| 45 |
name="bug-triage-env",
|
| 46 |
-
description="
|
| 47 |
-
|
|
|
|
| 48 |
author="Siteshcodes",
|
| 49 |
tasks=TASKS_META,
|
| 50 |
)
|
| 51 |
except Exception:
|
| 52 |
return {
|
| 53 |
"name": "bug-triage-env",
|
| 54 |
-
"description": "
|
| 55 |
-
"version": "
|
| 56 |
"author": "Siteshcodes",
|
| 57 |
"tasks": TASKS_META,
|
| 58 |
}
|
| 59 |
|
| 60 |
-
def
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
if task_id not in VALID_TASKS:
|
| 63 |
task_id = "easy"
|
| 64 |
|
| 65 |
self._current_task_key = task_id
|
| 66 |
self._episode_done = False
|
| 67 |
-
self.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
self._state = TriageState(
|
| 69 |
episode_id=episode_id or str(uuid.uuid4()),
|
| 70 |
current_task=task_id,
|
| 71 |
step_count=0,
|
| 72 |
-
total_score=0.
|
| 73 |
tasks_completed=[],
|
|
|
|
| 74 |
)
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
task_id
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
)
|
| 83 |
|
| 84 |
def step(self, action: TriageAction) -> TriageObservation:
|
| 85 |
-
"""Process
|
| 86 |
if self._episode_done:
|
| 87 |
-
return
|
| 88 |
-
|
| 89 |
-
task_id=self._current_task_key,
|
| 90 |
-
score=0.05,
|
| 91 |
feedback="Episode already complete. Call reset() to start a new episode.",
|
| 92 |
-
done=True,
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
)
|
| 95 |
|
| 96 |
-
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
self._state.total_score += score
|
| 102 |
-
self._state.tasks_completed.append(
|
| 103 |
self._episode_done = True
|
| 104 |
|
| 105 |
-
return
|
| 106 |
-
|
| 107 |
-
task_id=task_key,
|
| 108 |
-
score=round(score, 3),
|
| 109 |
-
feedback=feedback,
|
| 110 |
-
done=True,
|
| 111 |
-
reward=round(score, 3),
|
| 112 |
)
|
| 113 |
|
| 114 |
@property
|
|
@@ -116,4 +273,68 @@ class BugTriageEnvironment(Environment):
|
|
| 116 |
return self._state
|
| 117 |
|
| 118 |
def get_state(self) -> TriageState:
|
| 119 |
-
return self._state
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
sys.path.insert(0, "/app")
|
| 4 |
sys.path.insert(0, "/app/server")
|
| 5 |
import uuid
|
| 6 |
+
import time
|
| 7 |
+
from typing import Dict, Optional, Tuple
|
| 8 |
from openenv.core.env_server.interfaces import Environment
|
| 9 |
from model import TriageAction, TriageObservation, TriageState, BugReport
|
| 10 |
from task import grade_action, sample_bug
|
| 11 |
|
| 12 |
VALID_TASKS = ["easy", "medium", "hard"]
|
| 13 |
|
| 14 |
+
MAX_STEPS_PER_TASK = {"easy": 4, "medium": 5, "hard": 6}
|
| 15 |
+
|
| 16 |
TASKS_META = [
|
| 17 |
+
{"id": "easy", "name": "Priority Assignment",
|
| 18 |
+
"grader": "server.task:priority_match",
|
| 19 |
"difficulty": "easy", "reward_range": [0.0, 1.0],
|
| 20 |
+
"description": "Investigate a bug report and assign a P0-P3 priority. "
|
| 21 |
+
"Use investigation actions to gather info before submitting."},
|
| 22 |
+
{"id": "medium", "name": "Priority Labels and Team",
|
| 23 |
+
"grader": "server.task:priority_label_team",
|
| 24 |
"difficulty": "medium", "reward_range": [0.0, 1.0],
|
| 25 |
+
"description": "Investigate and assign priority, labels, and team routing. "
|
| 26 |
+
"More investigation steps available."},
|
| 27 |
+
{"id": "hard", "name": "Full Triage",
|
| 28 |
+
"grader": "server.task:full_triage",
|
| 29 |
"difficulty": "hard", "reward_range": [0.0, 1.0],
|
| 30 |
+
"description": "Full triage with priority, labels, team, milestone, "
|
| 31 |
+
"and security escalation penalty. Investigation is critical."},
|
| 32 |
]
|
| 33 |
|
| 34 |
+
INVESTIGATION_ACTIONS = {"read_body", "read_comments", "check_logs", "check_similar"}
|
| 35 |
+
|
| 36 |
+
|
| 37 |
class BugTriageEnvironment(Environment):
|
| 38 |
+
"""Multi-step bug triage environment with progressive information reveal."""
|
| 39 |
|
| 40 |
SUPPORTS_CONCURRENT_SESSIONS = True
|
| 41 |
|
|
|
|
| 43 |
super().__init__()
|
| 44 |
self._current_task_key: str = "easy"
|
| 45 |
self._episode_done: bool = False
|
| 46 |
+
self._current_bug: Optional[BugReport] = None
|
| 47 |
+
self._current_answer: Optional[dict] = None
|
| 48 |
+
self._step_count: int = 0
|
| 49 |
+
self._max_steps: int = 4
|
| 50 |
+
self._actions_taken: list = []
|
| 51 |
+
|
| 52 |
+
# Progressive visibility
|
| 53 |
+
self._body_visible: bool = False
|
| 54 |
+
self._comments_visible: bool = False
|
| 55 |
+
self._logs_visible: bool = False
|
| 56 |
+
self._similar_visible: bool = False
|
| 57 |
+
|
| 58 |
self._state = TriageState(
|
| 59 |
episode_id=str(uuid.uuid4()),
|
| 60 |
current_task="easy",
|
| 61 |
step_count=0,
|
| 62 |
+
total_score=0.0,
|
| 63 |
tasks_completed=[],
|
| 64 |
+
actions_taken=[],
|
| 65 |
)
|
| 66 |
|
| 67 |
def get_metadata(self):
|
|
|
|
| 69 |
from openenv.core.env_server.types import EnvironmentMetadata
|
| 70 |
return EnvironmentMetadata(
|
| 71 |
name="bug-triage-env",
|
| 72 |
+
description="Multi-step bug triage RL environment with progressive "
|
| 73 |
+
"information reveal and 3 difficulty levels",
|
| 74 |
+
version="2.0.0",
|
| 75 |
author="Siteshcodes",
|
| 76 |
tasks=TASKS_META,
|
| 77 |
)
|
| 78 |
except Exception:
|
| 79 |
return {
|
| 80 |
"name": "bug-triage-env",
|
| 81 |
+
"description": "Multi-step bug triage RL environment",
|
| 82 |
+
"version": "2.0.0",
|
| 83 |
"author": "Siteshcodes",
|
| 84 |
"tasks": TASKS_META,
|
| 85 |
}
|
| 86 |
|
| 87 |
+
def _build_observation(self, score=0.0, feedback="", done=False,
|
| 88 |
+
reward=0.0) -> TriageObservation:
|
| 89 |
+
"""Build observation with current visibility state."""
|
| 90 |
+
bug = self._current_bug
|
| 91 |
+
|
| 92 |
+
# Create a visibility-filtered view of the bug
|
| 93 |
+
visible_bug = BugReport(
|
| 94 |
+
id=bug.id,
|
| 95 |
+
title=bug.title,
|
| 96 |
+
body=bug.body if self._body_visible else bug.body[:120] + "..." if len(bug.body) > 120 else bug.body,
|
| 97 |
+
author=bug.author,
|
| 98 |
+
labels_hint=bug.labels_hint,
|
| 99 |
+
comments=bug.comments if self._comments_visible else [],
|
| 100 |
+
severity_signals=bug.severity_signals if self._logs_visible else [],
|
| 101 |
+
related_bugs=bug.related_bugs if self._similar_visible else [],
|
| 102 |
+
stack_trace=bug.stack_trace if self._logs_visible else "",
|
| 103 |
+
affected_component=bug.affected_component if self._logs_visible else "",
|
| 104 |
+
)
|
| 105 |
+
|
| 106 |
+
return TriageObservation(
|
| 107 |
+
bug_report=visible_bug,
|
| 108 |
+
task_id=self._current_task_key,
|
| 109 |
+
score=round(score, 3),
|
| 110 |
+
feedback=feedback,
|
| 111 |
+
done=done,
|
| 112 |
+
reward=round(reward, 3),
|
| 113 |
+
body_visible=self._body_visible,
|
| 114 |
+
comments_visible=self._comments_visible,
|
| 115 |
+
logs_visible=self._logs_visible,
|
| 116 |
+
similar_visible=self._similar_visible,
|
| 117 |
+
steps_taken=self._step_count,
|
| 118 |
+
max_steps=self._max_steps,
|
| 119 |
+
)
|
| 120 |
+
|
| 121 |
+
def reset(self, task_id: str = "easy", seed: int = None,
|
| 122 |
+
episode_id: str = None, **kwargs) -> TriageObservation:
|
| 123 |
+
"""Start a fresh episode for the given task."""
|
| 124 |
if task_id not in VALID_TASKS:
|
| 125 |
task_id = "easy"
|
| 126 |
|
| 127 |
self._current_task_key = task_id
|
| 128 |
self._episode_done = False
|
| 129 |
+
self._step_count = 0
|
| 130 |
+
self._max_steps = MAX_STEPS_PER_TASK.get(task_id, 4)
|
| 131 |
+
self._actions_taken = []
|
| 132 |
+
|
| 133 |
+
# Reset visibility β title + truncated body are always visible
|
| 134 |
+
self._body_visible = False
|
| 135 |
+
self._comments_visible = False
|
| 136 |
+
self._logs_visible = False
|
| 137 |
+
self._similar_visible = False
|
| 138 |
+
|
| 139 |
+
# Sample a bug and its answer
|
| 140 |
+
self._current_bug, self._current_answer = sample_bug(task_id, seed=seed)
|
| 141 |
+
|
| 142 |
self._state = TriageState(
|
| 143 |
episode_id=episode_id or str(uuid.uuid4()),
|
| 144 |
current_task=task_id,
|
| 145 |
step_count=0,
|
| 146 |
+
total_score=0.0,
|
| 147 |
tasks_completed=[],
|
| 148 |
+
actions_taken=[],
|
| 149 |
)
|
| 150 |
+
|
| 151 |
+
feedback = (
|
| 152 |
+
f"Episode started for task: {task_id}. "
|
| 153 |
+
f"You see the bug title and a preview. "
|
| 154 |
+
f"Use investigation actions (read_body, read_comments, check_logs, check_similar) "
|
| 155 |
+
f"to reveal more information, then submit your triage. "
|
| 156 |
+
f"You have {self._max_steps} steps max."
|
| 157 |
+
)
|
| 158 |
+
|
| 159 |
+
return self._build_observation(
|
| 160 |
+
score=0.0, feedback=feedback, done=False, reward=0.0,
|
| 161 |
)
|
| 162 |
|
| 163 |
def step(self, action: TriageAction) -> TriageObservation:
|
| 164 |
+
"""Process agent's action β either investigate or submit final triage."""
|
| 165 |
if self._episode_done:
|
| 166 |
+
return self._build_observation(
|
| 167 |
+
score=0.0,
|
|
|
|
|
|
|
| 168 |
feedback="Episode already complete. Call reset() to start a new episode.",
|
| 169 |
+
done=True, reward=0.0,
|
| 170 |
+
)
|
| 171 |
+
|
| 172 |
+
self._step_count += 1
|
| 173 |
+
self._state.step_count = self._step_count
|
| 174 |
+
action_type = getattr(action, "action_type", "submit")
|
| 175 |
+
self._actions_taken.append(action_type)
|
| 176 |
+
self._state.actions_taken = list(self._actions_taken)
|
| 177 |
+
|
| 178 |
+
# Check if max steps reached β force submission
|
| 179 |
+
if self._step_count >= self._max_steps and action_type != "submit":
|
| 180 |
+
action_type = "submit"
|
| 181 |
+
|
| 182 |
+
# --- Investigation actions ---
|
| 183 |
+
if action_type in INVESTIGATION_ACTIONS:
|
| 184 |
+
feedback = self._handle_investigation(action_type)
|
| 185 |
+
return self._build_observation(
|
| 186 |
+
score=0.0, feedback=feedback, done=False, reward=0.0,
|
| 187 |
+
)
|
| 188 |
+
|
| 189 |
+
# --- Submit action ---
|
| 190 |
+
return self._handle_submission(action)
|
| 191 |
+
|
| 192 |
+
def _handle_investigation(self, action_type: str) -> str:
|
| 193 |
+
"""Reveal information based on the investigation action."""
|
| 194 |
+
if action_type == "read_body":
|
| 195 |
+
if self._body_visible:
|
| 196 |
+
return "Full body already revealed. Choose another action or submit."
|
| 197 |
+
self._body_visible = True
|
| 198 |
+
return (
|
| 199 |
+
f"Full bug description revealed. "
|
| 200 |
+
f"Steps used: {self._step_count}/{self._max_steps}."
|
| 201 |
)
|
| 202 |
|
| 203 |
+
elif action_type == "read_comments":
|
| 204 |
+
if self._comments_visible:
|
| 205 |
+
return "Comments already revealed. Choose another action or submit."
|
| 206 |
+
self._comments_visible = True
|
| 207 |
+
n = len(self._current_bug.comments)
|
| 208 |
+
return (
|
| 209 |
+
f"Revealed {n} comment(s). "
|
| 210 |
+
f"Steps used: {self._step_count}/{self._max_steps}."
|
| 211 |
+
)
|
| 212 |
|
| 213 |
+
elif action_type == "check_logs":
|
| 214 |
+
if self._logs_visible:
|
| 215 |
+
return "Logs already revealed. Choose another action or submit."
|
| 216 |
+
self._logs_visible = True
|
| 217 |
+
has_trace = bool(self._current_bug.stack_trace)
|
| 218 |
+
return (
|
| 219 |
+
f"System logs revealed. {'Stack trace available.' if has_trace else 'No stack trace.'} "
|
| 220 |
+
f"Steps used: {self._step_count}/{self._max_steps}."
|
| 221 |
+
)
|
| 222 |
+
|
| 223 |
+
elif action_type == "check_similar":
|
| 224 |
+
if self._similar_visible:
|
| 225 |
+
return "Similar bugs already revealed. Choose another action or submit."
|
| 226 |
+
self._similar_visible = True
|
| 227 |
+
n = len(self._current_bug.related_bugs)
|
| 228 |
+
return (
|
| 229 |
+
f"Found {n} related bug(s). "
|
| 230 |
+
f"Steps used: {self._step_count}/{self._max_steps}."
|
| 231 |
+
)
|
| 232 |
+
|
| 233 |
+
return f"Unknown investigation action: {action_type}"
|
| 234 |
+
|
| 235 |
+
def _handle_submission(self, action: TriageAction) -> TriageObservation:
|
| 236 |
+
"""Grade the agent's final triage submission."""
|
| 237 |
+
score, feedback = grade_action(
|
| 238 |
+
self._current_task_key, self._current_bug, action,
|
| 239 |
+
answer=self._current_answer,
|
| 240 |
+
)
|
| 241 |
+
|
| 242 |
+
# Apply time efficiency bonus/penalty
|
| 243 |
+
# Fewer steps = better (if the answer is good)
|
| 244 |
+
investigation_steps = self._step_count - 1 # subtract the submit step
|
| 245 |
+
if investigation_steps == 0 and score >= 0.7:
|
| 246 |
+
# Got it right without investigating β impressive!
|
| 247 |
+
efficiency_bonus = 0.05
|
| 248 |
+
feedback += " | β‘ Efficiency bonus: +0.05 (correct with minimal investigation)"
|
| 249 |
+
elif investigation_steps >= 3 and score >= 0.7:
|
| 250 |
+
# Took many steps but got it right β slight penalty for slowness
|
| 251 |
+
efficiency_penalty = 0.02 * (investigation_steps - 2)
|
| 252 |
+
score = score - efficiency_penalty
|
| 253 |
+
feedback += f" | β± Time penalty: -{efficiency_penalty:.2f} ({investigation_steps} investigation steps)"
|
| 254 |
+
elif investigation_steps == 0 and score < 0.5:
|
| 255 |
+
# Rushed and got it wrong β penalty
|
| 256 |
+
feedback += " | β Consider investigating before submitting next time"
|
| 257 |
+
|
| 258 |
+
if investigation_steps == 0 and score >= 0.7:
|
| 259 |
+
score += 0.05
|
| 260 |
+
|
| 261 |
+
score = max(0.01, min(0.99, score))
|
| 262 |
|
| 263 |
self._state.total_score += score
|
| 264 |
+
self._state.tasks_completed.append(self._current_task_key)
|
| 265 |
self._episode_done = True
|
| 266 |
|
| 267 |
+
return self._build_observation(
|
| 268 |
+
score=score, feedback=feedback, done=True, reward=score,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 269 |
)
|
| 270 |
|
| 271 |
@property
|
|
|
|
| 273 |
return self._state
|
| 274 |
|
| 275 |
def get_state(self) -> TriageState:
|
| 276 |
+
return self._state
|
| 277 |
+
|
| 278 |
+
|
| 279 |
+
# ---------------------------------------------------------------------------
|
| 280 |
+
# SESSION MANAGER β handles concurrent sessions safely
|
| 281 |
+
# ---------------------------------------------------------------------------
|
| 282 |
+
|
| 283 |
+
class SessionManager:
|
| 284 |
+
"""Thread-safe session management for multiple concurrent agents."""
|
| 285 |
+
|
| 286 |
+
def __init__(self, max_sessions: int = 1000, ttl_seconds: int = 600):
|
| 287 |
+
self._sessions: Dict[str, BugTriageEnvironment] = {}
|
| 288 |
+
self._timestamps: Dict[str, float] = {}
|
| 289 |
+
self._max_sessions = max_sessions
|
| 290 |
+
self._ttl = ttl_seconds
|
| 291 |
+
|
| 292 |
+
def create_session(self) -> Tuple[str, BugTriageEnvironment]:
|
| 293 |
+
"""Create a new session and return (session_id, env)."""
|
| 294 |
+
self._cleanup_expired()
|
| 295 |
+
session_id = str(uuid.uuid4())
|
| 296 |
+
env = BugTriageEnvironment()
|
| 297 |
+
self._sessions[session_id] = env
|
| 298 |
+
self._timestamps[session_id] = time.time()
|
| 299 |
+
# Enforce max after adding
|
| 300 |
+
while len(self._sessions) > self._max_sessions:
|
| 301 |
+
oldest = min(self._timestamps, key=self._timestamps.get)
|
| 302 |
+
if oldest == session_id:
|
| 303 |
+
break
|
| 304 |
+
self._sessions.pop(oldest, None)
|
| 305 |
+
self._timestamps.pop(oldest, None)
|
| 306 |
+
return session_id, env
|
| 307 |
+
|
| 308 |
+
def get_session(self, session_id: str) -> Optional[BugTriageEnvironment]:
|
| 309 |
+
"""Get an existing session's environment, or None if expired/missing."""
|
| 310 |
+
if session_id not in self._sessions:
|
| 311 |
+
return None
|
| 312 |
+
# Refresh TTL on access
|
| 313 |
+
self._timestamps[session_id] = time.time()
|
| 314 |
+
return self._sessions[session_id]
|
| 315 |
+
|
| 316 |
+
def remove_session(self, session_id: str) -> None:
|
| 317 |
+
"""Remove a session after episode completes."""
|
| 318 |
+
self._sessions.pop(session_id, None)
|
| 319 |
+
self._timestamps.pop(session_id, None)
|
| 320 |
+
|
| 321 |
+
def _cleanup_expired(self) -> None:
|
| 322 |
+
"""Remove sessions that exceeded TTL."""
|
| 323 |
+
now = time.time()
|
| 324 |
+
expired = [
|
| 325 |
+
sid for sid, ts in self._timestamps.items()
|
| 326 |
+
if now - ts > self._ttl
|
| 327 |
+
]
|
| 328 |
+
for sid in expired:
|
| 329 |
+
self._sessions.pop(sid, None)
|
| 330 |
+
self._timestamps.pop(sid, None)
|
| 331 |
+
|
| 332 |
+
# Also enforce max sessions (remove oldest)
|
| 333 |
+
while len(self._sessions) > self._max_sessions:
|
| 334 |
+
oldest = min(self._timestamps, key=self._timestamps.get)
|
| 335 |
+
self._sessions.pop(oldest, None)
|
| 336 |
+
self._timestamps.pop(oldest, None)
|
| 337 |
+
|
| 338 |
+
@property
|
| 339 |
+
def active_count(self) -> int:
|
| 340 |
+
return len(self._sessions)
|
server/requirements.txt
CHANGED
|
@@ -4,4 +4,5 @@ uvicorn[standard]
|
|
| 4 |
pydantic
|
| 5 |
websockets
|
| 6 |
openai
|
| 7 |
-
httpx
|
|
|
|
|
|
| 4 |
pydantic
|
| 5 |
websockets
|
| 6 |
openai
|
| 7 |
+
httpx
|
| 8 |
+
requests
|
server/task.py
CHANGED
|
@@ -1,16 +1,426 @@
|
|
| 1 |
# server/task.py
|
| 2 |
import sys
|
| 3 |
import random
|
|
|
|
| 4 |
sys.path.insert(0, "/app")
|
| 5 |
|
| 6 |
-
from typing import Tuple, List
|
| 7 |
from model import BugReport, TriageAction
|
| 8 |
|
| 9 |
|
| 10 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
"easy": {
|
| 15 |
"bugs": [
|
| 16 |
BugReport(
|
|
@@ -22,6 +432,9 @@ TASKS = {
|
|
| 22 |
author="user123",
|
| 23 |
labels_hint=[],
|
| 24 |
comments=["Confirmed on iOS and Android.", "Happens every time."],
|
|
|
|
|
|
|
|
|
|
| 25 |
),
|
| 26 |
BugReport(
|
| 27 |
id="easy-002",
|
|
@@ -31,6 +444,9 @@ TASKS = {
|
|
| 31 |
author="docs_fan",
|
| 32 |
labels_hint=["documentation"],
|
| 33 |
comments=[],
|
|
|
|
|
|
|
|
|
|
| 34 |
),
|
| 35 |
BugReport(
|
| 36 |
id="easy-003",
|
|
@@ -40,6 +456,9 @@ TASKS = {
|
|
| 40 |
author="power_user",
|
| 41 |
labels_hint=["performance"],
|
| 42 |
comments=["Noticed after the last deploy.", "CPU spikes to 100%."],
|
|
|
|
|
|
|
|
|
|
| 43 |
),
|
| 44 |
BugReport(
|
| 45 |
id="easy-004",
|
|
@@ -49,7 +468,11 @@ TASKS = {
|
|
| 49 |
"Affects all users attempting password reset.",
|
| 50 |
author="support_team",
|
| 51 |
labels_hint=["bug"],
|
| 52 |
-
comments=["Reported by 12 users this week.",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
),
|
| 54 |
BugReport(
|
| 55 |
id="easy-005",
|
|
@@ -59,9 +482,11 @@ TASKS = {
|
|
| 59 |
author="intern_dev",
|
| 60 |
labels_hint=["documentation"],
|
| 61 |
comments=[],
|
|
|
|
|
|
|
|
|
|
| 62 |
),
|
| 63 |
],
|
| 64 |
-
# Ground truth for grader
|
| 65 |
"answers": {
|
| 66 |
"easy-001": {"priority": "P0"},
|
| 67 |
"easy-002": {"priority": "P3"},
|
|
@@ -82,6 +507,9 @@ TASKS = {
|
|
| 82 |
author="store_owner",
|
| 83 |
labels_hint=["bug"],
|
| 84 |
comments=["Revenue impact confirmed.", "Happening since Tuesday."],
|
|
|
|
|
|
|
|
|
|
| 85 |
),
|
| 86 |
BugReport(
|
| 87 |
id="med-002",
|
|
@@ -92,6 +520,9 @@ TASKS = {
|
|
| 92 |
author="moderator_jane",
|
| 93 |
labels_hint=[],
|
| 94 |
comments=["GDPR concern β deleted content still visible."],
|
|
|
|
|
|
|
|
|
|
| 95 |
),
|
| 96 |
BugReport(
|
| 97 |
id="med-003",
|
|
@@ -101,6 +532,9 @@ TASKS = {
|
|
| 101 |
author="safari_user",
|
| 102 |
labels_hint=["bug", "ux"],
|
| 103 |
comments=["Only on Safari, not Chrome/Firefox."],
|
|
|
|
|
|
|
|
|
|
| 104 |
),
|
| 105 |
BugReport(
|
| 106 |
id="med-004",
|
|
@@ -110,7 +544,11 @@ TASKS = {
|
|
| 110 |
"Affects users with international data.",
|
| 111 |
author="data_analyst",
|
| 112 |
labels_hint=["bug"],
|
| 113 |
-
comments=["Encoding issue β UTF-8 not respected.",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
),
|
| 115 |
BugReport(
|
| 116 |
id="med-005",
|
|
@@ -120,7 +558,11 @@ TASKS = {
|
|
| 120 |
"The unblock logic has a bug β it never clears the blocked flag.",
|
| 121 |
author="api_user",
|
| 122 |
labels_hint=["bug"],
|
| 123 |
-
comments=["Affects CI/CD pipelines hitting the API.",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 124 |
),
|
| 125 |
],
|
| 126 |
"answers": {
|
|
@@ -144,6 +586,10 @@ TASKS = {
|
|
| 144 |
author="security_researcher",
|
| 145 |
labels_hint=[],
|
| 146 |
comments=["Critical. Affects production.", "Do not discuss publicly."],
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
),
|
| 148 |
BugReport(
|
| 149 |
id="hard-002",
|
|
@@ -155,6 +601,9 @@ TASKS = {
|
|
| 155 |
author="devops_alice",
|
| 156 |
labels_hint=["performance"],
|
| 157 |
comments=["Verified with heap profiler.", "Started in v1.9."],
|
|
|
|
|
|
|
|
|
|
| 158 |
),
|
| 159 |
BugReport(
|
| 160 |
id="hard-003",
|
|
@@ -167,7 +616,12 @@ TASKS = {
|
|
| 167 |
"Risk is low-probability but affects data integrity.",
|
| 168 |
author="qa_bot",
|
| 169 |
labels_hint=["bug"],
|
| 170 |
-
comments=["Reproduced with locust at 50 concurrent users.",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 171 |
),
|
| 172 |
BugReport(
|
| 173 |
id="hard-004",
|
|
@@ -178,7 +632,12 @@ TASKS = {
|
|
| 178 |
"This is a session management security vulnerability.",
|
| 179 |
author="pentest_team",
|
| 180 |
labels_hint=["security"],
|
| 181 |
-
comments=["Verified on staging.",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 182 |
),
|
| 183 |
BugReport(
|
| 184 |
id="hard-005",
|
|
@@ -189,126 +648,316 @@ TASKS = {
|
|
| 189 |
"Triggered in production twice this week. Requires process kill to recover.",
|
| 190 |
author="oncall_eng",
|
| 191 |
labels_hint=["bug", "performance"],
|
| 192 |
-
comments=["PagerDuty alert fired twice.",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 193 |
),
|
| 194 |
],
|
| 195 |
"answers": {
|
| 196 |
"hard-001": {
|
| 197 |
-
"priority": "P0",
|
| 198 |
-
"
|
| 199 |
-
"assigned_team": "security",
|
| 200 |
-
"milestone": "hotfix",
|
| 201 |
},
|
| 202 |
"hard-002": {
|
| 203 |
-
"priority": "P1",
|
| 204 |
-
"
|
| 205 |
-
"assigned_team": "backend",
|
| 206 |
-
"milestone": "v2.1",
|
| 207 |
},
|
| 208 |
"hard-003": {
|
| 209 |
-
"priority": "P1",
|
| 210 |
-
"
|
| 211 |
-
"assigned_team": "backend",
|
| 212 |
-
"milestone": "v2.1",
|
| 213 |
},
|
| 214 |
"hard-004": {
|
| 215 |
-
"priority": "P0",
|
| 216 |
-
"
|
| 217 |
-
"assigned_team": "security",
|
| 218 |
-
"milestone": "hotfix",
|
| 219 |
},
|
| 220 |
"hard-005": {
|
| 221 |
-
"priority": "P0",
|
| 222 |
-
"
|
| 223 |
-
"assigned_team": "backend",
|
| 224 |
-
"milestone": "hotfix",
|
| 225 |
},
|
| 226 |
},
|
| 227 |
},
|
| 228 |
}
|
| 229 |
|
| 230 |
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 243 |
|
| 244 |
PRIORITY_ORDER = {"P0": 0, "P1": 1, "P2": 2, "P3": 3}
|
| 245 |
|
| 246 |
|
| 247 |
def _priority_score(predicted: str, correct: str) -> float:
|
|
|
|
| 248 |
if predicted == correct:
|
| 249 |
return 0.95
|
| 250 |
-
|
| 251 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 252 |
|
| 253 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 254 |
|
| 255 |
def _label_score(predicted: List[str], correct: List[str]) -> float:
|
| 256 |
-
|
| 257 |
-
|
| 258 |
-
|
|
|
|
|
|
|
| 259 |
return 0.95
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
|
|
|
|
|
|
|
| 263 |
return max(0.05, min(0.95, raw))
|
| 264 |
|
| 265 |
|
| 266 |
-
def
|
| 267 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
feedback_parts = []
|
|
|
|
| 269 |
|
| 270 |
if task_key == "easy":
|
| 271 |
score = _priority_score(action.priority, answer["priority"])
|
| 272 |
symbol = "β" if score >= 0.9 else "~" if score >= 0.4 else "β"
|
| 273 |
-
feedback_parts.append(
|
|
|
|
|
|
|
| 274 |
score = max(0.01, min(0.99, score))
|
| 275 |
return round(score, 3), " | ".join(feedback_parts)
|
| 276 |
|
| 277 |
elif task_key == "medium":
|
| 278 |
p_score = _priority_score(action.priority, answer["priority"])
|
| 279 |
-
l_score = _label_score(action.labels, answer
|
| 280 |
expected_team = answer.get("assigned_team", "")
|
| 281 |
t_score = 0.95 if expected_team and action.assigned_team.lower() == expected_team.lower() else 0.05
|
| 282 |
-
|
| 283 |
-
|
| 284 |
-
|
| 285 |
-
feedback_parts.append(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 286 |
score = max(0.01, min(0.99, score))
|
| 287 |
return round(score, 3), " | ".join(feedback_parts)
|
| 288 |
|
| 289 |
else: # hard
|
| 290 |
p_score = _priority_score(action.priority, answer["priority"])
|
| 291 |
-
l_score = _label_score(action.labels, answer
|
| 292 |
t_score = 0.95 if action.assigned_team.lower() == answer["assigned_team"].lower() else 0.05
|
| 293 |
m_score = 0.95 if action.milestone.lower() == answer["milestone"].lower() else 0.05
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
feedback_parts.append(
|
| 298 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 299 |
if answer.get("assigned_team") == "security" and action.assigned_team.lower() != "security":
|
| 300 |
score = max(0.01, score - 0.15)
|
| 301 |
feedback_parts.append("β Security escalation missed (-0.15)")
|
|
|
|
| 302 |
score = max(0.01, min(0.99, score))
|
| 303 |
return round(score, 3), " | ".join(feedback_parts)
|
| 304 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 305 |
def priority_match(*args, **kwargs):
|
| 306 |
if len(args) < 2:
|
| 307 |
return 0.5
|
| 308 |
-
|
| 309 |
-
bug = args[0]
|
| 310 |
-
action = args[1]
|
| 311 |
-
|
| 312 |
score, _ = grade_action("easy", bug, action)
|
| 313 |
return float(score)
|
| 314 |
|
|
@@ -316,10 +965,7 @@ def priority_match(*args, **kwargs):
|
|
| 316 |
def priority_label_team(*args, **kwargs):
|
| 317 |
if len(args) < 2:
|
| 318 |
return 0.5
|
| 319 |
-
|
| 320 |
-
bug = args[0]
|
| 321 |
-
action = args[1]
|
| 322 |
-
|
| 323 |
score, _ = grade_action("medium", bug, action)
|
| 324 |
return float(score)
|
| 325 |
|
|
@@ -327,17 +973,18 @@ def priority_label_team(*args, **kwargs):
|
|
| 327 |
def full_triage(*args, **kwargs):
|
| 328 |
if len(args) < 2:
|
| 329 |
return 0.5
|
| 330 |
-
|
| 331 |
-
bug = args[0]
|
| 332 |
-
action = args[1]
|
| 333 |
-
|
| 334 |
score, _ = grade_action("hard", bug, action)
|
| 335 |
return float(score)
|
|
|
|
|
|
|
| 336 |
__all__ = [
|
| 337 |
"priority_match",
|
| 338 |
"priority_label_team",
|
| 339 |
"full_triage",
|
| 340 |
"sample_bug",
|
|
|
|
| 341 |
"grade_action",
|
| 342 |
-
"TASKS",
|
|
|
|
| 343 |
]
|
|
|
|
| 1 |
# server/task.py
|
| 2 |
import sys
|
| 3 |
import random
|
| 4 |
+
import hashlib
|
| 5 |
sys.path.insert(0, "/app")
|
| 6 |
|
| 7 |
+
from typing import Tuple, List, Dict, Any
|
| 8 |
from model import BugReport, TriageAction
|
| 9 |
|
| 10 |
|
| 11 |
+
# ---------------------------------------------------------------------------
|
| 12 |
+
# LABEL SYNONYM MAP β allows semantic matching
|
| 13 |
+
# ---------------------------------------------------------------------------
|
| 14 |
+
|
| 15 |
+
LABEL_SYNONYMS: Dict[str, set] = {
|
| 16 |
+
"bug": {"defect", "issue", "error", "fault", "broken"},
|
| 17 |
+
"security": {"vulnerability", "cve", "exploit", "auth", "injection"},
|
| 18 |
+
"performance": {"perf", "slow", "latency", "optimization", "speed", "memory"},
|
| 19 |
+
"ux": {"ui", "frontend", "user-experience", "design", "usability"},
|
| 20 |
+
"data-integrity": {"data-loss", "corruption", "data", "consistency"},
|
| 21 |
+
"payments": {"billing", "payment", "stripe", "checkout", "revenue"},
|
| 22 |
+
"documentation": {"docs", "typo", "readme", "wiki"},
|
| 23 |
+
"infrastructure": {"infra", "devops", "deploy", "ci", "cd", "docker"},
|
| 24 |
+
"api": {"endpoint", "rest", "graphql", "http", "request"},
|
| 25 |
+
"database": {"db", "sql", "query", "migration", "schema"},
|
| 26 |
+
}
|
| 27 |
|
| 28 |
+
# ---------------------------------------------------------------------------
|
| 29 |
+
# BUG TEMPLATE SYSTEM β generates hundreds of unique bugs
|
| 30 |
+
# ---------------------------------------------------------------------------
|
| 31 |
+
|
| 32 |
+
_BUG_TEMPLATES = {
|
| 33 |
+
"crash": {
|
| 34 |
+
"titles": [
|
| 35 |
+
"{service} crashes on {trigger}",
|
| 36 |
+
"{service} throws {error_type} when {trigger}",
|
| 37 |
+
"Fatal error in {service} during {trigger}",
|
| 38 |
+
"Unhandled exception in {service}: {error_type}",
|
| 39 |
+
"{service} segfaults under {condition}",
|
| 40 |
+
],
|
| 41 |
+
"bodies": [
|
| 42 |
+
"When a user {trigger}, the {service} crashes immediately. "
|
| 43 |
+
"Error: {error_type}. Stack trace points to {component}. "
|
| 44 |
+
"Affects {impact}. {workaround}",
|
| 45 |
+
"The {service} is failing with {error_type} every time a user {trigger}. "
|
| 46 |
+
"No error message is shown to the user β the process just dies. "
|
| 47 |
+
"Impact: {impact}. {workaround}",
|
| 48 |
+
],
|
| 49 |
+
"vars": {
|
| 50 |
+
"service": ["auth service", "payment gateway", "search API", "notification worker",
|
| 51 |
+
"session manager", "user profile service", "file upload handler",
|
| 52 |
+
"webhook processor", "background job runner", "cache layer"],
|
| 53 |
+
"trigger": ["submits a form with special characters", "uploads a file larger than 10MB",
|
| 54 |
+
"logs in with SSO", "resets their password", "exports data to CSV",
|
| 55 |
+
"switches between tabs rapidly", "uses the bulk import feature",
|
| 56 |
+
"accesses the admin panel", "triggers a webhook", "runs a scheduled job"],
|
| 57 |
+
"error_type": ["NullPointerException", "SegmentationFault", "OutOfMemoryError",
|
| 58 |
+
"ConnectionTimeoutException", "StackOverflowError",
|
| 59 |
+
"IndexOutOfBoundsException", "TypeError", "KeyError"],
|
| 60 |
+
"component": ["UserController.java:142", "PaymentService.py:89",
|
| 61 |
+
"AuthMiddleware.ts:56", "SearchIndex.go:203",
|
| 62 |
+
"NotificationQueue.rb:77", "FileHandler.py:234"],
|
| 63 |
+
"impact": ["100% of users on this flow", "all mobile users", "EU region users only",
|
| 64 |
+
"users with accounts older than 1 year", "approximately 30% of sessions",
|
| 65 |
+
"every request during peak hours"],
|
| 66 |
+
"workaround": ["No workaround exists β the feature is completely broken.",
|
| 67 |
+
"Workaround: users can retry after clearing browser cache.",
|
| 68 |
+
"Temporary fix: restart the service every 2 hours.",
|
| 69 |
+
"No known workaround. Users are blocked."],
|
| 70 |
+
"condition": ["high concurrent load", "memory pressure above 80%",
|
| 71 |
+
"when connection pool is exhausted", "after running for 6+ hours"],
|
| 72 |
+
},
|
| 73 |
+
"answer_template": {
|
| 74 |
+
"severe": {"priority": "P0", "labels": ["bug"], "assigned_team": "backend", "milestone": "hotfix"},
|
| 75 |
+
"moderate": {"priority": "P1", "labels": ["bug"], "assigned_team": "backend", "milestone": "v2.1"},
|
| 76 |
+
},
|
| 77 |
+
"severity_keywords": {
|
| 78 |
+
"severe": ["100%", "all mobile", "No workaround", "completely broken", "blocked",
|
| 79 |
+
"SegmentationFault", "OutOfMemoryError"],
|
| 80 |
+
"moderate": ["retry", "30%", "Temporary fix", "restart"],
|
| 81 |
+
},
|
| 82 |
+
},
|
| 83 |
|
| 84 |
+
"security": {
|
| 85 |
+
"titles": [
|
| 86 |
+
"SQL injection vulnerability in {endpoint}",
|
| 87 |
+
"XSS attack possible via {input_field}",
|
| 88 |
+
"Authentication bypass in {service}",
|
| 89 |
+
"Sensitive data exposed in {location}",
|
| 90 |
+
"{credential_type} not invalidated after {event}",
|
| 91 |
+
"SSRF vulnerability in {endpoint}",
|
| 92 |
+
],
|
| 93 |
+
"bodies": [
|
| 94 |
+
"The {endpoint} endpoint does not sanitize {input_field} inputs. "
|
| 95 |
+
"Crafted queries can {exploit_result}. PoC attached and verified on {env}. "
|
| 96 |
+
"Treat as confidential β do not discuss publicly until patched. {additional_context}",
|
| 97 |
+
"When a user {event}, existing {credential_type} remain valid for {duration}. "
|
| 98 |
+
"An attacker who {attack_vector} can continue to access the account. "
|
| 99 |
+
"This is a {vuln_category} vulnerability. {additional_context}",
|
| 100 |
+
],
|
| 101 |
+
"vars": {
|
| 102 |
+
"endpoint": ["/api/search", "/api/users", "/api/export", "/admin/query",
|
| 103 |
+
"/api/upload", "/graphql", "/api/webhook"],
|
| 104 |
+
"input_field": ["search query", "username field", "file upload name",
|
| 105 |
+
"comment body", "profile bio", "webhook URL"],
|
| 106 |
+
"service": ["login flow", "OAuth callback", "API gateway", "admin panel",
|
| 107 |
+
"password reset", "2FA verification"],
|
| 108 |
+
"location": ["API error responses", "debug logs shipped to client",
|
| 109 |
+
"public S3 bucket", "unencrypted cookies", "localStorage"],
|
| 110 |
+
"credential_type": ["JWT tokens", "session cookies", "API keys", "OAuth tokens"],
|
| 111 |
+
"event": ["changes their password", "revokes API access",
|
| 112 |
+
"is suspended by admin", "enables 2FA"],
|
| 113 |
+
"exploit_result": ["dump the entire user table including password hashes",
|
| 114 |
+
"execute arbitrary JavaScript in other users' browsers",
|
| 115 |
+
"access any user's account without credentials",
|
| 116 |
+
"read internal service endpoints via SSRF"],
|
| 117 |
+
"env": ["production", "staging", "production replica"],
|
| 118 |
+
"duration": ["up to 24 hours", "indefinitely", "until manual cache clear",
|
| 119 |
+
"for the full token TTL (7 days)"],
|
| 120 |
+
"attack_vector": ["previously stole a token", "intercepted a session cookie",
|
| 121 |
+
"obtained a leaked API key"],
|
| 122 |
+
"vuln_category": ["session management", "access control",
|
| 123 |
+
"injection", "broken authentication"],
|
| 124 |
+
"additional_context": [
|
| 125 |
+
"OWASP A03 β Injection.",
|
| 126 |
+
"OWASP A07 β Identification and Authentication Failures.",
|
| 127 |
+
"CVSS score estimated at 9.1 (Critical).",
|
| 128 |
+
"Compliance impact: potential GDPR violation if user PII is exfiltrated.",
|
| 129 |
+
"Bounty hunter reported this 48 hours ago β disclosure deadline approaching.",
|
| 130 |
+
],
|
| 131 |
+
},
|
| 132 |
+
"answer_template": {
|
| 133 |
+
"default": {"priority": "P0", "labels": ["bug", "security"],
|
| 134 |
+
"assigned_team": "security", "milestone": "hotfix"},
|
| 135 |
+
},
|
| 136 |
+
"severity_keywords": {"default": []},
|
| 137 |
+
},
|
| 138 |
+
|
| 139 |
+
"performance": {
|
| 140 |
+
"titles": [
|
| 141 |
+
"{page} loads slowly for {dataset_size}",
|
| 142 |
+
"Memory leak in {service} causes OOM after {duration}",
|
| 143 |
+
"API response time degrades under {load_condition}",
|
| 144 |
+
"{operation} takes {duration} for {dataset_size}",
|
| 145 |
+
"CPU spikes to 100% when {trigger}",
|
| 146 |
+
],
|
| 147 |
+
"bodies": [
|
| 148 |
+
"When {condition}, the {page} takes {response_time} to load. "
|
| 149 |
+
"{diagnostic_info}. {impact}. {workaround}",
|
| 150 |
+
"The {service} allocates memory during {operation} and never frees it. "
|
| 151 |
+
"Server runs out of memory every {duration}. {diagnostic_info}. "
|
| 152 |
+
"{workaround}",
|
| 153 |
+
],
|
| 154 |
+
"vars": {
|
| 155 |
+
"page": ["dashboard", "analytics page", "user list", "search results",
|
| 156 |
+
"audit log", "reports page", "admin overview"],
|
| 157 |
+
"service": ["background job processor", "cache warming service",
|
| 158 |
+
"log aggregator", "image resizer", "ETL pipeline"],
|
| 159 |
+
"dataset_size": ["large datasets (10k+ rows)", "enterprise accounts",
|
| 160 |
+
"tables with 100k+ entries", "files over 50MB"],
|
| 161 |
+
"duration": ["6 hours", "4 hours", "12 hours", "30+ seconds",
|
| 162 |
+
"2+ minutes", "an entire day"],
|
| 163 |
+
"load_condition": ["concurrent load", "peak traffic", "batch processing",
|
| 164 |
+
"more than 50 simultaneous users"],
|
| 165 |
+
"operation": ["bulk export", "report generation", "data migration",
|
| 166 |
+
"full-text search", "image processing"],
|
| 167 |
+
"trigger": ["running bulk exports", "processing large uploads",
|
| 168 |
+
"generating PDF reports", "reindexing search"],
|
| 169 |
+
"condition": ["a dataset has more than 10k rows",
|
| 170 |
+
"multiple users trigger exports simultaneously",
|
| 171 |
+
"the nightly ETL job runs alongside user traffic"],
|
| 172 |
+
"response_time": ["30+ seconds", "over a minute", "2-3 minutes",
|
| 173 |
+
"timeout after 60 seconds"],
|
| 174 |
+
"diagnostic_info": ["CPU spikes to 100%", "Heap profiler confirms the leak",
|
| 175 |
+
"Database EXPLAIN shows full table scan",
|
| 176 |
+
"N+1 query pattern detected in APM",
|
| 177 |
+
"Garbage collector running every 500ms"],
|
| 178 |
+
"impact": ["Affects power users with large accounts",
|
| 179 |
+
"All users experience slowness during peak hours",
|
| 180 |
+
"Requires manual restart to recover",
|
| 181 |
+
"Operational overhead: scheduled restarts every 4 hours"],
|
| 182 |
+
"workaround": ["Workaround: export data and use offline tools.",
|
| 183 |
+
"Workaround: scheduled restarts every 4 hours.",
|
| 184 |
+
"No workaround β users just wait.",
|
| 185 |
+
"Workaround: paginate results (but UX is degraded)."],
|
| 186 |
+
},
|
| 187 |
+
"answer_template": {
|
| 188 |
+
"severe": {"priority": "P1", "labels": ["bug", "performance"],
|
| 189 |
+
"assigned_team": "backend", "milestone": "v2.1"},
|
| 190 |
+
"moderate": {"priority": "P2", "labels": ["bug", "performance"],
|
| 191 |
+
"assigned_team": "backend", "milestone": "v2.1"},
|
| 192 |
+
},
|
| 193 |
+
"severity_keywords": {
|
| 194 |
+
"severe": ["OOM", "100%", "manual restart", "timeout", "No workaround",
|
| 195 |
+
"all users", "never frees"],
|
| 196 |
+
"moderate": ["Workaround", "power users", "paginate"],
|
| 197 |
+
},
|
| 198 |
+
},
|
| 199 |
+
|
| 200 |
+
"ui_bug": {
|
| 201 |
+
"titles": [
|
| 202 |
+
"{ui_element} breaks layout on {browser}",
|
| 203 |
+
"{ui_element} not rendering correctly in {mode}",
|
| 204 |
+
"Responsive layout broken on {device}",
|
| 205 |
+
"{feature} toggle not persisting across {context}",
|
| 206 |
+
"Accessibility: {ui_element} missing {a11y_attr}",
|
| 207 |
+
],
|
| 208 |
+
"bodies": [
|
| 209 |
+
"Switching to {mode} on {browser} causes {ui_element} to {visual_issue}. "
|
| 210 |
+
"{other_browsers}. {workaround}",
|
| 211 |
+
"On {device}, the {ui_element} is {visual_issue}. "
|
| 212 |
+
"Tested on {browser}. {impact}. {workaround}",
|
| 213 |
+
],
|
| 214 |
+
"vars": {
|
| 215 |
+
"ui_element": ["navigation bar", "sidebar menu", "modal dialog",
|
| 216 |
+
"dropdown selector", "data table", "footer",
|
| 217 |
+
"toast notifications", "breadcrumb trail"],
|
| 218 |
+
"browser": ["Safari 16", "Firefox ESR", "Chrome on Android",
|
| 219 |
+
"Edge on Windows", "iOS Safari", "Samsung Internet"],
|
| 220 |
+
"mode": ["dark mode", "high contrast mode", "RTL layout",
|
| 221 |
+
"compact view", "print view"],
|
| 222 |
+
"device": ["iPhone SE", "tablets in portrait", "screens below 768px",
|
| 223 |
+
"ultra-wide monitors", "4K displays"],
|
| 224 |
+
"feature": ["dark mode", "compact view", "language preference",
|
| 225 |
+
"notification settings"],
|
| 226 |
+
"context": ["page reloads", "different tabs", "sessions",
|
| 227 |
+
"browser restarts"],
|
| 228 |
+
"visual_issue": ["overlap the main content", "disappear entirely",
|
| 229 |
+
"render with incorrect colors", "become unclickable",
|
| 230 |
+
"overflow beyond the viewport"],
|
| 231 |
+
"other_browsers": ["Chrome and Firefox are unaffected.",
|
| 232 |
+
"Only reproducible on this specific browser.",
|
| 233 |
+
"Affects all WebKit-based browsers."],
|
| 234 |
+
"a11y_attr": ["ARIA labels", "keyboard focus indicators",
|
| 235 |
+
"screen reader text", "proper heading hierarchy"],
|
| 236 |
+
"impact": ["Cosmetic issue, no functional impact.",
|
| 237 |
+
"Users cannot access the affected feature.",
|
| 238 |
+
"Usability is degraded but the feature works."],
|
| 239 |
+
"workaround": ["Workaround: use a different browser.",
|
| 240 |
+
"Workaround: manually resize the window.",
|
| 241 |
+
"No workaround for this browser.",
|
| 242 |
+
"Workaround: disable the feature in settings."],
|
| 243 |
+
},
|
| 244 |
+
"answer_template": {
|
| 245 |
+
"severe": {"priority": "P2", "labels": ["bug", "ux"],
|
| 246 |
+
"assigned_team": "frontend", "milestone": "v2.1"},
|
| 247 |
+
"moderate": {"priority": "P3", "labels": ["bug", "ux"],
|
| 248 |
+
"assigned_team": "frontend", "milestone": "backlog"},
|
| 249 |
+
},
|
| 250 |
+
"severity_keywords": {
|
| 251 |
+
"severe": ["cannot access", "unclickable", "disappear", "No workaround"],
|
| 252 |
+
"moderate": ["Cosmetic", "different browser", "resize"],
|
| 253 |
+
},
|
| 254 |
+
},
|
| 255 |
+
|
| 256 |
+
"data_corruption": {
|
| 257 |
+
"titles": [
|
| 258 |
+
"Race condition in {feature}: {consequence}",
|
| 259 |
+
"Data inconsistency in {feature} under concurrent writes",
|
| 260 |
+
"{export_format} export produces corrupted output for {edge_case}",
|
| 261 |
+
"Stale data served from cache after {trigger}",
|
| 262 |
+
"Duplicate records created when {trigger}",
|
| 263 |
+
],
|
| 264 |
+
"bodies": [
|
| 265 |
+
"Under concurrent load, {feature} can {consequence} due to a race condition "
|
| 266 |
+
"in {root_cause}. Frequency: {frequency}. {impact}. {workaround}",
|
| 267 |
+
"When {feature} data contains {edge_case}, the exported {export_format} file "
|
| 268 |
+
"is corrupted and cannot be {consumer}. {impact}. {workaround}",
|
| 269 |
+
],
|
| 270 |
+
"vars": {
|
| 271 |
+
"feature": ["file upload", "order processing", "user registration",
|
| 272 |
+
"inventory update", "comment system", "permission assignment"],
|
| 273 |
+
"consequence": ["files occasionally overwrite each other",
|
| 274 |
+
"orders are duplicated or lost",
|
| 275 |
+
"users get assigned wrong permissions",
|
| 276 |
+
"inventory counts become negative"],
|
| 277 |
+
"root_cause": ["temp file naming logic", "lack of database locking",
|
| 278 |
+
"non-atomic read-modify-write cycle",
|
| 279 |
+
"missing unique constraint"],
|
| 280 |
+
"frequency": ["approximately 1 in 10,000 operations",
|
| 281 |
+
"consistently under 50+ concurrent users",
|
| 282 |
+
"intermittently β hard to reproduce",
|
| 283 |
+
"every time the batch job runs"],
|
| 284 |
+
"edge_case": ["non-ASCII characters (e.g., cafΓ©, naΓ―ve)",
|
| 285 |
+
"values containing commas or quotes",
|
| 286 |
+
"null or empty fields",
|
| 287 |
+
"timestamps crossing DST boundaries"],
|
| 288 |
+
"export_format": ["CSV", "Excel", "JSON", "PDF"],
|
| 289 |
+
"consumer": ["opened in Excel", "parsed by downstream services",
|
| 290 |
+
"imported back into the system"],
|
| 291 |
+
"trigger": ["double-clicking the submit button",
|
| 292 |
+
"cache TTL expires during a write operation",
|
| 293 |
+
"two users edit the same record simultaneously",
|
| 294 |
+
"the nightly sync job overlaps with user activity"],
|
| 295 |
+
"impact": ["Potential data loss confirmed.",
|
| 296 |
+
"No data loss confirmed yet, but risk exists.",
|
| 297 |
+
"Affects users with international data.",
|
| 298 |
+
"Breaks downstream pipeline processing."],
|
| 299 |
+
"workaround": ["Workaround: enable sequential mode in settings.",
|
| 300 |
+
"Workaround: manually re-export after cleanup.",
|
| 301 |
+
"No reliable workaround β data must be manually verified.",
|
| 302 |
+
"Workaround: add a mutex lock externally (operational overhead)."],
|
| 303 |
+
},
|
| 304 |
+
"answer_template": {
|
| 305 |
+
"severe": {"priority": "P1", "labels": ["bug", "data-integrity"],
|
| 306 |
+
"assigned_team": "backend", "milestone": "v2.1"},
|
| 307 |
+
"moderate": {"priority": "P2", "labels": ["bug", "data-integrity"],
|
| 308 |
+
"assigned_team": "backend", "milestone": "v2.1"},
|
| 309 |
+
},
|
| 310 |
+
"severity_keywords": {
|
| 311 |
+
"severe": ["data loss", "No reliable workaround", "consistently",
|
| 312 |
+
"permissions", "overwrite", "negative"],
|
| 313 |
+
"moderate": ["No data loss", "intermittently", "sequential mode",
|
| 314 |
+
"re-export", "non-ASCII"],
|
| 315 |
+
},
|
| 316 |
+
},
|
| 317 |
+
|
| 318 |
+
"documentation": {
|
| 319 |
+
"titles": [
|
| 320 |
+
"Typo in {location}",
|
| 321 |
+
"Outdated {doc_type} on {page}",
|
| 322 |
+
"Missing documentation for {feature}",
|
| 323 |
+
"Incorrect {doc_element} in {location}",
|
| 324 |
+
],
|
| 325 |
+
"bodies": [
|
| 326 |
+
"There is a {issue_type} on the {page}: {detail}. No functional impact, "
|
| 327 |
+
"purely cosmetic. {extra}",
|
| 328 |
+
"The {doc_type} for {feature} is {issue_type}. {detail}. {extra}",
|
| 329 |
+
],
|
| 330 |
+
"vars": {
|
| 331 |
+
"location": ["homepage docs", "API reference", "README", "changelog",
|
| 332 |
+
"contributing guide", "onboarding wiki"],
|
| 333 |
+
"doc_type": ["installation guide", "API documentation", "changelog",
|
| 334 |
+
"migration guide", "code comments"],
|
| 335 |
+
"page": ["landing page", "docs homepage", "getting started page",
|
| 336 |
+
"FAQ section", "footer"],
|
| 337 |
+
"feature": ["new webhook API", "batch processing endpoint",
|
| 338 |
+
"SSO integration", "rate limiting"],
|
| 339 |
+
"doc_element": ["code example", "endpoint URL", "parameter description",
|
| 340 |
+
"copyright year", "version number"],
|
| 341 |
+
"issue_type": ["a typo", "outdated", "missing", "incorrect", "misleading"],
|
| 342 |
+
"detail": ["'Welccome' should be 'Welcome'",
|
| 343 |
+
"references removed v1.x API that no longer exists",
|
| 344 |
+
"completely undocumented despite being a core feature",
|
| 345 |
+
"shows 'Β© 2022' but should be 'Β© 2024'",
|
| 346 |
+
"the curl example uses the wrong HTTP method"],
|
| 347 |
+
"extra": ["", "Low priority β does not block any workflow.",
|
| 348 |
+
"New users have reported confusion.",
|
| 349 |
+
"Only noticed by contributors reading source code."],
|
| 350 |
+
},
|
| 351 |
+
"answer_template": {
|
| 352 |
+
"default": {"priority": "P3", "labels": ["documentation"],
|
| 353 |
+
"assigned_team": "devx", "milestone": "backlog"},
|
| 354 |
+
},
|
| 355 |
+
"severity_keywords": {"default": []},
|
| 356 |
+
},
|
| 357 |
+
|
| 358 |
+
"api_bug": {
|
| 359 |
+
"titles": [
|
| 360 |
+
"API rate limiter {issue} after {trigger}",
|
| 361 |
+
"{endpoint} returns {status_code} instead of {expected_code}",
|
| 362 |
+
"Pagination broken on {endpoint}: {symptom}",
|
| 363 |
+
"Webhook delivery {issue} for {event_type} events",
|
| 364 |
+
"API versioning: {endpoint} behaves differently on v1 vs v2",
|
| 365 |
+
],
|
| 366 |
+
"bodies": [
|
| 367 |
+
"After receiving a {status_code} response, {consequence}. "
|
| 368 |
+
"The {root_cause}. {impact}. {workaround}",
|
| 369 |
+
"The {endpoint} endpoint {symptom} when {trigger}. "
|
| 370 |
+
"Expected behavior: {expected}. Actual: {actual}. {impact}.",
|
| 371 |
+
],
|
| 372 |
+
"vars": {
|
| 373 |
+
"endpoint": ["/api/users", "/api/search", "/api/export",
|
| 374 |
+
"/api/webhooks", "/api/billing", "/api/analytics"],
|
| 375 |
+
"issue": ["blocks legitimate users", "fails silently",
|
| 376 |
+
"returns incorrect retry headers", "drops events"],
|
| 377 |
+
"trigger": ["a 429 error", "rate limit window resets",
|
| 378 |
+
"a burst of requests from CI/CD", "server restart"],
|
| 379 |
+
"status_code": ["429", "500", "502", "504", "403"],
|
| 380 |
+
"expected_code": ["200", "201", "204", "404"],
|
| 381 |
+
"symptom": ["returns duplicate entries",
|
| 382 |
+
"skips items between pages",
|
| 383 |
+
"returns empty page despite more data existing"],
|
| 384 |
+
"event_type": ["payment.completed", "user.created",
|
| 385 |
+
"subscription.cancelled", "deployment.finished"],
|
| 386 |
+
"consequence": ["legitimate users remain blocked for 1 hour",
|
| 387 |
+
"data is silently lost with no error",
|
| 388 |
+
"downstream services receive stale data"],
|
| 389 |
+
"root_cause": ["unblock logic has a bug β it never clears the blocked flag",
|
| 390 |
+
"cursor-based pagination uses wrong sort order",
|
| 391 |
+
"retry-after header reports seconds instead of milliseconds"],
|
| 392 |
+
"expected": ["200 OK with paginated results",
|
| 393 |
+
"successful delivery with retry on failure",
|
| 394 |
+
"proper rate limit reset after window expires"],
|
| 395 |
+
"actual": ["empty response with 200 status",
|
| 396 |
+
"permanent block until manual intervention",
|
| 397 |
+
"events dropped without any error log"],
|
| 398 |
+
"impact": ["Affects CI/CD pipelines hitting the API.",
|
| 399 |
+
"External integrations break silently.",
|
| 400 |
+
"Customer-facing dashboards show wrong data.",
|
| 401 |
+
"Retry-After header causes clients to wait too long."],
|
| 402 |
+
"workaround": ["Workaround: manually clear Redis key.",
|
| 403 |
+
"Workaround: add client-side deduplication.",
|
| 404 |
+
"No workaround β requires server-side fix.",
|
| 405 |
+
"Workaround: pin API version to v1 in headers."],
|
| 406 |
+
},
|
| 407 |
+
"answer_template": {
|
| 408 |
+
"severe": {"priority": "P1", "labels": ["bug", "api"],
|
| 409 |
+
"assigned_team": "backend", "milestone": "v2.1"},
|
| 410 |
+
"moderate": {"priority": "P2", "labels": ["bug", "api"],
|
| 411 |
+
"assigned_team": "backend", "milestone": "v2.1"},
|
| 412 |
+
},
|
| 413 |
+
"severity_keywords": {
|
| 414 |
+
"severe": ["silently lost", "permanent block", "No workaround",
|
| 415 |
+
"dropped", "external integrations"],
|
| 416 |
+
"moderate": ["Workaround", "pin API", "deduplication"],
|
| 417 |
+
},
|
| 418 |
+
},
|
| 419 |
+
}
|
| 420 |
+
|
| 421 |
+
|
| 422 |
+
# The original handcrafted bugs β kept as a gold-standard subset
|
| 423 |
+
_HANDCRAFTED_BUGS = {
|
| 424 |
"easy": {
|
| 425 |
"bugs": [
|
| 426 |
BugReport(
|
|
|
|
| 432 |
author="user123",
|
| 433 |
labels_hint=[],
|
| 434 |
comments=["Confirmed on iOS and Android.", "Happens every time."],
|
| 435 |
+
severity_signals=["100% of users", "crashes", "no workaround"],
|
| 436 |
+
stack_trace="NullPointerException at AuthController.java:87",
|
| 437 |
+
affected_component="auth-service",
|
| 438 |
),
|
| 439 |
BugReport(
|
| 440 |
id="easy-002",
|
|
|
|
| 444 |
author="docs_fan",
|
| 445 |
labels_hint=["documentation"],
|
| 446 |
comments=[],
|
| 447 |
+
severity_signals=["cosmetic", "no functional impact"],
|
| 448 |
+
stack_trace="",
|
| 449 |
+
affected_component="docs",
|
| 450 |
),
|
| 451 |
BugReport(
|
| 452 |
id="easy-003",
|
|
|
|
| 456 |
author="power_user",
|
| 457 |
labels_hint=["performance"],
|
| 458 |
comments=["Noticed after the last deploy.", "CPU spikes to 100%."],
|
| 459 |
+
severity_signals=["workaround exists", "power users only"],
|
| 460 |
+
stack_trace="",
|
| 461 |
+
affected_component="dashboard",
|
| 462 |
),
|
| 463 |
BugReport(
|
| 464 |
id="easy-004",
|
|
|
|
| 468 |
"Affects all users attempting password reset.",
|
| 469 |
author="support_team",
|
| 470 |
labels_hint=["bug"],
|
| 471 |
+
comments=["Reported by 12 users this week.",
|
| 472 |
+
"Started after email service migration."],
|
| 473 |
+
severity_signals=["all users", "never dispatched"],
|
| 474 |
+
stack_trace="",
|
| 475 |
+
affected_component="email-service",
|
| 476 |
),
|
| 477 |
BugReport(
|
| 478 |
id="easy-005",
|
|
|
|
| 482 |
author="intern_dev",
|
| 483 |
labels_hint=["documentation"],
|
| 484 |
comments=[],
|
| 485 |
+
severity_signals=["no functional impact"],
|
| 486 |
+
stack_trace="",
|
| 487 |
+
affected_component="frontend",
|
| 488 |
),
|
| 489 |
],
|
|
|
|
| 490 |
"answers": {
|
| 491 |
"easy-001": {"priority": "P0"},
|
| 492 |
"easy-002": {"priority": "P3"},
|
|
|
|
| 507 |
author="store_owner",
|
| 508 |
labels_hint=["bug"],
|
| 509 |
comments=["Revenue impact confirmed.", "Happening since Tuesday."],
|
| 510 |
+
severity_signals=["revenue loss", "silently", "every failed checkout"],
|
| 511 |
+
stack_trace="Stripe API: card_declined at PaymentService.py:145",
|
| 512 |
+
affected_component="payment-service",
|
| 513 |
),
|
| 514 |
BugReport(
|
| 515 |
id="med-002",
|
|
|
|
| 520 |
author="moderator_jane",
|
| 521 |
labels_hint=[],
|
| 522 |
comments=["GDPR concern β deleted content still visible."],
|
| 523 |
+
severity_signals=["GDPR violation", "deleted content visible"],
|
| 524 |
+
stack_trace="",
|
| 525 |
+
affected_component="search-index",
|
| 526 |
),
|
| 527 |
BugReport(
|
| 528 |
id="med-003",
|
|
|
|
| 532 |
author="safari_user",
|
| 533 |
labels_hint=["bug", "ux"],
|
| 534 |
comments=["Only on Safari, not Chrome/Firefox."],
|
| 535 |
+
severity_signals=["workaround exists", "single browser"],
|
| 536 |
+
stack_trace="",
|
| 537 |
+
affected_component="frontend-css",
|
| 538 |
),
|
| 539 |
BugReport(
|
| 540 |
id="med-004",
|
|
|
|
| 544 |
"Affects users with international data.",
|
| 545 |
author="data_analyst",
|
| 546 |
labels_hint=["bug"],
|
| 547 |
+
comments=["Encoding issue β UTF-8 not respected.",
|
| 548 |
+
"Workaround: manual copy-paste."],
|
| 549 |
+
severity_signals=["corrupted", "workaround exists"],
|
| 550 |
+
stack_trace="",
|
| 551 |
+
affected_component="export-service",
|
| 552 |
),
|
| 553 |
BugReport(
|
| 554 |
id="med-005",
|
|
|
|
| 558 |
"The unblock logic has a bug β it never clears the blocked flag.",
|
| 559 |
author="api_user",
|
| 560 |
labels_hint=["bug"],
|
| 561 |
+
comments=["Affects CI/CD pipelines hitting the API.",
|
| 562 |
+
"Retry-After header is wrong."],
|
| 563 |
+
severity_signals=["permanent block", "never clears", "bug in logic"],
|
| 564 |
+
stack_trace="",
|
| 565 |
+
affected_component="api-gateway",
|
| 566 |
),
|
| 567 |
],
|
| 568 |
"answers": {
|
|
|
|
| 586 |
author="security_researcher",
|
| 587 |
labels_hint=[],
|
| 588 |
comments=["Critical. Affects production.", "Do not discuss publicly."],
|
| 589 |
+
severity_signals=["SQL injection", "password hashes", "production",
|
| 590 |
+
"confidential"],
|
| 591 |
+
stack_trace="",
|
| 592 |
+
affected_component="search-api",
|
| 593 |
),
|
| 594 |
BugReport(
|
| 595 |
id="hard-002",
|
|
|
|
| 601 |
author="devops_alice",
|
| 602 |
labels_hint=["performance"],
|
| 603 |
comments=["Verified with heap profiler.", "Started in v1.9."],
|
| 604 |
+
severity_signals=["memory leak", "OOM", "manual restart", "never frees"],
|
| 605 |
+
stack_trace="HeapDump: JobProcessor.process() -> 50MB/call, never GC'd",
|
| 606 |
+
affected_component="job-processor",
|
| 607 |
),
|
| 608 |
BugReport(
|
| 609 |
id="hard-003",
|
|
|
|
| 616 |
"Risk is low-probability but affects data integrity.",
|
| 617 |
author="qa_bot",
|
| 618 |
labels_hint=["bug"],
|
| 619 |
+
comments=["Reproduced with locust at 50 concurrent users.",
|
| 620 |
+
"Sequential mode avoids it."],
|
| 621 |
+
severity_signals=["race condition", "data integrity",
|
| 622 |
+
"workaround exists", "low-probability"],
|
| 623 |
+
stack_trace="",
|
| 624 |
+
affected_component="file-upload",
|
| 625 |
),
|
| 626 |
BugReport(
|
| 627 |
id="hard-004",
|
|
|
|
| 632 |
"This is a session management security vulnerability.",
|
| 633 |
author="pentest_team",
|
| 634 |
labels_hint=["security"],
|
| 635 |
+
comments=["Verified on staging.",
|
| 636 |
+
"OWASP A07 β Identification and Authentication Failures."],
|
| 637 |
+
severity_signals=["JWT not invalidated", "attacker", "security vulnerability",
|
| 638 |
+
"stolen token"],
|
| 639 |
+
stack_trace="",
|
| 640 |
+
affected_component="auth-service",
|
| 641 |
),
|
| 642 |
BugReport(
|
| 643 |
id="hard-005",
|
|
|
|
| 648 |
"Triggered in production twice this week. Requires process kill to recover.",
|
| 649 |
author="oncall_eng",
|
| 650 |
labels_hint=["bug", "performance"],
|
| 651 |
+
comments=["PagerDuty alert fired twice.",
|
| 652 |
+
"Needs exponential backoff + max retry cap."],
|
| 653 |
+
severity_signals=["infinite loop", "100%", "production",
|
| 654 |
+
"process kill", "starves other services"],
|
| 655 |
+
stack_trace="Thread dump: WebhookRetrier.retry() β recursive call, no exit",
|
| 656 |
+
affected_component="webhook-service",
|
| 657 |
),
|
| 658 |
],
|
| 659 |
"answers": {
|
| 660 |
"hard-001": {
|
| 661 |
+
"priority": "P0", "labels": ["bug", "security"],
|
| 662 |
+
"assigned_team": "security", "milestone": "hotfix",
|
|
|
|
|
|
|
| 663 |
},
|
| 664 |
"hard-002": {
|
| 665 |
+
"priority": "P1", "labels": ["bug", "performance"],
|
| 666 |
+
"assigned_team": "backend", "milestone": "v2.1",
|
|
|
|
|
|
|
| 667 |
},
|
| 668 |
"hard-003": {
|
| 669 |
+
"priority": "P1", "labels": ["bug", "data-integrity"],
|
| 670 |
+
"assigned_team": "backend", "milestone": "v2.1",
|
|
|
|
|
|
|
| 671 |
},
|
| 672 |
"hard-004": {
|
| 673 |
+
"priority": "P0", "labels": ["bug", "security"],
|
| 674 |
+
"assigned_team": "security", "milestone": "hotfix",
|
|
|
|
|
|
|
| 675 |
},
|
| 676 |
"hard-005": {
|
| 677 |
+
"priority": "P0", "labels": ["bug", "performance"],
|
| 678 |
+
"assigned_team": "backend", "milestone": "hotfix",
|
|
|
|
|
|
|
| 679 |
},
|
| 680 |
},
|
| 681 |
},
|
| 682 |
}
|
| 683 |
|
| 684 |
|
| 685 |
+
# Combine into single TASKS dict (backward compatible)
|
| 686 |
+
TASKS = _HANDCRAFTED_BUGS
|
| 687 |
+
|
| 688 |
+
|
| 689 |
+
# ---------------------------------------------------------------------------
|
| 690 |
+
# PROCEDURAL BUG GENERATOR
|
| 691 |
+
# ---------------------------------------------------------------------------
|
| 692 |
+
|
| 693 |
+
def _determine_severity(text: str, keywords: Dict[str, list]) -> str:
|
| 694 |
+
"""Check which severity level the generated text matches."""
|
| 695 |
+
text_lower = text.lower()
|
| 696 |
+
for level, kws in keywords.items():
|
| 697 |
+
if level == "default":
|
| 698 |
+
return "default"
|
| 699 |
+
hits = sum(1 for kw in kws if kw.lower() in text_lower)
|
| 700 |
+
if hits >= 1:
|
| 701 |
+
return level
|
| 702 |
+
# fallback to first non-default key
|
| 703 |
+
return list(keywords.keys())[0] if keywords else "moderate"
|
| 704 |
+
|
| 705 |
+
|
| 706 |
+
def generate_bug(task_key: str, seed: int = None) -> Tuple[BugReport, dict]:
|
| 707 |
+
"""Generate a procedural bug report with its correct answer."""
|
| 708 |
+
rng = random.Random(seed)
|
| 709 |
+
|
| 710 |
+
# Weight categories by difficulty
|
| 711 |
+
weights = {
|
| 712 |
+
"easy": {"documentation": 3, "ui_bug": 3, "performance": 2,
|
| 713 |
+
"crash": 1, "api_bug": 1},
|
| 714 |
+
"medium": {"crash": 3, "performance": 3, "api_bug": 2,
|
| 715 |
+
"data_corruption": 2, "ui_bug": 1},
|
| 716 |
+
"hard": {"security": 4, "crash": 3, "data_corruption": 3,
|
| 717 |
+
"performance": 2, "api_bug": 2},
|
| 718 |
+
}
|
| 719 |
+
|
| 720 |
+
task_weights = weights.get(task_key, weights["medium"])
|
| 721 |
+
categories = []
|
| 722 |
+
for cat, w in task_weights.items():
|
| 723 |
+
categories.extend([cat] * w)
|
| 724 |
+
category = rng.choice(categories)
|
| 725 |
+
|
| 726 |
+
template = _BUG_TEMPLATES[category]
|
| 727 |
+
|
| 728 |
+
# Pick random variable values
|
| 729 |
+
chosen_vars = {}
|
| 730 |
+
for var_name, options in template["vars"].items():
|
| 731 |
+
chosen_vars[var_name] = rng.choice(options)
|
| 732 |
+
|
| 733 |
+
# Build title and body
|
| 734 |
+
title_tmpl = rng.choice(template["titles"])
|
| 735 |
+
body_tmpl = rng.choice(template["bodies"])
|
| 736 |
+
|
| 737 |
+
# Safe format β ignore missing keys
|
| 738 |
+
def safe_format(tmpl, vars_dict):
|
| 739 |
+
result = tmpl
|
| 740 |
+
for k, v in vars_dict.items():
|
| 741 |
+
result = result.replace("{" + k + "}", v)
|
| 742 |
+
return result
|
| 743 |
+
|
| 744 |
+
title = safe_format(title_tmpl, chosen_vars)
|
| 745 |
+
body = safe_format(body_tmpl, chosen_vars)
|
| 746 |
+
|
| 747 |
+
# Generate unique ID from seed
|
| 748 |
+
bug_id = f"gen-{seed or rng.randint(0, 999999):06d}"
|
| 749 |
+
|
| 750 |
+
# Pick author
|
| 751 |
+
authors = ["user_report", "qa_engineer", "support_team", "dev_oncall",
|
| 752 |
+
"security_bot", "customer_jane", "automated_monitor",
|
| 753 |
+
"intern_dev", "senior_eng", "pm_feedback"]
|
| 754 |
+
author = rng.choice(authors)
|
| 755 |
+
|
| 756 |
+
# Build comments
|
| 757 |
+
comment_templates = [
|
| 758 |
+
"Confirmed on our side.", "Reproduced in staging.",
|
| 759 |
+
"Multiple reports from users.", "Started after last deployment.",
|
| 760 |
+
"Urgent β customer escalation.", "Low priority β no user complaints.",
|
| 761 |
+
"Needs investigation.", "Related to ticket from last sprint.",
|
| 762 |
+
]
|
| 763 |
+
num_comments = rng.randint(0, 3)
|
| 764 |
+
comments = rng.sample(comment_templates, min(num_comments, len(comment_templates)))
|
| 765 |
+
|
| 766 |
+
# Determine severity and answer
|
| 767 |
+
full_text = f"{title} {body} {' '.join(comments)}"
|
| 768 |
+
severity_kws = template.get("severity_keywords", {})
|
| 769 |
+
severity = _determine_severity(full_text, severity_kws)
|
| 770 |
+
|
| 771 |
+
answer_templates = template["answer_template"]
|
| 772 |
+
answer = dict(answer_templates.get(severity, list(answer_templates.values())[0]))
|
| 773 |
+
|
| 774 |
+
# For easy tasks, only priority matters
|
| 775 |
+
if task_key == "easy":
|
| 776 |
+
answer = {"priority": answer["priority"]}
|
| 777 |
+
elif task_key == "medium":
|
| 778 |
+
answer.pop("milestone", None)
|
| 779 |
+
|
| 780 |
+
bug = BugReport(
|
| 781 |
+
id=bug_id,
|
| 782 |
+
title=title,
|
| 783 |
+
body=body,
|
| 784 |
+
author=author,
|
| 785 |
+
labels_hint=rng.sample(["bug", "needs-triage", "reported"], rng.randint(0, 2)),
|
| 786 |
+
comments=comments,
|
| 787 |
+
severity_signals=[],
|
| 788 |
+
stack_trace="",
|
| 789 |
+
affected_component=chosen_vars.get("service", chosen_vars.get("endpoint", "")),
|
| 790 |
+
)
|
| 791 |
+
|
| 792 |
+
return bug, answer
|
| 793 |
+
|
| 794 |
+
|
| 795 |
+
# ---------------------------------------------------------------------------
|
| 796 |
+
# BUG SAMPLER β uses handcrafted bugs first, then procedural for variety
|
| 797 |
+
# ---------------------------------------------------------------------------
|
| 798 |
+
|
| 799 |
+
def sample_bug(task_key: str, seed: int = None) -> Tuple[BugReport, dict]:
|
| 800 |
+
"""Return a bug and its answer. Mixes handcrafted + procedural."""
|
| 801 |
+
rng = random.Random(seed)
|
| 802 |
+
|
| 803 |
+
# 40% chance of handcrafted, 60% procedural
|
| 804 |
+
if rng.random() < 0.4 and task_key in _HANDCRAFTED_BUGS:
|
| 805 |
+
bugs = _HANDCRAFTED_BUGS[task_key]["bugs"]
|
| 806 |
+
bug = rng.choice(bugs)
|
| 807 |
+
answer = _HANDCRAFTED_BUGS[task_key]["answers"][bug.id]
|
| 808 |
+
return bug, answer
|
| 809 |
+
else:
|
| 810 |
+
gen_seed = seed if seed is not None else rng.randint(0, 999999)
|
| 811 |
+
return generate_bug(task_key, seed=gen_seed)
|
| 812 |
+
|
| 813 |
+
|
| 814 |
+
# ---------------------------------------------------------------------------
|
| 815 |
+
# GRADING β with semantic label matching
|
| 816 |
+
# ---------------------------------------------------------------------------
|
| 817 |
|
| 818 |
PRIORITY_ORDER = {"P0": 0, "P1": 1, "P2": 2, "P3": 3}
|
| 819 |
|
| 820 |
|
| 821 |
def _priority_score(predicted: str, correct: str) -> float:
|
| 822 |
+
"""Score priority assignment with partial credit for near-misses."""
|
| 823 |
if predicted == correct:
|
| 824 |
return 0.95
|
| 825 |
+
pred_rank = PRIORITY_ORDER.get(predicted, 99)
|
| 826 |
+
corr_rank = PRIORITY_ORDER.get(correct, 99)
|
| 827 |
+
diff = abs(pred_rank - corr_rank)
|
| 828 |
+
if diff == 1:
|
| 829 |
+
return 0.5
|
| 830 |
+
elif diff == 2:
|
| 831 |
+
return 0.2
|
| 832 |
+
return 0.05
|
| 833 |
|
| 834 |
|
| 835 |
+
def _normalize_label(label: str) -> str:
|
| 836 |
+
"""Normalize a label to its canonical form."""
|
| 837 |
+
label_lower = label.lower().strip()
|
| 838 |
+
for canonical, synonyms in LABEL_SYNONYMS.items():
|
| 839 |
+
if label_lower == canonical or label_lower in synonyms:
|
| 840 |
+
return canonical
|
| 841 |
+
return label_lower
|
| 842 |
+
|
| 843 |
|
| 844 |
def _label_score(predicted: List[str], correct: List[str]) -> float:
|
| 845 |
+
"""Score labels using semantic matching via synonym groups."""
|
| 846 |
+
pred_normalized = set(_normalize_label(l) for l in predicted)
|
| 847 |
+
corr_normalized = set(_normalize_label(l) for l in correct)
|
| 848 |
+
|
| 849 |
+
if not corr_normalized:
|
| 850 |
return 0.95
|
| 851 |
+
|
| 852 |
+
intersection = pred_normalized & corr_normalized
|
| 853 |
+
union = pred_normalized | corr_normalized
|
| 854 |
+
|
| 855 |
+
raw = len(intersection) / len(union) if union else 0.0
|
| 856 |
return max(0.05, min(0.95, raw))
|
| 857 |
|
| 858 |
|
| 859 |
+
def _reasoning_score(reasoning: str, answer: dict) -> float:
|
| 860 |
+
"""Bonus for reasoning that mentions relevant signals."""
|
| 861 |
+
if not reasoning or len(reasoning.strip()) < 10:
|
| 862 |
+
return 0.0
|
| 863 |
+
|
| 864 |
+
key_signals = {
|
| 865 |
+
"P0": ["production", "all users", "data loss", "security", "crash",
|
| 866 |
+
"revenue", "injection", "vulnerability", "100%"],
|
| 867 |
+
"P1": ["major", "significant", "no workaround", "broken",
|
| 868 |
+
"gdpr", "blocked", "leak", "never"],
|
| 869 |
+
"P2": ["degraded", "workaround", "partial", "slow",
|
| 870 |
+
"affected", "power users"],
|
| 871 |
+
"P3": ["minor", "cosmetic", "docs", "typo", "low",
|
| 872 |
+
"no functional impact"],
|
| 873 |
+
}
|
| 874 |
+
|
| 875 |
+
expected_priority = answer.get("priority", "P2")
|
| 876 |
+
signals = key_signals.get(expected_priority, [])
|
| 877 |
+
reasoning_lower = reasoning.lower()
|
| 878 |
+
|
| 879 |
+
hits = sum(1 for s in signals if s in reasoning_lower)
|
| 880 |
+
return min(0.15, hits * 0.05)
|
| 881 |
+
|
| 882 |
+
|
| 883 |
+
def grade_action(task_key: str, bug: BugReport, action: TriageAction,
|
| 884 |
+
answer: dict = None) -> Tuple[float, str]:
|
| 885 |
+
"""Grade the agent's triage action against the correct answer."""
|
| 886 |
+
|
| 887 |
+
# Backward compatibility: look up answer from handcrafted if not provided
|
| 888 |
+
if answer is None:
|
| 889 |
+
if task_key in _HANDCRAFTED_BUGS and bug.id in _HANDCRAFTED_BUGS[task_key]["answers"]:
|
| 890 |
+
answer = _HANDCRAFTED_BUGS[task_key]["answers"][bug.id]
|
| 891 |
+
else:
|
| 892 |
+
return 0.5, "No answer key found for this bug."
|
| 893 |
+
|
| 894 |
feedback_parts = []
|
| 895 |
+
reasoning_bonus = _reasoning_score(action.reasoning, answer)
|
| 896 |
|
| 897 |
if task_key == "easy":
|
| 898 |
score = _priority_score(action.priority, answer["priority"])
|
| 899 |
symbol = "β" if score >= 0.9 else "~" if score >= 0.4 else "β"
|
| 900 |
+
feedback_parts.append(
|
| 901 |
+
f"Priority: {symbol} (got {action.priority}, expected {answer['priority']})")
|
| 902 |
+
score = score + reasoning_bonus
|
| 903 |
score = max(0.01, min(0.99, score))
|
| 904 |
return round(score, 3), " | ".join(feedback_parts)
|
| 905 |
|
| 906 |
elif task_key == "medium":
|
| 907 |
p_score = _priority_score(action.priority, answer["priority"])
|
| 908 |
+
l_score = _label_score(action.labels, answer.get("labels", []))
|
| 909 |
expected_team = answer.get("assigned_team", "")
|
| 910 |
t_score = 0.95 if expected_team and action.assigned_team.lower() == expected_team.lower() else 0.05
|
| 911 |
+
|
| 912 |
+
score = 0.45 * p_score + 0.40 * l_score + 0.15 * t_score + reasoning_bonus
|
| 913 |
+
|
| 914 |
+
feedback_parts.append(
|
| 915 |
+
f"Priority: {p_score:.2f} (got {action.priority}, expected {answer['priority']})")
|
| 916 |
+
feedback_parts.append(f"Labels: {l_score:.2f} (semantic match)")
|
| 917 |
+
feedback_parts.append(
|
| 918 |
+
f"Team: {t_score:.2f} (got {action.assigned_team}, expected {expected_team})")
|
| 919 |
+
if reasoning_bonus > 0:
|
| 920 |
+
feedback_parts.append(f"Reasoning bonus: +{reasoning_bonus:.2f}")
|
| 921 |
+
|
| 922 |
score = max(0.01, min(0.99, score))
|
| 923 |
return round(score, 3), " | ".join(feedback_parts)
|
| 924 |
|
| 925 |
else: # hard
|
| 926 |
p_score = _priority_score(action.priority, answer["priority"])
|
| 927 |
+
l_score = _label_score(action.labels, answer.get("labels", []))
|
| 928 |
t_score = 0.95 if action.assigned_team.lower() == answer["assigned_team"].lower() else 0.05
|
| 929 |
m_score = 0.95 if action.milestone.lower() == answer["milestone"].lower() else 0.05
|
| 930 |
+
|
| 931 |
+
score = 0.35 * p_score + 0.30 * l_score + 0.20 * t_score + 0.15 * m_score + reasoning_bonus
|
| 932 |
+
|
| 933 |
+
feedback_parts.append(
|
| 934 |
+
f"Priority: {p_score:.2f} (got {action.priority}, expected {answer['priority']})")
|
| 935 |
+
feedback_parts.append(f"Labels: {l_score:.2f} (semantic match)")
|
| 936 |
+
feedback_parts.append(
|
| 937 |
+
f"Team: {t_score:.2f} (got {action.assigned_team}, expected {answer['assigned_team']})")
|
| 938 |
+
feedback_parts.append(
|
| 939 |
+
f"Milestone: {m_score:.2f} (got {action.milestone}, expected {answer['milestone']})")
|
| 940 |
+
|
| 941 |
+
if reasoning_bonus > 0:
|
| 942 |
+
feedback_parts.append(f"Reasoning bonus: +{reasoning_bonus:.2f}")
|
| 943 |
+
|
| 944 |
+
# Security escalation penalty
|
| 945 |
if answer.get("assigned_team") == "security" and action.assigned_team.lower() != "security":
|
| 946 |
score = max(0.01, score - 0.15)
|
| 947 |
feedback_parts.append("β Security escalation missed (-0.15)")
|
| 948 |
+
|
| 949 |
score = max(0.01, min(0.99, score))
|
| 950 |
return round(score, 3), " | ".join(feedback_parts)
|
| 951 |
+
|
| 952 |
+
|
| 953 |
+
# ---------------------------------------------------------------------------
|
| 954 |
+
# NAMED GRADER FUNCTIONS β referenced by openenv.yaml
|
| 955 |
+
# ---------------------------------------------------------------------------
|
| 956 |
+
|
| 957 |
def priority_match(*args, **kwargs):
|
| 958 |
if len(args) < 2:
|
| 959 |
return 0.5
|
| 960 |
+
bug, action = args[0], args[1]
|
|
|
|
|
|
|
|
|
|
| 961 |
score, _ = grade_action("easy", bug, action)
|
| 962 |
return float(score)
|
| 963 |
|
|
|
|
| 965 |
def priority_label_team(*args, **kwargs):
|
| 966 |
if len(args) < 2:
|
| 967 |
return 0.5
|
| 968 |
+
bug, action = args[0], args[1]
|
|
|
|
|
|
|
|
|
|
| 969 |
score, _ = grade_action("medium", bug, action)
|
| 970 |
return float(score)
|
| 971 |
|
|
|
|
| 973 |
def full_triage(*args, **kwargs):
|
| 974 |
if len(args) < 2:
|
| 975 |
return 0.5
|
| 976 |
+
bug, action = args[0], args[1]
|
|
|
|
|
|
|
|
|
|
| 977 |
score, _ = grade_action("hard", bug, action)
|
| 978 |
return float(score)
|
| 979 |
+
|
| 980 |
+
|
| 981 |
__all__ = [
|
| 982 |
"priority_match",
|
| 983 |
"priority_label_team",
|
| 984 |
"full_triage",
|
| 985 |
"sample_bug",
|
| 986 |
+
"generate_bug",
|
| 987 |
"grade_action",
|
| 988 |
+
"TASKS",
|
| 989 |
+
"LABEL_SYNONYMS",
|
| 990 |
]
|
tests/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
# tests/__init__.py
|
tests/test_api.py
ADDED
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# tests/test_api.py
|
| 2 |
+
"""Integration tests for the FastAPI endpoints."""
|
| 3 |
+
import sys
|
| 4 |
+
import os
|
| 5 |
+
sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
|
| 6 |
+
sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(__file__)), "server"))
|
| 7 |
+
|
| 8 |
+
import pytest
|
| 9 |
+
|
| 10 |
+
# These tests require fastapi and httpx
|
| 11 |
+
try:
|
| 12 |
+
from fastapi.testclient import TestClient
|
| 13 |
+
from server.app import app
|
| 14 |
+
HAS_DEPS = True
|
| 15 |
+
except ImportError:
|
| 16 |
+
HAS_DEPS = False
|
| 17 |
+
|
| 18 |
+
pytestmark = pytest.mark.skipif(not HAS_DEPS, reason="FastAPI/httpx not installed")
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
@pytest.fixture
|
| 22 |
+
def client():
|
| 23 |
+
return TestClient(app)
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
class TestHealthEndpoint:
|
| 27 |
+
def test_health_returns_ok(self, client):
|
| 28 |
+
r = client.get("/health")
|
| 29 |
+
assert r.status_code == 200
|
| 30 |
+
data = r.json()
|
| 31 |
+
assert data.get("status") in ("ok", "healthy")
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
class TestTaskEndpoints:
|
| 35 |
+
def test_list_tasks(self, client):
|
| 36 |
+
r = client.get("/tasks")
|
| 37 |
+
assert r.status_code == 200
|
| 38 |
+
tasks = r.json()
|
| 39 |
+
assert len(tasks) == 3
|
| 40 |
+
ids = [t["id"] for t in tasks]
|
| 41 |
+
assert "easy" in ids
|
| 42 |
+
assert "medium" in ids
|
| 43 |
+
assert "hard" in ids
|
| 44 |
+
|
| 45 |
+
def test_get_specific_task(self, client):
|
| 46 |
+
r = client.get("/tasks/easy")
|
| 47 |
+
assert r.status_code == 200
|
| 48 |
+
assert r.json()["id"] == "easy"
|
| 49 |
+
|
| 50 |
+
def test_get_nonexistent_task(self, client):
|
| 51 |
+
r = client.get("/tasks/impossible")
|
| 52 |
+
assert r.status_code == 404
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
class TestResetEndpoint:
|
| 56 |
+
def test_reset_returns_observation(self, client):
|
| 57 |
+
r = client.post("/reset", json={"task_id": "easy"})
|
| 58 |
+
assert r.status_code == 200
|
| 59 |
+
data = r.json()
|
| 60 |
+
assert "observation" in data
|
| 61 |
+
assert "session_id" in data
|
| 62 |
+
assert data["done"] is False
|
| 63 |
+
|
| 64 |
+
def test_reset_with_empty_body(self, client):
|
| 65 |
+
r = client.post("/reset", json={})
|
| 66 |
+
assert r.status_code == 200
|
| 67 |
+
|
| 68 |
+
def test_reset_returns_bug_report(self, client):
|
| 69 |
+
r = client.post("/reset", json={"task_id": "medium"})
|
| 70 |
+
data = r.json()
|
| 71 |
+
obs = data["observation"]
|
| 72 |
+
assert "bug_report" in obs
|
| 73 |
+
assert "title" in obs["bug_report"]
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
class TestStepEndpoint:
|
| 77 |
+
def test_investigation_step(self, client):
|
| 78 |
+
# Reset first
|
| 79 |
+
r = client.post("/reset", json={"task_id": "easy"})
|
| 80 |
+
session_id = r.json()["session_id"]
|
| 81 |
+
|
| 82 |
+
# Investigate
|
| 83 |
+
r = client.post("/step", json={
|
| 84 |
+
"session_id": session_id,
|
| 85 |
+
"action": {"action_type": "read_body"},
|
| 86 |
+
})
|
| 87 |
+
assert r.status_code == 200
|
| 88 |
+
data = r.json()
|
| 89 |
+
assert data["done"] is False
|
| 90 |
+
|
| 91 |
+
def test_submit_step(self, client):
|
| 92 |
+
# Reset
|
| 93 |
+
r = client.post("/reset", json={"task_id": "easy"})
|
| 94 |
+
session_id = r.json()["session_id"]
|
| 95 |
+
|
| 96 |
+
# Submit
|
| 97 |
+
r = client.post("/step", json={
|
| 98 |
+
"session_id": session_id,
|
| 99 |
+
"action": {
|
| 100 |
+
"action_type": "submit",
|
| 101 |
+
"priority": "P0",
|
| 102 |
+
"labels": ["bug"],
|
| 103 |
+
"assigned_team": "backend",
|
| 104 |
+
},
|
| 105 |
+
})
|
| 106 |
+
assert r.status_code == 200
|
| 107 |
+
data = r.json()
|
| 108 |
+
assert data["done"] is True
|
| 109 |
+
assert 0 < data["reward"] < 1
|
| 110 |
+
|
| 111 |
+
def test_full_episode_flow(self, client):
|
| 112 |
+
# Reset
|
| 113 |
+
r = client.post("/reset", json={"task_id": "hard"})
|
| 114 |
+
assert r.status_code == 200
|
| 115 |
+
session_id = r.json()["session_id"]
|
| 116 |
+
|
| 117 |
+
# Investigate: read body
|
| 118 |
+
r = client.post("/step", json={
|
| 119 |
+
"session_id": session_id,
|
| 120 |
+
"action": {"action_type": "read_body"},
|
| 121 |
+
})
|
| 122 |
+
assert r.status_code == 200
|
| 123 |
+
assert r.json()["done"] is False
|
| 124 |
+
|
| 125 |
+
# Investigate: read comments
|
| 126 |
+
r = client.post("/step", json={
|
| 127 |
+
"session_id": session_id,
|
| 128 |
+
"action": {"action_type": "read_comments"},
|
| 129 |
+
})
|
| 130 |
+
assert r.status_code == 200
|
| 131 |
+
assert r.json()["done"] is False
|
| 132 |
+
|
| 133 |
+
# Submit triage
|
| 134 |
+
r = client.post("/step", json={
|
| 135 |
+
"session_id": session_id,
|
| 136 |
+
"action": {
|
| 137 |
+
"action_type": "submit",
|
| 138 |
+
"priority": "P0",
|
| 139 |
+
"labels": ["bug", "security"],
|
| 140 |
+
"assigned_team": "security",
|
| 141 |
+
"milestone": "hotfix",
|
| 142 |
+
"reasoning": "Critical security vulnerability in production",
|
| 143 |
+
},
|
| 144 |
+
})
|
| 145 |
+
assert r.status_code == 200
|
| 146 |
+
data = r.json()
|
| 147 |
+
assert data["done"] is True
|
| 148 |
+
assert 0 < data["reward"] < 1
|
| 149 |
+
|
| 150 |
+
def test_backward_compatible_no_session(self, client):
|
| 151 |
+
"""Old-style requests without session_id should still work."""
|
| 152 |
+
r = client.post("/reset", json={"task_id": "easy"})
|
| 153 |
+
assert r.status_code == 200
|
| 154 |
+
|
| 155 |
+
r = client.post("/step", json={
|
| 156 |
+
"action": {
|
| 157 |
+
"priority": "P0",
|
| 158 |
+
"labels": ["bug"],
|
| 159 |
+
},
|
| 160 |
+
})
|
| 161 |
+
assert r.status_code == 200
|
| 162 |
+
|
| 163 |
+
|
| 164 |
+
class TestStateEndpoint:
|
| 165 |
+
def test_state_returns_data(self, client):
|
| 166 |
+
client.post("/reset", json={"task_id": "easy"})
|
| 167 |
+
r = client.get("/state")
|
| 168 |
+
assert r.status_code == 200
|
| 169 |
+
data = r.json()
|
| 170 |
+
assert "current_task" in data
|
| 171 |
+
assert "step_count" in data
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
class TestLeaderboard:
|
| 175 |
+
def test_get_empty_leaderboard(self, client):
|
| 176 |
+
r = client.get("/leaderboard")
|
| 177 |
+
assert r.status_code == 200
|
| 178 |
+
assert isinstance(r.json(), list)
|
| 179 |
+
|
| 180 |
+
def test_submit_to_leaderboard(self, client):
|
| 181 |
+
r = client.post("/leaderboard/submit", json={
|
| 182 |
+
"agent_name": "test-agent",
|
| 183 |
+
"model": "test-model",
|
| 184 |
+
"scores": {"easy": 0.9, "medium": 0.7, "hard": 0.5},
|
| 185 |
+
"avg_score": 0.7,
|
| 186 |
+
})
|
| 187 |
+
assert r.status_code == 200
|
| 188 |
+
data = r.json()
|
| 189 |
+
assert data["status"] == "submitted"
|
| 190 |
+
assert "rank" in data
|
tests/test_environment.py
ADDED
|
@@ -0,0 +1,205 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# tests/test_environment.py
|
| 2 |
+
"""Tests for the environment logic in server/environment.py"""
|
| 3 |
+
import sys
|
| 4 |
+
import os
|
| 5 |
+
sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
|
| 6 |
+
sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(__file__)), "server"))
|
| 7 |
+
|
| 8 |
+
import pytest
|
| 9 |
+
from model import TriageAction, TriageObservation
|
| 10 |
+
from server.environment import BugTriageEnvironment, SessionManager
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
class TestEnvironmentReset:
|
| 14 |
+
def test_reset_returns_observation(self):
|
| 15 |
+
env = BugTriageEnvironment()
|
| 16 |
+
obs = env.reset(task_id="easy")
|
| 17 |
+
assert isinstance(obs, TriageObservation)
|
| 18 |
+
assert obs.bug_report is not None
|
| 19 |
+
assert obs.done is False
|
| 20 |
+
assert obs.task_id == "easy"
|
| 21 |
+
|
| 22 |
+
def test_reset_different_tasks(self):
|
| 23 |
+
env = BugTriageEnvironment()
|
| 24 |
+
for task_id in ["easy", "medium", "hard"]:
|
| 25 |
+
obs = env.reset(task_id=task_id)
|
| 26 |
+
assert obs.task_id == task_id
|
| 27 |
+
assert obs.done is False
|
| 28 |
+
|
| 29 |
+
def test_reset_invalid_task_defaults_to_easy(self):
|
| 30 |
+
env = BugTriageEnvironment()
|
| 31 |
+
obs = env.reset(task_id="nonexistent")
|
| 32 |
+
assert obs.task_id == "easy"
|
| 33 |
+
|
| 34 |
+
def test_reset_shows_truncated_body(self):
|
| 35 |
+
env = BugTriageEnvironment()
|
| 36 |
+
obs = env.reset(task_id="easy")
|
| 37 |
+
# Body should be truncated (not fully visible) on reset
|
| 38 |
+
assert obs.body_visible is False
|
| 39 |
+
|
| 40 |
+
def test_reset_hides_comments(self):
|
| 41 |
+
env = BugTriageEnvironment()
|
| 42 |
+
obs = env.reset(task_id="easy")
|
| 43 |
+
assert obs.comments_visible is False
|
| 44 |
+
|
| 45 |
+
def test_reset_clears_previous_state(self):
|
| 46 |
+
env = BugTriageEnvironment()
|
| 47 |
+
env.reset(task_id="easy")
|
| 48 |
+
env.step(TriageAction(action_type="submit", priority="P0"))
|
| 49 |
+
# Reset should clear everything
|
| 50 |
+
obs = env.reset(task_id="medium")
|
| 51 |
+
assert obs.done is False
|
| 52 |
+
assert obs.task_id == "medium"
|
| 53 |
+
assert obs.steps_taken == 0
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
class TestEnvironmentInvestigation:
|
| 57 |
+
def test_read_body_reveals_full_body(self):
|
| 58 |
+
env = BugTriageEnvironment()
|
| 59 |
+
env.reset(task_id="easy")
|
| 60 |
+
obs = env.step(TriageAction(action_type="read_body"))
|
| 61 |
+
assert obs.body_visible is True
|
| 62 |
+
assert obs.done is False
|
| 63 |
+
assert obs.steps_taken == 1
|
| 64 |
+
|
| 65 |
+
def test_read_comments_reveals_comments(self):
|
| 66 |
+
env = BugTriageEnvironment()
|
| 67 |
+
env.reset(task_id="easy")
|
| 68 |
+
obs = env.step(TriageAction(action_type="read_comments"))
|
| 69 |
+
assert obs.comments_visible is True
|
| 70 |
+
assert obs.done is False
|
| 71 |
+
|
| 72 |
+
def test_check_logs_reveals_logs(self):
|
| 73 |
+
env = BugTriageEnvironment()
|
| 74 |
+
env.reset(task_id="easy")
|
| 75 |
+
obs = env.step(TriageAction(action_type="check_logs"))
|
| 76 |
+
assert obs.logs_visible is True
|
| 77 |
+
assert obs.done is False
|
| 78 |
+
|
| 79 |
+
def test_duplicate_investigation_gives_feedback(self):
|
| 80 |
+
env = BugTriageEnvironment()
|
| 81 |
+
env.reset(task_id="easy")
|
| 82 |
+
env.step(TriageAction(action_type="read_body"))
|
| 83 |
+
obs = env.step(TriageAction(action_type="read_body"))
|
| 84 |
+
assert "already" in obs.feedback.lower()
|
| 85 |
+
|
| 86 |
+
def test_step_count_increments(self):
|
| 87 |
+
env = BugTriageEnvironment()
|
| 88 |
+
env.reset(task_id="easy")
|
| 89 |
+
obs1 = env.step(TriageAction(action_type="read_body"))
|
| 90 |
+
assert obs1.steps_taken == 1
|
| 91 |
+
obs2 = env.step(TriageAction(action_type="read_comments"))
|
| 92 |
+
assert obs2.steps_taken == 2
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
class TestEnvironmentSubmission:
|
| 96 |
+
def test_submit_returns_done(self):
|
| 97 |
+
env = BugTriageEnvironment()
|
| 98 |
+
env.reset(task_id="easy")
|
| 99 |
+
obs = env.step(TriageAction(action_type="submit", priority="P0"))
|
| 100 |
+
assert obs.done is True
|
| 101 |
+
|
| 102 |
+
def test_submit_returns_valid_score(self):
|
| 103 |
+
env = BugTriageEnvironment()
|
| 104 |
+
env.reset(task_id="easy")
|
| 105 |
+
obs = env.step(TriageAction(action_type="submit", priority="P0"))
|
| 106 |
+
assert 0 < obs.score < 1
|
| 107 |
+
assert 0 < obs.reward < 1
|
| 108 |
+
|
| 109 |
+
def test_investigate_then_submit(self):
|
| 110 |
+
env = BugTriageEnvironment()
|
| 111 |
+
env.reset(task_id="medium")
|
| 112 |
+
env.step(TriageAction(action_type="read_body"))
|
| 113 |
+
env.step(TriageAction(action_type="read_comments"))
|
| 114 |
+
obs = env.step(TriageAction(
|
| 115 |
+
action_type="submit", priority="P0",
|
| 116 |
+
labels=["bug"], assigned_team="backend",
|
| 117 |
+
))
|
| 118 |
+
assert obs.done is True
|
| 119 |
+
assert 0 < obs.score < 1
|
| 120 |
+
|
| 121 |
+
def test_double_submit_stays_done(self):
|
| 122 |
+
env = BugTriageEnvironment()
|
| 123 |
+
env.reset(task_id="easy")
|
| 124 |
+
env.step(TriageAction(action_type="submit", priority="P0"))
|
| 125 |
+
obs = env.step(TriageAction(action_type="submit", priority="P1"))
|
| 126 |
+
assert obs.done is True
|
| 127 |
+
assert "already complete" in obs.feedback.lower()
|
| 128 |
+
|
| 129 |
+
def test_max_steps_forces_submit(self):
|
| 130 |
+
env = BugTriageEnvironment()
|
| 131 |
+
obs = env.reset(task_id="easy")
|
| 132 |
+
max_steps = obs.max_steps
|
| 133 |
+
|
| 134 |
+
# Use all steps investigating
|
| 135 |
+
for _ in range(max_steps - 1):
|
| 136 |
+
obs = env.step(TriageAction(action_type="read_body"))
|
| 137 |
+
if obs.done:
|
| 138 |
+
break
|
| 139 |
+
|
| 140 |
+
# This should force a submit even if action_type is investigate
|
| 141 |
+
if not obs.done:
|
| 142 |
+
obs = env.step(TriageAction(
|
| 143 |
+
action_type="read_comments", # will be forced to submit
|
| 144 |
+
priority="P0",
|
| 145 |
+
))
|
| 146 |
+
|
| 147 |
+
|
| 148 |
+
class TestEnvironmentState:
|
| 149 |
+
def test_state_tracks_steps(self):
|
| 150 |
+
env = BugTriageEnvironment()
|
| 151 |
+
env.reset(task_id="easy")
|
| 152 |
+
env.step(TriageAction(action_type="read_body"))
|
| 153 |
+
state = env.get_state()
|
| 154 |
+
assert state.step_count == 1
|
| 155 |
+
assert "read_body" in state.actions_taken
|
| 156 |
+
|
| 157 |
+
def test_state_tracks_completed_tasks(self):
|
| 158 |
+
env = BugTriageEnvironment()
|
| 159 |
+
env.reset(task_id="easy")
|
| 160 |
+
env.step(TriageAction(action_type="submit", priority="P0"))
|
| 161 |
+
state = env.get_state()
|
| 162 |
+
assert "easy" in state.tasks_completed
|
| 163 |
+
|
| 164 |
+
|
| 165 |
+
class TestSessionManager:
|
| 166 |
+
def test_create_session(self):
|
| 167 |
+
mgr = SessionManager(max_sessions=10, ttl_seconds=60)
|
| 168 |
+
session_id, env = mgr.create_session()
|
| 169 |
+
assert session_id is not None
|
| 170 |
+
assert isinstance(env, BugTriageEnvironment)
|
| 171 |
+
assert mgr.active_count == 1
|
| 172 |
+
|
| 173 |
+
def test_get_session(self):
|
| 174 |
+
mgr = SessionManager()
|
| 175 |
+
session_id, env = mgr.create_session()
|
| 176 |
+
retrieved = mgr.get_session(session_id)
|
| 177 |
+
assert retrieved is env
|
| 178 |
+
|
| 179 |
+
def test_get_missing_session(self):
|
| 180 |
+
mgr = SessionManager()
|
| 181 |
+
assert mgr.get_session("nonexistent") is None
|
| 182 |
+
|
| 183 |
+
def test_remove_session(self):
|
| 184 |
+
mgr = SessionManager()
|
| 185 |
+
session_id, _ = mgr.create_session()
|
| 186 |
+
mgr.remove_session(session_id)
|
| 187 |
+
assert mgr.get_session(session_id) is None
|
| 188 |
+
assert mgr.active_count == 0
|
| 189 |
+
|
| 190 |
+
def test_max_sessions_enforced(self):
|
| 191 |
+
mgr = SessionManager(max_sessions=3, ttl_seconds=60)
|
| 192 |
+
for _ in range(5):
|
| 193 |
+
mgr.create_session()
|
| 194 |
+
assert mgr.active_count <= 3
|
| 195 |
+
|
| 196 |
+
def test_multiple_sessions_independent(self):
|
| 197 |
+
mgr = SessionManager()
|
| 198 |
+
sid1, env1 = mgr.create_session()
|
| 199 |
+
sid2, env2 = mgr.create_session()
|
| 200 |
+
|
| 201 |
+
env1.reset(task_id="easy")
|
| 202 |
+
env2.reset(task_id="hard")
|
| 203 |
+
|
| 204 |
+
assert env1.get_state().current_task == "easy"
|
| 205 |
+
assert env2.get_state().current_task == "hard"
|
tests/test_grading.py
ADDED
|
@@ -0,0 +1,253 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# tests/test_grading.py
|
| 2 |
+
"""Tests for the grading logic in server/task.py"""
|
| 3 |
+
import sys
|
| 4 |
+
import os
|
| 5 |
+
sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
|
| 6 |
+
sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(__file__)), "server"))
|
| 7 |
+
|
| 8 |
+
import pytest
|
| 9 |
+
from model import BugReport, TriageAction
|
| 10 |
+
from server.task import (
|
| 11 |
+
_priority_score, _label_score, _normalize_label, _reasoning_score,
|
| 12 |
+
grade_action, generate_bug, sample_bug, TASKS, LABEL_SYNONYMS,
|
| 13 |
+
)
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
# ββ Priority Scoring ββββββββββββββββββββββββββββββββββββββ
|
| 17 |
+
|
| 18 |
+
class TestPriorityScoring:
|
| 19 |
+
def test_exact_match_gives_high_score(self):
|
| 20 |
+
assert _priority_score("P0", "P0") == 0.95
|
| 21 |
+
|
| 22 |
+
def test_all_exact_matches(self):
|
| 23 |
+
for p in ["P0", "P1", "P2", "P3"]:
|
| 24 |
+
assert _priority_score(p, p) == 0.95
|
| 25 |
+
|
| 26 |
+
def test_off_by_one_gives_partial_credit(self):
|
| 27 |
+
assert _priority_score("P0", "P1") == 0.5
|
| 28 |
+
assert _priority_score("P1", "P2") == 0.5
|
| 29 |
+
assert _priority_score("P2", "P3") == 0.5
|
| 30 |
+
|
| 31 |
+
def test_off_by_two_gives_low_credit(self):
|
| 32 |
+
assert _priority_score("P0", "P2") == 0.2
|
| 33 |
+
assert _priority_score("P1", "P3") == 0.2
|
| 34 |
+
|
| 35 |
+
def test_completely_wrong_gives_minimum(self):
|
| 36 |
+
assert _priority_score("P0", "P3") == 0.05
|
| 37 |
+
|
| 38 |
+
def test_invalid_priority(self):
|
| 39 |
+
assert _priority_score("P9", "P0") == 0.05
|
| 40 |
+
assert _priority_score("invalid", "P0") == 0.05
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
# ββ Label Scoring βββββββββββββββββββββββββββββββββββββββββ
|
| 44 |
+
|
| 45 |
+
class TestLabelScoring:
|
| 46 |
+
def test_perfect_match(self):
|
| 47 |
+
score = _label_score(["bug", "security"], ["bug", "security"])
|
| 48 |
+
assert score >= 0.9
|
| 49 |
+
|
| 50 |
+
def test_partial_overlap(self):
|
| 51 |
+
score = _label_score(["bug"], ["bug", "security"])
|
| 52 |
+
assert 0.3 < score < 0.7 # ~50% Jaccard
|
| 53 |
+
|
| 54 |
+
def test_no_overlap(self):
|
| 55 |
+
score = _label_score(["docs"], ["bug", "security"])
|
| 56 |
+
assert score == 0.05 # clamped minimum
|
| 57 |
+
|
| 58 |
+
def test_empty_correct_labels(self):
|
| 59 |
+
score = _label_score(["bug"], [])
|
| 60 |
+
assert score == 0.95 # nothing expected => full credit
|
| 61 |
+
|
| 62 |
+
def test_synonym_matching(self):
|
| 63 |
+
# "defect" is a synonym for "bug"
|
| 64 |
+
score = _label_score(["defect"], ["bug"])
|
| 65 |
+
assert score >= 0.9 # should match via synonym
|
| 66 |
+
|
| 67 |
+
def test_case_insensitive(self):
|
| 68 |
+
score = _label_score(["BUG", "Security"], ["bug", "security"])
|
| 69 |
+
assert score >= 0.9
|
| 70 |
+
|
| 71 |
+
|
| 72 |
+
# ββ Label Normalization βββββββββββββββββββββββββββββββββββ
|
| 73 |
+
|
| 74 |
+
class TestLabelNormalization:
|
| 75 |
+
def test_canonical_stays_same(self):
|
| 76 |
+
assert _normalize_label("bug") == "bug"
|
| 77 |
+
assert _normalize_label("security") == "security"
|
| 78 |
+
|
| 79 |
+
def test_synonym_maps_to_canonical(self):
|
| 80 |
+
assert _normalize_label("defect") == "bug"
|
| 81 |
+
assert _normalize_label("vulnerability") == "security"
|
| 82 |
+
assert _normalize_label("slow") == "performance"
|
| 83 |
+
assert _normalize_label("ui") == "ux"
|
| 84 |
+
|
| 85 |
+
def test_unknown_label_passes_through(self):
|
| 86 |
+
assert _normalize_label("my-custom-label") == "my-custom-label"
|
| 87 |
+
|
| 88 |
+
def test_case_insensitive(self):
|
| 89 |
+
assert _normalize_label("BUG") == "bug"
|
| 90 |
+
assert _normalize_label("Vulnerability") == "security"
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
# ββ Reasoning Scoring βββββββββββββββββββββββββββββββββββββ
|
| 94 |
+
|
| 95 |
+
class TestReasoningScoring:
|
| 96 |
+
def test_empty_reasoning_gives_zero(self):
|
| 97 |
+
assert _reasoning_score("", {"priority": "P0"}) == 0.0
|
| 98 |
+
|
| 99 |
+
def test_short_reasoning_gives_zero(self):
|
| 100 |
+
assert _reasoning_score("bad", {"priority": "P0"}) == 0.0
|
| 101 |
+
|
| 102 |
+
def test_relevant_reasoning_gives_bonus(self):
|
| 103 |
+
score = _reasoning_score(
|
| 104 |
+
"This is a critical security vulnerability affecting production and causing data loss",
|
| 105 |
+
{"priority": "P0"},
|
| 106 |
+
)
|
| 107 |
+
assert score > 0
|
| 108 |
+
|
| 109 |
+
def test_bonus_capped_at_max(self):
|
| 110 |
+
score = _reasoning_score(
|
| 111 |
+
"production down all users data loss security crash revenue injection vulnerability 100%",
|
| 112 |
+
{"priority": "P0"},
|
| 113 |
+
)
|
| 114 |
+
assert score <= 0.15
|
| 115 |
+
|
| 116 |
+
|
| 117 |
+
# ββ Grade Action ββββββββββββββββββββββββββββββββββββββββββ
|
| 118 |
+
|
| 119 |
+
class TestGradeAction:
|
| 120 |
+
@pytest.fixture
|
| 121 |
+
def easy_bug(self):
|
| 122 |
+
return TASKS["easy"]["bugs"][0] # easy-001: P0
|
| 123 |
+
|
| 124 |
+
@pytest.fixture
|
| 125 |
+
def medium_bug(self):
|
| 126 |
+
return TASKS["medium"]["bugs"][0] # med-001: P0, payments, backend
|
| 127 |
+
|
| 128 |
+
@pytest.fixture
|
| 129 |
+
def hard_bug(self):
|
| 130 |
+
return TASKS["hard"]["bugs"][0] # hard-001: P0, security, hotfix
|
| 131 |
+
|
| 132 |
+
def test_easy_perfect_answer(self, easy_bug):
|
| 133 |
+
action = TriageAction(priority="P0")
|
| 134 |
+
score, feedback = grade_action("easy", easy_bug, action)
|
| 135 |
+
assert 0.9 <= score <= 0.99
|
| 136 |
+
assert "β" in feedback
|
| 137 |
+
|
| 138 |
+
def test_easy_wrong_answer(self, easy_bug):
|
| 139 |
+
action = TriageAction(priority="P3")
|
| 140 |
+
score, feedback = grade_action("easy", easy_bug, action)
|
| 141 |
+
assert score < 0.2
|
| 142 |
+
|
| 143 |
+
def test_medium_perfect_answer(self, medium_bug):
|
| 144 |
+
action = TriageAction(
|
| 145 |
+
priority="P0",
|
| 146 |
+
labels=["bug", "payments"],
|
| 147 |
+
assigned_team="backend",
|
| 148 |
+
)
|
| 149 |
+
score, feedback = grade_action("medium", medium_bug, action)
|
| 150 |
+
assert score > 0.8
|
| 151 |
+
|
| 152 |
+
def test_hard_security_penalty(self, hard_bug):
|
| 153 |
+
# hard-001 requires security team; assigning backend should be penalized
|
| 154 |
+
action_wrong = TriageAction(
|
| 155 |
+
priority="P0",
|
| 156 |
+
labels=["bug", "security"],
|
| 157 |
+
assigned_team="backend", # Wrong! Should be security
|
| 158 |
+
milestone="hotfix",
|
| 159 |
+
)
|
| 160 |
+
action_right = TriageAction(
|
| 161 |
+
priority="P0",
|
| 162 |
+
labels=["bug", "security"],
|
| 163 |
+
assigned_team="security",
|
| 164 |
+
milestone="hotfix",
|
| 165 |
+
)
|
| 166 |
+
score_wrong, fb_wrong = grade_action("hard", hard_bug, action_wrong)
|
| 167 |
+
score_right, fb_right = grade_action("hard", hard_bug, action_right)
|
| 168 |
+
|
| 169 |
+
assert score_right > score_wrong
|
| 170 |
+
assert "Security escalation missed" in fb_wrong
|
| 171 |
+
|
| 172 |
+
def test_all_scores_in_valid_range(self):
|
| 173 |
+
"""Every grading result must be in (0, 1) β open interval."""
|
| 174 |
+
for task_key in ["easy", "medium", "hard"]:
|
| 175 |
+
for bug in TASKS[task_key]["bugs"]:
|
| 176 |
+
for priority in ["P0", "P1", "P2", "P3"]:
|
| 177 |
+
action = TriageAction(
|
| 178 |
+
priority=priority,
|
| 179 |
+
labels=["bug"],
|
| 180 |
+
assigned_team="backend",
|
| 181 |
+
milestone="backlog",
|
| 182 |
+
)
|
| 183 |
+
score, feedback = grade_action(task_key, bug, action)
|
| 184 |
+
assert 0 < score < 1, (
|
| 185 |
+
f"Score {score} out of range for {bug.id} "
|
| 186 |
+
f"with priority={priority}"
|
| 187 |
+
)
|
| 188 |
+
assert isinstance(feedback, str)
|
| 189 |
+
assert len(feedback) > 0
|
| 190 |
+
|
| 191 |
+
|
| 192 |
+
# ββ Procedural Bug Generation βββββββββββββββββββββββββββββ
|
| 193 |
+
|
| 194 |
+
class TestBugGeneration:
|
| 195 |
+
def test_generate_produces_valid_bug(self):
|
| 196 |
+
bug, answer = generate_bug("easy", seed=42)
|
| 197 |
+
assert isinstance(bug, BugReport)
|
| 198 |
+
assert bug.id.startswith("gen-")
|
| 199 |
+
assert len(bug.title) > 5
|
| 200 |
+
assert len(bug.body) > 20
|
| 201 |
+
assert "priority" in answer
|
| 202 |
+
|
| 203 |
+
def test_different_seeds_produce_different_bugs(self):
|
| 204 |
+
bug1, _ = generate_bug("easy", seed=1)
|
| 205 |
+
bug2, _ = generate_bug("easy", seed=2)
|
| 206 |
+
# Very unlikely to produce the same title with different seeds
|
| 207 |
+
assert bug1.title != bug2.title or bug1.body != bug2.body
|
| 208 |
+
|
| 209 |
+
def test_same_seed_produces_same_bug(self):
|
| 210 |
+
bug1, ans1 = generate_bug("easy", seed=42)
|
| 211 |
+
bug2, ans2 = generate_bug("easy", seed=42)
|
| 212 |
+
assert bug1.title == bug2.title
|
| 213 |
+
assert bug1.body == bug2.body
|
| 214 |
+
assert ans1 == ans2
|
| 215 |
+
|
| 216 |
+
def test_easy_bugs_have_only_priority(self):
|
| 217 |
+
for seed in range(10):
|
| 218 |
+
_, answer = generate_bug("easy", seed=seed)
|
| 219 |
+
assert "priority" in answer
|
| 220 |
+
# easy should NOT include milestone
|
| 221 |
+
assert "milestone" not in answer
|
| 222 |
+
|
| 223 |
+
def test_hard_bugs_have_full_answer(self):
|
| 224 |
+
for seed in range(50):
|
| 225 |
+
_, answer = generate_bug("hard", seed=seed)
|
| 226 |
+
assert "priority" in answer
|
| 227 |
+
|
| 228 |
+
def test_all_difficulties(self):
|
| 229 |
+
for difficulty in ["easy", "medium", "hard"]:
|
| 230 |
+
bug, answer = generate_bug(difficulty, seed=100)
|
| 231 |
+
assert isinstance(bug, BugReport)
|
| 232 |
+
assert "priority" in answer
|
| 233 |
+
|
| 234 |
+
def test_sample_bug_returns_tuple(self):
|
| 235 |
+
bug, answer = sample_bug("easy", seed=42)
|
| 236 |
+
assert isinstance(bug, BugReport)
|
| 237 |
+
assert isinstance(answer, dict)
|
| 238 |
+
|
| 239 |
+
def test_generated_bugs_are_gradeable(self):
|
| 240 |
+
"""Generated bugs should work with the grading system."""
|
| 241 |
+
for difficulty in ["easy", "medium", "hard"]:
|
| 242 |
+
for seed in range(5):
|
| 243 |
+
bug, answer = generate_bug(difficulty, seed=seed)
|
| 244 |
+
action = TriageAction(
|
| 245 |
+
priority=answer["priority"],
|
| 246 |
+
labels=answer.get("labels", ["bug"]),
|
| 247 |
+
assigned_team=answer.get("assigned_team", "backend"),
|
| 248 |
+
milestone=answer.get("milestone", "backlog"),
|
| 249 |
+
)
|
| 250 |
+
score, feedback = grade_action(difficulty, bug, action, answer=answer)
|
| 251 |
+
assert 0 < score < 1, (
|
| 252 |
+
f"Score {score} for {bug.id} ({difficulty})"
|
| 253 |
+
)
|