Siteshcodes commited on
Commit
703aa57
Β·
1 Parent(s): 1893444

v2.0: multi-step episodes, procedural bugs, semantic grading, sessions, 71 tests

Browse files
.gitignore CHANGED
Binary files a/.gitignore and b/.gitignore differ
 
README.md CHANGED
@@ -9,85 +9,127 @@ tags:
9
  - openenv
10
  ---
11
 
12
- # πŸ› Bug Triage Environment
13
 
14
  > **OpenEnv RL environment for the Meta PyTorch Hackathon x Scaler School of Technology**
15
 
16
- An OpenEnv reinforcement learning environment where an AI agent triages GitHub-style bug reports β€” assigning priority, labels, team ownership, and milestone β€” exactly as a senior engineer would.
17
 
18
  **Live:** [https://siteshcodes-bug-triage-env.hf.space](https://siteshcodes-bug-triage-env.hf.space)
19
  **GitHub:** [https://github.com/Siteshcodes/bug-triage-env](https://github.com/Siteshcodes/bug-triage-env)
20
 
21
  ---
22
 
23
- ## Why This Environment?
24
 
25
- Every software team triages dozens of bug reports weekly. Getting prioritization wrong delays critical fixes and wastes engineering time. This environment trains and evaluates agents on real triage decision-making, with graders that reflect actual engineering judgment.
 
 
 
 
 
 
 
 
26
 
27
- **Key features:**
28
- - 🎯 Simulates a real-world engineering task (not a game or toy)
29
- - πŸ“Š 3 tasks of increasing difficulty with deterministic graders
30
- - πŸ”„ Meaningful partial-credit reward function
31
- - πŸ›‘οΈ Security escalation penalty for missed critical vulnerabilities
32
- - πŸ“¦ Full OpenEnv spec compliance: `step()` / `reset()` / `state()`
 
 
 
 
 
 
 
 
 
33
 
34
  ---
35
 
36
  ## Action Space
37
 
38
- | Field | Type | Values |
39
- |-----------------|-----------|-------------------------------------------------|
40
- | `priority` | string | `P0` Β· `P1` Β· `P2` Β· `P3` |
41
- | `labels` | list[str] | `bug` Β· `performance` Β· `security` Β· `ux` Β· `data-integrity` Β· `payments` … |
42
- | `assigned_team` | string | `backend` Β· `frontend` Β· `infra` Β· `security` Β· `devx` |
43
- | `milestone` | string | `hotfix` Β· `v2.1` Β· `backlog` |
44
- | `reasoning` | string | Free-form explanation of triage decision |
 
45
 
46
  ## Observation Space
47
 
48
- | Field | Type | Description |
49
- |--------------|-----------|------------------------------------------|
50
- | `bug_report` | BugReport | Title, body, author, labels_hint, comments |
51
- | `task_id` | string | Current difficulty: `easy` / `medium` / `hard` |
52
- | `score` | float | Score from grader (0.0–1.0) |
53
- | `reward` | float | Reward from last action (0.0–1.0) |
54
- | `feedback` | string | Human-readable grader feedback |
55
- | `done` | bool | Episode complete flag |
 
 
 
 
 
56
 
57
  ---
58
 
59
  ## Tasks
60
 
61
  ### Task 1 β€” Easy: Priority Assignment
62
- Assign a single P0–P3 priority to a bug report.
63
  - **Grader:** `server.task:priority_match`
64
- - **Scoring:** exact match β†’ 0.95, one level off β†’ 0.50, else β†’ 0.05
65
- - **Weight:** priority 100%
66
- - **Reward range:** (0.0, 1.0) β€” strictly exclusive
67
 
68
  ### Task 2 β€” Medium: Priority + Labels + Team
69
- Assign priority, category labels, and team routing.
70
  - **Grader:** `server.task:priority_label_team`
71
- - **Scoring:** priority 45% + label Jaccard similarity 40% + team routing 15%
72
- - **Reward range:** (0.0, 1.0) β€” strictly exclusive
73
 
74
  ### Task 3 β€” Hard: Full Triage
75
- Full triage: priority, labels, team, and milestone. Security escalation failures are penalized.
76
  - **Grader:** `server.task:full_triage`
77
  - **Scoring:** priority 35% + labels 30% + team 20% + milestone 15%
78
- - **Penalty:** βˆ’0.15 for missing security escalation (e.g., SQL injection assigned to `backend` instead of `security`)
79
- - **Reward range:** (0.0, 1.0) β€” strictly exclusive
 
80
 
81
  ---
82
 
83
  ## Reward Function
84
 
85
- Rewards provide meaningful partial-credit signals at every step:
86
- - **Priority:** Close-but-wrong gets partial credit (0.50 for 1-level off vs 0.05 for 2+ levels off vs 0.95 for exact match)
87
- - **Labels:** Jaccard similarity between predicted and expected label sets (continuous signal)
88
- - **Team routing:** Binary accuracy, weighted per task difficulty
89
- - **Security escalation:** Hard penalty (βˆ’0.15) discourages ignoring critical security signals
90
- - **Clamping:** All scores strictly within (0.0, 1.0) β€” never exactly 0 or 1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
  ---
93
 
@@ -107,7 +149,13 @@ docker build -t bug-triage-env .
107
  docker run -p 7860:7860 bug-triage-env
108
  ```
109
 
110
- ### Run Inference (Hackathon Submission Script)
 
 
 
 
 
 
111
  ```bash
112
  pip install openai openenv-core requests pydantic
113
  export API_BASE_URL=https://router.huggingface.co/v1
@@ -119,77 +167,79 @@ python inference.py
119
 
120
  ### Environment Variables
121
 
122
- | Variable | Description | Required |
123
- |----------------|--------------------------------------|----------|
124
- | `API_BASE_URL` | LLM API endpoint | Yes |
125
- | `MODEL_NAME` | Model identifier for inference | Yes |
126
- | `HF_TOKEN` | Hugging Face / API key | Yes |
127
- | `ENV_BASE_URL` | Bug Triage environment URL | Optional |
128
-
129
- ---
130
-
131
- ## Baseline Scores
132
-
133
- Evaluated with `meta-llama/Llama-3.3-70B-Instruct` via HuggingFace router (temperature=0):
134
-
135
- | Task | Difficulty | Score |
136
- |------------|------------|-------|
137
- | Easy | easy | 0.95 |
138
- | Medium | medium | 0.50 |
139
- | Hard | hard | 0.85 |
140
- | **Average**| | **0.77** |
141
-
142
- > Scores vary per run due to random bug sampling from a pool of 5 bugs per task.
143
 
144
  ---
145
 
146
  ## API Endpoints
147
 
148
- | Method | Endpoint | Description |
149
- |--------|------------------|------------------------------------|
150
- | GET | `/` | Health check |
151
- | POST | `/reset` | Start new episode for a task |
152
- | POST | `/step` | Submit triage action |
153
- | GET | `/state` | Get current episode state |
154
- | GET | `/tasks` | List all tasks with grader info |
155
- | GET | `/tasks/{id}` | Get specific task metadata |
 
 
 
156
 
157
- ### Example: Reset + Step
158
 
159
  ```bash
160
- # Reset for easy task
161
  curl -X POST https://siteshcodes-bug-triage-env.hf.space/reset \
162
  -H "Content-Type: application/json" \
163
- -d '{"task_id": "easy"}'
164
 
165
- # Submit triage action
166
  curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
167
  -H "Content-Type: application/json" \
168
- -d '{"action": {"priority": "P0", "labels": ["bug"], "assigned_team": "backend", "milestone": "hotfix", "reasoning": "App crash affecting all users"}}'
 
 
 
 
 
 
 
 
 
 
169
  ```
170
 
171
  ---
172
 
173
  ## Inference Log Format
174
 
175
- The inference script emits structured logs per the OpenEnv spec:
176
 
177
  ```
178
  [START] task=easy env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
179
- [STEP] step=1 action=priority=P0,team=backend,milestone=hotfix reward=0.95 done=true error=null
180
- [END] success=true steps=1 score=0.95 rewards=0.95
 
 
181
 
182
  [START] task=medium env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
183
- [STEP] step=1 action=priority=P0,team=backend,milestone=hotfix reward=0.85 done=true error=null
184
- [END] success=true steps=1 score=0.85 rewards=0.85
 
 
185
 
186
  [START] task=hard env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
187
- [STEP] step=1 action=priority=P0,team=security,milestone=hotfix reward=0.72 done=true error=null
188
- [END] success=true steps=1 score=0.72 rewards=0.72
 
 
189
  ```
190
 
191
- Each task gets its own `[START]` β†’ `[STEP]` β†’ `[END]` block.
192
-
193
  ---
194
 
195
  ## Project Structure
@@ -197,16 +247,24 @@ Each task gets its own `[START]` β†’ `[STEP]` β†’ `[END]` block.
197
  ```
198
  bug-triage-env/
199
  β”œβ”€β”€ server/
200
- β”‚ β”œβ”€β”€ app.py # FastAPI + OpenEnv stateful endpoints
201
- β”‚ β”œβ”€β”€ environment.py # BugTriageEnvironment (reset/step/state)
202
- β”‚ β”œβ”€β”€ task.py # 15 bug reports + 3 graders
203
  β”‚ β”œβ”€β”€ __init__.py
204
- β”‚ └── requirements.txt
 
 
 
 
 
 
205
  β”œβ”€β”€ model.py # Pydantic models (TriageAction, TriageObservation, TriageState)
206
- β”œβ”€β”€ inference.py # OpenAI client submission script (per-task logs)
207
- β”œβ”€β”€ openenv.yaml # OpenEnv spec manifest (3 tasks with graders)
208
- β”œβ”€β”€ Dockerfile # Docker container config
209
- β”œβ”€β”€ pyproject.toml # Package metadata
 
 
210
  └── README.md
211
  ```
212
 
@@ -214,17 +272,20 @@ bug-triage-env/
214
 
215
  ## OpenEnv Spec Compliance
216
 
217
- | Requirement | Status |
218
- |-------------------------------------|--------|
219
  | Typed models (Action/Observation/State) | βœ… |
220
- | `step()` / `reset()` / `state()` API | βœ… |
221
- | `openenv.yaml` manifest | βœ… |
222
- | 3+ tasks with graders (easyβ†’hard) | βœ… |
223
- | Reward range strictly (0.0, 1.0) | βœ… |
 
224
  | Baseline inference with reproducible scores | βœ… |
225
- | Dockerfile builds | βœ… |
226
- | Deployed on HF Spaces | βœ… |
227
- | Structured `[START]/[STEP]/[END]` logs | βœ… |
 
 
228
 
229
  ---
230
 
 
9
  - openenv
10
  ---
11
 
12
+ # πŸ› Bug Triage Environment v2.0
13
 
14
  > **OpenEnv RL environment for the Meta PyTorch Hackathon x Scaler School of Technology**
15
 
16
+ A multi-step reinforcement learning environment where an AI agent investigates and triages GitHub-style bug reports β€” deciding priority, labels, team ownership, and milestone β€” just like a senior engineer would.
17
 
18
  **Live:** [https://siteshcodes-bug-triage-env.hf.space](https://siteshcodes-bug-triage-env.hf.space)
19
  **GitHub:** [https://github.com/Siteshcodes/bug-triage-env](https://github.com/Siteshcodes/bug-triage-env)
20
 
21
  ---
22
 
23
+ ## What Makes This Different
24
 
25
+ | Feature | v1.0 (before) | v2.0 (now) |
26
+ |---------|---------------|------------|
27
+ | Episode length | 1 step (quiz) | Multi-step investigation |
28
+ | Bug pool | 15 hardcrafted | 200+ procedurally generated |
29
+ | Label matching | Exact string | Semantic (synonym-aware) |
30
+ | Concurrency | Broken (global state) | Session-based, thread-safe |
31
+ | Information reveal | Everything at once | Progressive (title β†’ body β†’ comments β†’ logs) |
32
+ | Tests | None | 50+ unit & integration tests |
33
+ | Grading depth | String matching | Weighted scoring + reasoning bonus |
34
 
35
+ ---
36
+
37
+ ## Multi-Step Investigation
38
+
39
+ Unlike simple Q&A environments, the agent must **investigate before deciding**:
40
+
41
+ ```
42
+ reset() β†’ Agent sees: bug title + body preview
43
+ step(read_body) β†’ Full description revealed
44
+ step(read_comments) β†’ User comments revealed
45
+ step(check_logs) β†’ Stack traces + severity signals revealed
46
+ step(submit, ...) β†’ Final triage graded (reward returned)
47
+ ```
48
+
49
+ Each investigation step costs a step (out of a limited budget). The agent must learn **when it has enough information to decide correctly** β€” balancing accuracy vs. efficiency.
50
 
51
  ---
52
 
53
  ## Action Space
54
 
55
+ | Field | Type | Values |
56
+ |-------|------|--------|
57
+ | `action_type` | string | `read_body` Β· `read_comments` Β· `check_logs` Β· `check_similar` Β· `submit` |
58
+ | `priority` | string | `P0` Β· `P1` Β· `P2` Β· `P3` (only for submit) |
59
+ | `labels` | list[str] | `bug` Β· `performance` Β· `security` Β· `ux` Β· `data-integrity` Β· `payments` … |
60
+ | `assigned_team` | string | `backend` Β· `frontend` Β· `infra` Β· `security` Β· `devx` |
61
+ | `milestone` | string | `hotfix` Β· `v2.1` Β· `backlog` |
62
+ | `reasoning` | string | Free-form explanation (earns bonus points) |
63
 
64
  ## Observation Space
65
 
66
+ | Field | Type | Description |
67
+ |-------|------|-------------|
68
+ | `bug_report` | BugReport | Title, body, author, labels_hint, comments, stack_trace |
69
+ | `task_id` | string | Current difficulty: `easy` / `medium` / `hard` |
70
+ | `score` | float | Score from grader (0.0–1.0) |
71
+ | `reward` | float | Reward from last action (0.0–1.0) |
72
+ | `feedback` | string | Human-readable grader feedback |
73
+ | `done` | bool | Episode complete flag |
74
+ | `body_visible` | bool | Whether full body has been revealed |
75
+ | `comments_visible` | bool | Whether comments have been revealed |
76
+ | `logs_visible` | bool | Whether logs/stack traces have been revealed |
77
+ | `steps_taken` | int | Steps used so far |
78
+ | `max_steps` | int | Maximum steps allowed |
79
 
80
  ---
81
 
82
  ## Tasks
83
 
84
  ### Task 1 β€” Easy: Priority Assignment
85
+ Assign a single P0–P3 priority. Up to 4 steps.
86
  - **Grader:** `server.task:priority_match`
87
+ - **Scoring:** exact β†’ 0.95, Β±1 β†’ 0.50, Β±2 β†’ 0.20, else β†’ 0.05
88
+ - **Reward range:** (0.0, 1.0)
 
89
 
90
  ### Task 2 β€” Medium: Priority + Labels + Team
91
+ Assign priority, category labels, and team routing. Up to 5 steps.
92
  - **Grader:** `server.task:priority_label_team`
93
+ - **Scoring:** priority 45% + label Jaccard (semantic) 40% + team 15%
94
+ - **Reward range:** (0.0, 1.0)
95
 
96
  ### Task 3 β€” Hard: Full Triage
97
+ Full triage with security escalation penalty. Up to 6 steps.
98
  - **Grader:** `server.task:full_triage`
99
  - **Scoring:** priority 35% + labels 30% + team 20% + milestone 15%
100
+ - **Penalty:** βˆ’0.15 for missing security escalation
101
+ - **Bonus:** up to +0.15 for relevant reasoning
102
+ - **Reward range:** (0.0, 1.0)
103
 
104
  ---
105
 
106
  ## Reward Function
107
 
108
+ - **Priority:** Graduated partial credit (0.95 β†’ 0.50 β†’ 0.20 β†’ 0.05)
109
+ - **Labels:** Semantic Jaccard similarity with synonym matching (e.g., "defect" β‰ˆ "bug")
110
+ - **Team routing:** Binary accuracy, weighted per difficulty
111
+ - **Security escalation:** Hard penalty (βˆ’0.15) for ignoring security signals
112
+ - **Reasoning bonus:** Up to +0.15 for mentioning relevant signals
113
+ - **Efficiency:** +0.05 bonus for correct answers with minimal investigation
114
+ - **Clamping:** All scores strictly within (0.0, 1.0)
115
+
116
+ ---
117
+
118
+ ## Procedural Bug Generation
119
+
120
+ The environment generates bugs from **7 template categories**:
121
+
122
+ | Category | Example Bugs |
123
+ |----------|-------------|
124
+ | `crash` | Service crashes, unhandled exceptions, segfaults |
125
+ | `security` | SQL injection, XSS, auth bypass, data exposure |
126
+ | `performance` | Memory leaks, slow queries, CPU spikes |
127
+ | `ui_bug` | Layout breaks, dark mode issues, accessibility |
128
+ | `data_corruption` | Race conditions, encoding issues, stale cache |
129
+ | `documentation` | Typos, outdated docs, missing guides |
130
+ | `api_bug` | Rate limiting bugs, pagination issues, webhook failures |
131
+
132
+ Each category has 5-6 title templates Γ— 2 body templates Γ— 6-12 variables = hundreds of unique combinations. The 15 original handcrafted bugs are preserved as a high-quality subset (40% chance per sample).
133
 
134
  ---
135
 
 
149
  docker run -p 7860:7860 bug-triage-env
150
  ```
151
 
152
+ ### Run Tests
153
+ ```bash
154
+ pip install -e ".[dev]"
155
+ pytest tests/ -v
156
+ ```
157
+
158
+ ### Run Inference (Hackathon Submission)
159
  ```bash
160
  pip install openai openenv-core requests pydantic
161
  export API_BASE_URL=https://router.huggingface.co/v1
 
167
 
168
  ### Environment Variables
169
 
170
+ | Variable | Description | Required |
171
+ |----------|-------------|----------|
172
+ | `API_BASE_URL` | LLM API endpoint | Yes |
173
+ | `MODEL_NAME` | Model identifier for inference | Yes |
174
+ | `HF_TOKEN` | Hugging Face / API key | Yes |
175
+ | `ENV_BASE_URL` | Bug Triage environment URL | Optional |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
176
 
177
  ---
178
 
179
  ## API Endpoints
180
 
181
+ | Method | Endpoint | Description |
182
+ |--------|----------|-------------|
183
+ | GET | `/` | Interactive demo frontend |
184
+ | GET | `/health` | Health check + active sessions |
185
+ | POST | `/reset` | Start new episode (returns session_id) |
186
+ | POST | `/step` | Investigation or submit action |
187
+ | GET | `/state` | Current episode state |
188
+ | GET | `/tasks` | List all 3 tasks |
189
+ | GET | `/tasks/{id}` | Task metadata |
190
+ | GET | `/leaderboard` | Top agent scores |
191
+ | POST | `/leaderboard/submit` | Submit agent scores |
192
 
193
+ ### Example: Multi-Step Episode
194
 
195
  ```bash
196
+ # 1. Reset β€” get a bug and session_id
197
  curl -X POST https://siteshcodes-bug-triage-env.hf.space/reset \
198
  -H "Content-Type: application/json" \
199
+ -d '{"task_id": "hard"}'
200
 
201
+ # 2. Investigate β€” read full body (use session_id from step 1)
202
  curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
203
  -H "Content-Type: application/json" \
204
+ -d '{"session_id": "...", "action": {"action_type": "read_body"}}'
205
+
206
+ # 3. Investigate β€” read comments
207
+ curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
208
+ -H "Content-Type: application/json" \
209
+ -d '{"session_id": "...", "action": {"action_type": "read_comments"}}'
210
+
211
+ # 4. Submit triage decision
212
+ curl -X POST https://siteshcodes-bug-triage-env.hf.space/step \
213
+ -H "Content-Type: application/json" \
214
+ -d '{"session_id": "...", "action": {"action_type": "submit", "priority": "P0", "labels": ["bug", "security"], "assigned_team": "security", "milestone": "hotfix", "reasoning": "SQL injection in production β€” critical security vulnerability"}}'
215
  ```
216
 
217
  ---
218
 
219
  ## Inference Log Format
220
 
221
+ Structured logs per OpenEnv spec (3 tasks, each with its own block):
222
 
223
  ```
224
  [START] task=easy env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
225
+ [STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
226
+ [STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
227
+ [STEP] step=3 action=priority=P0,team=backend,milestone=hotfix reward=0.95 done=true error=null
228
+ [END] success=true steps=3 score=0.95 rewards=0.95
229
 
230
  [START] task=medium env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
231
+ [STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
232
+ [STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
233
+ [STEP] step=3 action=priority=P0,team=backend,milestone=hotfix reward=0.85 done=true error=null
234
+ [END] success=true steps=3 score=0.85 rewards=0.85
235
 
236
  [START] task=hard env=bug-triage-env model=meta-llama/Llama-3.3-70B-Instruct
237
+ [STEP] step=1 action=investigate:read_body reward=0.00 done=false error=null
238
+ [STEP] step=2 action=investigate:read_comments reward=0.00 done=false error=null
239
+ [STEP] step=3 action=priority=P0,team=security,milestone=hotfix reward=0.92 done=true error=null
240
+ [END] success=true steps=3 score=0.92 rewards=0.92
241
  ```
242
 
 
 
243
  ---
244
 
245
  ## Project Structure
 
247
  ```
248
  bug-triage-env/
249
  β”œβ”€β”€ server/
250
+ β”‚ β”œβ”€β”€ app.py # FastAPI routes + session management
251
+ β”‚ β”œβ”€β”€ environment.py # Multi-step environment + SessionManager
252
+ β”‚ β”œβ”€β”€ task.py # 200+ bugs (procedural + handcrafted) + semantic grading
253
  β”‚ β”œβ”€β”€ __init__.py
254
+ β”‚ β”œβ”€β”€ requirements.txt
255
+ β”‚ └── static/
256
+ β”‚ └── index.html # Interactive demo
257
+ β”œβ”€β”€ tests/
258
+ β”‚ β”œβ”€β”€ test_grading.py # Grading logic tests
259
+ β”‚ β”œβ”€β”€ test_environment.py # Environment flow tests
260
+ β”‚ └── test_api.py # HTTP endpoint integration tests
261
  β”œβ”€β”€ model.py # Pydantic models (TriageAction, TriageObservation, TriageState)
262
+ β”œβ”€β”€ client.py # HTTP client (single source of truth)
263
+ β”œβ”€β”€ inference.py # Multi-step OpenAI agent (hackathon submission)
264
+ β”œβ”€β”€ baseline.py # Groq baseline agent
265
+ β”œβ”€β”€ openenv.yaml # OpenEnv spec manifest
266
+ β”œβ”€β”€ Dockerfile # Docker config
267
+ β”œβ”€β”€ pyproject.toml # Package metadata + dev deps
268
  └── README.md
269
  ```
270
 
 
272
 
273
  ## OpenEnv Spec Compliance
274
 
275
+ | Requirement | Status |
276
+ |-------------|--------|
277
  | Typed models (Action/Observation/State) | βœ… |
278
+ | `step()` / `reset()` / `state()` API | βœ… |
279
+ | `openenv.yaml` manifest | βœ… |
280
+ | 3+ tasks with graders (easy β†’ hard) | βœ… |
281
+ | Reward range strictly (0.0, 1.0) | βœ… |
282
+ | Multi-step episodes | βœ… |
283
  | Baseline inference with reproducible scores | βœ… |
284
+ | Dockerfile builds | βœ… |
285
+ | Deployed on HF Spaces | βœ… |
286
+ | Structured `[START]/[STEP]/[END]` logs | βœ… |
287
+ | Session-based concurrency | βœ… |
288
+ | 50+ automated tests | βœ… |
289
 
290
  ---
291
 
__pycache__/client.cpython-314.pyc DELETED
Binary file (5.72 kB)
 
__pycache__/model.cpython-314.pyc DELETED
Binary file (4.18 kB)
 
baseline.py CHANGED
@@ -1,17 +1,16 @@
1
  # baseline.py
2
- # Runs a Groq-hosted LLaMA model against all 3 tasks
3
  # Set env vars: GROQ_API_KEY, ENV_BASE_URL (optional)
4
 
5
  import os
6
  import json
 
7
  from groq import Groq
8
  from client import BugTriageClient
9
  from model import TriageAction
10
- import time
11
 
12
- # ── config ─────────────────────────────────────────────────
13
  GROQ_API_KEY = os.getenv("GROQ_API_KEY")
14
- MODEL = "llama-3.3-70b-versatile" # strong + free on Groq
15
  TEMPERATURE = 0.0
16
  MAX_TOKENS = 400
17
 
@@ -40,12 +39,19 @@ Milestones: hotfix | v2.1 | backlog"""
40
 
41
  def format_bug(obs) -> str:
42
  bug = obs.bug_report
43
- return (
44
- f"Title: {bug.title}\n\n"
45
- f"Description:\n{bug.body}\n\n"
46
- f"Existing labels: {', '.join(bug.labels_hint) or 'none'}\n"
47
- f"Comments:\n" + "\n".join(f" - {c}" for c in bug.comments)
48
- )
 
 
 
 
 
 
 
49
 
50
 
51
  def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
@@ -60,7 +66,6 @@ def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
60
  )
61
  raw = response.choices[0].message.content.strip()
62
 
63
- # strip accidental markdown fences
64
  if raw.startswith("```"):
65
  raw = raw.split("```")[1]
66
  if raw.startswith("json"):
@@ -68,6 +73,7 @@ def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
68
 
69
  data = json.loads(raw)
70
  return TriageAction(
 
71
  priority=data["priority"],
72
  labels=data.get("labels", []),
73
  assigned_team=data.get("assigned_team", "backend"),
@@ -78,26 +84,39 @@ def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
78
 
79
  def main():
80
  if not GROQ_API_KEY:
81
- raise EnvironmentError("GROQ_API_KEY not set. Get a free key at console.groq.com")
 
82
 
83
  groq_client = Groq(api_key=GROQ_API_KEY)
84
  scores = {}
85
- step_count = 0
86
 
87
  print("=" * 50)
88
- print(" Bug Triage Env β€” Baseline Inference Script")
89
  print(f" Model: {MODEL}")
90
  print("=" * 50)
91
 
 
 
92
  with BugTriageClient() as env:
93
- obs = env.reset()
94
- MAX_STEPS = 3
95
- step_count = 0
96
- while not obs.done and step_count < MAX_STEPS:
97
- task = obs.task_id
98
- print(f"\n── Task: {task.upper()} ──")
99
  print(f" Bug: {obs.bug_report.title}")
100
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
  bug_text = format_bug(obs)
102
  action = call_model(groq_client, bug_text)
103
 
@@ -112,21 +131,19 @@ def main():
112
  print(f" βœ“ Reward: {result.reward:.3f}")
113
  print(f" βœ“ Feedback: {obs.feedback}")
114
 
115
- scores[task] = result.reward
116
- step_count += 1
117
  time.sleep(2)
118
 
119
  print("\n" + "=" * 50)
120
  print(" BASELINE SCORES")
121
  print("=" * 50)
122
- task_order = ["easy", "medium", "hard"]
123
  total = 0.0
124
  for task in task_order:
125
  s = scores.get(task, 0.0)
126
  bar = "β–ˆ" * int(s * 20) + "β–‘" * (20 - int(s * 20))
127
  print(f" {task:<8} {bar} {s:.3f}")
128
  total += s
129
- avg = total / max(step_count, 1)
130
  print(f"\n Average score: {avg:.3f}")
131
  print("=" * 50)
132
 
 
1
  # baseline.py
2
+ # Runs a Groq-hosted LLaMA model against all 3 tasks with multi-step investigation
3
  # Set env vars: GROQ_API_KEY, ENV_BASE_URL (optional)
4
 
5
  import os
6
  import json
7
+ import time
8
  from groq import Groq
9
  from client import BugTriageClient
10
  from model import TriageAction
 
11
 
 
12
  GROQ_API_KEY = os.getenv("GROQ_API_KEY")
13
+ MODEL = "llama-3.3-70b-versatile"
14
  TEMPERATURE = 0.0
15
  MAX_TOKENS = 400
16
 
 
39
 
40
  def format_bug(obs) -> str:
41
  bug = obs.bug_report
42
+ parts = [f"Title: {bug.title}", f"\nDescription:\n{bug.body}"]
43
+
44
+ if obs.comments_visible and bug.comments:
45
+ comments = "\n".join(f" - {c}" for c in bug.comments)
46
+ parts.append(f"\nComments:\n{comments}")
47
+
48
+ if bug.labels_hint:
49
+ parts.append(f"\nExisting labels: {', '.join(bug.labels_hint)}")
50
+
51
+ if obs.logs_visible and bug.stack_trace:
52
+ parts.append(f"\nStack trace: {bug.stack_trace}")
53
+
54
+ return "\n".join(parts)
55
 
56
 
57
  def call_model(groq_client: Groq, bug_text: str) -> TriageAction:
 
66
  )
67
  raw = response.choices[0].message.content.strip()
68
 
 
69
  if raw.startswith("```"):
70
  raw = raw.split("```")[1]
71
  if raw.startswith("json"):
 
73
 
74
  data = json.loads(raw)
75
  return TriageAction(
76
+ action_type="submit",
77
  priority=data["priority"],
78
  labels=data.get("labels", []),
79
  assigned_team=data.get("assigned_team", "backend"),
 
84
 
85
  def main():
86
  if not GROQ_API_KEY:
87
+ raise EnvironmentError(
88
+ "GROQ_API_KEY not set. Get a free key at console.groq.com")
89
 
90
  groq_client = Groq(api_key=GROQ_API_KEY)
91
  scores = {}
 
92
 
93
  print("=" * 50)
94
+ print(" Bug Triage Env β€” Baseline (Multi-Step Agent)")
95
  print(f" Model: {MODEL}")
96
  print("=" * 50)
97
 
98
+ task_order = ["easy", "medium", "hard"]
99
+
100
  with BugTriageClient() as env:
101
+ for task_id in task_order:
102
+ obs = env.reset(task_id=task_id)
103
+
104
+ print(f"\n── Task: {task_id.upper()} ──")
 
 
105
  print(f" Bug: {obs.bug_report.title}")
106
 
107
+ # Step 1: Read full body
108
+ if not obs.body_visible:
109
+ result = env.investigate("read_body")
110
+ obs = result.observation
111
+ print(f" πŸ“– Investigated: read_body")
112
+
113
+ # Step 2: Read comments
114
+ if not obs.comments_visible:
115
+ result = env.investigate("read_comments")
116
+ obs = result.observation
117
+ print(f" πŸ’¬ Investigated: read_comments")
118
+
119
+ # Step 3: Submit triage
120
  bug_text = format_bug(obs)
121
  action = call_model(groq_client, bug_text)
122
 
 
131
  print(f" βœ“ Reward: {result.reward:.3f}")
132
  print(f" βœ“ Feedback: {obs.feedback}")
133
 
134
+ scores[task_id] = result.reward
 
135
  time.sleep(2)
136
 
137
  print("\n" + "=" * 50)
138
  print(" BASELINE SCORES")
139
  print("=" * 50)
 
140
  total = 0.0
141
  for task in task_order:
142
  s = scores.get(task, 0.0)
143
  bar = "β–ˆ" * int(s * 20) + "β–‘" * (20 - int(s * 20))
144
  print(f" {task:<8} {bar} {s:.3f}")
145
  total += s
146
+ avg = total / max(len(scores), 1)
147
  print(f"\n Average score: {avg:.3f}")
148
  print("=" * 50)
149
 
bug_triage_client.py DELETED
@@ -1,75 +0,0 @@
1
- # client.py
2
- import os
3
- import requests
4
- from typing import Optional
5
- from model import TriageAction, TriageObservation, BugReport
6
-
7
-
8
- class StepResult:
9
- def __init__(self, observation: TriageObservation, reward: float, done: bool, info: dict):
10
- self.observation = observation
11
- self.reward = reward
12
- self.done = done
13
- self.info = info
14
-
15
-
16
- def _parse_observation(data: dict) -> TriageObservation:
17
- bug_data = data["bug_report"]
18
- bug = BugReport(**bug_data)
19
- return TriageObservation(
20
- bug_report=bug,
21
- task_id=data.get("task_id", "easy"),
22
- score=data.get("score", 0.0),
23
- feedback=data.get("feedback", ""),
24
- done=data.get("done", False),
25
- reward=data.get("reward", 0.0),
26
- )
27
-
28
-
29
- class BugTriageClient:
30
- def __init__(self, base_url: Optional[str] = None):
31
- self.base_url = (
32
- base_url
33
- or os.getenv("ENV_BASE_URL", "https://siteshcodes-bug-triage-env.hf.space")
34
- ).rstrip("/")
35
- self.session = requests.Session()
36
- self.session.headers.update({"Content-Type": "application/json"})
37
-
38
- def reset(self) -> TriageObservation:
39
- response = self.session.post(f"{self.base_url}/reset", json={}, timeout=30)
40
- response.raise_for_status()
41
- data = response.json()
42
- obs_data = data.get("observation", data)
43
- return _parse_observation(obs_data)
44
-
45
- def step(self, action: TriageAction) -> StepResult:
46
- try:
47
- action_dict = action.model_dump()
48
- except AttributeError:
49
- action_dict = action.dict()
50
- payload = {"action": action_dict}
51
- response = self.session.post(f"{self.base_url}/step", json=payload, timeout=30)
52
- response.raise_for_status()
53
- data = response.json()
54
- obs_data = data.get("observation", data)
55
- obs = _parse_observation(obs_data)
56
- return StepResult(
57
- observation=obs,
58
- reward=data.get("reward", obs.reward) or 0.0,
59
- done=data.get("done", obs.done),
60
- info={},
61
- )
62
-
63
- def state(self) -> dict:
64
- response = self.session.get(f"{self.base_url}/state", timeout=30)
65
- response.raise_for_status()
66
- return response.json()
67
-
68
- def close(self):
69
- self.session.close()
70
-
71
- def __enter__(self):
72
- return self
73
-
74
- def __exit__(self, *args):
75
- self.close()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
client.py CHANGED
@@ -1,12 +1,14 @@
1
- # client.py
2
  import os
3
  import requests
4
- from typing import Optional
5
  from model import TriageAction, TriageObservation, BugReport
6
 
7
 
8
  class StepResult:
9
- def __init__(self, observation: TriageObservation, reward: float, done: bool, info: dict):
 
 
10
  self.observation = observation
11
  self.reward = reward
12
  self.done = done
@@ -14,11 +16,13 @@ class StepResult:
14
 
15
 
16
  def _parse_observation(data: dict) -> TriageObservation:
 
17
  bug_data = data["bug_report"]
18
  try:
19
  bug = BugReport.model_validate(bug_data)
20
  except Exception:
21
  bug = BugReport(**bug_data)
 
22
  return TriageObservation(
23
  bug_report=bug,
24
  task_id=data.get("task_id", "easy"),
@@ -26,10 +30,18 @@ def _parse_observation(data: dict) -> TriageObservation:
26
  feedback=data.get("feedback", ""),
27
  done=data.get("done", False),
28
  reward=data.get("reward", 0.0),
 
 
 
 
 
 
29
  )
30
 
31
 
32
  class BugTriageClient:
 
 
33
  def __init__(self, base_url: Optional[str] = None):
34
  self.base_url = (
35
  base_url
@@ -37,39 +49,91 @@ class BugTriageClient:
37
  ).rstrip("/")
38
  self.session = requests.Session()
39
  self.session.headers.update({"Content-Type": "application/json"})
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
- def reset(self, task_id: str = "easy") -> TriageObservation:
42
  response = self.session.post(
43
- f"{self.base_url}/reset",
44
- json={"task_id": task_id},
45
- timeout=30,
46
  )
47
  response.raise_for_status()
48
  data = response.json()
49
- return _parse_observation(data.get("observation", data))
 
 
 
50
 
51
  def step(self, action: TriageAction) -> StepResult:
 
52
  try:
53
- action_dict = action.model_dump() # Pydantic v2
54
  except AttributeError:
55
- action_dict = action.dict() # Pydantic v1 fallback
 
 
 
 
 
56
  response = self.session.post(
57
- f"{self.base_url}/step",
58
- json={"action": action_dict},
59
- timeout=30,
60
  )
61
  response.raise_for_status()
62
  data = response.json()
63
- obs = _parse_observation(data.get("observation", data))
 
 
 
 
 
 
 
 
 
 
64
  return StepResult(
65
  observation=obs,
66
- reward=data.get("reward", obs.reward) or 0.0,
67
  done=data.get("done", obs.done),
68
- info={},
69
  )
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  def state(self) -> dict:
72
- response = self.session.get(f"{self.base_url}/state", timeout=30)
 
 
 
 
 
 
73
  response.raise_for_status()
74
  return response.json()
75
 
 
1
+ # client.py β€” Single source of truth for environment client
2
  import os
3
  import requests
4
+ from typing import Optional, List
5
  from model import TriageAction, TriageObservation, BugReport
6
 
7
 
8
  class StepResult:
9
+ """Result returned by env.step()."""
10
+ def __init__(self, observation: TriageObservation, reward: float,
11
+ done: bool, info: dict):
12
  self.observation = observation
13
  self.reward = reward
14
  self.done = done
 
16
 
17
 
18
  def _parse_observation(data: dict) -> TriageObservation:
19
+ """Parse a JSON dict into a TriageObservation."""
20
  bug_data = data["bug_report"]
21
  try:
22
  bug = BugReport.model_validate(bug_data)
23
  except Exception:
24
  bug = BugReport(**bug_data)
25
+
26
  return TriageObservation(
27
  bug_report=bug,
28
  task_id=data.get("task_id", "easy"),
 
30
  feedback=data.get("feedback", ""),
31
  done=data.get("done", False),
32
  reward=data.get("reward", 0.0),
33
+ body_visible=data.get("body_visible", False),
34
+ comments_visible=data.get("comments_visible", False),
35
+ logs_visible=data.get("logs_visible", False),
36
+ similar_visible=data.get("similar_visible", False),
37
+ steps_taken=data.get("steps_taken", 0),
38
+ max_steps=data.get("max_steps", 6),
39
  )
40
 
41
 
42
  class BugTriageClient:
43
+ """HTTP client for the Bug Triage Environment server."""
44
+
45
  def __init__(self, base_url: Optional[str] = None):
46
  self.base_url = (
47
  base_url
 
49
  ).rstrip("/")
50
  self.session = requests.Session()
51
  self.session.headers.update({"Content-Type": "application/json"})
52
+ self._session_id: Optional[str] = None
53
+
54
+ @property
55
+ def session_id(self) -> Optional[str]:
56
+ return self._session_id
57
+
58
+ def reset(self, task_id: str = "easy", seed: int = None) -> TriageObservation:
59
+ """Start a new episode. Stores session_id for subsequent step() calls."""
60
+ payload = {"task_id": task_id}
61
+ if seed is not None:
62
+ payload["seed"] = seed
63
+ if self._session_id:
64
+ payload["session_id"] = self._session_id
65
 
 
66
  response = self.session.post(
67
+ f"{self.base_url}/reset", json=payload, timeout=30,
 
 
68
  )
69
  response.raise_for_status()
70
  data = response.json()
71
+
72
+ self._session_id = data.get("session_id")
73
+ obs_data = data.get("observation", data)
74
+ return _parse_observation(obs_data)
75
 
76
  def step(self, action: TriageAction) -> StepResult:
77
+ """Send an action (investigation or submit) and get the result."""
78
  try:
79
+ action_dict = action.model_dump()
80
  except AttributeError:
81
+ action_dict = action.dict()
82
+
83
+ payload = {"action": action_dict}
84
+ if self._session_id:
85
+ payload["session_id"] = self._session_id
86
+
87
  response = self.session.post(
88
+ f"{self.base_url}/step", json=payload, timeout=30,
 
 
89
  )
90
  response.raise_for_status()
91
  data = response.json()
92
+
93
+ obs_data = data.get("observation", data)
94
+ obs = _parse_observation(obs_data)
95
+
96
+ reward = data.get("reward", obs.reward) or 0.0
97
+ reward = float(reward)
98
+
99
+ # Update session_id if server returned one
100
+ if "session_id" in data:
101
+ self._session_id = data["session_id"]
102
+
103
  return StepResult(
104
  observation=obs,
105
+ reward=reward,
106
  done=data.get("done", obs.done),
107
+ info=data.get("info", {}),
108
  )
109
 
110
+ def investigate(self, action_type: str) -> StepResult:
111
+ """Shortcut for investigation actions."""
112
+ action = TriageAction(action_type=action_type)
113
+ return self.step(action)
114
+
115
+ def submit(self, priority: str, labels: List[str] = None,
116
+ assigned_team: str = "backend", milestone: str = "backlog",
117
+ reasoning: str = "") -> StepResult:
118
+ """Shortcut for submitting the final triage decision."""
119
+ action = TriageAction(
120
+ action_type="submit",
121
+ priority=priority,
122
+ labels=labels or ["bug"],
123
+ assigned_team=assigned_team,
124
+ milestone=milestone,
125
+ reasoning=reasoning,
126
+ )
127
+ return self.step(action)
128
+
129
  def state(self) -> dict:
130
+ """Get current environment state."""
131
+ params = {}
132
+ if self._session_id:
133
+ params["session_id"] = self._session_id
134
+ response = self.session.get(
135
+ f"{self.base_url}/state", params=params, timeout=30,
136
+ )
137
  response.raise_for_status()
138
  return response.json()
139
 
inference.py CHANGED
@@ -20,6 +20,10 @@ from openai import OpenAI
20
  from model import TriageAction, TriageObservation, BugReport
21
 
22
 
 
 
 
 
23
  API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
24
  API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY") or os.getenv("OPENAI_API_KEY")
25
  MODEL_NAME = os.getenv("MODEL_NAME") or "meta-llama/Llama-3.3-70B-Instruct"
@@ -31,9 +35,9 @@ if not API_KEY:
31
  TASK_IDS = ["easy", "medium", "hard"]
32
  BENCHMARK = "bug-triage-env"
33
  TEMPERATURE = 0.0
34
- MAX_TOKENS = 400
35
- MAX_STEPS = 1 # Each task is 1 step (reset β†’ step β†’ done)
36
- MAX_TOTAL_REWARD = 1.0 # Per-task max reward
37
  SUCCESS_SCORE_THRESHOLD = 0.4
38
 
39
  print(f"[CONFIG] API_BASE_URL={API_BASE_URL}", flush=True)
@@ -41,7 +45,10 @@ print(f"[CONFIG] MODEL_NAME={MODEL_NAME}", flush=True)
41
  print(f"[CONFIG] ENV_BASE_URL={ENV_BASE_URL}", flush=True)
42
  print(f"[CONFIG] API_KEY={'set' if API_KEY else 'MISSING'}", flush=True)
43
 
44
- #inlined client
 
 
 
45
 
46
  def _parse_observation(data: dict) -> TriageObservation:
47
  try:
@@ -51,15 +58,22 @@ def _parse_observation(data: dict) -> TriageObservation:
51
  return TriageObservation(
52
  bug_report=bug,
53
  task_id=data.get("task_id", "easy"),
54
- score=data.get("score", 0.05),
55
  feedback=data.get("feedback", ""),
56
  done=data.get("done", False),
57
- reward=data.get("reward", 0.05),
 
 
 
 
 
 
58
  )
59
 
60
 
61
  class StepResult:
62
- def __init__(self, observation: TriageObservation, reward: float, done: bool, info: dict):
 
63
  self.observation = observation
64
  self.reward = reward
65
  self.done = done
@@ -71,42 +85,53 @@ class BugTriageClient:
71
  self.base_url = (base_url or ENV_BASE_URL).rstrip("/")
72
  self.session = requests.Session()
73
  self.session.headers.update({"Content-Type": "application/json"})
 
74
 
75
  def reset(self, task_id: str = "easy") -> TriageObservation:
76
  print(f"[ENV] Resetting env for task={task_id}", flush=True)
 
 
 
 
77
  response = self.session.post(
78
- f"{self.base_url}/reset",
79
- json={"task_id": task_id},
80
- timeout=30,
81
  )
82
  response.raise_for_status()
83
  data = response.json()
 
84
  return _parse_observation(data.get("observation", data))
85
 
86
  def step(self, action: TriageAction) -> StepResult:
87
- print("[ENV] Sending step action...", flush=True)
88
  try:
89
  action_dict = action.model_dump()
90
  except AttributeError:
91
  action_dict = action.dict()
 
 
 
 
 
92
  response = self.session.post(
93
- f"{self.base_url}/step",
94
- json={"action": action_dict},
95
- timeout=30,
96
  )
97
  response.raise_for_status()
98
  data = response.json()
99
  obs = _parse_observation(data.get("observation", data))
 
100
  reward = data.get("reward", obs.reward)
101
- if reward is None or reward == 0:
102
- reward = 0.05
103
  reward = float(reward)
104
- reward = max(0.01, min(0.99, reward))
 
 
 
 
 
105
  return StepResult(
106
- observation=obs,
107
- reward=reward,
108
- done=data.get("done", obs.done),
109
- info={},
110
  )
111
 
112
  def close(self):
@@ -119,12 +144,14 @@ class BugTriageClient:
119
  self.close()
120
 
121
 
122
-
 
 
123
 
124
  SYSTEM_PROMPT = textwrap.dedent("""
125
- You are a senior software engineering manager.
126
- You will receive a bug report and must triage it. Respond ONLY with
127
- valid JSON β€” no markdown, no explanation, no backticks.
128
 
129
  Return exactly this structure:
130
  {
@@ -143,22 +170,44 @@ SYSTEM_PROMPT = textwrap.dedent("""
143
 
144
  Teams: backend | frontend | infra | security | devx
145
  Milestones: hotfix | v2.1 | backlog
 
 
 
146
  """).strip()
147
 
 
 
 
 
148
 
 
149
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
 
151
  def log_start(task: str, env: str, model: str) -> None:
152
  print(f"[START] task={task} env={env} model={model}", flush=True)
153
 
154
 
155
- def log_step(
156
- step: int,
157
- action: str,
158
- reward: float,
159
- done: bool,
160
- error: Optional[str] = None,
161
- ) -> None:
162
  print(
163
  f"[STEP] step={step} action={action} "
164
  f"reward={reward:.2f} done={str(done).lower()} error={error or 'null'}",
@@ -166,7 +215,8 @@ def log_step(
166
  )
167
 
168
 
169
- def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
 
170
  rewards_str = ",".join(f"{r:.2f}" for r in rewards)
171
  print(
172
  f"[END] success={str(success).lower()} steps={steps} "
@@ -175,21 +225,97 @@ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> No
175
  )
176
 
177
 
178
-
 
 
179
 
180
  def format_bug(obs: TriageObservation) -> str:
 
181
  bug = obs.bug_report
182
- comments = "\n".join(f" - {c}" for c in bug.comments) if bug.comments else " None"
183
- return (
184
- f"Title: {bug.title}\n\n"
185
- f"Description:\n{bug.body}\n\n"
186
- f"Existing labels: {', '.join(bug.labels_hint) if bug.labels_hint else 'none'}\n"
187
- f"Comments:\n{comments}"
188
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
189
 
190
 
191
  def call_model(client: OpenAI, bug_text: str) -> TriageAction:
192
- print("[LLM] Sending request to model...", flush=True)
 
193
 
194
  completion = client.chat.completions.create(
195
  model=MODEL_NAME,
@@ -218,6 +344,7 @@ def call_model(client: OpenAI, bug_text: str) -> TriageAction:
218
  data = {}
219
 
220
  action = TriageAction(
 
221
  priority=data.get("priority", "P2"),
222
  labels=data.get("labels", ["bug"]),
223
  assigned_team=data.get("assigned_team", "backend"),
@@ -233,12 +360,13 @@ def call_model(client: OpenAI, bug_text: str) -> TriageAction:
233
  return action
234
 
235
 
236
-
 
 
237
 
238
  def main() -> None:
239
  client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
240
 
241
-
242
  all_scores = []
243
 
244
  with BugTriageClient(base_url=ENV_BASE_URL) as env:
@@ -247,32 +375,90 @@ def main() -> None:
247
  score = 0.0
248
  success = False
249
  steps_taken = 0
 
250
  log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
251
 
252
  try:
253
  obs = env.reset(task_id=task_id)
254
- action = call_model(client, format_bug(obs))
255
- result = env.step(action)
256
-
257
- reward = float(result.reward or 0.05)
258
- reward = max(0.01, min(0.99, reward))
259
- rewards.append(reward)
260
- steps_taken = 1
261
-
262
- action_str = (
263
- f"priority={action.priority},"
264
- f"team={action.assigned_team},"
265
- f"milestone={action.milestone}"
266
- )
267
- log_step(
268
- step=1,
269
- action=action_str,
270
- reward=reward,
271
- done=True,
272
- )
273
-
274
-
275
- score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
276
  score = min(max(score, 0.01), 0.99)
277
  success = score >= SUCCESS_SCORE_THRESHOLD
278
 
@@ -282,15 +468,17 @@ def main() -> None:
282
  score = min(max(score, 0.01), 0.99)
283
  success = False
284
 
285
- # [END] for this task
286
  log_end(success, steps_taken, score, rewards)
287
  all_scores.append(score)
288
 
289
  time.sleep(0.5)
290
 
291
-
292
  avg_score = sum(all_scores) / len(all_scores) if all_scores else 0.0
293
- print(f"[SUMMARY] tasks={len(all_scores)} avg_score={avg_score:.2f} scores={all_scores}", flush=True)
 
 
 
 
294
 
295
 
296
  if __name__ == "__main__":
 
20
  from model import TriageAction, TriageObservation, BugReport
21
 
22
 
23
+ # ---------------------------------------------------------------------------
24
+ # CONFIG β€” uses env vars required by hackathon spec
25
+ # ---------------------------------------------------------------------------
26
+
27
  API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
28
  API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY") or os.getenv("OPENAI_API_KEY")
29
  MODEL_NAME = os.getenv("MODEL_NAME") or "meta-llama/Llama-3.3-70B-Instruct"
 
35
  TASK_IDS = ["easy", "medium", "hard"]
36
  BENCHMARK = "bug-triage-env"
37
  TEMPERATURE = 0.0
38
+ MAX_TOKENS = 500
39
+ MAX_STEPS = 4 # Max steps per task (investigate + submit)
40
+ MAX_TOTAL_REWARD = 1.0
41
  SUCCESS_SCORE_THRESHOLD = 0.4
42
 
43
  print(f"[CONFIG] API_BASE_URL={API_BASE_URL}", flush=True)
 
45
  print(f"[CONFIG] ENV_BASE_URL={ENV_BASE_URL}", flush=True)
46
  print(f"[CONFIG] API_KEY={'set' if API_KEY else 'MISSING'}", flush=True)
47
 
48
+
49
+ # ---------------------------------------------------------------------------
50
+ # INLINED CLIENT β€” self-contained, no external dependency
51
+ # ---------------------------------------------------------------------------
52
 
53
  def _parse_observation(data: dict) -> TriageObservation:
54
  try:
 
58
  return TriageObservation(
59
  bug_report=bug,
60
  task_id=data.get("task_id", "easy"),
61
+ score=data.get("score", 0.0),
62
  feedback=data.get("feedback", ""),
63
  done=data.get("done", False),
64
+ reward=data.get("reward", 0.0),
65
+ body_visible=data.get("body_visible", False),
66
+ comments_visible=data.get("comments_visible", False),
67
+ logs_visible=data.get("logs_visible", False),
68
+ similar_visible=data.get("similar_visible", False),
69
+ steps_taken=data.get("steps_taken", 0),
70
+ max_steps=data.get("max_steps", 6),
71
  )
72
 
73
 
74
  class StepResult:
75
+ def __init__(self, observation: TriageObservation, reward: float,
76
+ done: bool, info: dict):
77
  self.observation = observation
78
  self.reward = reward
79
  self.done = done
 
85
  self.base_url = (base_url or ENV_BASE_URL).rstrip("/")
86
  self.session = requests.Session()
87
  self.session.headers.update({"Content-Type": "application/json"})
88
+ self._session_id: Optional[str] = None
89
 
90
  def reset(self, task_id: str = "easy") -> TriageObservation:
91
  print(f"[ENV] Resetting env for task={task_id}", flush=True)
92
+ payload = {"task_id": task_id}
93
+ if self._session_id:
94
+ payload["session_id"] = self._session_id
95
+
96
  response = self.session.post(
97
+ f"{self.base_url}/reset", json=payload, timeout=30,
 
 
98
  )
99
  response.raise_for_status()
100
  data = response.json()
101
+ self._session_id = data.get("session_id")
102
  return _parse_observation(data.get("observation", data))
103
 
104
  def step(self, action: TriageAction) -> StepResult:
105
+ print(f"[ENV] Sending step: action_type={action.action_type}", flush=True)
106
  try:
107
  action_dict = action.model_dump()
108
  except AttributeError:
109
  action_dict = action.dict()
110
+
111
+ payload = {"action": action_dict}
112
+ if self._session_id:
113
+ payload["session_id"] = self._session_id
114
+
115
  response = self.session.post(
116
+ f"{self.base_url}/step", json=payload, timeout=30,
 
 
117
  )
118
  response.raise_for_status()
119
  data = response.json()
120
  obs = _parse_observation(data.get("observation", data))
121
+
122
  reward = data.get("reward", obs.reward)
123
+ if reward is None:
124
+ reward = 0.0
125
  reward = float(reward)
126
+ if obs.done:
127
+ reward = max(0.01, min(0.99, reward))
128
+
129
+ if "session_id" in data:
130
+ self._session_id = data["session_id"]
131
+
132
  return StepResult(
133
+ observation=obs, reward=reward,
134
+ done=data.get("done", obs.done), info={},
 
 
135
  )
136
 
137
  def close(self):
 
144
  self.close()
145
 
146
 
147
+ # ---------------------------------------------------------------------------
148
+ # LLM PROMPTS
149
+ # ---------------------------------------------------------------------------
150
 
151
  SYSTEM_PROMPT = textwrap.dedent("""
152
+ You are a senior software engineering manager triaging a bug report.
153
+ You will receive a bug report (possibly with partial information).
154
+ Respond ONLY with valid JSON β€” no markdown, no explanation, no backticks.
155
 
156
  Return exactly this structure:
157
  {
 
170
 
171
  Teams: backend | frontend | infra | security | devx
172
  Milestones: hotfix | v2.1 | backlog
173
+
174
+ Important: Pay attention to security signals (SQL injection, XSS, auth bypass,
175
+ data exposure). Security bugs should almost always be P0 + security team + hotfix.
176
  """).strip()
177
 
178
+ INVESTIGATION_PROMPT = textwrap.dedent("""
179
+ You are deciding whether to investigate further or submit your triage.
180
+ You have seen the following information about a bug. Based on what you see,
181
+ decide if you need more information or can triage now.
182
 
183
+ Respond with ONLY one of these JSON formats:
184
 
185
+ To investigate: {"action": "read_body"} or {"action": "read_comments"} or {"action": "check_logs"}
186
+ To submit:
187
+ {
188
+ "action": "submit",
189
+ "priority": "P0",
190
+ "labels": ["bug"],
191
+ "assigned_team": "backend",
192
+ "milestone": "hotfix",
193
+ "reasoning": "explanation"
194
+ }
195
+
196
+ Only investigate if the title and preview are genuinely ambiguous.
197
+ If the bug is clearly a typo or clearly critical, submit immediately.
198
+ """).strip()
199
+
200
+
201
+ # ---------------------------------------------------------------------------
202
+ # STRUCTURED LOGGING β€” strict [START]/[STEP]/[END] format
203
+ # ---------------------------------------------------------------------------
204
 
205
  def log_start(task: str, env: str, model: str) -> None:
206
  print(f"[START] task={task} env={env} model={model}", flush=True)
207
 
208
 
209
+ def log_step(step: int, action: str, reward: float, done: bool,
210
+ error: Optional[str] = None) -> None:
 
 
 
 
 
211
  print(
212
  f"[STEP] step={step} action={action} "
213
  f"reward={reward:.2f} done={str(done).lower()} error={error or 'null'}",
 
215
  )
216
 
217
 
218
+ def log_end(success: bool, steps: int, score: float,
219
+ rewards: List[float]) -> None:
220
  rewards_str = ",".join(f"{r:.2f}" for r in rewards)
221
  print(
222
  f"[END] success={str(success).lower()} steps={steps} "
 
225
  )
226
 
227
 
228
+ # ---------------------------------------------------------------------------
229
+ # BUG FORMATTING
230
+ # ---------------------------------------------------------------------------
231
 
232
  def format_bug(obs: TriageObservation) -> str:
233
+ """Format a bug observation into text the LLM can read."""
234
  bug = obs.bug_report
235
+ parts = [f"Title: {bug.title}"]
236
+
237
+ parts.append(f"\nDescription:\n{bug.body}")
238
+
239
+ if obs.comments_visible and bug.comments:
240
+ comments = "\n".join(f" - {c}" for c in bug.comments)
241
+ parts.append(f"\nComments:\n{comments}")
242
+
243
+ if bug.labels_hint:
244
+ parts.append(f"\nExisting labels: {', '.join(bug.labels_hint)}")
245
+
246
+ if obs.logs_visible:
247
+ if bug.stack_trace:
248
+ parts.append(f"\nStack trace: {bug.stack_trace}")
249
+ if bug.affected_component:
250
+ parts.append(f"\nAffected component: {bug.affected_component}")
251
+ if bug.severity_signals:
252
+ parts.append(f"\nSeverity signals: {', '.join(bug.severity_signals)}")
253
+
254
+ if obs.similar_visible and bug.related_bugs:
255
+ parts.append(f"\nRelated bugs: {', '.join(bug.related_bugs)}")
256
+
257
+ # Add visibility context
258
+ visibility = []
259
+ if not obs.body_visible:
260
+ visibility.append("body (truncated)")
261
+ if not obs.comments_visible:
262
+ visibility.append("comments (hidden)")
263
+ if not obs.logs_visible:
264
+ visibility.append("logs (hidden)")
265
+ if visibility:
266
+ parts.append(f"\n[Hidden info: {', '.join(visibility)}]")
267
+
268
+ parts.append(f"\nSteps used: {obs.steps_taken}/{obs.max_steps}")
269
+
270
+ return "\n".join(parts)
271
+
272
+
273
+ def format_bug_for_decision(obs: TriageObservation) -> str:
274
+ """Shorter format for the investigation decision."""
275
+ bug = obs.bug_report
276
+ text = f"Title: {bug.title}\nPreview: {bug.body[:150]}"
277
+ if obs.body_visible:
278
+ text += f"\n\nFull body visible."
279
+ if obs.comments_visible and bug.comments:
280
+ text += f"\nComments: {len(bug.comments)} visible."
281
+ text += f"\nSteps remaining: {obs.max_steps - obs.steps_taken}"
282
+ return text
283
+
284
+
285
+ # ---------------------------------------------------------------------------
286
+ # MODEL CALLS
287
+ # ---------------------------------------------------------------------------
288
+
289
+ def decide_action(client: OpenAI, obs: TriageObservation) -> dict:
290
+ """Ask the LLM whether to investigate or submit."""
291
+ bug_text = format_bug_for_decision(obs)
292
+
293
+ try:
294
+ completion = client.chat.completions.create(
295
+ model=MODEL_NAME,
296
+ messages=[
297
+ {"role": "system", "content": INVESTIGATION_PROMPT},
298
+ {"role": "user", "content": bug_text},
299
+ ],
300
+ temperature=TEMPERATURE,
301
+ max_tokens=200,
302
+ stream=False,
303
+ )
304
+ raw = (completion.choices[0].message.content or "").strip()
305
+ if raw.startswith("```"):
306
+ parts = raw.split("```")
307
+ raw = parts[1] if len(parts) > 1 else raw
308
+ if raw.startswith("json"):
309
+ raw = raw[4:].strip()
310
+ return json.loads(raw)
311
+ except Exception as e:
312
+ print(f"[DEBUG] Decision model call failed: {e}", flush=True)
313
+ return {"action": "submit"}
314
 
315
 
316
  def call_model(client: OpenAI, bug_text: str) -> TriageAction:
317
+ """Ask the LLM to triage the bug report."""
318
+ print("[LLM] Sending triage request to model...", flush=True)
319
 
320
  completion = client.chat.completions.create(
321
  model=MODEL_NAME,
 
344
  data = {}
345
 
346
  action = TriageAction(
347
+ action_type="submit",
348
  priority=data.get("priority", "P2"),
349
  labels=data.get("labels", ["bug"]),
350
  assigned_team=data.get("assigned_team", "backend"),
 
360
  return action
361
 
362
 
363
+ # ---------------------------------------------------------------------------
364
+ # MAIN β€” multi-step agent with per-task [START]/[STEP]/[END] logging
365
+ # ---------------------------------------------------------------------------
366
 
367
  def main() -> None:
368
  client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
369
 
 
370
  all_scores = []
371
 
372
  with BugTriageClient(base_url=ENV_BASE_URL) as env:
 
375
  score = 0.0
376
  success = False
377
  steps_taken = 0
378
+
379
  log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
380
 
381
  try:
382
  obs = env.reset(task_id=task_id)
383
+
384
+ for step_num in range(1, MAX_STEPS + 1):
385
+ if obs.done:
386
+ break
387
+
388
+ # Decide: investigate or submit?
389
+ # For efficiency, check if we have enough info
390
+ # On step 1, always read full body; on later steps, decide
391
+ if step_num == 1 and not obs.body_visible:
392
+ # First step: read the full body
393
+ action = TriageAction(action_type="read_body")
394
+ result = env.step(action)
395
+ obs = result.observation
396
+ steps_taken = step_num
397
+
398
+ log_step(
399
+ step=step_num,
400
+ action="investigate:read_body",
401
+ reward=0.0,
402
+ done=result.done,
403
+ )
404
+
405
+ if result.done:
406
+ rewards.append(result.reward)
407
+ break
408
+ continue
409
+
410
+ elif step_num == 2 and not obs.comments_visible:
411
+ # Second step: read comments for extra context
412
+ action = TriageAction(action_type="read_comments")
413
+ result = env.step(action)
414
+ obs = result.observation
415
+ steps_taken = step_num
416
+
417
+ log_step(
418
+ step=step_num,
419
+ action="investigate:read_comments",
420
+ reward=0.0,
421
+ done=result.done,
422
+ )
423
+
424
+ if result.done:
425
+ rewards.append(result.reward)
426
+ break
427
+ continue
428
+
429
+ # Now submit the triage decision
430
+ bug_text = format_bug(obs)
431
+ action = call_model(client, bug_text)
432
+ result = env.step(action)
433
+ obs = result.observation
434
+ steps_taken = step_num
435
+
436
+ reward = float(result.reward or 0.0)
437
+ if result.done:
438
+ reward = max(0.01, min(0.99, reward))
439
+ rewards.append(reward)
440
+
441
+ action_str = (
442
+ f"priority={action.priority},"
443
+ f"team={action.assigned_team},"
444
+ f"milestone={action.milestone}"
445
+ )
446
+
447
+ log_step(
448
+ step=step_num,
449
+ action=action_str,
450
+ reward=reward,
451
+ done=result.done,
452
+ )
453
+
454
+ if result.done:
455
+ break
456
+
457
+ # Calculate score
458
+ if rewards:
459
+ score = sum(rewards) / MAX_TOTAL_REWARD
460
+ else:
461
+ score = 0.0
462
  score = min(max(score, 0.01), 0.99)
463
  success = score >= SUCCESS_SCORE_THRESHOLD
464
 
 
468
  score = min(max(score, 0.01), 0.99)
469
  success = False
470
 
 
471
  log_end(success, steps_taken, score, rewards)
472
  all_scores.append(score)
473
 
474
  time.sleep(0.5)
475
 
 
476
  avg_score = sum(all_scores) / len(all_scores) if all_scores else 0.0
477
+ print(
478
+ f"[SUMMARY] tasks={len(all_scores)} avg_score={avg_score:.2f} "
479
+ f"scores={all_scores}",
480
+ flush=True,
481
+ )
482
 
483
 
484
  if __name__ == "__main__":
model.py CHANGED
@@ -1,13 +1,10 @@
1
  # model.py
2
- from typing import List
3
  from pydantic import BaseModel, Field
4
  from openenv.core.env_server import Action, Observation
5
  from openenv.core.env_server.types import State
6
 
7
 
8
-
9
-
10
-
11
  class BugReport(BaseModel):
12
  """A single GitHub-style bug report."""
13
  id: str
@@ -16,16 +13,21 @@ class BugReport(BaseModel):
16
  author: str
17
  labels_hint: List[str] = Field(default_factory=list)
18
  comments: List[str] = Field(default_factory=list)
 
 
 
 
19
 
20
  class Config:
21
  arbitrary_types_allowed = True
22
 
23
 
24
-
25
-
26
  class TriageAction(Action):
27
- """What the agent submits as its triage decision."""
28
- priority: str # "P0" | "P1" | "P2" | "P3"
 
 
 
29
  labels: List[str] = Field(default_factory=list)
30
  assigned_team: str = "backend"
31
  milestone: str = "backlog"
@@ -36,7 +38,7 @@ class TriageAction(Action):
36
 
37
 
38
  class TriageObservation(Observation):
39
- """What the agent sees after each step."""
40
  bug_report: BugReport
41
  task_id: str = "easy"
42
  score: float = 0.0
@@ -44,6 +46,14 @@ class TriageObservation(Observation):
44
  done: bool = False
45
  reward: float = 0.0
46
 
 
 
 
 
 
 
 
 
47
  class Config:
48
  arbitrary_types_allowed = True
49
 
@@ -51,10 +61,12 @@ class TriageObservation(Observation):
51
  class TriageState(State):
52
  """Internal episode state."""
53
  episode_id: str = ""
 
54
  current_task: str = "easy"
55
  step_count: int = 0
56
  total_score: float = 0.0
57
  tasks_completed: List[str] = Field(default_factory=list)
 
58
 
59
  class Config:
60
  arbitrary_types_allowed = True
 
1
  # model.py
2
+ from typing import List, Optional, Dict, Any
3
  from pydantic import BaseModel, Field
4
  from openenv.core.env_server import Action, Observation
5
  from openenv.core.env_server.types import State
6
 
7
 
 
 
 
8
  class BugReport(BaseModel):
9
  """A single GitHub-style bug report."""
10
  id: str
 
13
  author: str
14
  labels_hint: List[str] = Field(default_factory=list)
15
  comments: List[str] = Field(default_factory=list)
16
+ severity_signals: List[str] = Field(default_factory=list)
17
+ related_bugs: List[str] = Field(default_factory=list)
18
+ stack_trace: str = ""
19
+ affected_component: str = ""
20
 
21
  class Config:
22
  arbitrary_types_allowed = True
23
 
24
 
 
 
25
  class TriageAction(Action):
26
+ """What the agent submits β€” either an investigation or a final triage decision."""
27
+ action_type: str = "submit" # "read_body" | "read_comments" | "check_logs" | "check_similar" | "submit"
28
+
29
+ # Only used when action_type == "submit"
30
+ priority: str = "P2"
31
  labels: List[str] = Field(default_factory=list)
32
  assigned_team: str = "backend"
33
  milestone: str = "backlog"
 
38
 
39
 
40
  class TriageObservation(Observation):
41
+ """What the agent sees after each step β€” progressively reveals info."""
42
  bug_report: BugReport
43
  task_id: str = "easy"
44
  score: float = 0.0
 
46
  done: bool = False
47
  reward: float = 0.0
48
 
49
+ # Progressive visibility fields
50
+ body_visible: bool = False
51
+ comments_visible: bool = False
52
+ logs_visible: bool = False
53
+ similar_visible: bool = False
54
+ steps_taken: int = 0
55
+ max_steps: int = 6
56
+
57
  class Config:
58
  arbitrary_types_allowed = True
59
 
 
61
  class TriageState(State):
62
  """Internal episode state."""
63
  episode_id: str = ""
64
+ session_id: str = ""
65
  current_task: str = "easy"
66
  step_count: int = 0
67
  total_score: float = 0.0
68
  tasks_completed: List[str] = Field(default_factory=list)
69
+ actions_taken: List[str] = Field(default_factory=list)
70
 
71
  class Config:
72
  arbitrary_types_allowed = True
openenv.yaml CHANGED
@@ -1,32 +1,43 @@
1
  spec_version: 1
2
  name: bug-triage-env
3
- version: "1.0.0"
4
  description: >
5
- A reinforcement learning environment where an agent triages
6
- GitHub-style bug reports by assigning priority, labels, team,
7
- and milestone. 3 tasks of increasing difficulty (easy β†’ medium β†’ hard).
 
 
 
8
  endpoint: https://siteshcodes-bug-triage-env.hf.space
9
  tags:
10
  - openenv
11
  - bug-triage
12
  - real-world
13
  - nlp
 
14
  tasks:
15
  - id: easy
16
  name: Priority Assignment
17
- description: Assign correct P0-P3 priority to a bug report
 
 
18
  difficulty: easy
19
  grader: server.task:priority_match
20
  reward_range: [0.0, 1.0]
21
  - id: medium
22
  name: Priority Labels and Team
23
- description: Assign correct priority, labels, and team routing
 
 
24
  difficulty: medium
25
  grader: server.task:priority_label_team
26
  reward_range: [0.0, 1.0]
27
  - id: hard
28
  name: Full Triage
29
- description: Full triage with priority, labels, team, milestone and security penalty
 
 
 
30
  difficulty: hard
31
  grader: server.task:full_triage
32
  reward_range: [0.0, 1.0]
@@ -35,6 +46,7 @@ endpoints:
35
  step: /step
36
  state: /state
37
  actions:
 
38
  priority: string
39
  labels: list
40
  assigned_team: string
@@ -46,4 +58,10 @@ observations:
46
  score: float
47
  reward: float
48
  feedback: string
49
- done: bool
 
 
 
 
 
 
 
1
  spec_version: 1
2
  name: bug-triage-env
3
+ version: "2.0.0"
4
  description: >
5
+ A multi-step reinforcement learning environment where an AI agent
6
+ investigates and triages GitHub-style bug reports by assigning
7
+ priority, labels, team, and milestone. Features progressive
8
+ information reveal, procedural bug generation (200+ unique bugs),
9
+ semantic label matching, and a security escalation penalty.
10
+ 3 tasks of increasing difficulty (easy β†’ medium β†’ hard).
11
  endpoint: https://siteshcodes-bug-triage-env.hf.space
12
  tags:
13
  - openenv
14
  - bug-triage
15
  - real-world
16
  - nlp
17
+ - multi-step
18
  tasks:
19
  - id: easy
20
  name: Priority Assignment
21
+ description: >
22
+ Investigate a bug report and assign correct P0-P3 priority.
23
+ Use investigation actions to gather info before submitting.
24
  difficulty: easy
25
  grader: server.task:priority_match
26
  reward_range: [0.0, 1.0]
27
  - id: medium
28
  name: Priority Labels and Team
29
+ description: >
30
+ Investigate and assign correct priority, labels, and team
31
+ routing. More investigation steps available.
32
  difficulty: medium
33
  grader: server.task:priority_label_team
34
  reward_range: [0.0, 1.0]
35
  - id: hard
36
  name: Full Triage
37
+ description: >
38
+ Full triage with priority, labels, team, milestone and
39
+ security escalation penalty. Investigation is critical β€”
40
+ missing security signals is penalized.
41
  difficulty: hard
42
  grader: server.task:full_triage
43
  reward_range: [0.0, 1.0]
 
46
  step: /step
47
  state: /state
48
  actions:
49
+ action_type: string
50
  priority: string
51
  labels: list
52
  assigned_team: string
 
58
  score: float
59
  reward: float
60
  feedback: string
61
+ done: bool
62
+ body_visible: bool
63
+ comments_visible: bool
64
+ logs_visible: bool
65
+ similar_visible: bool
66
+ steps_taken: int
67
+ max_steps: int
pyproject.toml CHANGED
@@ -4,8 +4,8 @@ build-backend = "setuptools.backends.legacy:build"
4
 
5
  [project]
6
  name = "bug-triage-env"
7
- version = "1.0.0"
8
- description = "OpenEnv RL environment for bug report triage"
9
  requires-python = ">=3.11"
10
  dependencies = [
11
  "openenv-core>=0.2.0",
@@ -13,6 +13,15 @@ dependencies = [
13
  "uvicorn[standard]",
14
  "pydantic",
15
  "websockets",
 
 
 
 
 
 
 
 
 
16
  "groq",
17
  ]
18
 
 
4
 
5
  [project]
6
  name = "bug-triage-env"
7
+ version = "2.0.0"
8
+ description = "Multi-step OpenEnv RL environment for bug report triage"
9
  requires-python = ">=3.11"
10
  dependencies = [
11
  "openenv-core>=0.2.0",
 
13
  "uvicorn[standard]",
14
  "pydantic",
15
  "websockets",
16
+ "requests",
17
+ "openai",
18
+ ]
19
+
20
+ [project.optional-dependencies]
21
+ dev = [
22
+ "pytest>=7.0",
23
+ "pytest-cov",
24
+ "httpx",
25
  "groq",
26
  ]
27
 
server/__pycache__/__init__.cpython-314.pyc DELETED
Binary file (434 Bytes)
 
server/__pycache__/task.cpython-314.pyc DELETED
Binary file (14.5 kB)
 
server/app.py CHANGED
@@ -1,18 +1,16 @@
1
  # server/app.py
2
  import sys
3
  import os
4
- import json
5
  sys.path.insert(0, "/app")
6
  sys.path.insert(0, "/app/server")
7
 
8
  from openenv.core.env_server import create_app
9
  from model import TriageAction, TriageObservation
10
- from environment import BugTriageEnvironment
11
  from task import sample_bug, grade_action, TASKS
12
- from fastapi import Response, Request
13
- from fastapi.responses import FileResponse
14
  from fastapi.staticfiles import StaticFiles
15
- from pydantic import BaseModel
16
  from typing import Optional, Dict, Any
17
 
18
  app = create_app(
@@ -22,39 +20,15 @@ app = create_app(
22
  env_name="bug-triage-env",
23
  )
24
 
25
- TASKS_META = [
26
- {
27
- "id": "easy",
28
- "name": "Priority Assignment",
29
- "description": "Assign correct P0-P3 priority to a bug report",
30
- "difficulty": "easy",
31
- "grader": "server.task:priority_match",
32
- "reward_range": [0.0, 1.0]
33
- },
34
- {
35
- "id": "medium",
36
- "name": "Priority Labels and Team",
37
- "description": "Assign correct priority, labels, and team routing",
38
- "difficulty": "medium",
39
- "grader": "server.task:priority_label_team",
40
- "reward_range": [0.0, 1.0]
41
- },
42
- {
43
- "id": "hard",
44
- "name": "Full Triage",
45
- "description": "Full triage with priority, labels, team, milestone and security penalty",
46
- "difficulty": "hard",
47
- "grader": "server.task:full_triage",
48
- "reward_range": [0.0, 1.0]
49
- }
50
- ]
51
-
52
-
53
-
54
- _global_env = BugTriageEnvironment()
55
 
 
 
 
56
 
57
 
 
58
  routes_to_remove = []
59
  for route in app.routes:
60
  if hasattr(route, "path") and route.path in ("/reset", "/step", "/state"):
@@ -63,44 +37,60 @@ for route in routes_to_remove:
63
  app.routes.remove(route)
64
 
65
 
 
 
 
 
66
  @app.get("/health")
67
  def health():
68
- return {"status": "ok", "env": "bug-triage-env"}
 
 
 
 
 
 
69
 
70
  @app.get("/")
71
  def root():
72
- """Serve the interactive demo frontend at root."""
73
  static_dir = os.path.join(os.path.dirname(__file__), "static")
74
- return FileResponse(os.path.join(static_dir, "index.html"))
 
 
 
 
75
 
76
  @app.get("/web")
77
  def web_ui():
78
  """Alias for the frontend."""
79
- static_dir = os.path.join(os.path.dirname(__file__), "static")
80
- return FileResponse(os.path.join(static_dir, "index.html"))
81
 
82
  @app.get("/tasks")
83
  def list_tasks():
84
  return TASKS_META
85
 
86
- @app.get("/tasks/easy")
87
- def task_easy():
88
- return TASKS_META[0]
89
 
90
- @app.get("/tasks/medium")
91
- def task_medium():
92
- return TASKS_META[1]
93
-
94
- @app.get("/tasks/hard")
95
- def task_hard():
96
- return TASKS_META[2]
 
 
97
 
98
 
 
 
 
99
 
100
  @app.post("/reset")
101
  async def custom_reset(request: Request):
102
- """Stateful reset β€” remembers the bug for the subsequent step() call."""
103
- global _global_env
104
 
105
  body = {}
106
  try:
@@ -111,9 +101,20 @@ async def custom_reset(request: Request):
111
  task_id = body.get("task_id", "easy")
112
  seed = body.get("seed", None)
113
  episode_id = body.get("episode_id", None)
 
 
 
 
 
 
 
 
 
 
 
114
 
115
- _global_env = BugTriageEnvironment()
116
- obs = _global_env.reset(task_id=task_id, seed=seed, episode_id=episode_id)
117
 
118
  try:
119
  obs_dict = obs.model_dump(exclude={"reward", "done", "metadata"})
@@ -124,21 +125,31 @@ async def custom_reset(request: Request):
124
  obs_dict.pop("metadata", None)
125
 
126
  return {
 
127
  "observation": obs_dict,
128
- "reward": obs.reward,
129
- "done": obs.done,
130
  }
131
 
132
 
133
  @app.post("/step")
134
  async def custom_step(request: Request):
135
- """Stateful step β€” uses the bug from the last reset() call."""
136
- global _global_env
137
 
138
  body = await request.json()
139
  action_data = body.get("action", body)
 
 
 
 
 
 
 
 
140
 
141
  action = TriageAction(
 
142
  priority=action_data.get("priority", "P2"),
143
  labels=action_data.get("labels", ["bug"]),
144
  assigned_team=action_data.get("assigned_team", "backend"),
@@ -146,7 +157,7 @@ async def custom_step(request: Request):
146
  reasoning=action_data.get("reasoning", ""),
147
  )
148
 
149
- obs = _global_env.step(action)
150
 
151
  try:
152
  obs_dict = obs.model_dump(exclude={"reward", "done", "metadata"})
@@ -156,53 +167,122 @@ async def custom_step(request: Request):
156
  obs_dict.pop("done", None)
157
  obs_dict.pop("metadata", None)
158
 
159
- reward = float(obs.reward) if obs.reward is not None else 0.05
160
- # Strictly clamp to open interval (0, 1)
161
- reward = max(0.01, min(0.99, reward))
162
 
163
- return {
164
  "observation": obs_dict,
165
  "reward": reward,
166
  "done": obs.done,
167
  }
168
 
 
 
 
 
 
 
 
 
 
169
 
170
  @app.get("/state")
171
- def custom_state():
172
  """Return current environment state."""
173
- global _global_env
174
- state = _global_env.get_state()
 
 
 
 
 
175
  try:
176
  return state.model_dump()
177
  except AttributeError:
178
  return state.dict()
179
 
180
 
 
 
 
 
181
  @app.post("/tasks/easy/reset")
182
- def reset_easy():
183
- global _global_env
184
- _global_env = BugTriageEnvironment()
185
- obs = _global_env.reset(task_id="easy")
186
- return {"task_id": "easy", "bug_report": obs.bug_report.model_dump(), "done": False, "reward": 0.05}
 
 
 
 
 
 
187
 
188
  @app.post("/tasks/medium/reset")
189
- def reset_medium():
190
- global _global_env
191
- _global_env = BugTriageEnvironment()
192
- obs = _global_env.reset(task_id="medium")
193
- return {"task_id": "medium", "bug_report": obs.bug_report.model_dump(), "done": False, "reward": 0.05}
 
 
 
 
 
 
194
 
195
  @app.post("/tasks/hard/reset")
196
- def reset_hard():
197
- global _global_env
198
- _global_env = BugTriageEnvironment()
199
- obs = _global_env.reset(task_id="hard")
200
- return {"task_id": "hard", "bug_report": obs.bug_report.model_dump(), "done": False, "reward": 0.05}
 
 
 
 
 
 
201
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
202
 
203
  def main():
204
  import uvicorn
205
  uvicorn.run(app, host="0.0.0.0", port=7860)
206
 
 
207
  if __name__ == "__main__":
208
  main()
 
1
  # server/app.py
2
  import sys
3
  import os
 
4
  sys.path.insert(0, "/app")
5
  sys.path.insert(0, "/app/server")
6
 
7
  from openenv.core.env_server import create_app
8
  from model import TriageAction, TriageObservation
9
+ from environment import BugTriageEnvironment, SessionManager, TASKS_META
10
  from task import sample_bug, grade_action, TASKS
11
+ from fastapi import Response, Request, HTTPException
12
+ from fastapi.responses import FileResponse, JSONResponse
13
  from fastapi.staticfiles import StaticFiles
 
14
  from typing import Optional, Dict, Any
15
 
16
  app = create_app(
 
20
  env_name="bug-triage-env",
21
  )
22
 
23
+ # Session manager replaces the broken global state
24
+ sessions = SessionManager(max_sessions=500, ttl_seconds=600)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
+ # Fallback env for backward-compatible (non-session) requests
27
+ _fallback_env = BugTriageEnvironment()
28
+ _fallback_answer = None
29
 
30
 
31
+ # Remove default routes from create_app β€” we override them
32
  routes_to_remove = []
33
  for route in app.routes:
34
  if hasattr(route, "path") and route.path in ("/reset", "/step", "/state"):
 
37
  app.routes.remove(route)
38
 
39
 
40
+ # ---------------------------------------------------------------------------
41
+ # CORE ENDPOINTS
42
+ # ---------------------------------------------------------------------------
43
+
44
  @app.get("/health")
45
  def health():
46
+ return {
47
+ "status": "ok",
48
+ "env": "bug-triage-env",
49
+ "version": "2.0.0",
50
+ "active_sessions": sessions.active_count,
51
+ }
52
+
53
 
54
  @app.get("/")
55
  def root():
56
+ """Serve the interactive demo frontend."""
57
  static_dir = os.path.join(os.path.dirname(__file__), "static")
58
+ index_path = os.path.join(static_dir, "index.html")
59
+ if os.path.exists(index_path):
60
+ return FileResponse(index_path)
61
+ return {"message": "Bug Triage Environment v2.0.0", "docs": "/docs"}
62
+
63
 
64
  @app.get("/web")
65
  def web_ui():
66
  """Alias for the frontend."""
67
+ return root()
68
+
69
 
70
  @app.get("/tasks")
71
  def list_tasks():
72
  return TASKS_META
73
 
 
 
 
74
 
75
+ @app.get("/tasks/{task_id}")
76
+ def get_task(task_id: str):
77
+ for t in TASKS_META:
78
+ if t["id"] == task_id:
79
+ return t
80
+ raise HTTPException(404, detail={
81
+ "error": "task_not_found",
82
+ "message": f"Task '{task_id}' not found. Valid: easy, medium, hard",
83
+ })
84
 
85
 
86
+ # ---------------------------------------------------------------------------
87
+ # SESSION-BASED RESET / STEP / STATE
88
+ # ---------------------------------------------------------------------------
89
 
90
  @app.post("/reset")
91
  async def custom_reset(request: Request):
92
+ """Start a new episode. Returns a session_id for subsequent step() calls."""
93
+ global _fallback_env, _fallback_answer
94
 
95
  body = {}
96
  try:
 
101
  task_id = body.get("task_id", "easy")
102
  seed = body.get("seed", None)
103
  episode_id = body.get("episode_id", None)
104
+ session_id = body.get("session_id", None)
105
+
106
+ # If session_id provided, reuse that session
107
+ if session_id:
108
+ env = sessions.get_session(session_id)
109
+ if env is None:
110
+ session_id, env = sessions.create_session()
111
+ else:
112
+ session_id, env = sessions.create_session()
113
+
114
+ obs = env.reset(task_id=task_id, seed=seed, episode_id=episode_id)
115
 
116
+ # Also update fallback for backward compatibility
117
+ _fallback_env = env
118
 
119
  try:
120
  obs_dict = obs.model_dump(exclude={"reward", "done", "metadata"})
 
125
  obs_dict.pop("metadata", None)
126
 
127
  return {
128
+ "session_id": session_id,
129
  "observation": obs_dict,
130
+ "reward": 0.0,
131
+ "done": False,
132
  }
133
 
134
 
135
  @app.post("/step")
136
  async def custom_step(request: Request):
137
+ """Process an action β€” either investigation or final triage submission."""
138
+ global _fallback_env
139
 
140
  body = await request.json()
141
  action_data = body.get("action", body)
142
+ session_id = body.get("session_id", None)
143
+
144
+ # Find the right environment
145
+ env = None
146
+ if session_id:
147
+ env = sessions.get_session(session_id)
148
+ if env is None:
149
+ env = _fallback_env
150
 
151
  action = TriageAction(
152
+ action_type=action_data.get("action_type", "submit"),
153
  priority=action_data.get("priority", "P2"),
154
  labels=action_data.get("labels", ["bug"]),
155
  assigned_team=action_data.get("assigned_team", "backend"),
 
157
  reasoning=action_data.get("reasoning", ""),
158
  )
159
 
160
+ obs = env.step(action)
161
 
162
  try:
163
  obs_dict = obs.model_dump(exclude={"reward", "done", "metadata"})
 
167
  obs_dict.pop("done", None)
168
  obs_dict.pop("metadata", None)
169
 
170
+ reward = float(obs.reward) if obs.reward is not None else 0.0
171
+ reward = max(0.01, min(0.99, reward)) if obs.done else 0.0
 
172
 
173
+ response_data = {
174
  "observation": obs_dict,
175
  "reward": reward,
176
  "done": obs.done,
177
  }
178
 
179
+ if session_id:
180
+ response_data["session_id"] = session_id
181
+
182
+ # Cleanup session when episode is done
183
+ if obs.done and session_id:
184
+ sessions.remove_session(session_id)
185
+
186
+ return response_data
187
+
188
 
189
  @app.get("/state")
190
+ def custom_state(session_id: Optional[str] = None):
191
  """Return current environment state."""
192
+ env = None
193
+ if session_id:
194
+ env = sessions.get_session(session_id)
195
+ if env is None:
196
+ env = _fallback_env
197
+
198
+ state = env.get_state()
199
  try:
200
  return state.model_dump()
201
  except AttributeError:
202
  return state.dict()
203
 
204
 
205
+ # ---------------------------------------------------------------------------
206
+ # PER-TASK SHORTCUT ENDPOINTS
207
+ # ---------------------------------------------------------------------------
208
+
209
  @app.post("/tasks/easy/reset")
210
+ async def reset_easy():
211
+ session_id, env = sessions.create_session()
212
+ obs = env.reset(task_id="easy")
213
+ return {
214
+ "session_id": session_id,
215
+ "task_id": "easy",
216
+ "bug_report": obs.bug_report.model_dump(),
217
+ "done": False,
218
+ "reward": 0.0,
219
+ }
220
+
221
 
222
  @app.post("/tasks/medium/reset")
223
+ async def reset_medium():
224
+ session_id, env = sessions.create_session()
225
+ obs = env.reset(task_id="medium")
226
+ return {
227
+ "session_id": session_id,
228
+ "task_id": "medium",
229
+ "bug_report": obs.bug_report.model_dump(),
230
+ "done": False,
231
+ "reward": 0.0,
232
+ }
233
+
234
 
235
  @app.post("/tasks/hard/reset")
236
+ async def reset_hard():
237
+ session_id, env = sessions.create_session()
238
+ obs = env.reset(task_id="hard")
239
+ return {
240
+ "session_id": session_id,
241
+ "task_id": "hard",
242
+ "bug_report": obs.bug_report.model_dump(),
243
+ "done": False,
244
+ "reward": 0.0,
245
+ }
246
+
247
 
248
+ # ---------------------------------------------------------------------------
249
+ # LEADERBOARD
250
+ # ---------------------------------------------------------------------------
251
+
252
+ _leaderboard = []
253
+
254
+
255
+ @app.get("/leaderboard")
256
+ def get_leaderboard():
257
+ """Return top 50 agent scores."""
258
+ return sorted(_leaderboard, key=lambda x: x.get("avg_score", 0), reverse=True)[:50]
259
+
260
+
261
+ @app.post("/leaderboard/submit")
262
+ async def submit_to_leaderboard(request: Request):
263
+ """Submit agent scores to the leaderboard."""
264
+ body = await request.json()
265
+ entry = {
266
+ "agent_name": body.get("agent_name", "anonymous"),
267
+ "model": body.get("model", "unknown"),
268
+ "scores": body.get("scores", {}),
269
+ "avg_score": body.get("avg_score", 0.0),
270
+ }
271
+ _leaderboard.append(entry)
272
+ rank = sorted(
273
+ _leaderboard, key=lambda x: x.get("avg_score", 0), reverse=True
274
+ ).index(entry) + 1
275
+ return {"status": "submitted", "rank": rank, "total_entries": len(_leaderboard)}
276
+
277
+
278
+ # ---------------------------------------------------------------------------
279
+ # ENTRYPOINT
280
+ # ---------------------------------------------------------------------------
281
 
282
  def main():
283
  import uvicorn
284
  uvicorn.run(app, host="0.0.0.0", port=7860)
285
 
286
+
287
  if __name__ == "__main__":
288
  main()
server/environment.py CHANGED
@@ -3,25 +3,39 @@ import sys
3
  sys.path.insert(0, "/app")
4
  sys.path.insert(0, "/app/server")
5
  import uuid
 
 
6
  from openenv.core.env_server.interfaces import Environment
7
  from model import TriageAction, TriageObservation, TriageState, BugReport
8
  from task import grade_action, sample_bug
9
 
10
  VALID_TASKS = ["easy", "medium", "hard"]
11
 
 
 
12
  TASKS_META = [
13
- {"id": "easy", "name": "Priority Assignment", "grader": "server.task:priority_match",
 
14
  "difficulty": "easy", "reward_range": [0.0, 1.0],
15
- "description": "Assign a single P0-P3 priority to a bug report"},
16
- {"id": "medium", "name": "Priority Labels and Team", "grader": "server.task:priority_label_team",
 
 
17
  "difficulty": "medium", "reward_range": [0.0, 1.0],
18
- "description": "Assign priority, labels, and team routing"},
19
- {"id": "hard", "name": "Full Triage", "grader": "server.task:full_triage",
 
 
20
  "difficulty": "hard", "reward_range": [0.0, 1.0],
21
- "description": "Full triage with security escalation penalty"},
 
22
  ]
23
 
 
 
 
24
  class BugTriageEnvironment(Environment):
 
25
 
26
  SUPPORTS_CONCURRENT_SESSIONS = True
27
 
@@ -29,13 +43,25 @@ class BugTriageEnvironment(Environment):
29
  super().__init__()
30
  self._current_task_key: str = "easy"
31
  self._episode_done: bool = False
32
- self._current_bug: BugReport = sample_bug("easy")
 
 
 
 
 
 
 
 
 
 
 
33
  self._state = TriageState(
34
  episode_id=str(uuid.uuid4()),
35
  current_task="easy",
36
  step_count=0,
37
- total_score=0.05,
38
  tasks_completed=[],
 
39
  )
40
 
41
  def get_metadata(self):
@@ -43,72 +69,203 @@ class BugTriageEnvironment(Environment):
43
  from openenv.core.env_server.types import EnvironmentMetadata
44
  return EnvironmentMetadata(
45
  name="bug-triage-env",
46
- description="Bug triage RL environment with 3 tasks of increasing difficulty",
47
- version="1.0.0",
 
48
  author="Siteshcodes",
49
  tasks=TASKS_META,
50
  )
51
  except Exception:
52
  return {
53
  "name": "bug-triage-env",
54
- "description": "Bug triage RL environment with 3 tasks of increasing difficulty",
55
- "version": "1.0.0",
56
  "author": "Siteshcodes",
57
  "tasks": TASKS_META,
58
  }
59
 
60
- def reset(self, task_id: str = "easy", seed: int = None, episode_id: str = None, **kwargs) -> TriageObservation:
61
- """Start a fresh episode for the specified task."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  if task_id not in VALID_TASKS:
63
  task_id = "easy"
64
 
65
  self._current_task_key = task_id
66
  self._episode_done = False
67
- self._current_bug = sample_bug(task_id)
 
 
 
 
 
 
 
 
 
 
 
 
68
  self._state = TriageState(
69
  episode_id=episode_id or str(uuid.uuid4()),
70
  current_task=task_id,
71
  step_count=0,
72
- total_score=0.05,
73
  tasks_completed=[],
 
74
  )
75
- return TriageObservation(
76
- bug_report=self._current_bug,
77
- task_id=task_id,
78
- score=0.05,
79
- feedback=f"Episode started for task: {task_id}. Triage this bug report.",
80
- done=False,
81
- reward=0.05,
 
 
 
 
82
  )
83
 
84
  def step(self, action: TriageAction) -> TriageObservation:
85
- """Process the agent's triage action β€” one step, then done."""
86
  if self._episode_done:
87
- return TriageObservation(
88
- bug_report=self._current_bug,
89
- task_id=self._current_task_key,
90
- score=0.05,
91
  feedback="Episode already complete. Call reset() to start a new episode.",
92
- done=True,
93
- reward=0.05,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
  )
95
 
96
- self._state.step_count += 1
97
- task_key = self._current_task_key
 
 
 
 
 
 
 
98
 
99
- score, feedback = grade_action(task_key, self._current_bug, action)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
 
101
  self._state.total_score += score
102
- self._state.tasks_completed.append(task_key)
103
  self._episode_done = True
104
 
105
- return TriageObservation(
106
- bug_report=self._current_bug,
107
- task_id=task_key,
108
- score=round(score, 3),
109
- feedback=feedback,
110
- done=True,
111
- reward=round(score, 3),
112
  )
113
 
114
  @property
@@ -116,4 +273,68 @@ class BugTriageEnvironment(Environment):
116
  return self._state
117
 
118
  def get_state(self) -> TriageState:
119
- return self._state
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  sys.path.insert(0, "/app")
4
  sys.path.insert(0, "/app/server")
5
  import uuid
6
+ import time
7
+ from typing import Dict, Optional, Tuple
8
  from openenv.core.env_server.interfaces import Environment
9
  from model import TriageAction, TriageObservation, TriageState, BugReport
10
  from task import grade_action, sample_bug
11
 
12
  VALID_TASKS = ["easy", "medium", "hard"]
13
 
14
+ MAX_STEPS_PER_TASK = {"easy": 4, "medium": 5, "hard": 6}
15
+
16
  TASKS_META = [
17
+ {"id": "easy", "name": "Priority Assignment",
18
+ "grader": "server.task:priority_match",
19
  "difficulty": "easy", "reward_range": [0.0, 1.0],
20
+ "description": "Investigate a bug report and assign a P0-P3 priority. "
21
+ "Use investigation actions to gather info before submitting."},
22
+ {"id": "medium", "name": "Priority Labels and Team",
23
+ "grader": "server.task:priority_label_team",
24
  "difficulty": "medium", "reward_range": [0.0, 1.0],
25
+ "description": "Investigate and assign priority, labels, and team routing. "
26
+ "More investigation steps available."},
27
+ {"id": "hard", "name": "Full Triage",
28
+ "grader": "server.task:full_triage",
29
  "difficulty": "hard", "reward_range": [0.0, 1.0],
30
+ "description": "Full triage with priority, labels, team, milestone, "
31
+ "and security escalation penalty. Investigation is critical."},
32
  ]
33
 
34
+ INVESTIGATION_ACTIONS = {"read_body", "read_comments", "check_logs", "check_similar"}
35
+
36
+
37
  class BugTriageEnvironment(Environment):
38
+ """Multi-step bug triage environment with progressive information reveal."""
39
 
40
  SUPPORTS_CONCURRENT_SESSIONS = True
41
 
 
43
  super().__init__()
44
  self._current_task_key: str = "easy"
45
  self._episode_done: bool = False
46
+ self._current_bug: Optional[BugReport] = None
47
+ self._current_answer: Optional[dict] = None
48
+ self._step_count: int = 0
49
+ self._max_steps: int = 4
50
+ self._actions_taken: list = []
51
+
52
+ # Progressive visibility
53
+ self._body_visible: bool = False
54
+ self._comments_visible: bool = False
55
+ self._logs_visible: bool = False
56
+ self._similar_visible: bool = False
57
+
58
  self._state = TriageState(
59
  episode_id=str(uuid.uuid4()),
60
  current_task="easy",
61
  step_count=0,
62
+ total_score=0.0,
63
  tasks_completed=[],
64
+ actions_taken=[],
65
  )
66
 
67
  def get_metadata(self):
 
69
  from openenv.core.env_server.types import EnvironmentMetadata
70
  return EnvironmentMetadata(
71
  name="bug-triage-env",
72
+ description="Multi-step bug triage RL environment with progressive "
73
+ "information reveal and 3 difficulty levels",
74
+ version="2.0.0",
75
  author="Siteshcodes",
76
  tasks=TASKS_META,
77
  )
78
  except Exception:
79
  return {
80
  "name": "bug-triage-env",
81
+ "description": "Multi-step bug triage RL environment",
82
+ "version": "2.0.0",
83
  "author": "Siteshcodes",
84
  "tasks": TASKS_META,
85
  }
86
 
87
+ def _build_observation(self, score=0.0, feedback="", done=False,
88
+ reward=0.0) -> TriageObservation:
89
+ """Build observation with current visibility state."""
90
+ bug = self._current_bug
91
+
92
+ # Create a visibility-filtered view of the bug
93
+ visible_bug = BugReport(
94
+ id=bug.id,
95
+ title=bug.title,
96
+ body=bug.body if self._body_visible else bug.body[:120] + "..." if len(bug.body) > 120 else bug.body,
97
+ author=bug.author,
98
+ labels_hint=bug.labels_hint,
99
+ comments=bug.comments if self._comments_visible else [],
100
+ severity_signals=bug.severity_signals if self._logs_visible else [],
101
+ related_bugs=bug.related_bugs if self._similar_visible else [],
102
+ stack_trace=bug.stack_trace if self._logs_visible else "",
103
+ affected_component=bug.affected_component if self._logs_visible else "",
104
+ )
105
+
106
+ return TriageObservation(
107
+ bug_report=visible_bug,
108
+ task_id=self._current_task_key,
109
+ score=round(score, 3),
110
+ feedback=feedback,
111
+ done=done,
112
+ reward=round(reward, 3),
113
+ body_visible=self._body_visible,
114
+ comments_visible=self._comments_visible,
115
+ logs_visible=self._logs_visible,
116
+ similar_visible=self._similar_visible,
117
+ steps_taken=self._step_count,
118
+ max_steps=self._max_steps,
119
+ )
120
+
121
+ def reset(self, task_id: str = "easy", seed: int = None,
122
+ episode_id: str = None, **kwargs) -> TriageObservation:
123
+ """Start a fresh episode for the given task."""
124
  if task_id not in VALID_TASKS:
125
  task_id = "easy"
126
 
127
  self._current_task_key = task_id
128
  self._episode_done = False
129
+ self._step_count = 0
130
+ self._max_steps = MAX_STEPS_PER_TASK.get(task_id, 4)
131
+ self._actions_taken = []
132
+
133
+ # Reset visibility β€” title + truncated body are always visible
134
+ self._body_visible = False
135
+ self._comments_visible = False
136
+ self._logs_visible = False
137
+ self._similar_visible = False
138
+
139
+ # Sample a bug and its answer
140
+ self._current_bug, self._current_answer = sample_bug(task_id, seed=seed)
141
+
142
  self._state = TriageState(
143
  episode_id=episode_id or str(uuid.uuid4()),
144
  current_task=task_id,
145
  step_count=0,
146
+ total_score=0.0,
147
  tasks_completed=[],
148
+ actions_taken=[],
149
  )
150
+
151
+ feedback = (
152
+ f"Episode started for task: {task_id}. "
153
+ f"You see the bug title and a preview. "
154
+ f"Use investigation actions (read_body, read_comments, check_logs, check_similar) "
155
+ f"to reveal more information, then submit your triage. "
156
+ f"You have {self._max_steps} steps max."
157
+ )
158
+
159
+ return self._build_observation(
160
+ score=0.0, feedback=feedback, done=False, reward=0.0,
161
  )
162
 
163
  def step(self, action: TriageAction) -> TriageObservation:
164
+ """Process agent's action β€” either investigate or submit final triage."""
165
  if self._episode_done:
166
+ return self._build_observation(
167
+ score=0.0,
 
 
168
  feedback="Episode already complete. Call reset() to start a new episode.",
169
+ done=True, reward=0.0,
170
+ )
171
+
172
+ self._step_count += 1
173
+ self._state.step_count = self._step_count
174
+ action_type = getattr(action, "action_type", "submit")
175
+ self._actions_taken.append(action_type)
176
+ self._state.actions_taken = list(self._actions_taken)
177
+
178
+ # Check if max steps reached β€” force submission
179
+ if self._step_count >= self._max_steps and action_type != "submit":
180
+ action_type = "submit"
181
+
182
+ # --- Investigation actions ---
183
+ if action_type in INVESTIGATION_ACTIONS:
184
+ feedback = self._handle_investigation(action_type)
185
+ return self._build_observation(
186
+ score=0.0, feedback=feedback, done=False, reward=0.0,
187
+ )
188
+
189
+ # --- Submit action ---
190
+ return self._handle_submission(action)
191
+
192
+ def _handle_investigation(self, action_type: str) -> str:
193
+ """Reveal information based on the investigation action."""
194
+ if action_type == "read_body":
195
+ if self._body_visible:
196
+ return "Full body already revealed. Choose another action or submit."
197
+ self._body_visible = True
198
+ return (
199
+ f"Full bug description revealed. "
200
+ f"Steps used: {self._step_count}/{self._max_steps}."
201
  )
202
 
203
+ elif action_type == "read_comments":
204
+ if self._comments_visible:
205
+ return "Comments already revealed. Choose another action or submit."
206
+ self._comments_visible = True
207
+ n = len(self._current_bug.comments)
208
+ return (
209
+ f"Revealed {n} comment(s). "
210
+ f"Steps used: {self._step_count}/{self._max_steps}."
211
+ )
212
 
213
+ elif action_type == "check_logs":
214
+ if self._logs_visible:
215
+ return "Logs already revealed. Choose another action or submit."
216
+ self._logs_visible = True
217
+ has_trace = bool(self._current_bug.stack_trace)
218
+ return (
219
+ f"System logs revealed. {'Stack trace available.' if has_trace else 'No stack trace.'} "
220
+ f"Steps used: {self._step_count}/{self._max_steps}."
221
+ )
222
+
223
+ elif action_type == "check_similar":
224
+ if self._similar_visible:
225
+ return "Similar bugs already revealed. Choose another action or submit."
226
+ self._similar_visible = True
227
+ n = len(self._current_bug.related_bugs)
228
+ return (
229
+ f"Found {n} related bug(s). "
230
+ f"Steps used: {self._step_count}/{self._max_steps}."
231
+ )
232
+
233
+ return f"Unknown investigation action: {action_type}"
234
+
235
+ def _handle_submission(self, action: TriageAction) -> TriageObservation:
236
+ """Grade the agent's final triage submission."""
237
+ score, feedback = grade_action(
238
+ self._current_task_key, self._current_bug, action,
239
+ answer=self._current_answer,
240
+ )
241
+
242
+ # Apply time efficiency bonus/penalty
243
+ # Fewer steps = better (if the answer is good)
244
+ investigation_steps = self._step_count - 1 # subtract the submit step
245
+ if investigation_steps == 0 and score >= 0.7:
246
+ # Got it right without investigating β€” impressive!
247
+ efficiency_bonus = 0.05
248
+ feedback += " | ⚑ Efficiency bonus: +0.05 (correct with minimal investigation)"
249
+ elif investigation_steps >= 3 and score >= 0.7:
250
+ # Took many steps but got it right β€” slight penalty for slowness
251
+ efficiency_penalty = 0.02 * (investigation_steps - 2)
252
+ score = score - efficiency_penalty
253
+ feedback += f" | ⏱ Time penalty: -{efficiency_penalty:.2f} ({investigation_steps} investigation steps)"
254
+ elif investigation_steps == 0 and score < 0.5:
255
+ # Rushed and got it wrong β€” penalty
256
+ feedback += " | ⚠ Consider investigating before submitting next time"
257
+
258
+ if investigation_steps == 0 and score >= 0.7:
259
+ score += 0.05
260
+
261
+ score = max(0.01, min(0.99, score))
262
 
263
  self._state.total_score += score
264
+ self._state.tasks_completed.append(self._current_task_key)
265
  self._episode_done = True
266
 
267
+ return self._build_observation(
268
+ score=score, feedback=feedback, done=True, reward=score,
 
 
 
 
 
269
  )
270
 
271
  @property
 
273
  return self._state
274
 
275
  def get_state(self) -> TriageState:
276
+ return self._state
277
+
278
+
279
+ # ---------------------------------------------------------------------------
280
+ # SESSION MANAGER β€” handles concurrent sessions safely
281
+ # ---------------------------------------------------------------------------
282
+
283
+ class SessionManager:
284
+ """Thread-safe session management for multiple concurrent agents."""
285
+
286
+ def __init__(self, max_sessions: int = 1000, ttl_seconds: int = 600):
287
+ self._sessions: Dict[str, BugTriageEnvironment] = {}
288
+ self._timestamps: Dict[str, float] = {}
289
+ self._max_sessions = max_sessions
290
+ self._ttl = ttl_seconds
291
+
292
+ def create_session(self) -> Tuple[str, BugTriageEnvironment]:
293
+ """Create a new session and return (session_id, env)."""
294
+ self._cleanup_expired()
295
+ session_id = str(uuid.uuid4())
296
+ env = BugTriageEnvironment()
297
+ self._sessions[session_id] = env
298
+ self._timestamps[session_id] = time.time()
299
+ # Enforce max after adding
300
+ while len(self._sessions) > self._max_sessions:
301
+ oldest = min(self._timestamps, key=self._timestamps.get)
302
+ if oldest == session_id:
303
+ break
304
+ self._sessions.pop(oldest, None)
305
+ self._timestamps.pop(oldest, None)
306
+ return session_id, env
307
+
308
+ def get_session(self, session_id: str) -> Optional[BugTriageEnvironment]:
309
+ """Get an existing session's environment, or None if expired/missing."""
310
+ if session_id not in self._sessions:
311
+ return None
312
+ # Refresh TTL on access
313
+ self._timestamps[session_id] = time.time()
314
+ return self._sessions[session_id]
315
+
316
+ def remove_session(self, session_id: str) -> None:
317
+ """Remove a session after episode completes."""
318
+ self._sessions.pop(session_id, None)
319
+ self._timestamps.pop(session_id, None)
320
+
321
+ def _cleanup_expired(self) -> None:
322
+ """Remove sessions that exceeded TTL."""
323
+ now = time.time()
324
+ expired = [
325
+ sid for sid, ts in self._timestamps.items()
326
+ if now - ts > self._ttl
327
+ ]
328
+ for sid in expired:
329
+ self._sessions.pop(sid, None)
330
+ self._timestamps.pop(sid, None)
331
+
332
+ # Also enforce max sessions (remove oldest)
333
+ while len(self._sessions) > self._max_sessions:
334
+ oldest = min(self._timestamps, key=self._timestamps.get)
335
+ self._sessions.pop(oldest, None)
336
+ self._timestamps.pop(oldest, None)
337
+
338
+ @property
339
+ def active_count(self) -> int:
340
+ return len(self._sessions)
server/requirements.txt CHANGED
@@ -4,4 +4,5 @@ uvicorn[standard]
4
  pydantic
5
  websockets
6
  openai
7
- httpx
 
 
4
  pydantic
5
  websockets
6
  openai
7
+ httpx
8
+ requests
server/task.py CHANGED
@@ -1,16 +1,426 @@
1
  # server/task.py
2
  import sys
3
  import random
 
4
  sys.path.insert(0, "/app")
5
 
6
- from typing import Tuple, List
7
  from model import BugReport, TriageAction
8
 
9
 
10
- # BUG REPORT DATASET
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- TASKS = {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  "easy": {
15
  "bugs": [
16
  BugReport(
@@ -22,6 +432,9 @@ TASKS = {
22
  author="user123",
23
  labels_hint=[],
24
  comments=["Confirmed on iOS and Android.", "Happens every time."],
 
 
 
25
  ),
26
  BugReport(
27
  id="easy-002",
@@ -31,6 +444,9 @@ TASKS = {
31
  author="docs_fan",
32
  labels_hint=["documentation"],
33
  comments=[],
 
 
 
34
  ),
35
  BugReport(
36
  id="easy-003",
@@ -40,6 +456,9 @@ TASKS = {
40
  author="power_user",
41
  labels_hint=["performance"],
42
  comments=["Noticed after the last deploy.", "CPU spikes to 100%."],
 
 
 
43
  ),
44
  BugReport(
45
  id="easy-004",
@@ -49,7 +468,11 @@ TASKS = {
49
  "Affects all users attempting password reset.",
50
  author="support_team",
51
  labels_hint=["bug"],
52
- comments=["Reported by 12 users this week.", "Started after email service migration."],
 
 
 
 
53
  ),
54
  BugReport(
55
  id="easy-005",
@@ -59,9 +482,11 @@ TASKS = {
59
  author="intern_dev",
60
  labels_hint=["documentation"],
61
  comments=[],
 
 
 
62
  ),
63
  ],
64
- # Ground truth for grader
65
  "answers": {
66
  "easy-001": {"priority": "P0"},
67
  "easy-002": {"priority": "P3"},
@@ -82,6 +507,9 @@ TASKS = {
82
  author="store_owner",
83
  labels_hint=["bug"],
84
  comments=["Revenue impact confirmed.", "Happening since Tuesday."],
 
 
 
85
  ),
86
  BugReport(
87
  id="med-002",
@@ -92,6 +520,9 @@ TASKS = {
92
  author="moderator_jane",
93
  labels_hint=[],
94
  comments=["GDPR concern β€” deleted content still visible."],
 
 
 
95
  ),
96
  BugReport(
97
  id="med-003",
@@ -101,6 +532,9 @@ TASKS = {
101
  author="safari_user",
102
  labels_hint=["bug", "ux"],
103
  comments=["Only on Safari, not Chrome/Firefox."],
 
 
 
104
  ),
105
  BugReport(
106
  id="med-004",
@@ -110,7 +544,11 @@ TASKS = {
110
  "Affects users with international data.",
111
  author="data_analyst",
112
  labels_hint=["bug"],
113
- comments=["Encoding issue β€” UTF-8 not respected.", "Workaround: manual copy-paste."],
 
 
 
 
114
  ),
115
  BugReport(
116
  id="med-005",
@@ -120,7 +558,11 @@ TASKS = {
120
  "The unblock logic has a bug β€” it never clears the blocked flag.",
121
  author="api_user",
122
  labels_hint=["bug"],
123
- comments=["Affects CI/CD pipelines hitting the API.", "Retry-After header is wrong."],
 
 
 
 
124
  ),
125
  ],
126
  "answers": {
@@ -144,6 +586,10 @@ TASKS = {
144
  author="security_researcher",
145
  labels_hint=[],
146
  comments=["Critical. Affects production.", "Do not discuss publicly."],
 
 
 
 
147
  ),
148
  BugReport(
149
  id="hard-002",
@@ -155,6 +601,9 @@ TASKS = {
155
  author="devops_alice",
156
  labels_hint=["performance"],
157
  comments=["Verified with heap profiler.", "Started in v1.9."],
 
 
 
158
  ),
159
  BugReport(
160
  id="hard-003",
@@ -167,7 +616,12 @@ TASKS = {
167
  "Risk is low-probability but affects data integrity.",
168
  author="qa_bot",
169
  labels_hint=["bug"],
170
- comments=["Reproduced with locust at 50 concurrent users.", "Sequential mode avoids it."],
 
 
 
 
 
171
  ),
172
  BugReport(
173
  id="hard-004",
@@ -178,7 +632,12 @@ TASKS = {
178
  "This is a session management security vulnerability.",
179
  author="pentest_team",
180
  labels_hint=["security"],
181
- comments=["Verified on staging.", "OWASP A07 β€” Identification and Authentication Failures."],
 
 
 
 
 
182
  ),
183
  BugReport(
184
  id="hard-005",
@@ -189,126 +648,316 @@ TASKS = {
189
  "Triggered in production twice this week. Requires process kill to recover.",
190
  author="oncall_eng",
191
  labels_hint=["bug", "performance"],
192
- comments=["PagerDuty alert fired twice.", "Needs exponential backoff + max retry cap."],
 
 
 
 
 
193
  ),
194
  ],
195
  "answers": {
196
  "hard-001": {
197
- "priority": "P0",
198
- "labels": ["bug", "security"],
199
- "assigned_team": "security",
200
- "milestone": "hotfix",
201
  },
202
  "hard-002": {
203
- "priority": "P1",
204
- "labels": ["bug", "performance"],
205
- "assigned_team": "backend",
206
- "milestone": "v2.1",
207
  },
208
  "hard-003": {
209
- "priority": "P1",
210
- "labels": ["bug", "data-integrity"],
211
- "assigned_team": "backend",
212
- "milestone": "v2.1",
213
  },
214
  "hard-004": {
215
- "priority": "P0",
216
- "labels": ["bug", "security"],
217
- "assigned_team": "security",
218
- "milestone": "hotfix",
219
  },
220
  "hard-005": {
221
- "priority": "P0",
222
- "labels": ["bug", "performance"],
223
- "assigned_team": "backend",
224
- "milestone": "hotfix",
225
  },
226
  },
227
  },
228
  }
229
 
230
 
231
-
232
- # TASK SAMPLER β€” picks a random bug each reset
233
-
234
-
235
- def sample_bug(task_key: str) -> BugReport:
236
- """Return a random bug from the given task's pool."""
237
- return random.choice(TASKS[task_key]["bugs"])
238
-
239
-
240
-
241
- # GRADERS
242
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
243
 
244
  PRIORITY_ORDER = {"P0": 0, "P1": 1, "P2": 2, "P3": 3}
245
 
246
 
247
  def _priority_score(predicted: str, correct: str) -> float:
 
248
  if predicted == correct:
249
  return 0.95
250
- diff = abs(PRIORITY_ORDER.get(predicted, 99) - PRIORITY_ORDER.get(correct, 99))
251
- return 0.5 if diff == 1 else 0.05
 
 
 
 
 
 
252
 
253
 
 
 
 
 
 
 
 
 
254
 
255
  def _label_score(predicted: List[str], correct: List[str]) -> float:
256
- pred_set = set(l.lower() for l in predicted)
257
- corr_set = set(l.lower() for l in correct)
258
- if not corr_set:
 
 
259
  return 0.95
260
- intersection = pred_set & corr_set
261
- union = pred_set | corr_set
262
- raw = len(intersection) / len(union)
 
 
263
  return max(0.05, min(0.95, raw))
264
 
265
 
266
- def grade_action(task_key, bug, action):
267
- answer = TASKS[task_key]["answers"][bug.id]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
268
  feedback_parts = []
 
269
 
270
  if task_key == "easy":
271
  score = _priority_score(action.priority, answer["priority"])
272
  symbol = "βœ“" if score >= 0.9 else "~" if score >= 0.4 else "βœ—"
273
- feedback_parts.append(f"Priority: {symbol} (got {action.priority}, expected {answer['priority']})")
 
 
274
  score = max(0.01, min(0.99, score))
275
  return round(score, 3), " | ".join(feedback_parts)
276
 
277
  elif task_key == "medium":
278
  p_score = _priority_score(action.priority, answer["priority"])
279
- l_score = _label_score(action.labels, answer["labels"])
280
  expected_team = answer.get("assigned_team", "")
281
  t_score = 0.95 if expected_team and action.assigned_team.lower() == expected_team.lower() else 0.05
282
- score = 0.45 * p_score + 0.40 * l_score + 0.15 * t_score
283
- feedback_parts.append(f"Priority: {p_score:.2f} (got {action.priority}, expected {answer['priority']})")
284
- feedback_parts.append(f"Labels: {l_score:.2f}")
285
- feedback_parts.append(f"Team: {t_score:.2f} (got {action.assigned_team}, expected {expected_team})")
 
 
 
 
 
 
 
286
  score = max(0.01, min(0.99, score))
287
  return round(score, 3), " | ".join(feedback_parts)
288
 
289
  else: # hard
290
  p_score = _priority_score(action.priority, answer["priority"])
291
- l_score = _label_score(action.labels, answer["labels"])
292
  t_score = 0.95 if action.assigned_team.lower() == answer["assigned_team"].lower() else 0.05
293
  m_score = 0.95 if action.milestone.lower() == answer["milestone"].lower() else 0.05
294
- score = 0.35 * p_score + 0.30 * l_score + 0.20 * t_score + 0.15 * m_score
295
- feedback_parts.append(f"Priority: {p_score:.2f} (got {action.priority}, expected {answer['priority']})")
296
- feedback_parts.append(f"Labels: {l_score:.2f}")
297
- feedback_parts.append(f"Team: {t_score:.2f} (got {action.assigned_team}, expected {answer['assigned_team']})")
298
- feedback_parts.append(f"Milestone: {m_score:.2f} (got {action.milestone}, expected {answer['milestone']})")
 
 
 
 
 
 
 
 
 
 
299
  if answer.get("assigned_team") == "security" and action.assigned_team.lower() != "security":
300
  score = max(0.01, score - 0.15)
301
  feedback_parts.append("⚠ Security escalation missed (-0.15)")
 
302
  score = max(0.01, min(0.99, score))
303
  return round(score, 3), " | ".join(feedback_parts)
304
-
 
 
 
 
 
305
  def priority_match(*args, **kwargs):
306
  if len(args) < 2:
307
  return 0.5
308
-
309
- bug = args[0]
310
- action = args[1]
311
-
312
  score, _ = grade_action("easy", bug, action)
313
  return float(score)
314
 
@@ -316,10 +965,7 @@ def priority_match(*args, **kwargs):
316
  def priority_label_team(*args, **kwargs):
317
  if len(args) < 2:
318
  return 0.5
319
-
320
- bug = args[0]
321
- action = args[1]
322
-
323
  score, _ = grade_action("medium", bug, action)
324
  return float(score)
325
 
@@ -327,17 +973,18 @@ def priority_label_team(*args, **kwargs):
327
  def full_triage(*args, **kwargs):
328
  if len(args) < 2:
329
  return 0.5
330
-
331
- bug = args[0]
332
- action = args[1]
333
-
334
  score, _ = grade_action("hard", bug, action)
335
  return float(score)
 
 
336
  __all__ = [
337
  "priority_match",
338
  "priority_label_team",
339
  "full_triage",
340
  "sample_bug",
 
341
  "grade_action",
342
- "TASKS",
 
343
  ]
 
1
  # server/task.py
2
  import sys
3
  import random
4
+ import hashlib
5
  sys.path.insert(0, "/app")
6
 
7
+ from typing import Tuple, List, Dict, Any
8
  from model import BugReport, TriageAction
9
 
10
 
11
+ # ---------------------------------------------------------------------------
12
+ # LABEL SYNONYM MAP β€” allows semantic matching
13
+ # ---------------------------------------------------------------------------
14
+
15
+ LABEL_SYNONYMS: Dict[str, set] = {
16
+ "bug": {"defect", "issue", "error", "fault", "broken"},
17
+ "security": {"vulnerability", "cve", "exploit", "auth", "injection"},
18
+ "performance": {"perf", "slow", "latency", "optimization", "speed", "memory"},
19
+ "ux": {"ui", "frontend", "user-experience", "design", "usability"},
20
+ "data-integrity": {"data-loss", "corruption", "data", "consistency"},
21
+ "payments": {"billing", "payment", "stripe", "checkout", "revenue"},
22
+ "documentation": {"docs", "typo", "readme", "wiki"},
23
+ "infrastructure": {"infra", "devops", "deploy", "ci", "cd", "docker"},
24
+ "api": {"endpoint", "rest", "graphql", "http", "request"},
25
+ "database": {"db", "sql", "query", "migration", "schema"},
26
+ }
27
 
28
+ # ---------------------------------------------------------------------------
29
+ # BUG TEMPLATE SYSTEM β€” generates hundreds of unique bugs
30
+ # ---------------------------------------------------------------------------
31
+
32
+ _BUG_TEMPLATES = {
33
+ "crash": {
34
+ "titles": [
35
+ "{service} crashes on {trigger}",
36
+ "{service} throws {error_type} when {trigger}",
37
+ "Fatal error in {service} during {trigger}",
38
+ "Unhandled exception in {service}: {error_type}",
39
+ "{service} segfaults under {condition}",
40
+ ],
41
+ "bodies": [
42
+ "When a user {trigger}, the {service} crashes immediately. "
43
+ "Error: {error_type}. Stack trace points to {component}. "
44
+ "Affects {impact}. {workaround}",
45
+ "The {service} is failing with {error_type} every time a user {trigger}. "
46
+ "No error message is shown to the user β€” the process just dies. "
47
+ "Impact: {impact}. {workaround}",
48
+ ],
49
+ "vars": {
50
+ "service": ["auth service", "payment gateway", "search API", "notification worker",
51
+ "session manager", "user profile service", "file upload handler",
52
+ "webhook processor", "background job runner", "cache layer"],
53
+ "trigger": ["submits a form with special characters", "uploads a file larger than 10MB",
54
+ "logs in with SSO", "resets their password", "exports data to CSV",
55
+ "switches between tabs rapidly", "uses the bulk import feature",
56
+ "accesses the admin panel", "triggers a webhook", "runs a scheduled job"],
57
+ "error_type": ["NullPointerException", "SegmentationFault", "OutOfMemoryError",
58
+ "ConnectionTimeoutException", "StackOverflowError",
59
+ "IndexOutOfBoundsException", "TypeError", "KeyError"],
60
+ "component": ["UserController.java:142", "PaymentService.py:89",
61
+ "AuthMiddleware.ts:56", "SearchIndex.go:203",
62
+ "NotificationQueue.rb:77", "FileHandler.py:234"],
63
+ "impact": ["100% of users on this flow", "all mobile users", "EU region users only",
64
+ "users with accounts older than 1 year", "approximately 30% of sessions",
65
+ "every request during peak hours"],
66
+ "workaround": ["No workaround exists β€” the feature is completely broken.",
67
+ "Workaround: users can retry after clearing browser cache.",
68
+ "Temporary fix: restart the service every 2 hours.",
69
+ "No known workaround. Users are blocked."],
70
+ "condition": ["high concurrent load", "memory pressure above 80%",
71
+ "when connection pool is exhausted", "after running for 6+ hours"],
72
+ },
73
+ "answer_template": {
74
+ "severe": {"priority": "P0", "labels": ["bug"], "assigned_team": "backend", "milestone": "hotfix"},
75
+ "moderate": {"priority": "P1", "labels": ["bug"], "assigned_team": "backend", "milestone": "v2.1"},
76
+ },
77
+ "severity_keywords": {
78
+ "severe": ["100%", "all mobile", "No workaround", "completely broken", "blocked",
79
+ "SegmentationFault", "OutOfMemoryError"],
80
+ "moderate": ["retry", "30%", "Temporary fix", "restart"],
81
+ },
82
+ },
83
 
84
+ "security": {
85
+ "titles": [
86
+ "SQL injection vulnerability in {endpoint}",
87
+ "XSS attack possible via {input_field}",
88
+ "Authentication bypass in {service}",
89
+ "Sensitive data exposed in {location}",
90
+ "{credential_type} not invalidated after {event}",
91
+ "SSRF vulnerability in {endpoint}",
92
+ ],
93
+ "bodies": [
94
+ "The {endpoint} endpoint does not sanitize {input_field} inputs. "
95
+ "Crafted queries can {exploit_result}. PoC attached and verified on {env}. "
96
+ "Treat as confidential β€” do not discuss publicly until patched. {additional_context}",
97
+ "When a user {event}, existing {credential_type} remain valid for {duration}. "
98
+ "An attacker who {attack_vector} can continue to access the account. "
99
+ "This is a {vuln_category} vulnerability. {additional_context}",
100
+ ],
101
+ "vars": {
102
+ "endpoint": ["/api/search", "/api/users", "/api/export", "/admin/query",
103
+ "/api/upload", "/graphql", "/api/webhook"],
104
+ "input_field": ["search query", "username field", "file upload name",
105
+ "comment body", "profile bio", "webhook URL"],
106
+ "service": ["login flow", "OAuth callback", "API gateway", "admin panel",
107
+ "password reset", "2FA verification"],
108
+ "location": ["API error responses", "debug logs shipped to client",
109
+ "public S3 bucket", "unencrypted cookies", "localStorage"],
110
+ "credential_type": ["JWT tokens", "session cookies", "API keys", "OAuth tokens"],
111
+ "event": ["changes their password", "revokes API access",
112
+ "is suspended by admin", "enables 2FA"],
113
+ "exploit_result": ["dump the entire user table including password hashes",
114
+ "execute arbitrary JavaScript in other users' browsers",
115
+ "access any user's account without credentials",
116
+ "read internal service endpoints via SSRF"],
117
+ "env": ["production", "staging", "production replica"],
118
+ "duration": ["up to 24 hours", "indefinitely", "until manual cache clear",
119
+ "for the full token TTL (7 days)"],
120
+ "attack_vector": ["previously stole a token", "intercepted a session cookie",
121
+ "obtained a leaked API key"],
122
+ "vuln_category": ["session management", "access control",
123
+ "injection", "broken authentication"],
124
+ "additional_context": [
125
+ "OWASP A03 β€” Injection.",
126
+ "OWASP A07 β€” Identification and Authentication Failures.",
127
+ "CVSS score estimated at 9.1 (Critical).",
128
+ "Compliance impact: potential GDPR violation if user PII is exfiltrated.",
129
+ "Bounty hunter reported this 48 hours ago β€” disclosure deadline approaching.",
130
+ ],
131
+ },
132
+ "answer_template": {
133
+ "default": {"priority": "P0", "labels": ["bug", "security"],
134
+ "assigned_team": "security", "milestone": "hotfix"},
135
+ },
136
+ "severity_keywords": {"default": []},
137
+ },
138
+
139
+ "performance": {
140
+ "titles": [
141
+ "{page} loads slowly for {dataset_size}",
142
+ "Memory leak in {service} causes OOM after {duration}",
143
+ "API response time degrades under {load_condition}",
144
+ "{operation} takes {duration} for {dataset_size}",
145
+ "CPU spikes to 100% when {trigger}",
146
+ ],
147
+ "bodies": [
148
+ "When {condition}, the {page} takes {response_time} to load. "
149
+ "{diagnostic_info}. {impact}. {workaround}",
150
+ "The {service} allocates memory during {operation} and never frees it. "
151
+ "Server runs out of memory every {duration}. {diagnostic_info}. "
152
+ "{workaround}",
153
+ ],
154
+ "vars": {
155
+ "page": ["dashboard", "analytics page", "user list", "search results",
156
+ "audit log", "reports page", "admin overview"],
157
+ "service": ["background job processor", "cache warming service",
158
+ "log aggregator", "image resizer", "ETL pipeline"],
159
+ "dataset_size": ["large datasets (10k+ rows)", "enterprise accounts",
160
+ "tables with 100k+ entries", "files over 50MB"],
161
+ "duration": ["6 hours", "4 hours", "12 hours", "30+ seconds",
162
+ "2+ minutes", "an entire day"],
163
+ "load_condition": ["concurrent load", "peak traffic", "batch processing",
164
+ "more than 50 simultaneous users"],
165
+ "operation": ["bulk export", "report generation", "data migration",
166
+ "full-text search", "image processing"],
167
+ "trigger": ["running bulk exports", "processing large uploads",
168
+ "generating PDF reports", "reindexing search"],
169
+ "condition": ["a dataset has more than 10k rows",
170
+ "multiple users trigger exports simultaneously",
171
+ "the nightly ETL job runs alongside user traffic"],
172
+ "response_time": ["30+ seconds", "over a minute", "2-3 minutes",
173
+ "timeout after 60 seconds"],
174
+ "diagnostic_info": ["CPU spikes to 100%", "Heap profiler confirms the leak",
175
+ "Database EXPLAIN shows full table scan",
176
+ "N+1 query pattern detected in APM",
177
+ "Garbage collector running every 500ms"],
178
+ "impact": ["Affects power users with large accounts",
179
+ "All users experience slowness during peak hours",
180
+ "Requires manual restart to recover",
181
+ "Operational overhead: scheduled restarts every 4 hours"],
182
+ "workaround": ["Workaround: export data and use offline tools.",
183
+ "Workaround: scheduled restarts every 4 hours.",
184
+ "No workaround β€” users just wait.",
185
+ "Workaround: paginate results (but UX is degraded)."],
186
+ },
187
+ "answer_template": {
188
+ "severe": {"priority": "P1", "labels": ["bug", "performance"],
189
+ "assigned_team": "backend", "milestone": "v2.1"},
190
+ "moderate": {"priority": "P2", "labels": ["bug", "performance"],
191
+ "assigned_team": "backend", "milestone": "v2.1"},
192
+ },
193
+ "severity_keywords": {
194
+ "severe": ["OOM", "100%", "manual restart", "timeout", "No workaround",
195
+ "all users", "never frees"],
196
+ "moderate": ["Workaround", "power users", "paginate"],
197
+ },
198
+ },
199
+
200
+ "ui_bug": {
201
+ "titles": [
202
+ "{ui_element} breaks layout on {browser}",
203
+ "{ui_element} not rendering correctly in {mode}",
204
+ "Responsive layout broken on {device}",
205
+ "{feature} toggle not persisting across {context}",
206
+ "Accessibility: {ui_element} missing {a11y_attr}",
207
+ ],
208
+ "bodies": [
209
+ "Switching to {mode} on {browser} causes {ui_element} to {visual_issue}. "
210
+ "{other_browsers}. {workaround}",
211
+ "On {device}, the {ui_element} is {visual_issue}. "
212
+ "Tested on {browser}. {impact}. {workaround}",
213
+ ],
214
+ "vars": {
215
+ "ui_element": ["navigation bar", "sidebar menu", "modal dialog",
216
+ "dropdown selector", "data table", "footer",
217
+ "toast notifications", "breadcrumb trail"],
218
+ "browser": ["Safari 16", "Firefox ESR", "Chrome on Android",
219
+ "Edge on Windows", "iOS Safari", "Samsung Internet"],
220
+ "mode": ["dark mode", "high contrast mode", "RTL layout",
221
+ "compact view", "print view"],
222
+ "device": ["iPhone SE", "tablets in portrait", "screens below 768px",
223
+ "ultra-wide monitors", "4K displays"],
224
+ "feature": ["dark mode", "compact view", "language preference",
225
+ "notification settings"],
226
+ "context": ["page reloads", "different tabs", "sessions",
227
+ "browser restarts"],
228
+ "visual_issue": ["overlap the main content", "disappear entirely",
229
+ "render with incorrect colors", "become unclickable",
230
+ "overflow beyond the viewport"],
231
+ "other_browsers": ["Chrome and Firefox are unaffected.",
232
+ "Only reproducible on this specific browser.",
233
+ "Affects all WebKit-based browsers."],
234
+ "a11y_attr": ["ARIA labels", "keyboard focus indicators",
235
+ "screen reader text", "proper heading hierarchy"],
236
+ "impact": ["Cosmetic issue, no functional impact.",
237
+ "Users cannot access the affected feature.",
238
+ "Usability is degraded but the feature works."],
239
+ "workaround": ["Workaround: use a different browser.",
240
+ "Workaround: manually resize the window.",
241
+ "No workaround for this browser.",
242
+ "Workaround: disable the feature in settings."],
243
+ },
244
+ "answer_template": {
245
+ "severe": {"priority": "P2", "labels": ["bug", "ux"],
246
+ "assigned_team": "frontend", "milestone": "v2.1"},
247
+ "moderate": {"priority": "P3", "labels": ["bug", "ux"],
248
+ "assigned_team": "frontend", "milestone": "backlog"},
249
+ },
250
+ "severity_keywords": {
251
+ "severe": ["cannot access", "unclickable", "disappear", "No workaround"],
252
+ "moderate": ["Cosmetic", "different browser", "resize"],
253
+ },
254
+ },
255
+
256
+ "data_corruption": {
257
+ "titles": [
258
+ "Race condition in {feature}: {consequence}",
259
+ "Data inconsistency in {feature} under concurrent writes",
260
+ "{export_format} export produces corrupted output for {edge_case}",
261
+ "Stale data served from cache after {trigger}",
262
+ "Duplicate records created when {trigger}",
263
+ ],
264
+ "bodies": [
265
+ "Under concurrent load, {feature} can {consequence} due to a race condition "
266
+ "in {root_cause}. Frequency: {frequency}. {impact}. {workaround}",
267
+ "When {feature} data contains {edge_case}, the exported {export_format} file "
268
+ "is corrupted and cannot be {consumer}. {impact}. {workaround}",
269
+ ],
270
+ "vars": {
271
+ "feature": ["file upload", "order processing", "user registration",
272
+ "inventory update", "comment system", "permission assignment"],
273
+ "consequence": ["files occasionally overwrite each other",
274
+ "orders are duplicated or lost",
275
+ "users get assigned wrong permissions",
276
+ "inventory counts become negative"],
277
+ "root_cause": ["temp file naming logic", "lack of database locking",
278
+ "non-atomic read-modify-write cycle",
279
+ "missing unique constraint"],
280
+ "frequency": ["approximately 1 in 10,000 operations",
281
+ "consistently under 50+ concurrent users",
282
+ "intermittently β€” hard to reproduce",
283
+ "every time the batch job runs"],
284
+ "edge_case": ["non-ASCII characters (e.g., cafΓ©, naΓ―ve)",
285
+ "values containing commas or quotes",
286
+ "null or empty fields",
287
+ "timestamps crossing DST boundaries"],
288
+ "export_format": ["CSV", "Excel", "JSON", "PDF"],
289
+ "consumer": ["opened in Excel", "parsed by downstream services",
290
+ "imported back into the system"],
291
+ "trigger": ["double-clicking the submit button",
292
+ "cache TTL expires during a write operation",
293
+ "two users edit the same record simultaneously",
294
+ "the nightly sync job overlaps with user activity"],
295
+ "impact": ["Potential data loss confirmed.",
296
+ "No data loss confirmed yet, but risk exists.",
297
+ "Affects users with international data.",
298
+ "Breaks downstream pipeline processing."],
299
+ "workaround": ["Workaround: enable sequential mode in settings.",
300
+ "Workaround: manually re-export after cleanup.",
301
+ "No reliable workaround β€” data must be manually verified.",
302
+ "Workaround: add a mutex lock externally (operational overhead)."],
303
+ },
304
+ "answer_template": {
305
+ "severe": {"priority": "P1", "labels": ["bug", "data-integrity"],
306
+ "assigned_team": "backend", "milestone": "v2.1"},
307
+ "moderate": {"priority": "P2", "labels": ["bug", "data-integrity"],
308
+ "assigned_team": "backend", "milestone": "v2.1"},
309
+ },
310
+ "severity_keywords": {
311
+ "severe": ["data loss", "No reliable workaround", "consistently",
312
+ "permissions", "overwrite", "negative"],
313
+ "moderate": ["No data loss", "intermittently", "sequential mode",
314
+ "re-export", "non-ASCII"],
315
+ },
316
+ },
317
+
318
+ "documentation": {
319
+ "titles": [
320
+ "Typo in {location}",
321
+ "Outdated {doc_type} on {page}",
322
+ "Missing documentation for {feature}",
323
+ "Incorrect {doc_element} in {location}",
324
+ ],
325
+ "bodies": [
326
+ "There is a {issue_type} on the {page}: {detail}. No functional impact, "
327
+ "purely cosmetic. {extra}",
328
+ "The {doc_type} for {feature} is {issue_type}. {detail}. {extra}",
329
+ ],
330
+ "vars": {
331
+ "location": ["homepage docs", "API reference", "README", "changelog",
332
+ "contributing guide", "onboarding wiki"],
333
+ "doc_type": ["installation guide", "API documentation", "changelog",
334
+ "migration guide", "code comments"],
335
+ "page": ["landing page", "docs homepage", "getting started page",
336
+ "FAQ section", "footer"],
337
+ "feature": ["new webhook API", "batch processing endpoint",
338
+ "SSO integration", "rate limiting"],
339
+ "doc_element": ["code example", "endpoint URL", "parameter description",
340
+ "copyright year", "version number"],
341
+ "issue_type": ["a typo", "outdated", "missing", "incorrect", "misleading"],
342
+ "detail": ["'Welccome' should be 'Welcome'",
343
+ "references removed v1.x API that no longer exists",
344
+ "completely undocumented despite being a core feature",
345
+ "shows 'Β© 2022' but should be 'Β© 2024'",
346
+ "the curl example uses the wrong HTTP method"],
347
+ "extra": ["", "Low priority β€” does not block any workflow.",
348
+ "New users have reported confusion.",
349
+ "Only noticed by contributors reading source code."],
350
+ },
351
+ "answer_template": {
352
+ "default": {"priority": "P3", "labels": ["documentation"],
353
+ "assigned_team": "devx", "milestone": "backlog"},
354
+ },
355
+ "severity_keywords": {"default": []},
356
+ },
357
+
358
+ "api_bug": {
359
+ "titles": [
360
+ "API rate limiter {issue} after {trigger}",
361
+ "{endpoint} returns {status_code} instead of {expected_code}",
362
+ "Pagination broken on {endpoint}: {symptom}",
363
+ "Webhook delivery {issue} for {event_type} events",
364
+ "API versioning: {endpoint} behaves differently on v1 vs v2",
365
+ ],
366
+ "bodies": [
367
+ "After receiving a {status_code} response, {consequence}. "
368
+ "The {root_cause}. {impact}. {workaround}",
369
+ "The {endpoint} endpoint {symptom} when {trigger}. "
370
+ "Expected behavior: {expected}. Actual: {actual}. {impact}.",
371
+ ],
372
+ "vars": {
373
+ "endpoint": ["/api/users", "/api/search", "/api/export",
374
+ "/api/webhooks", "/api/billing", "/api/analytics"],
375
+ "issue": ["blocks legitimate users", "fails silently",
376
+ "returns incorrect retry headers", "drops events"],
377
+ "trigger": ["a 429 error", "rate limit window resets",
378
+ "a burst of requests from CI/CD", "server restart"],
379
+ "status_code": ["429", "500", "502", "504", "403"],
380
+ "expected_code": ["200", "201", "204", "404"],
381
+ "symptom": ["returns duplicate entries",
382
+ "skips items between pages",
383
+ "returns empty page despite more data existing"],
384
+ "event_type": ["payment.completed", "user.created",
385
+ "subscription.cancelled", "deployment.finished"],
386
+ "consequence": ["legitimate users remain blocked for 1 hour",
387
+ "data is silently lost with no error",
388
+ "downstream services receive stale data"],
389
+ "root_cause": ["unblock logic has a bug β€” it never clears the blocked flag",
390
+ "cursor-based pagination uses wrong sort order",
391
+ "retry-after header reports seconds instead of milliseconds"],
392
+ "expected": ["200 OK with paginated results",
393
+ "successful delivery with retry on failure",
394
+ "proper rate limit reset after window expires"],
395
+ "actual": ["empty response with 200 status",
396
+ "permanent block until manual intervention",
397
+ "events dropped without any error log"],
398
+ "impact": ["Affects CI/CD pipelines hitting the API.",
399
+ "External integrations break silently.",
400
+ "Customer-facing dashboards show wrong data.",
401
+ "Retry-After header causes clients to wait too long."],
402
+ "workaround": ["Workaround: manually clear Redis key.",
403
+ "Workaround: add client-side deduplication.",
404
+ "No workaround β€” requires server-side fix.",
405
+ "Workaround: pin API version to v1 in headers."],
406
+ },
407
+ "answer_template": {
408
+ "severe": {"priority": "P1", "labels": ["bug", "api"],
409
+ "assigned_team": "backend", "milestone": "v2.1"},
410
+ "moderate": {"priority": "P2", "labels": ["bug", "api"],
411
+ "assigned_team": "backend", "milestone": "v2.1"},
412
+ },
413
+ "severity_keywords": {
414
+ "severe": ["silently lost", "permanent block", "No workaround",
415
+ "dropped", "external integrations"],
416
+ "moderate": ["Workaround", "pin API", "deduplication"],
417
+ },
418
+ },
419
+ }
420
+
421
+
422
+ # The original handcrafted bugs β€” kept as a gold-standard subset
423
+ _HANDCRAFTED_BUGS = {
424
  "easy": {
425
  "bugs": [
426
  BugReport(
 
432
  author="user123",
433
  labels_hint=[],
434
  comments=["Confirmed on iOS and Android.", "Happens every time."],
435
+ severity_signals=["100% of users", "crashes", "no workaround"],
436
+ stack_trace="NullPointerException at AuthController.java:87",
437
+ affected_component="auth-service",
438
  ),
439
  BugReport(
440
  id="easy-002",
 
444
  author="docs_fan",
445
  labels_hint=["documentation"],
446
  comments=[],
447
+ severity_signals=["cosmetic", "no functional impact"],
448
+ stack_trace="",
449
+ affected_component="docs",
450
  ),
451
  BugReport(
452
  id="easy-003",
 
456
  author="power_user",
457
  labels_hint=["performance"],
458
  comments=["Noticed after the last deploy.", "CPU spikes to 100%."],
459
+ severity_signals=["workaround exists", "power users only"],
460
+ stack_trace="",
461
+ affected_component="dashboard",
462
  ),
463
  BugReport(
464
  id="easy-004",
 
468
  "Affects all users attempting password reset.",
469
  author="support_team",
470
  labels_hint=["bug"],
471
+ comments=["Reported by 12 users this week.",
472
+ "Started after email service migration."],
473
+ severity_signals=["all users", "never dispatched"],
474
+ stack_trace="",
475
+ affected_component="email-service",
476
  ),
477
  BugReport(
478
  id="easy-005",
 
482
  author="intern_dev",
483
  labels_hint=["documentation"],
484
  comments=[],
485
+ severity_signals=["no functional impact"],
486
+ stack_trace="",
487
+ affected_component="frontend",
488
  ),
489
  ],
 
490
  "answers": {
491
  "easy-001": {"priority": "P0"},
492
  "easy-002": {"priority": "P3"},
 
507
  author="store_owner",
508
  labels_hint=["bug"],
509
  comments=["Revenue impact confirmed.", "Happening since Tuesday."],
510
+ severity_signals=["revenue loss", "silently", "every failed checkout"],
511
+ stack_trace="Stripe API: card_declined at PaymentService.py:145",
512
+ affected_component="payment-service",
513
  ),
514
  BugReport(
515
  id="med-002",
 
520
  author="moderator_jane",
521
  labels_hint=[],
522
  comments=["GDPR concern β€” deleted content still visible."],
523
+ severity_signals=["GDPR violation", "deleted content visible"],
524
+ stack_trace="",
525
+ affected_component="search-index",
526
  ),
527
  BugReport(
528
  id="med-003",
 
532
  author="safari_user",
533
  labels_hint=["bug", "ux"],
534
  comments=["Only on Safari, not Chrome/Firefox."],
535
+ severity_signals=["workaround exists", "single browser"],
536
+ stack_trace="",
537
+ affected_component="frontend-css",
538
  ),
539
  BugReport(
540
  id="med-004",
 
544
  "Affects users with international data.",
545
  author="data_analyst",
546
  labels_hint=["bug"],
547
+ comments=["Encoding issue β€” UTF-8 not respected.",
548
+ "Workaround: manual copy-paste."],
549
+ severity_signals=["corrupted", "workaround exists"],
550
+ stack_trace="",
551
+ affected_component="export-service",
552
  ),
553
  BugReport(
554
  id="med-005",
 
558
  "The unblock logic has a bug β€” it never clears the blocked flag.",
559
  author="api_user",
560
  labels_hint=["bug"],
561
+ comments=["Affects CI/CD pipelines hitting the API.",
562
+ "Retry-After header is wrong."],
563
+ severity_signals=["permanent block", "never clears", "bug in logic"],
564
+ stack_trace="",
565
+ affected_component="api-gateway",
566
  ),
567
  ],
568
  "answers": {
 
586
  author="security_researcher",
587
  labels_hint=[],
588
  comments=["Critical. Affects production.", "Do not discuss publicly."],
589
+ severity_signals=["SQL injection", "password hashes", "production",
590
+ "confidential"],
591
+ stack_trace="",
592
+ affected_component="search-api",
593
  ),
594
  BugReport(
595
  id="hard-002",
 
601
  author="devops_alice",
602
  labels_hint=["performance"],
603
  comments=["Verified with heap profiler.", "Started in v1.9."],
604
+ severity_signals=["memory leak", "OOM", "manual restart", "never frees"],
605
+ stack_trace="HeapDump: JobProcessor.process() -> 50MB/call, never GC'd",
606
+ affected_component="job-processor",
607
  ),
608
  BugReport(
609
  id="hard-003",
 
616
  "Risk is low-probability but affects data integrity.",
617
  author="qa_bot",
618
  labels_hint=["bug"],
619
+ comments=["Reproduced with locust at 50 concurrent users.",
620
+ "Sequential mode avoids it."],
621
+ severity_signals=["race condition", "data integrity",
622
+ "workaround exists", "low-probability"],
623
+ stack_trace="",
624
+ affected_component="file-upload",
625
  ),
626
  BugReport(
627
  id="hard-004",
 
632
  "This is a session management security vulnerability.",
633
  author="pentest_team",
634
  labels_hint=["security"],
635
+ comments=["Verified on staging.",
636
+ "OWASP A07 β€” Identification and Authentication Failures."],
637
+ severity_signals=["JWT not invalidated", "attacker", "security vulnerability",
638
+ "stolen token"],
639
+ stack_trace="",
640
+ affected_component="auth-service",
641
  ),
642
  BugReport(
643
  id="hard-005",
 
648
  "Triggered in production twice this week. Requires process kill to recover.",
649
  author="oncall_eng",
650
  labels_hint=["bug", "performance"],
651
+ comments=["PagerDuty alert fired twice.",
652
+ "Needs exponential backoff + max retry cap."],
653
+ severity_signals=["infinite loop", "100%", "production",
654
+ "process kill", "starves other services"],
655
+ stack_trace="Thread dump: WebhookRetrier.retry() β†’ recursive call, no exit",
656
+ affected_component="webhook-service",
657
  ),
658
  ],
659
  "answers": {
660
  "hard-001": {
661
+ "priority": "P0", "labels": ["bug", "security"],
662
+ "assigned_team": "security", "milestone": "hotfix",
 
 
663
  },
664
  "hard-002": {
665
+ "priority": "P1", "labels": ["bug", "performance"],
666
+ "assigned_team": "backend", "milestone": "v2.1",
 
 
667
  },
668
  "hard-003": {
669
+ "priority": "P1", "labels": ["bug", "data-integrity"],
670
+ "assigned_team": "backend", "milestone": "v2.1",
 
 
671
  },
672
  "hard-004": {
673
+ "priority": "P0", "labels": ["bug", "security"],
674
+ "assigned_team": "security", "milestone": "hotfix",
 
 
675
  },
676
  "hard-005": {
677
+ "priority": "P0", "labels": ["bug", "performance"],
678
+ "assigned_team": "backend", "milestone": "hotfix",
 
 
679
  },
680
  },
681
  },
682
  }
683
 
684
 
685
+ # Combine into single TASKS dict (backward compatible)
686
+ TASKS = _HANDCRAFTED_BUGS
687
+
688
+
689
+ # ---------------------------------------------------------------------------
690
+ # PROCEDURAL BUG GENERATOR
691
+ # ---------------------------------------------------------------------------
692
+
693
+ def _determine_severity(text: str, keywords: Dict[str, list]) -> str:
694
+ """Check which severity level the generated text matches."""
695
+ text_lower = text.lower()
696
+ for level, kws in keywords.items():
697
+ if level == "default":
698
+ return "default"
699
+ hits = sum(1 for kw in kws if kw.lower() in text_lower)
700
+ if hits >= 1:
701
+ return level
702
+ # fallback to first non-default key
703
+ return list(keywords.keys())[0] if keywords else "moderate"
704
+
705
+
706
+ def generate_bug(task_key: str, seed: int = None) -> Tuple[BugReport, dict]:
707
+ """Generate a procedural bug report with its correct answer."""
708
+ rng = random.Random(seed)
709
+
710
+ # Weight categories by difficulty
711
+ weights = {
712
+ "easy": {"documentation": 3, "ui_bug": 3, "performance": 2,
713
+ "crash": 1, "api_bug": 1},
714
+ "medium": {"crash": 3, "performance": 3, "api_bug": 2,
715
+ "data_corruption": 2, "ui_bug": 1},
716
+ "hard": {"security": 4, "crash": 3, "data_corruption": 3,
717
+ "performance": 2, "api_bug": 2},
718
+ }
719
+
720
+ task_weights = weights.get(task_key, weights["medium"])
721
+ categories = []
722
+ for cat, w in task_weights.items():
723
+ categories.extend([cat] * w)
724
+ category = rng.choice(categories)
725
+
726
+ template = _BUG_TEMPLATES[category]
727
+
728
+ # Pick random variable values
729
+ chosen_vars = {}
730
+ for var_name, options in template["vars"].items():
731
+ chosen_vars[var_name] = rng.choice(options)
732
+
733
+ # Build title and body
734
+ title_tmpl = rng.choice(template["titles"])
735
+ body_tmpl = rng.choice(template["bodies"])
736
+
737
+ # Safe format β€” ignore missing keys
738
+ def safe_format(tmpl, vars_dict):
739
+ result = tmpl
740
+ for k, v in vars_dict.items():
741
+ result = result.replace("{" + k + "}", v)
742
+ return result
743
+
744
+ title = safe_format(title_tmpl, chosen_vars)
745
+ body = safe_format(body_tmpl, chosen_vars)
746
+
747
+ # Generate unique ID from seed
748
+ bug_id = f"gen-{seed or rng.randint(0, 999999):06d}"
749
+
750
+ # Pick author
751
+ authors = ["user_report", "qa_engineer", "support_team", "dev_oncall",
752
+ "security_bot", "customer_jane", "automated_monitor",
753
+ "intern_dev", "senior_eng", "pm_feedback"]
754
+ author = rng.choice(authors)
755
+
756
+ # Build comments
757
+ comment_templates = [
758
+ "Confirmed on our side.", "Reproduced in staging.",
759
+ "Multiple reports from users.", "Started after last deployment.",
760
+ "Urgent β€” customer escalation.", "Low priority β€” no user complaints.",
761
+ "Needs investigation.", "Related to ticket from last sprint.",
762
+ ]
763
+ num_comments = rng.randint(0, 3)
764
+ comments = rng.sample(comment_templates, min(num_comments, len(comment_templates)))
765
+
766
+ # Determine severity and answer
767
+ full_text = f"{title} {body} {' '.join(comments)}"
768
+ severity_kws = template.get("severity_keywords", {})
769
+ severity = _determine_severity(full_text, severity_kws)
770
+
771
+ answer_templates = template["answer_template"]
772
+ answer = dict(answer_templates.get(severity, list(answer_templates.values())[0]))
773
+
774
+ # For easy tasks, only priority matters
775
+ if task_key == "easy":
776
+ answer = {"priority": answer["priority"]}
777
+ elif task_key == "medium":
778
+ answer.pop("milestone", None)
779
+
780
+ bug = BugReport(
781
+ id=bug_id,
782
+ title=title,
783
+ body=body,
784
+ author=author,
785
+ labels_hint=rng.sample(["bug", "needs-triage", "reported"], rng.randint(0, 2)),
786
+ comments=comments,
787
+ severity_signals=[],
788
+ stack_trace="",
789
+ affected_component=chosen_vars.get("service", chosen_vars.get("endpoint", "")),
790
+ )
791
+
792
+ return bug, answer
793
+
794
+
795
+ # ---------------------------------------------------------------------------
796
+ # BUG SAMPLER β€” uses handcrafted bugs first, then procedural for variety
797
+ # ---------------------------------------------------------------------------
798
+
799
+ def sample_bug(task_key: str, seed: int = None) -> Tuple[BugReport, dict]:
800
+ """Return a bug and its answer. Mixes handcrafted + procedural."""
801
+ rng = random.Random(seed)
802
+
803
+ # 40% chance of handcrafted, 60% procedural
804
+ if rng.random() < 0.4 and task_key in _HANDCRAFTED_BUGS:
805
+ bugs = _HANDCRAFTED_BUGS[task_key]["bugs"]
806
+ bug = rng.choice(bugs)
807
+ answer = _HANDCRAFTED_BUGS[task_key]["answers"][bug.id]
808
+ return bug, answer
809
+ else:
810
+ gen_seed = seed if seed is not None else rng.randint(0, 999999)
811
+ return generate_bug(task_key, seed=gen_seed)
812
+
813
+
814
+ # ---------------------------------------------------------------------------
815
+ # GRADING β€” with semantic label matching
816
+ # ---------------------------------------------------------------------------
817
 
818
  PRIORITY_ORDER = {"P0": 0, "P1": 1, "P2": 2, "P3": 3}
819
 
820
 
821
  def _priority_score(predicted: str, correct: str) -> float:
822
+ """Score priority assignment with partial credit for near-misses."""
823
  if predicted == correct:
824
  return 0.95
825
+ pred_rank = PRIORITY_ORDER.get(predicted, 99)
826
+ corr_rank = PRIORITY_ORDER.get(correct, 99)
827
+ diff = abs(pred_rank - corr_rank)
828
+ if diff == 1:
829
+ return 0.5
830
+ elif diff == 2:
831
+ return 0.2
832
+ return 0.05
833
 
834
 
835
+ def _normalize_label(label: str) -> str:
836
+ """Normalize a label to its canonical form."""
837
+ label_lower = label.lower().strip()
838
+ for canonical, synonyms in LABEL_SYNONYMS.items():
839
+ if label_lower == canonical or label_lower in synonyms:
840
+ return canonical
841
+ return label_lower
842
+
843
 
844
  def _label_score(predicted: List[str], correct: List[str]) -> float:
845
+ """Score labels using semantic matching via synonym groups."""
846
+ pred_normalized = set(_normalize_label(l) for l in predicted)
847
+ corr_normalized = set(_normalize_label(l) for l in correct)
848
+
849
+ if not corr_normalized:
850
  return 0.95
851
+
852
+ intersection = pred_normalized & corr_normalized
853
+ union = pred_normalized | corr_normalized
854
+
855
+ raw = len(intersection) / len(union) if union else 0.0
856
  return max(0.05, min(0.95, raw))
857
 
858
 
859
+ def _reasoning_score(reasoning: str, answer: dict) -> float:
860
+ """Bonus for reasoning that mentions relevant signals."""
861
+ if not reasoning or len(reasoning.strip()) < 10:
862
+ return 0.0
863
+
864
+ key_signals = {
865
+ "P0": ["production", "all users", "data loss", "security", "crash",
866
+ "revenue", "injection", "vulnerability", "100%"],
867
+ "P1": ["major", "significant", "no workaround", "broken",
868
+ "gdpr", "blocked", "leak", "never"],
869
+ "P2": ["degraded", "workaround", "partial", "slow",
870
+ "affected", "power users"],
871
+ "P3": ["minor", "cosmetic", "docs", "typo", "low",
872
+ "no functional impact"],
873
+ }
874
+
875
+ expected_priority = answer.get("priority", "P2")
876
+ signals = key_signals.get(expected_priority, [])
877
+ reasoning_lower = reasoning.lower()
878
+
879
+ hits = sum(1 for s in signals if s in reasoning_lower)
880
+ return min(0.15, hits * 0.05)
881
+
882
+
883
+ def grade_action(task_key: str, bug: BugReport, action: TriageAction,
884
+ answer: dict = None) -> Tuple[float, str]:
885
+ """Grade the agent's triage action against the correct answer."""
886
+
887
+ # Backward compatibility: look up answer from handcrafted if not provided
888
+ if answer is None:
889
+ if task_key in _HANDCRAFTED_BUGS and bug.id in _HANDCRAFTED_BUGS[task_key]["answers"]:
890
+ answer = _HANDCRAFTED_BUGS[task_key]["answers"][bug.id]
891
+ else:
892
+ return 0.5, "No answer key found for this bug."
893
+
894
  feedback_parts = []
895
+ reasoning_bonus = _reasoning_score(action.reasoning, answer)
896
 
897
  if task_key == "easy":
898
  score = _priority_score(action.priority, answer["priority"])
899
  symbol = "βœ“" if score >= 0.9 else "~" if score >= 0.4 else "βœ—"
900
+ feedback_parts.append(
901
+ f"Priority: {symbol} (got {action.priority}, expected {answer['priority']})")
902
+ score = score + reasoning_bonus
903
  score = max(0.01, min(0.99, score))
904
  return round(score, 3), " | ".join(feedback_parts)
905
 
906
  elif task_key == "medium":
907
  p_score = _priority_score(action.priority, answer["priority"])
908
+ l_score = _label_score(action.labels, answer.get("labels", []))
909
  expected_team = answer.get("assigned_team", "")
910
  t_score = 0.95 if expected_team and action.assigned_team.lower() == expected_team.lower() else 0.05
911
+
912
+ score = 0.45 * p_score + 0.40 * l_score + 0.15 * t_score + reasoning_bonus
913
+
914
+ feedback_parts.append(
915
+ f"Priority: {p_score:.2f} (got {action.priority}, expected {answer['priority']})")
916
+ feedback_parts.append(f"Labels: {l_score:.2f} (semantic match)")
917
+ feedback_parts.append(
918
+ f"Team: {t_score:.2f} (got {action.assigned_team}, expected {expected_team})")
919
+ if reasoning_bonus > 0:
920
+ feedback_parts.append(f"Reasoning bonus: +{reasoning_bonus:.2f}")
921
+
922
  score = max(0.01, min(0.99, score))
923
  return round(score, 3), " | ".join(feedback_parts)
924
 
925
  else: # hard
926
  p_score = _priority_score(action.priority, answer["priority"])
927
+ l_score = _label_score(action.labels, answer.get("labels", []))
928
  t_score = 0.95 if action.assigned_team.lower() == answer["assigned_team"].lower() else 0.05
929
  m_score = 0.95 if action.milestone.lower() == answer["milestone"].lower() else 0.05
930
+
931
+ score = 0.35 * p_score + 0.30 * l_score + 0.20 * t_score + 0.15 * m_score + reasoning_bonus
932
+
933
+ feedback_parts.append(
934
+ f"Priority: {p_score:.2f} (got {action.priority}, expected {answer['priority']})")
935
+ feedback_parts.append(f"Labels: {l_score:.2f} (semantic match)")
936
+ feedback_parts.append(
937
+ f"Team: {t_score:.2f} (got {action.assigned_team}, expected {answer['assigned_team']})")
938
+ feedback_parts.append(
939
+ f"Milestone: {m_score:.2f} (got {action.milestone}, expected {answer['milestone']})")
940
+
941
+ if reasoning_bonus > 0:
942
+ feedback_parts.append(f"Reasoning bonus: +{reasoning_bonus:.2f}")
943
+
944
+ # Security escalation penalty
945
  if answer.get("assigned_team") == "security" and action.assigned_team.lower() != "security":
946
  score = max(0.01, score - 0.15)
947
  feedback_parts.append("⚠ Security escalation missed (-0.15)")
948
+
949
  score = max(0.01, min(0.99, score))
950
  return round(score, 3), " | ".join(feedback_parts)
951
+
952
+
953
+ # ---------------------------------------------------------------------------
954
+ # NAMED GRADER FUNCTIONS β€” referenced by openenv.yaml
955
+ # ---------------------------------------------------------------------------
956
+
957
  def priority_match(*args, **kwargs):
958
  if len(args) < 2:
959
  return 0.5
960
+ bug, action = args[0], args[1]
 
 
 
961
  score, _ = grade_action("easy", bug, action)
962
  return float(score)
963
 
 
965
  def priority_label_team(*args, **kwargs):
966
  if len(args) < 2:
967
  return 0.5
968
+ bug, action = args[0], args[1]
 
 
 
969
  score, _ = grade_action("medium", bug, action)
970
  return float(score)
971
 
 
973
  def full_triage(*args, **kwargs):
974
  if len(args) < 2:
975
  return 0.5
976
+ bug, action = args[0], args[1]
 
 
 
977
  score, _ = grade_action("hard", bug, action)
978
  return float(score)
979
+
980
+
981
  __all__ = [
982
  "priority_match",
983
  "priority_label_team",
984
  "full_triage",
985
  "sample_bug",
986
+ "generate_bug",
987
  "grade_action",
988
+ "TASKS",
989
+ "LABEL_SYNONYMS",
990
  ]
tests/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # tests/__init__.py
tests/test_api.py ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # tests/test_api.py
2
+ """Integration tests for the FastAPI endpoints."""
3
+ import sys
4
+ import os
5
+ sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
6
+ sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(__file__)), "server"))
7
+
8
+ import pytest
9
+
10
+ # These tests require fastapi and httpx
11
+ try:
12
+ from fastapi.testclient import TestClient
13
+ from server.app import app
14
+ HAS_DEPS = True
15
+ except ImportError:
16
+ HAS_DEPS = False
17
+
18
+ pytestmark = pytest.mark.skipif(not HAS_DEPS, reason="FastAPI/httpx not installed")
19
+
20
+
21
+ @pytest.fixture
22
+ def client():
23
+ return TestClient(app)
24
+
25
+
26
+ class TestHealthEndpoint:
27
+ def test_health_returns_ok(self, client):
28
+ r = client.get("/health")
29
+ assert r.status_code == 200
30
+ data = r.json()
31
+ assert data.get("status") in ("ok", "healthy")
32
+
33
+
34
+ class TestTaskEndpoints:
35
+ def test_list_tasks(self, client):
36
+ r = client.get("/tasks")
37
+ assert r.status_code == 200
38
+ tasks = r.json()
39
+ assert len(tasks) == 3
40
+ ids = [t["id"] for t in tasks]
41
+ assert "easy" in ids
42
+ assert "medium" in ids
43
+ assert "hard" in ids
44
+
45
+ def test_get_specific_task(self, client):
46
+ r = client.get("/tasks/easy")
47
+ assert r.status_code == 200
48
+ assert r.json()["id"] == "easy"
49
+
50
+ def test_get_nonexistent_task(self, client):
51
+ r = client.get("/tasks/impossible")
52
+ assert r.status_code == 404
53
+
54
+
55
+ class TestResetEndpoint:
56
+ def test_reset_returns_observation(self, client):
57
+ r = client.post("/reset", json={"task_id": "easy"})
58
+ assert r.status_code == 200
59
+ data = r.json()
60
+ assert "observation" in data
61
+ assert "session_id" in data
62
+ assert data["done"] is False
63
+
64
+ def test_reset_with_empty_body(self, client):
65
+ r = client.post("/reset", json={})
66
+ assert r.status_code == 200
67
+
68
+ def test_reset_returns_bug_report(self, client):
69
+ r = client.post("/reset", json={"task_id": "medium"})
70
+ data = r.json()
71
+ obs = data["observation"]
72
+ assert "bug_report" in obs
73
+ assert "title" in obs["bug_report"]
74
+
75
+
76
+ class TestStepEndpoint:
77
+ def test_investigation_step(self, client):
78
+ # Reset first
79
+ r = client.post("/reset", json={"task_id": "easy"})
80
+ session_id = r.json()["session_id"]
81
+
82
+ # Investigate
83
+ r = client.post("/step", json={
84
+ "session_id": session_id,
85
+ "action": {"action_type": "read_body"},
86
+ })
87
+ assert r.status_code == 200
88
+ data = r.json()
89
+ assert data["done"] is False
90
+
91
+ def test_submit_step(self, client):
92
+ # Reset
93
+ r = client.post("/reset", json={"task_id": "easy"})
94
+ session_id = r.json()["session_id"]
95
+
96
+ # Submit
97
+ r = client.post("/step", json={
98
+ "session_id": session_id,
99
+ "action": {
100
+ "action_type": "submit",
101
+ "priority": "P0",
102
+ "labels": ["bug"],
103
+ "assigned_team": "backend",
104
+ },
105
+ })
106
+ assert r.status_code == 200
107
+ data = r.json()
108
+ assert data["done"] is True
109
+ assert 0 < data["reward"] < 1
110
+
111
+ def test_full_episode_flow(self, client):
112
+ # Reset
113
+ r = client.post("/reset", json={"task_id": "hard"})
114
+ assert r.status_code == 200
115
+ session_id = r.json()["session_id"]
116
+
117
+ # Investigate: read body
118
+ r = client.post("/step", json={
119
+ "session_id": session_id,
120
+ "action": {"action_type": "read_body"},
121
+ })
122
+ assert r.status_code == 200
123
+ assert r.json()["done"] is False
124
+
125
+ # Investigate: read comments
126
+ r = client.post("/step", json={
127
+ "session_id": session_id,
128
+ "action": {"action_type": "read_comments"},
129
+ })
130
+ assert r.status_code == 200
131
+ assert r.json()["done"] is False
132
+
133
+ # Submit triage
134
+ r = client.post("/step", json={
135
+ "session_id": session_id,
136
+ "action": {
137
+ "action_type": "submit",
138
+ "priority": "P0",
139
+ "labels": ["bug", "security"],
140
+ "assigned_team": "security",
141
+ "milestone": "hotfix",
142
+ "reasoning": "Critical security vulnerability in production",
143
+ },
144
+ })
145
+ assert r.status_code == 200
146
+ data = r.json()
147
+ assert data["done"] is True
148
+ assert 0 < data["reward"] < 1
149
+
150
+ def test_backward_compatible_no_session(self, client):
151
+ """Old-style requests without session_id should still work."""
152
+ r = client.post("/reset", json={"task_id": "easy"})
153
+ assert r.status_code == 200
154
+
155
+ r = client.post("/step", json={
156
+ "action": {
157
+ "priority": "P0",
158
+ "labels": ["bug"],
159
+ },
160
+ })
161
+ assert r.status_code == 200
162
+
163
+
164
+ class TestStateEndpoint:
165
+ def test_state_returns_data(self, client):
166
+ client.post("/reset", json={"task_id": "easy"})
167
+ r = client.get("/state")
168
+ assert r.status_code == 200
169
+ data = r.json()
170
+ assert "current_task" in data
171
+ assert "step_count" in data
172
+
173
+
174
+ class TestLeaderboard:
175
+ def test_get_empty_leaderboard(self, client):
176
+ r = client.get("/leaderboard")
177
+ assert r.status_code == 200
178
+ assert isinstance(r.json(), list)
179
+
180
+ def test_submit_to_leaderboard(self, client):
181
+ r = client.post("/leaderboard/submit", json={
182
+ "agent_name": "test-agent",
183
+ "model": "test-model",
184
+ "scores": {"easy": 0.9, "medium": 0.7, "hard": 0.5},
185
+ "avg_score": 0.7,
186
+ })
187
+ assert r.status_code == 200
188
+ data = r.json()
189
+ assert data["status"] == "submitted"
190
+ assert "rank" in data
tests/test_environment.py ADDED
@@ -0,0 +1,205 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # tests/test_environment.py
2
+ """Tests for the environment logic in server/environment.py"""
3
+ import sys
4
+ import os
5
+ sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
6
+ sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(__file__)), "server"))
7
+
8
+ import pytest
9
+ from model import TriageAction, TriageObservation
10
+ from server.environment import BugTriageEnvironment, SessionManager
11
+
12
+
13
+ class TestEnvironmentReset:
14
+ def test_reset_returns_observation(self):
15
+ env = BugTriageEnvironment()
16
+ obs = env.reset(task_id="easy")
17
+ assert isinstance(obs, TriageObservation)
18
+ assert obs.bug_report is not None
19
+ assert obs.done is False
20
+ assert obs.task_id == "easy"
21
+
22
+ def test_reset_different_tasks(self):
23
+ env = BugTriageEnvironment()
24
+ for task_id in ["easy", "medium", "hard"]:
25
+ obs = env.reset(task_id=task_id)
26
+ assert obs.task_id == task_id
27
+ assert obs.done is False
28
+
29
+ def test_reset_invalid_task_defaults_to_easy(self):
30
+ env = BugTriageEnvironment()
31
+ obs = env.reset(task_id="nonexistent")
32
+ assert obs.task_id == "easy"
33
+
34
+ def test_reset_shows_truncated_body(self):
35
+ env = BugTriageEnvironment()
36
+ obs = env.reset(task_id="easy")
37
+ # Body should be truncated (not fully visible) on reset
38
+ assert obs.body_visible is False
39
+
40
+ def test_reset_hides_comments(self):
41
+ env = BugTriageEnvironment()
42
+ obs = env.reset(task_id="easy")
43
+ assert obs.comments_visible is False
44
+
45
+ def test_reset_clears_previous_state(self):
46
+ env = BugTriageEnvironment()
47
+ env.reset(task_id="easy")
48
+ env.step(TriageAction(action_type="submit", priority="P0"))
49
+ # Reset should clear everything
50
+ obs = env.reset(task_id="medium")
51
+ assert obs.done is False
52
+ assert obs.task_id == "medium"
53
+ assert obs.steps_taken == 0
54
+
55
+
56
+ class TestEnvironmentInvestigation:
57
+ def test_read_body_reveals_full_body(self):
58
+ env = BugTriageEnvironment()
59
+ env.reset(task_id="easy")
60
+ obs = env.step(TriageAction(action_type="read_body"))
61
+ assert obs.body_visible is True
62
+ assert obs.done is False
63
+ assert obs.steps_taken == 1
64
+
65
+ def test_read_comments_reveals_comments(self):
66
+ env = BugTriageEnvironment()
67
+ env.reset(task_id="easy")
68
+ obs = env.step(TriageAction(action_type="read_comments"))
69
+ assert obs.comments_visible is True
70
+ assert obs.done is False
71
+
72
+ def test_check_logs_reveals_logs(self):
73
+ env = BugTriageEnvironment()
74
+ env.reset(task_id="easy")
75
+ obs = env.step(TriageAction(action_type="check_logs"))
76
+ assert obs.logs_visible is True
77
+ assert obs.done is False
78
+
79
+ def test_duplicate_investigation_gives_feedback(self):
80
+ env = BugTriageEnvironment()
81
+ env.reset(task_id="easy")
82
+ env.step(TriageAction(action_type="read_body"))
83
+ obs = env.step(TriageAction(action_type="read_body"))
84
+ assert "already" in obs.feedback.lower()
85
+
86
+ def test_step_count_increments(self):
87
+ env = BugTriageEnvironment()
88
+ env.reset(task_id="easy")
89
+ obs1 = env.step(TriageAction(action_type="read_body"))
90
+ assert obs1.steps_taken == 1
91
+ obs2 = env.step(TriageAction(action_type="read_comments"))
92
+ assert obs2.steps_taken == 2
93
+
94
+
95
+ class TestEnvironmentSubmission:
96
+ def test_submit_returns_done(self):
97
+ env = BugTriageEnvironment()
98
+ env.reset(task_id="easy")
99
+ obs = env.step(TriageAction(action_type="submit", priority="P0"))
100
+ assert obs.done is True
101
+
102
+ def test_submit_returns_valid_score(self):
103
+ env = BugTriageEnvironment()
104
+ env.reset(task_id="easy")
105
+ obs = env.step(TriageAction(action_type="submit", priority="P0"))
106
+ assert 0 < obs.score < 1
107
+ assert 0 < obs.reward < 1
108
+
109
+ def test_investigate_then_submit(self):
110
+ env = BugTriageEnvironment()
111
+ env.reset(task_id="medium")
112
+ env.step(TriageAction(action_type="read_body"))
113
+ env.step(TriageAction(action_type="read_comments"))
114
+ obs = env.step(TriageAction(
115
+ action_type="submit", priority="P0",
116
+ labels=["bug"], assigned_team="backend",
117
+ ))
118
+ assert obs.done is True
119
+ assert 0 < obs.score < 1
120
+
121
+ def test_double_submit_stays_done(self):
122
+ env = BugTriageEnvironment()
123
+ env.reset(task_id="easy")
124
+ env.step(TriageAction(action_type="submit", priority="P0"))
125
+ obs = env.step(TriageAction(action_type="submit", priority="P1"))
126
+ assert obs.done is True
127
+ assert "already complete" in obs.feedback.lower()
128
+
129
+ def test_max_steps_forces_submit(self):
130
+ env = BugTriageEnvironment()
131
+ obs = env.reset(task_id="easy")
132
+ max_steps = obs.max_steps
133
+
134
+ # Use all steps investigating
135
+ for _ in range(max_steps - 1):
136
+ obs = env.step(TriageAction(action_type="read_body"))
137
+ if obs.done:
138
+ break
139
+
140
+ # This should force a submit even if action_type is investigate
141
+ if not obs.done:
142
+ obs = env.step(TriageAction(
143
+ action_type="read_comments", # will be forced to submit
144
+ priority="P0",
145
+ ))
146
+
147
+
148
+ class TestEnvironmentState:
149
+ def test_state_tracks_steps(self):
150
+ env = BugTriageEnvironment()
151
+ env.reset(task_id="easy")
152
+ env.step(TriageAction(action_type="read_body"))
153
+ state = env.get_state()
154
+ assert state.step_count == 1
155
+ assert "read_body" in state.actions_taken
156
+
157
+ def test_state_tracks_completed_tasks(self):
158
+ env = BugTriageEnvironment()
159
+ env.reset(task_id="easy")
160
+ env.step(TriageAction(action_type="submit", priority="P0"))
161
+ state = env.get_state()
162
+ assert "easy" in state.tasks_completed
163
+
164
+
165
+ class TestSessionManager:
166
+ def test_create_session(self):
167
+ mgr = SessionManager(max_sessions=10, ttl_seconds=60)
168
+ session_id, env = mgr.create_session()
169
+ assert session_id is not None
170
+ assert isinstance(env, BugTriageEnvironment)
171
+ assert mgr.active_count == 1
172
+
173
+ def test_get_session(self):
174
+ mgr = SessionManager()
175
+ session_id, env = mgr.create_session()
176
+ retrieved = mgr.get_session(session_id)
177
+ assert retrieved is env
178
+
179
+ def test_get_missing_session(self):
180
+ mgr = SessionManager()
181
+ assert mgr.get_session("nonexistent") is None
182
+
183
+ def test_remove_session(self):
184
+ mgr = SessionManager()
185
+ session_id, _ = mgr.create_session()
186
+ mgr.remove_session(session_id)
187
+ assert mgr.get_session(session_id) is None
188
+ assert mgr.active_count == 0
189
+
190
+ def test_max_sessions_enforced(self):
191
+ mgr = SessionManager(max_sessions=3, ttl_seconds=60)
192
+ for _ in range(5):
193
+ mgr.create_session()
194
+ assert mgr.active_count <= 3
195
+
196
+ def test_multiple_sessions_independent(self):
197
+ mgr = SessionManager()
198
+ sid1, env1 = mgr.create_session()
199
+ sid2, env2 = mgr.create_session()
200
+
201
+ env1.reset(task_id="easy")
202
+ env2.reset(task_id="hard")
203
+
204
+ assert env1.get_state().current_task == "easy"
205
+ assert env2.get_state().current_task == "hard"
tests/test_grading.py ADDED
@@ -0,0 +1,253 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # tests/test_grading.py
2
+ """Tests for the grading logic in server/task.py"""
3
+ import sys
4
+ import os
5
+ sys.path.insert(0, os.path.dirname(os.path.dirname(__file__)))
6
+ sys.path.insert(0, os.path.join(os.path.dirname(os.path.dirname(__file__)), "server"))
7
+
8
+ import pytest
9
+ from model import BugReport, TriageAction
10
+ from server.task import (
11
+ _priority_score, _label_score, _normalize_label, _reasoning_score,
12
+ grade_action, generate_bug, sample_bug, TASKS, LABEL_SYNONYMS,
13
+ )
14
+
15
+
16
+ # ── Priority Scoring ──────────────────────────────────────
17
+
18
+ class TestPriorityScoring:
19
+ def test_exact_match_gives_high_score(self):
20
+ assert _priority_score("P0", "P0") == 0.95
21
+
22
+ def test_all_exact_matches(self):
23
+ for p in ["P0", "P1", "P2", "P3"]:
24
+ assert _priority_score(p, p) == 0.95
25
+
26
+ def test_off_by_one_gives_partial_credit(self):
27
+ assert _priority_score("P0", "P1") == 0.5
28
+ assert _priority_score("P1", "P2") == 0.5
29
+ assert _priority_score("P2", "P3") == 0.5
30
+
31
+ def test_off_by_two_gives_low_credit(self):
32
+ assert _priority_score("P0", "P2") == 0.2
33
+ assert _priority_score("P1", "P3") == 0.2
34
+
35
+ def test_completely_wrong_gives_minimum(self):
36
+ assert _priority_score("P0", "P3") == 0.05
37
+
38
+ def test_invalid_priority(self):
39
+ assert _priority_score("P9", "P0") == 0.05
40
+ assert _priority_score("invalid", "P0") == 0.05
41
+
42
+
43
+ # ── Label Scoring ─────────────────────────────────────────
44
+
45
+ class TestLabelScoring:
46
+ def test_perfect_match(self):
47
+ score = _label_score(["bug", "security"], ["bug", "security"])
48
+ assert score >= 0.9
49
+
50
+ def test_partial_overlap(self):
51
+ score = _label_score(["bug"], ["bug", "security"])
52
+ assert 0.3 < score < 0.7 # ~50% Jaccard
53
+
54
+ def test_no_overlap(self):
55
+ score = _label_score(["docs"], ["bug", "security"])
56
+ assert score == 0.05 # clamped minimum
57
+
58
+ def test_empty_correct_labels(self):
59
+ score = _label_score(["bug"], [])
60
+ assert score == 0.95 # nothing expected => full credit
61
+
62
+ def test_synonym_matching(self):
63
+ # "defect" is a synonym for "bug"
64
+ score = _label_score(["defect"], ["bug"])
65
+ assert score >= 0.9 # should match via synonym
66
+
67
+ def test_case_insensitive(self):
68
+ score = _label_score(["BUG", "Security"], ["bug", "security"])
69
+ assert score >= 0.9
70
+
71
+
72
+ # ── Label Normalization ───────────────────────────────────
73
+
74
+ class TestLabelNormalization:
75
+ def test_canonical_stays_same(self):
76
+ assert _normalize_label("bug") == "bug"
77
+ assert _normalize_label("security") == "security"
78
+
79
+ def test_synonym_maps_to_canonical(self):
80
+ assert _normalize_label("defect") == "bug"
81
+ assert _normalize_label("vulnerability") == "security"
82
+ assert _normalize_label("slow") == "performance"
83
+ assert _normalize_label("ui") == "ux"
84
+
85
+ def test_unknown_label_passes_through(self):
86
+ assert _normalize_label("my-custom-label") == "my-custom-label"
87
+
88
+ def test_case_insensitive(self):
89
+ assert _normalize_label("BUG") == "bug"
90
+ assert _normalize_label("Vulnerability") == "security"
91
+
92
+
93
+ # ── Reasoning Scoring ─────────────────────────────────────
94
+
95
+ class TestReasoningScoring:
96
+ def test_empty_reasoning_gives_zero(self):
97
+ assert _reasoning_score("", {"priority": "P0"}) == 0.0
98
+
99
+ def test_short_reasoning_gives_zero(self):
100
+ assert _reasoning_score("bad", {"priority": "P0"}) == 0.0
101
+
102
+ def test_relevant_reasoning_gives_bonus(self):
103
+ score = _reasoning_score(
104
+ "This is a critical security vulnerability affecting production and causing data loss",
105
+ {"priority": "P0"},
106
+ )
107
+ assert score > 0
108
+
109
+ def test_bonus_capped_at_max(self):
110
+ score = _reasoning_score(
111
+ "production down all users data loss security crash revenue injection vulnerability 100%",
112
+ {"priority": "P0"},
113
+ )
114
+ assert score <= 0.15
115
+
116
+
117
+ # ── Grade Action ──────────────────────────────────────────
118
+
119
+ class TestGradeAction:
120
+ @pytest.fixture
121
+ def easy_bug(self):
122
+ return TASKS["easy"]["bugs"][0] # easy-001: P0
123
+
124
+ @pytest.fixture
125
+ def medium_bug(self):
126
+ return TASKS["medium"]["bugs"][0] # med-001: P0, payments, backend
127
+
128
+ @pytest.fixture
129
+ def hard_bug(self):
130
+ return TASKS["hard"]["bugs"][0] # hard-001: P0, security, hotfix
131
+
132
+ def test_easy_perfect_answer(self, easy_bug):
133
+ action = TriageAction(priority="P0")
134
+ score, feedback = grade_action("easy", easy_bug, action)
135
+ assert 0.9 <= score <= 0.99
136
+ assert "βœ“" in feedback
137
+
138
+ def test_easy_wrong_answer(self, easy_bug):
139
+ action = TriageAction(priority="P3")
140
+ score, feedback = grade_action("easy", easy_bug, action)
141
+ assert score < 0.2
142
+
143
+ def test_medium_perfect_answer(self, medium_bug):
144
+ action = TriageAction(
145
+ priority="P0",
146
+ labels=["bug", "payments"],
147
+ assigned_team="backend",
148
+ )
149
+ score, feedback = grade_action("medium", medium_bug, action)
150
+ assert score > 0.8
151
+
152
+ def test_hard_security_penalty(self, hard_bug):
153
+ # hard-001 requires security team; assigning backend should be penalized
154
+ action_wrong = TriageAction(
155
+ priority="P0",
156
+ labels=["bug", "security"],
157
+ assigned_team="backend", # Wrong! Should be security
158
+ milestone="hotfix",
159
+ )
160
+ action_right = TriageAction(
161
+ priority="P0",
162
+ labels=["bug", "security"],
163
+ assigned_team="security",
164
+ milestone="hotfix",
165
+ )
166
+ score_wrong, fb_wrong = grade_action("hard", hard_bug, action_wrong)
167
+ score_right, fb_right = grade_action("hard", hard_bug, action_right)
168
+
169
+ assert score_right > score_wrong
170
+ assert "Security escalation missed" in fb_wrong
171
+
172
+ def test_all_scores_in_valid_range(self):
173
+ """Every grading result must be in (0, 1) β€” open interval."""
174
+ for task_key in ["easy", "medium", "hard"]:
175
+ for bug in TASKS[task_key]["bugs"]:
176
+ for priority in ["P0", "P1", "P2", "P3"]:
177
+ action = TriageAction(
178
+ priority=priority,
179
+ labels=["bug"],
180
+ assigned_team="backend",
181
+ milestone="backlog",
182
+ )
183
+ score, feedback = grade_action(task_key, bug, action)
184
+ assert 0 < score < 1, (
185
+ f"Score {score} out of range for {bug.id} "
186
+ f"with priority={priority}"
187
+ )
188
+ assert isinstance(feedback, str)
189
+ assert len(feedback) > 0
190
+
191
+
192
+ # ── Procedural Bug Generation ─────────────────────────────
193
+
194
+ class TestBugGeneration:
195
+ def test_generate_produces_valid_bug(self):
196
+ bug, answer = generate_bug("easy", seed=42)
197
+ assert isinstance(bug, BugReport)
198
+ assert bug.id.startswith("gen-")
199
+ assert len(bug.title) > 5
200
+ assert len(bug.body) > 20
201
+ assert "priority" in answer
202
+
203
+ def test_different_seeds_produce_different_bugs(self):
204
+ bug1, _ = generate_bug("easy", seed=1)
205
+ bug2, _ = generate_bug("easy", seed=2)
206
+ # Very unlikely to produce the same title with different seeds
207
+ assert bug1.title != bug2.title or bug1.body != bug2.body
208
+
209
+ def test_same_seed_produces_same_bug(self):
210
+ bug1, ans1 = generate_bug("easy", seed=42)
211
+ bug2, ans2 = generate_bug("easy", seed=42)
212
+ assert bug1.title == bug2.title
213
+ assert bug1.body == bug2.body
214
+ assert ans1 == ans2
215
+
216
+ def test_easy_bugs_have_only_priority(self):
217
+ for seed in range(10):
218
+ _, answer = generate_bug("easy", seed=seed)
219
+ assert "priority" in answer
220
+ # easy should NOT include milestone
221
+ assert "milestone" not in answer
222
+
223
+ def test_hard_bugs_have_full_answer(self):
224
+ for seed in range(50):
225
+ _, answer = generate_bug("hard", seed=seed)
226
+ assert "priority" in answer
227
+
228
+ def test_all_difficulties(self):
229
+ for difficulty in ["easy", "medium", "hard"]:
230
+ bug, answer = generate_bug(difficulty, seed=100)
231
+ assert isinstance(bug, BugReport)
232
+ assert "priority" in answer
233
+
234
+ def test_sample_bug_returns_tuple(self):
235
+ bug, answer = sample_bug("easy", seed=42)
236
+ assert isinstance(bug, BugReport)
237
+ assert isinstance(answer, dict)
238
+
239
+ def test_generated_bugs_are_gradeable(self):
240
+ """Generated bugs should work with the grading system."""
241
+ for difficulty in ["easy", "medium", "hard"]:
242
+ for seed in range(5):
243
+ bug, answer = generate_bug(difficulty, seed=seed)
244
+ action = TriageAction(
245
+ priority=answer["priority"],
246
+ labels=answer.get("labels", ["bug"]),
247
+ assigned_team=answer.get("assigned_team", "backend"),
248
+ milestone=answer.get("milestone", "backlog"),
249
+ )
250
+ score, feedback = grade_action(difficulty, bug, action, answer=answer)
251
+ assert 0 < score < 1, (
252
+ f"Score {score} for {bug.id} ({difficulty})"
253
+ )