Spaces:
Running
Running
TASKS.md — Bug Triage Environment Tasks
3 graded tasks, easy → medium → hard. Scores in [0.0, 1.0].
Task 1 (Easy): Bug Type Classification
Goal: Given a bug report, classify its type.
Output labels: crash | ui | performance | security | data_loss | compatibility
Grader: Exact match → 1.0, wrong → 0.0
Input Example
Title: "App crashes on iOS 17 when uploading files > 50MB"
Description: "Consistently crashes immediately on upload tap."
Logs: "FATAL: EXC_BAD_ACCESS KERN_INVALID_ADDRESS at 0x0"
Expected Output
{"task_id": "task_1", "bug_type": "crash"}
Success Criteria
| Agent Output | Score |
|---|---|
bug_type: "crash" |
1.0 ✓ |
bug_type: "performance" |
0.0 ✗ |
| Missing field | 0.0 ✗ |
Task 2 (Medium): Priority Assignment
Goal: Given a bug report, assign the correct priority level.
Output labels: low | medium | high | critical
Grader: Distance-based partial credit.
Input Example
Title: "SQL injection possible in search endpoint"
Description: "The /api/v1/search endpoint does not sanitize user input.
Passing ' OR 1=1 -- returns all records."
Metadata: {"severity": "CVSS_9.8", "disclosure_deadline": "2025-03-27"}
Expected Output
{"task_id": "task_2", "priority": "critical"}
Success Criteria (Partial Credit)
| Agent Output | Distance | Score |
|---|---|---|
priority: "critical" |
0 | 1.00 ✓ |
priority: "high" |
1 | 0.67 |
priority: "medium" |
2 | 0.33 |
priority: "low" |
3 | 0.00 ✗ |
Task 3 (Hard): Full Bug Triage
Goal: Complete triage — classify + prioritize + assign developer + suggest action.
Weights: bug_type (30%) + priority (30%) + developer (20%) + action (20%)
Input Example
Title: "Export to CSV corrupts data with special characters"
Description: "Characters like é, ü, 中文 replaced with question marks."
Logs: "UnicodeEncodeError: 'latin-1' codec can't encode character"
Metadata: {"affected_users": 3200, "regression": true}
Expected Output
{
"task_id": "task_3",
"bug_type": "data_loss",
"priority": "high",
"assigned_developer": "David",
"suggested_action": "fix_immediately"
}
Success Criteria (Composite)
| Dimension | Weight | Scoring |
|---|---|---|
| bug_type | 30% | exact match → 1.0, else 0.0 |
| priority | 30% | distance: 0→1.0, 1→0.67, 2→0.33, 3→0.0 |
| developer | 20% | exact→1.0, right specialization→0.5, else→0.0 |
| action | 20% | exact→1.0, adjacent→0.5, else→0.0 |
Example Scoring
Agent output: bug_type=data_loss ✓ priority=high ✓ dev=Alice (wrong, but knows data_loss) →0.5 action=fix_immediately ✓
Score = 0.30×1.0 + 0.30×1.0 + 0.20×0.5 + 0.20×1.0 = 0.30+0.30+0.10+0.20 = 0.90
Bug Dataset (15 reports)
| Bug ID | Type | Priority | Developer | Action |
|---|---|---|---|---|
| BUG-001 | crash | critical | Alice | fix_immediately |
| BUG-002 | ui | low | Carol | schedule_sprint |
| BUG-003 | performance | high | Alice | fix_immediately |
| BUG-004 | security | critical | Bob | fix_immediately |
| BUG-005 | data_loss | high | David | fix_immediately |
| BUG-006 | compatibility | high | Eve | fix_immediately |
| BUG-007 | ui | medium | Carol | schedule_sprint |
| BUG-008 | performance | critical | Alice | fix_immediately |
| BUG-009 | ui | medium | Carol | fix_immediately |
| BUG-010 | data_loss | critical | David | fix_immediately |
| BUG-011 | ui | low | Carol | schedule_sprint |
| BUG-012 | security | critical | Bob | fix_immediately |
| BUG-013 | compatibility | medium | Eve | schedule_sprint |
| BUG-014 | performance | high | Eve | fix_immediately |
| BUG-015 | ui | low | Carol | needs_more_info |