Spaces:

savetrees
/

bug-triage-openenv

Running

App Files Files Community

bug-triage-openenv / docs /TASKS.md

savetrees

Upload folder using huggingface_hub

0135a17 verified 2 days ago

preview code

raw

history blame contribute delete

3.88 kB

TASKS.md — Bug Triage Environment Tasks

3 graded tasks, easy → medium → hard. Scores in [0.0, 1.0].

Task 1 (Easy): Bug Type Classification

Goal: Given a bug report, classify its type.

Grader: Exact match → 1.0, wrong → 0.0

Input Example

Title: "App crashes on iOS 17 when uploading files > 50MB"
Description: "Consistently crashes immediately on upload tap."
Logs: "FATAL: EXC_BAD_ACCESS KERN_INVALID_ADDRESS at 0x0"

Expected Output

{"task_id": "task_1", "bug_type": "crash"}

Success Criteria

Agent Output	Score
`bug_type: "crash"`	1.0 ✓
`bug_type: "performance"`	0.0 ✗
Missing field	0.0 ✗

Task 2 (Medium): Priority Assignment

Goal: Given a bug report, assign the correct priority level.

Output labels: low | medium | high | critical

Grader: Distance-based partial credit.

Input Example

Title: "SQL injection possible in search endpoint"
Description: "The /api/v1/search endpoint does not sanitize user input.
  Passing ' OR 1=1 -- returns all records."
Metadata: {"severity": "CVSS_9.8", "disclosure_deadline": "2025-03-27"}

Expected Output

{"task_id": "task_2", "priority": "critical"}

Success Criteria (Partial Credit)

Agent Output	Distance	Score
`priority: "critical"`	0	1.00 ✓
`priority: "high"`	1	0.67
`priority: "medium"`	2	0.33
`priority: "low"`	3	0.00 ✗

Task 3 (Hard): Full Bug Triage

Goal: Complete triage — classify + prioritize + assign developer + suggest action.

Weights: bug_type (30%) + priority (30%) + developer (20%) + action (20%)

Input Example

Title: "Export to CSV corrupts data with special characters"
Description: "Characters like é, ü, 中文 replaced with question marks."
Logs: "UnicodeEncodeError: 'latin-1' codec can't encode character"
Metadata: {"affected_users": 3200, "regression": true}

Expected Output

{
  "task_id": "task_3",
  "bug_type": "data_loss",
  "priority": "high",
  "assigned_developer": "David",
  "suggested_action": "fix_immediately"
}

Success Criteria (Composite)

Dimension	Weight	Scoring
bug_type	30%	exact match → 1.0, else 0.0
priority	30%	distance: 0→1.0, 1→0.67, 2→0.33, 3→0.0
developer	20%	exact→1.0, right specialization→0.5, else→0.0
action	20%	exact→1.0, adjacent→0.5, else→0.0

Example Scoring

Agent output:  bug_type=data_loss ✓  priority=high ✓  dev=Alice (wrong, but knows data_loss) →0.5  action=fix_immediately ✓
Score = 0.30×1.0 + 0.30×1.0 + 0.20×0.5 + 0.20×1.0 = 0.30+0.30+0.10+0.20 = 0.90

Bug Dataset (15 reports)

Bug ID	Type	Priority	Developer	Action
BUG-001	crash	critical	Alice	fix_immediately
BUG-002	ui	low	Carol	schedule_sprint
BUG-003	performance	high	Alice	fix_immediately
BUG-004	security	critical	Bob	fix_immediately
BUG-005	data_loss	high	David	fix_immediately
BUG-006	compatibility	high	Eve	fix_immediately
BUG-007	ui	medium	Carol	schedule_sprint
BUG-008	performance	critical	Alice	fix_immediately
BUG-009	ui	medium	Carol	fix_immediately
BUG-010	data_loss	critical	David	fix_immediately
BUG-011	ui	low	Carol	schedule_sprint
BUG-012	security	critical	Bob	fix_immediately
BUG-013	compatibility	medium	Eve	schedule_sprint
BUG-014	performance	high	Eve	fix_immediately
BUG-015	ui	low	Carol	needs_more_info