Spaces:

savetrees
/

bug-triage-openenv

Running

App Files Files Community

bug-triage-openenv / docs /REWARD_DESIGN.md

savetrees

Upload folder using huggingface_hub

0135a17 verified 2 days ago

preview code

raw

history blame contribute delete

3.19 kB

REWARD_DESIGN.md — Bug Triage Reward & Grader Design

Two Systems: Reward vs Grader

	Step Reward	Grader Score
Purpose	GRPO training signal	Hackathon evaluation
When	Every `step()`	Only at `done=True`
Range	[-0.5, 1.0] (shaped)	[0.0, 1.0] (strict)
Endpoint	Via observation `reward`	`/grader` endpoint

Mapping: reward = (grader_score × 1.5) − 0.5

Perfect score (1.0) → reward = +1.0
Zero score (0.0) → reward = -0.5
This shaping gives gradient signal even for bad predictions.

Task 1 Reward: Bug Type Classification

Agent Output	Score
Correct type	+1.0
Wrong type	0.0

Simple exact match. 6 possible types, so random baseline ≈ 0.167.

Task 2 Reward: Priority Assignment

Distance from Correct	Score
Exact match	1.00
Off by 1 level	0.67
Off by 2 levels	0.33
Off by 3 levels	0.00

Why partial credit? Priority has ordinal structure — predicting "high" when answer is "critical" is better than predicting "low".

Task 3 Reward: Full Triage (Composite)

Component	Weight	Scoring
Bug type correct	0.30	1.0 exact, 0.0 else
Priority correct	0.30	Distance: 0→1.0, 1→0.67, 2→0.33, 3→0.0
Developer correct	0.20	Exact→1.0, right specialty→0.5, else→0.0
Action correct	0.20	Exact→1.0, adjacent→0.5, else→0.0

Partial Credit Details

Developer assignment:

Exact match to ground truth → 1.0
Wrong person but has the right specialization for the bug type → 0.5
No match → 0.0

Action suggestion (adjacency map):

fix_immediately  schedule_sprint     (0.5)
schedule_sprint  needs_more_info     (0.5)
wontfix  duplicate                   (0.5)

Example Scoring

Ground truth: type=crash, priority=critical, dev=Alice, action=fix_immediately
Agent output: type=crash, priority=high,     dev=Bob,   action=fix_immediately

type:     crash == crash       → 1.0 × 0.30 = 0.30
priority: high vs critical     → 0.67 × 0.30 = 0.20
dev:      Bob vs Alice         → Bob knows crash → 0.5 × 0.20 = 0.10
action:   fix_immediately      → 1.0 × 0.20 = 0.20

Total: 0.80

Why This Reward Design Is Meaningful

Non-trivial — random agent scores ~0.20 on Task 3; a good model should score 0.7+
Decomposed — each dimension provides independent learning signal
Partial credit — prevents plateau; agent improves incrementally
Varies per bug — different bugs have different answers, so grader scores vary (no fixed output)
Production-relevant — mirrors how humans evaluate triage quality

Grader Code Locations

Task	Grader File	Method
task_1	`graders/task1_grader.py`	`grade(episode_log, ground_truth) → float`
task_2	`graders/task2_grader.py`	`grade(episode_log, ground_truth) → float`
task_3	`graders/task3_grader.py`	`grade(episode_log, ground_truth) → float`