Spaces:
Running
Running
REWARD_DESIGN.md — Bug Triage Reward & Grader Design
Two Systems: Reward vs Grader
| Step Reward | Grader Score | |
|---|---|---|
| Purpose | GRPO training signal | Hackathon evaluation |
| When | Every step() |
Only at done=True |
| Range | [-0.5, 1.0] (shaped) | [0.0, 1.0] (strict) |
| Endpoint | Via observation reward |
/grader endpoint |
Mapping: reward = (grader_score × 1.5) − 0.5
- Perfect score (1.0) → reward = +1.0
- Zero score (0.0) → reward = -0.5
- This shaping gives gradient signal even for bad predictions.
Task 1 Reward: Bug Type Classification
| Agent Output | Score |
|---|---|
| Correct type | +1.0 |
| Wrong type | 0.0 |
Simple exact match. 6 possible types, so random baseline ≈ 0.167.
Task 2 Reward: Priority Assignment
| Distance from Correct | Score |
|---|---|
| Exact match | 1.00 |
| Off by 1 level | 0.67 |
| Off by 2 levels | 0.33 |
| Off by 3 levels | 0.00 |
Why partial credit? Priority has ordinal structure — predicting "high" when answer is "critical" is better than predicting "low".
Task 3 Reward: Full Triage (Composite)
| Component | Weight | Scoring |
|---|---|---|
| Bug type correct | 0.30 | 1.0 exact, 0.0 else |
| Priority correct | 0.30 | Distance: 0→1.0, 1→0.67, 2→0.33, 3→0.0 |
| Developer correct | 0.20 | Exact→1.0, right specialty→0.5, else→0.0 |
| Action correct | 0.20 | Exact→1.0, adjacent→0.5, else→0.0 |
Partial Credit Details
Developer assignment:
- Exact match to ground truth → 1.0
- Wrong person but has the right specialization for the bug type → 0.5
- No match → 0.0
Action suggestion (adjacency map):
fix_immediately schedule_sprint (0.5)
schedule_sprint needs_more_info (0.5)
wontfix duplicate (0.5)
Example Scoring
Ground truth: type=crash, priority=critical, dev=Alice, action=fix_immediately
Agent output: type=crash, priority=high, dev=Bob, action=fix_immediately
type: crash == crash → 1.0 × 0.30 = 0.30
priority: high vs critical → 0.67 × 0.30 = 0.20
dev: Bob vs Alice → Bob knows crash → 0.5 × 0.20 = 0.10
action: fix_immediately → 1.0 × 0.20 = 0.20
Total: 0.80
Why This Reward Design Is Meaningful
- Non-trivial — random agent scores ~0.20 on Task 3; a good model should score 0.7+
- Decomposed — each dimension provides independent learning signal
- Partial credit — prevents plateau; agent improves incrementally
- Varies per bug — different bugs have different answers, so grader scores vary (no fixed output)
- Production-relevant — mirrors how humans evaluate triage quality
Grader Code Locations
| Task | Grader File | Method |
|---|---|---|
| task_1 | graders/task1_grader.py |
grade(episode_log, ground_truth) → float |
| task_2 | graders/task2_grader.py |
grade(episode_log, ground_truth) → float |
| task_3 | graders/task3_grader.py |
grade(episode_log, ground_truth) → float |