# REWARD_DESIGN.md — Bug Triage Reward & Grader Design

---

## Two Systems: Reward vs Grader

| | Step Reward | Grader Score |
|-|------------|-------------|
| Purpose | GRPO training signal | Hackathon evaluation |
| When | Every `step()` | Only at `done=True` |
| Range | [-0.5, 1.0] (shaped) | [0.0, 1.0] (strict) |
| Endpoint | Via observation `reward` | `/grader` endpoint |

**Mapping:** `reward = (grader_score × 1.5) − 0.5`
- Perfect score (1.0) → reward = +1.0
- Zero score (0.0) → reward = -0.5
- This shaping gives gradient signal even for bad predictions.

---

## Task 1 Reward: Bug Type Classification

| Agent Output | Score |
|-------------|-------|
| Correct type | **+1.0** |
| Wrong type | **0.0** |

Simple exact match. 6 possible types, so random baseline ≈ 0.167.

---

## Task 2 Reward: Priority Assignment

| Distance from Correct | Score |
|-----------------------|-------|
| Exact match | **1.00** |
| Off by 1 level | **0.67** |
| Off by 2 levels | **0.33** |
| Off by 3 levels | **0.00** |

**Why partial credit?** Priority has ordinal structure — predicting "high" when answer is "critical" is better than predicting "low".

---

## Task 3 Reward: Full Triage (Composite)

| Component | Weight | Scoring |
|-----------|--------|---------|
| Bug type correct | **0.30** | 1.0 exact, 0.0 else |
| Priority correct | **0.30** | Distance: 0→1.0, 1→0.67, 2→0.33, 3→0.0 |
| Developer correct | **0.20** | Exact→1.0, right specialty→0.5, else→0.0 |
| Action correct | **0.20** | Exact→1.0, adjacent→0.5, else→0.0 |

### Partial Credit Details

**Developer assignment:**
- Exact match to ground truth → 1.0
- Wrong person but has the right specialization for the bug type → 0.5
- No match → 0.0

**Action suggestion (adjacency map):**
```
fix_immediately  schedule_sprint     (0.5)
schedule_sprint  needs_more_info     (0.5)
wontfix  duplicate                   (0.5)
```

### Example Scoring
```
Ground truth: type=crash, priority=critical, dev=Alice, action=fix_immediately
Agent output: type=crash, priority=high,     dev=Bob,   action=fix_immediately

type:     crash == crash       → 1.0 × 0.30 = 0.30
priority: high vs critical     → 0.67 × 0.30 = 0.20
dev:      Bob vs Alice         → Bob knows crash → 0.5 × 0.20 = 0.10
action:   fix_immediately      → 1.0 × 0.20 = 0.20

Total: 0.80
```

---

## Why This Reward Design Is Meaningful

1. **Non-trivial** — random agent scores ~0.20 on Task 3; a good model should score 0.7+
2. **Decomposed** — each dimension provides independent learning signal
3. **Partial credit** — prevents plateau; agent improves incrementally
4. **Varies per bug** — different bugs have different answers, so grader scores vary (no fixed output)
5. **Production-relevant** — mirrors how humans evaluate triage quality

---

## Grader Code Locations

| Task | Grader File | Method |
|------|-------------|--------|
| task_1 | `graders/task1_grader.py` | `grade(episode_log, ground_truth) → float` |
| task_2 | `graders/task2_grader.py` | `grade(episode_log, ground_truth) → float` |
| task_3 | `graders/task3_grader.py` | `grade(episode_log, ground_truth) → float` |