agentic-rag-gym / documents /reward_function.md
williyam's picture
feat: complete Agentic RAG Gym implementation
7f38b9c
# Reward Function & Training Loop — Agentic RAG Gym
## Reward Function Design
### Philosophy
The reward function follows three core principles:
1. **Process Supervision** — Reward intermediate steps, not just final outcomes
2. **Anti-Hacking** — Detect and penalize degenerate strategies
3. **Composite Signals** — Multiple evaluation dimensions prevent reward hacking
### Composite Reward Architecture
The `CompositeRewardFunction` combines five weighted signals:
| Signal | Weight | Description |
|---|---|---|
| Retrieval Relevance | 0.25 | Average relevance score of retrieved documents |
| Reasoning Quality | 0.20 | Depth and logic of reasoning traces |
| Answer Completeness | 0.30 | Coverage of task requirements in final answer |
| Efficiency | 0.15 | Penalty for excessive steps |
| Anti-Hack Penalty | 0.10 | Deductions for detected hacking patterns |
### Step-Level Rewards (Process Supervision)
Each action type receives a specialized evaluation:
#### Retrieval Steps
- Base reward proportional to average document relevance scores
- Bonus for retrieving diverse sources
- No reward if no documents found
#### Reasoning Steps
- Trace length (>50 chars baseline)
- Evidence markers: "because", "therefore", "based on"
- Critical thinking markers: "however", "alternatively", "caveat"
- Scaled by step position (earlier = higher efficiency multiplier)
#### Answer Steps
- Length heuristic (word count thresholds)
- Source grounding: overlap between answer terms and retrieved document terms
- Rubric alignment: presence of task-specific criteria keywords
### Anti-Reward-Hacking Measures
Four detection mechanisms:
1. **Repetition Detection** — If last 3 queries are identical → 0.3 penalty
2. **Monotonic Action Exploit** — If all steps use same action type → 0.3 penalty
3. **Copy-Paste Detection** — If answer equals a previous query → 0.4 penalty
4. **Degenerate Output** — If unique word ratio < 0.3 → 0.3 penalty
### Score Clamping
All scores are strictly clamped to `[0.01, 0.99]` to ensure:
- No task returns exactly 0.0 (which would indicate a broken grader)
- No task returns exactly 1.0 (which would indicate a trivial task)
### LLM Judge Mode
An alternative `LLMJudgeRewardFunction` uses the LLM itself as an evaluator:
- Per-step: Prompts the LLM to rate step quality 0.0-1.0
- Per-episode: Prompts the LLM to rate overall performance
This is used when rule-based evaluation is insufficient for the domain.
## Training Loop
### Episode Structure
```
reset(task_id)
for step in range(max_steps):
│ │
│ ├── Agent observes state
│ ├── Agent selects action (retrieve/reason/answer/plan/critique/verify)
│ ├── Environment processes action
│ ├── Step reward computed (process supervision)
│ ├── State updated
│ │
│ └── if done: break
├── Episode reward computed
├── Grader evaluates final performance
└── Trajectory saved for training
```
### Self-Improvement Loop
The environment supports self-improvement through:
1. **Adversarial Critique** — The Critic agent identifies weaknesses in reasoning
2. **Iterative Refinement** — Retriever can be redirected based on critique
3. **Verification Gate** — Verifier checks answer grounding before submission
4. **Curriculum Difficulty** — Tasks range easy → hard, challenging frontier models
### Trajectory Analysis
Each trajectory records:
- Per-step actions and reasoning traces
- Per-step intermediate rewards
- Final score and episode metadata
- Agent communication history
This data enables:
- Offline RL training on collected trajectories
- Process reward model training
- Failure mode analysis
- Self-improvement curriculum generation
## Grading System
### Deterministic Graders
Each task has a `KeywordCoverageGrader` that:
1. Checks keyword presence across rubric categories
2. Weights coverage by category importance
3. Adds process quality bonus (20% weight)
4. Clamps final score to [0.01, 0.99]
### Process Quality Evaluation
The process component (20% of final grade) rewards:
- Diverse action types (3+ types → bonus)
- Proper sequencing (retrieve before answer)
- Reasoning steps present
- Not using excessive steps