Spaces:
Sleeping
Sleeping
Reward Function & Training Loop — Agentic RAG Gym
Reward Function Design
Philosophy
The reward function follows three core principles:
- Process Supervision — Reward intermediate steps, not just final outcomes
- Anti-Hacking — Detect and penalize degenerate strategies
- Composite Signals — Multiple evaluation dimensions prevent reward hacking
Composite Reward Architecture
The CompositeRewardFunction combines five weighted signals:
| Signal | Weight | Description |
|---|---|---|
| Retrieval Relevance | 0.25 | Average relevance score of retrieved documents |
| Reasoning Quality | 0.20 | Depth and logic of reasoning traces |
| Answer Completeness | 0.30 | Coverage of task requirements in final answer |
| Efficiency | 0.15 | Penalty for excessive steps |
| Anti-Hack Penalty | 0.10 | Deductions for detected hacking patterns |
Step-Level Rewards (Process Supervision)
Each action type receives a specialized evaluation:
Retrieval Steps
- Base reward proportional to average document relevance scores
- Bonus for retrieving diverse sources
- No reward if no documents found
Reasoning Steps
- Trace length (>50 chars baseline)
- Evidence markers: "because", "therefore", "based on"
- Critical thinking markers: "however", "alternatively", "caveat"
- Scaled by step position (earlier = higher efficiency multiplier)
Answer Steps
- Length heuristic (word count thresholds)
- Source grounding: overlap between answer terms and retrieved document terms
- Rubric alignment: presence of task-specific criteria keywords
Anti-Reward-Hacking Measures
Four detection mechanisms:
- Repetition Detection — If last 3 queries are identical → 0.3 penalty
- Monotonic Action Exploit — If all steps use same action type → 0.3 penalty
- Copy-Paste Detection — If answer equals a previous query → 0.4 penalty
- Degenerate Output — If unique word ratio < 0.3 → 0.3 penalty
Score Clamping
All scores are strictly clamped to [0.01, 0.99] to ensure:
- No task returns exactly 0.0 (which would indicate a broken grader)
- No task returns exactly 1.0 (which would indicate a trivial task)
LLM Judge Mode
An alternative LLMJudgeRewardFunction uses the LLM itself as an evaluator:
- Per-step: Prompts the LLM to rate step quality 0.0-1.0
- Per-episode: Prompts the LLM to rate overall performance
This is used when rule-based evaluation is insufficient for the domain.
Training Loop
Episode Structure
reset(task_id)
│
for step in range(max_steps):
│ │
│ ├── Agent observes state
│ ├── Agent selects action (retrieve/reason/answer/plan/critique/verify)
│ ├── Environment processes action
│ ├── Step reward computed (process supervision)
│ ├── State updated
│ │
│ └── if done: break
│
├── Episode reward computed
├── Grader evaluates final performance
└── Trajectory saved for training
Self-Improvement Loop
The environment supports self-improvement through:
- Adversarial Critique — The Critic agent identifies weaknesses in reasoning
- Iterative Refinement — Retriever can be redirected based on critique
- Verification Gate — Verifier checks answer grounding before submission
- Curriculum Difficulty — Tasks range easy → hard, challenging frontier models
Trajectory Analysis
Each trajectory records:
- Per-step actions and reasoning traces
- Per-step intermediate rewards
- Final score and episode metadata
- Agent communication history
This data enables:
- Offline RL training on collected trajectories
- Process reward model training
- Failure mode analysis
- Self-improvement curriculum generation
Grading System
Deterministic Graders
Each task has a KeywordCoverageGrader that:
- Checks keyword presence across rubric categories
- Weights coverage by category importance
- Adds process quality bonus (20% weight)
- Clamps final score to [0.01, 0.99]
Process Quality Evaluation
The process component (20% of final grade) rewards:
- Diverse action types (3+ types → bonus)
- Proper sequencing (retrieve before answer)
- Reasoning steps present
- Not using excessive steps