# Reward Function & Training Loop — Agentic RAG Gym ## Reward Function Design ### Philosophy The reward function follows three core principles: 1. **Process Supervision** — Reward intermediate steps, not just final outcomes 2. **Anti-Hacking** — Detect and penalize degenerate strategies 3. **Composite Signals** — Multiple evaluation dimensions prevent reward hacking ### Composite Reward Architecture The `CompositeRewardFunction` combines five weighted signals: | Signal | Weight | Description | |---|---|---| | Retrieval Relevance | 0.25 | Average relevance score of retrieved documents | | Reasoning Quality | 0.20 | Depth and logic of reasoning traces | | Answer Completeness | 0.30 | Coverage of task requirements in final answer | | Efficiency | 0.15 | Penalty for excessive steps | | Anti-Hack Penalty | 0.10 | Deductions for detected hacking patterns | ### Step-Level Rewards (Process Supervision) Each action type receives a specialized evaluation: #### Retrieval Steps - Base reward proportional to average document relevance scores - Bonus for retrieving diverse sources - No reward if no documents found #### Reasoning Steps - Trace length (>50 chars baseline) - Evidence markers: "because", "therefore", "based on" - Critical thinking markers: "however", "alternatively", "caveat" - Scaled by step position (earlier = higher efficiency multiplier) #### Answer Steps - Length heuristic (word count thresholds) - Source grounding: overlap between answer terms and retrieved document terms - Rubric alignment: presence of task-specific criteria keywords ### Anti-Reward-Hacking Measures Four detection mechanisms: 1. **Repetition Detection** — If last 3 queries are identical → 0.3 penalty 2. **Monotonic Action Exploit** — If all steps use same action type → 0.3 penalty 3. **Copy-Paste Detection** — If answer equals a previous query → 0.4 penalty 4. **Degenerate Output** — If unique word ratio < 0.3 → 0.3 penalty ### Score Clamping All scores are strictly clamped to `[0.01, 0.99]` to ensure: - No task returns exactly 0.0 (which would indicate a broken grader) - No task returns exactly 1.0 (which would indicate a trivial task) ### LLM Judge Mode An alternative `LLMJudgeRewardFunction` uses the LLM itself as an evaluator: - Per-step: Prompts the LLM to rate step quality 0.0-1.0 - Per-episode: Prompts the LLM to rate overall performance This is used when rule-based evaluation is insufficient for the domain. ## Training Loop ### Episode Structure ``` reset(task_id) │ for step in range(max_steps): │ │ │ ├── Agent observes state │ ├── Agent selects action (retrieve/reason/answer/plan/critique/verify) │ ├── Environment processes action │ ├── Step reward computed (process supervision) │ ├── State updated │ │ │ └── if done: break │ ├── Episode reward computed ├── Grader evaluates final performance └── Trajectory saved for training ``` ### Self-Improvement Loop The environment supports self-improvement through: 1. **Adversarial Critique** — The Critic agent identifies weaknesses in reasoning 2. **Iterative Refinement** — Retriever can be redirected based on critique 3. **Verification Gate** — Verifier checks answer grounding before submission 4. **Curriculum Difficulty** — Tasks range easy → hard, challenging frontier models ### Trajectory Analysis Each trajectory records: - Per-step actions and reasoning traces - Per-step intermediate rewards - Final score and episode metadata - Agent communication history This data enables: - Offline RL training on collected trajectories - Process reward model training - Failure mode analysis - Self-improvement curriculum generation ## Grading System ### Deterministic Graders Each task has a `KeywordCoverageGrader` that: 1. Checks keyword presence across rubric categories 2. Weights coverage by category importance 3. Adds process quality bonus (20% weight) 4. Clamps final score to [0.01, 0.99] ### Process Quality Evaluation The process component (20% of final grade) rewards: - Diverse action types (3+ types → bonus) - Proper sequencing (retrieve before answer) - Reasoning steps present - Not using excessive steps