Spaces:

williyam
/

agentic-rag-gym

Sleeping

App Files Files Community

agentic-rag-gym / documents /reward_function.md

williyam

feat: complete Agentic RAG Gym implementation

7f38b9c about 1 month ago

preview code

raw

history blame contribute delete

4.32 kB

Reward Function & Training Loop — Agentic RAG Gym

Reward Function Design

Philosophy

The reward function follows three core principles:

Process Supervision — Reward intermediate steps, not just final outcomes
Anti-Hacking — Detect and penalize degenerate strategies
Composite Signals — Multiple evaluation dimensions prevent reward hacking

Composite Reward Architecture

The CompositeRewardFunction combines five weighted signals:

Signal	Weight	Description
Retrieval Relevance	0.25	Average relevance score of retrieved documents
Reasoning Quality	0.20	Depth and logic of reasoning traces
Answer Completeness	0.30	Coverage of task requirements in final answer
Efficiency	0.15	Penalty for excessive steps
Anti-Hack Penalty	0.10	Deductions for detected hacking patterns

Step-Level Rewards (Process Supervision)

Each action type receives a specialized evaluation:

Retrieval Steps

Base reward proportional to average document relevance scores
Bonus for retrieving diverse sources
No reward if no documents found

Reasoning Steps

Trace length (>50 chars baseline)
Evidence markers: "because", "therefore", "based on"
Critical thinking markers: "however", "alternatively", "caveat"
Scaled by step position (earlier = higher efficiency multiplier)

Answer Steps

Length heuristic (word count thresholds)
Source grounding: overlap between answer terms and retrieved document terms
Rubric alignment: presence of task-specific criteria keywords

Anti-Reward-Hacking Measures

Four detection mechanisms:

Repetition Detection — If last 3 queries are identical → 0.3 penalty
Monotonic Action Exploit — If all steps use same action type → 0.3 penalty
Copy-Paste Detection — If answer equals a previous query → 0.4 penalty
Degenerate Output — If unique word ratio < 0.3 → 0.3 penalty

Score Clamping

All scores are strictly clamped to [0.01, 0.99] to ensure:

No task returns exactly 0.0 (which would indicate a broken grader)
No task returns exactly 1.0 (which would indicate a trivial task)

LLM Judge Mode

An alternative LLMJudgeRewardFunction uses the LLM itself as an evaluator:

Per-step: Prompts the LLM to rate step quality 0.0-1.0
Per-episode: Prompts the LLM to rate overall performance

This is used when rule-based evaluation is insufficient for the domain.

Training Loop

Episode Structure

reset(task_id)
    │
    for step in range(max_steps):
    │   │
    │   ├── Agent observes state
    │   ├── Agent selects action (retrieve/reason/answer/plan/critique/verify)
    │   ├── Environment processes action
    │   ├── Step reward computed (process supervision)
    │   ├── State updated
    │   │
    │   └── if done: break
    │
    ├── Episode reward computed
    ├── Grader evaluates final performance
    └── Trajectory saved for training

Self-Improvement Loop

The environment supports self-improvement through:

Adversarial Critique — The Critic agent identifies weaknesses in reasoning
Iterative Refinement — Retriever can be redirected based on critique
Verification Gate — Verifier checks answer grounding before submission
Curriculum Difficulty — Tasks range easy → hard, challenging frontier models

Trajectory Analysis

Each trajectory records:

Per-step actions and reasoning traces
Per-step intermediate rewards
Final score and episode metadata
Agent communication history

This data enables:

Offline RL training on collected trajectories
Process reward model training
Failure mode analysis
Self-improvement curriculum generation

Grading System

Deterministic Graders

Each task has a KeywordCoverageGrader that:

Checks keyword presence across rubric categories
Weights coverage by category importance
Adds process quality bonus (20% weight)
Clamps final score to [0.01, 0.99]

Process Quality Evaluation

The process component (20% of final grade) rewards:

Diverse action types (3+ types → bonus)
Proper sequencing (retrieve before answer)
Reasoning steps present
Not using excessive steps