Spaces:

williyam
/

agentic-rag-gym

Sleeping

App Files Files Community

agentic-rag-gym / documents /reward_function.md

williyam

feat: complete Agentic RAG Gym implementation

7f38b9c about 1 month ago

preview code

raw

history blame contribute delete

4.32 kB

	# Reward Function & Training Loop — Agentic RAG Gym

	## Reward Function Design

	### Philosophy

	The reward function follows three core principles:
	1. Process Supervision — Reward intermediate steps, not just final outcomes
	2. Anti-Hacking — Detect and penalize degenerate strategies
	3. Composite Signals — Multiple evaluation dimensions prevent reward hacking

	### Composite Reward Architecture

	The `CompositeRewardFunction` combines five weighted signals:

	\| Signal \| Weight \| Description \|
	\|---\|---\|---\|
	\| Retrieval Relevance \| 0.25 \| Average relevance score of retrieved documents \|
	\| Reasoning Quality \| 0.20 \| Depth and logic of reasoning traces \|
	\| Answer Completeness \| 0.30 \| Coverage of task requirements in final answer \|
	\| Efficiency \| 0.15 \| Penalty for excessive steps \|
	\| Anti-Hack Penalty \| 0.10 \| Deductions for detected hacking patterns \|

	### Step-Level Rewards (Process Supervision)

	Each action type receives a specialized evaluation:

	#### Retrieval Steps
	- Base reward proportional to average document relevance scores
	- Bonus for retrieving diverse sources
	- No reward if no documents found

	#### Reasoning Steps
	- Trace length (>50 chars baseline)
	- Evidence markers: "because", "therefore", "based on"
	- Critical thinking markers: "however", "alternatively", "caveat"
	- Scaled by step position (earlier = higher efficiency multiplier)

	#### Answer Steps
	- Length heuristic (word count thresholds)
	- Source grounding: overlap between answer terms and retrieved document terms
	- Rubric alignment: presence of task-specific criteria keywords

	### Anti-Reward-Hacking Measures

	Four detection mechanisms:

	1. Repetition Detection — If last 3 queries are identical → 0.3 penalty
	2. Monotonic Action Exploit — If all steps use same action type → 0.3 penalty
	3. Copy-Paste Detection — If answer equals a previous query → 0.4 penalty
	4. Degenerate Output — If unique word ratio < 0.3 → 0.3 penalty

	### Score Clamping

	All scores are strictly clamped to `[0.01, 0.99]` to ensure:
	- No task returns exactly 0.0 (which would indicate a broken grader)
	- No task returns exactly 1.0 (which would indicate a trivial task)

	### LLM Judge Mode

	An alternative `LLMJudgeRewardFunction` uses the LLM itself as an evaluator:
	- Per-step: Prompts the LLM to rate step quality 0.0-1.0
	- Per-episode: Prompts the LLM to rate overall performance

	This is used when rule-based evaluation is insufficient for the domain.

	## Training Loop

	### Episode Structure

	```
	reset(task_id)
	│
	for step in range(max_steps):
	│ │
	│ ├── Agent observes state
	│ ├── Agent selects action (retrieve/reason/answer/plan/critique/verify)
	│ ├── Environment processes action
	│ ├── Step reward computed (process supervision)
	│ ├── State updated
	│ │
	│ └── if done: break
	│
	├── Episode reward computed
	├── Grader evaluates final performance
	└── Trajectory saved for training
	```

	### Self-Improvement Loop

	The environment supports self-improvement through:

	1. Adversarial Critique — The Critic agent identifies weaknesses in reasoning
	2. Iterative Refinement — Retriever can be redirected based on critique
	3. Verification Gate — Verifier checks answer grounding before submission
	4. Curriculum Difficulty — Tasks range easy → hard, challenging frontier models

	### Trajectory Analysis

	Each trajectory records:
	- Per-step actions and reasoning traces
	- Per-step intermediate rewards
	- Final score and episode metadata
	- Agent communication history

	This data enables:
	- Offline RL training on collected trajectories
	- Process reward model training
	- Failure mode analysis
	- Self-improvement curriculum generation

	## Grading System

	### Deterministic Graders

	Each task has a `KeywordCoverageGrader` that:
	1. Checks keyword presence across rubric categories
	2. Weights coverage by category importance
	3. Adds process quality bonus (20% weight)
	4. Clamps final score to [0.01, 0.99]

	### Process Quality Evaluation

	The process component (20% of final grade) rewards:
	- Diverse action types (3+ types → bonus)
	- Proper sequencing (retrieve before answer)
	- Reasoning steps present
	- Not using excessive steps