# advanced-reward-function
## table-of-contents
1. [Overview](#overview)
2. [Reward Components](#reward-components)
3. [Planning Quality](#planning-quality)
4. [Recovery Ability](#recovery-ability)
5. [Exploration Bonus](#exploration-bonus)
6. [Redundancy Penalty](#redundancy-penalty)
7. [Generalization Score](#generalization-score)
8. [Tool Usage Efficiency](#tool-usage-efficiency)
9. [Memory Utilization](#memory-utilization)
10. [Final Reward Formula](#final-reward-formula)
11. [Configuration](#configuration)
---
## overview
The **Advanced Reward Function** provides dense, interpretable signals that guide the agent toward intelligent, efficient, and generalizable web scraping strategies.
### design-principles
1. **Dense Rewards:** Provide feedback at every step, not just terminal states
2. **Interpretable:** Each component has a clear purpose agents (and humans) can understand
3. **Balanced:** Prevent reward hacking by balancing conflicting objectives
4. **Adaptive:** Adjust weights based on task difficulty and agent progress
### basic-vs-advanced
**Basic Reward (existing):**
```python
reward = task_completion_score # 0.0 to 1.0
```
**Advanced Reward:**
```python
reward = (
w1 * task_completion +
w2 * efficiency +
w3 * planning_quality +
w4 * recovery_ability +
w5 * exploration_bonus +
w6 * tool_usage +
w7 * memory_usage +
w8 * generalization
) - penalties
```
---
## reward-components
### 1-task-completion-w1-0-40
**Purpose:** Measure how much of the task is complete.
**Calculation:**
```python
def task_completion_score(extracted: Dict, ground_truth: Dict) -> float:
"""Score based on field completeness and accuracy."""
if not ground_truth:
return 0.0
total_fields = len(ground_truth)
correct_fields = 0
partial_fields = 0
for field, true_value in ground_truth.items():
extracted_value = extracted.get(field)
if extracted_value is None:
continue # Missing field, 0 points
# Exact match
if normalize(extracted_value) == normalize(true_value):
correct_fields += 1
# Partial match (fuzzy)
elif similarity(extracted_value, true_value) > 0.7:
partial_fields += 1
score = (correct_fields + 0.5 * partial_fields) / total_fields
return score
```
**Example:**
```python
# Task: Extract name, price, rating
ground_truth = {"name": "Widget Pro", "price": "$49.99", "rating": "4.5"}
# Agent extracted 2/3 correctly
extracted = {"name": "Widget Pro", "price": "$49.99", "rating": None}
task_completion = 2/3 = 0.67
```
---
### 2-efficiency-w2-0-15
**Purpose:** Reward completing tasks quickly with fewer actions.
**Calculation:**
```python
def efficiency_score(steps_taken: int, max_steps: int, pages_visited: int) -> float:
"""Lower steps and pages = higher efficiency."""
# Step efficiency
step_efficiency = 1.0 - (steps_taken / max_steps)
# Page efficiency (prefer fewer page visits)
ideal_pages = estimate_ideal_page_count(task)
page_efficiency = 1.0 - abs(pages_visited - ideal_pages) / ideal_pages
page_efficiency = max(0.0, page_efficiency)
return 0.7 * step_efficiency + 0.3 * page_efficiency
```
**Example:**
```python
# Task with max 20 steps
steps_taken = 8
efficiency = 1.0 - (8/20) = 0.60 # Good!
steps_taken = 18
efficiency = 1.0 - (18/20) = 0.10 # Inefficient
```
---
## planning-quality
### 3-planning-quality-score-w3-0-10
**Purpose:** Reward agents that plan before acting.
**Signals:**
- Used WRITE_MEMORY with reasoning notes
- Actions follow a coherent strategy
- Fewer backtracking actions
**Calculation:**
```python
def planning_quality_score(episode_history: List[Action]) -> float:
"""Measure planning behavior."""
score = 0.0
# 1. Did agent write reasoning notes?
reasoning_actions = [a for a in episode_history if a.notes]
if reasoning_actions:
score += 0.3
# 2. Action coherence: Do actions follow a logical sequence?
coherence = measure_action_coherence(episode_history)
score += 0.4 * coherence
# 3. Backtracking penalty: Visiting same page multiple times
unique_pages = len(set(a.navigate_to for a in episode_history if a.navigate_to))
total_navigations = len([a for a in episode_history if a.action_type == "NAVIGATE"])
if total_navigations > 0:
backtrack_ratio = 1.0 - (unique_pages / total_navigations)
score += 0.3 * (1.0 - backtrack_ratio) # Lower backtracking = higher score
return min(score, 1.0)
def measure_action_coherence(actions: List[Action]) -> float:
"""Are actions logically connected?"""
coherence_patterns = [
# Good patterns
("SEARCH_PAGE", "EXTRACT_FIELD"), # Search then extract
("NAVIGATE", "EXTRACT_FIELD"), # Navigate then extract
("EXTRACT_FIELD", "VERIFY_FACT"), # Extract then verify
("SEARCH_ENGINE", "NAVIGATE"), # Search then visit
]
coherent_pairs = 0
total_pairs = len(actions) - 1
for i in range(total_pairs):
pair = (actions[i].action_type, actions[i+1].action_type)
if pair in coherence_patterns:
coherent_pairs += 1
return coherent_pairs / total_pairs if total_pairs > 0 else 0.0
```
**Example:**
```python
# Good planning:
actions = [
Action(type="SEARCH_PAGE", notes="Looking for price pattern"),
Action(type="EXTRACT_FIELD", target="price"),
Action(type="VERIFY_FACT", field="price")
]
planning_score = 0.3 (notes) + 0.4*0.67 (coherence) + 0.3 (no backtrack) = 0.87
# Poor planning:
actions = [
Action(type="NAVIGATE", navigate_to="/page1"),
Action(type="NAVIGATE", navigate_to="/page2"),
Action(type="NAVIGATE", navigate_to="/page1"), # Backtrack!
Action(type="EXTRACT_FIELD")
]
planning_score = 0.0 (no notes) + 0.4*0.0 (incoherent) + 0.3*0.33 (backtracking) = 0.10
```
---
## recovery-ability
### 4-recovery-ability-score-w4-0-08
**Purpose:** Reward agents that recover from failures.
**Signals:**
- Action failed → Agent tried alternative approach
- Extraction returned empty → Agent searched with different selector
- Page blocked → Agent switched proxy/VPN
**Calculation:**
```python
def recovery_ability_score(episode_history: List[Tuple[Action, Reward]]) -> float:
"""Measure ability to recover from failures."""
recoveries = 0
failures = 0
for i in range(len(episode_history) - 1):
action, reward = episode_history[i]
next_action, next_reward = episode_history[i + 1]
# Detect failure (negative reward or empty result)
if reward.value < 0 or "failed" in reward.message.lower():
failures += 1
# Check if next action was a recovery attempt
if is_recovery_action(action, next_action):
if next_reward.value > reward.value: # Recovery succeeded
recoveries += 1
return recoveries / failures if failures > 0 else 0.0
def is_recovery_action(failed_action: Action, next_action: Action) -> bool:
"""Is next_action a recovery attempt for failed_action?"""
# Same action type with different parameters
if failed_action.action_type == next_action.action_type:
if failed_action.selector != next_action.selector:
return True # Tried different selector
# Switched to alternative action type
recovery_alternatives = {
"EXTRACT_FIELD": ["SEARCH_PAGE", "INSPECT_ELEMENT"],
"NAVIGATE": ["FETCH_URL"], # Try direct fetch if navigate blocked
"SEARCH_ENGINE": ["NAVIGATE"], # Try direct URL if search fails
}
if next_action.action_type in recovery_alternatives.get(failed_action.action_type, []):
return True
return False
```
**Example:**
```python
# Good recovery:
history = [
(Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1, message="Not found")),
(Action(type="SEARCH_PAGE", query="price"), Reward(value=0.2, message="Found price pattern")),
(Action(type="EXTRACT_FIELD", selector="span.product-price"), Reward(value=0.5, message="Extracted"))
]
recovery_score = 1/1 = 1.0 # 1 failure, 1 successful recovery
# No recovery:
history = [
(Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1)),
(Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1)), # Repeated same failed action!
(Action(type="SUBMIT"), Reward(value=0.0))
]
recovery_score = 0/2 = 0.0 # 2 failures, 0 recoveries
```
---
## exploration-bonus
### 5-exploration-bonus-w5-0-05
**Purpose:** Encourage discovering new pages and patterns early in training.
**Calculation:**
```python
def exploration_bonus(
pages_visited: List[str],
known_pages: Set[str], # From long-term memory
episode_number: int
) -> float:
"""Bonus for discovering new pages/patterns."""
new_pages = set(pages_visited) - known_pages
# Bonus decreases over time (we want agent to eventually exploit)
decay_factor = math.exp(-0.01 * episode_number)
# Bonus per new page discovered
bonus_per_page = 0.1
return min(len(new_pages) * bonus_per_page * decay_factor, 1.0)
```
**Example:**
```python
# Episode 10: Agent discovers 3 new pages
exploration_bonus = 3 * 0.1 * exp(-0.01*10) = 0.3 * 0.90 = 0.27
# Episode 500: Same discovery
exploration_bonus = 3 * 0.1 * exp(-0.01*500) = 0.3 * 0.007 = 0.002 # Minimal bonus now
```
---
## redundancy-penalty
### 6-redundancy-penalty-penalty-not-bonus
**Purpose:** Penalize visiting the same page repeatedly without progress.
**Calculation:**
```python
def redundancy_penalty(pages_visited: List[str]) -> float:
"""Penalty for revisiting pages."""
from collections import Counter
visit_counts = Counter(pages_visited)
penalty = 0.0
for page, count in visit_counts.items():
if count > 1:
# Exponential penalty for repeat visits
penalty += 0.05 * (count - 1) ** 1.5
return min(penalty, 1.0)
```
**Example:**
```python
pages = ["/page1", "/page2", "/page1", "/page1", "/page3"]
# page1 visited 3 times
redundancy_penalty = 0.05 * (3-1)**1.5 = 0.05 * 2.83 = 0.14
```
---
## generalization-score
### 7-generalization-score-w8-0-07
**Purpose:** Reward strategies that work across different page layouts.
**Measurement:** After training, evaluate agent on unseen task variations.
**Calculation:**
```python
def generalization_score(
agent: Agent,
test_tasks: List[Task],
training_tasks: List[Task]
) -> float:
"""Test agent on unseen variations of trained tasks."""
test_results = []
for task in test_tasks:
# Ensure task is not in training set
if task.id in [t.id for t in training_tasks]:
continue
result = agent.run(task)
test_results.append(result.completion_score)
# Average performance on unseen tasks
return np.mean(test_results) if test_results else 0.0
```
---
## tool-usage-efficiency
### 8-tool-usage-w6-0-05
**Purpose:** Reward using the right tools at the right time.
**Calculation:**
```python
def tool_usage_score(actions: List[Action]) -> float:
"""Reward appropriate tool usage."""
score = 0.0
# 1. Used memory appropriately
memory_actions = [a for a in actions if a.action_type in ["READ_MEMORY", "WRITE_MEMORY"]]
if memory_actions:
score += 0.3
# 2. Used MCP tools when appropriate
mcp_actions = [a for a in actions if a.action_type == "MCP_TOOL_CALL"]
if mcp_actions:
score += 0.3
# 3. Verified important extractions
verify_actions = [a for a in actions if a.action_type == "VERIFY_FACT"]
extract_actions = [a for a in actions if a.action_type == "EXTRACT_FIELD"]
if verify_actions and extract_actions:
verification_ratio = len(verify_actions) / len(extract_actions)
score += 0.4 * min(verification_ratio, 1.0)
return min(score, 1.0)
```
---
## memory-utilization
### 9-memory-usage-w7-0-05
**Purpose:** Reward effective use of memory system.
**Calculation:**
```python
def memory_usage_score(episode: Episode) -> float:
"""Reward effective memory usage."""
score = 0.0
# 1. Did agent query long-term memory for similar patterns?
if episode.memory_queries > 0:
score += 0.4
# 2. Did agent write successful patterns to long-term memory?
if episode.memory_writes > 0:
score += 0.3
# 3. Did memory queries lead to successful actions?
memory_assisted_success = episode.memory_assisted_actions / episode.total_actions
score += 0.3 * memory_assisted_success
return min(score, 1.0)
```
---
## final-reward-formula
### complete-formula
```python
def calculate_reward(episode: Episode, config: RewardConfig) -> Reward:
"""Calculate comprehensive reward."""
# Positive components
R_completion = task_completion_score(episode.extracted, episode.ground_truth)
R_efficiency = efficiency_score(episode.steps, episode.max_steps, len(episode.pages))
R_planning = planning_quality_score(episode.actions)
R_recovery = recovery_ability_score(episode.history)
R_exploration = exploration_bonus(episode.pages, episode.memory.known_pages, episode.number)
R_tools = tool_usage_score(episode.actions)
R_memory = memory_usage_score(episode)
R_generalization = generalization_score(episode.agent, episode.test_tasks, episode.training_tasks)
# Penalties
P_redundancy = redundancy_penalty(episode.pages)
P_timeout = 1.0 if episode.timed_out else 0.0
P_invalid = sum(1 for a in episode.actions if not a.valid) * 0.1
# Weighted sum
w = config.weights
reward_value = (
w.completion * R_completion +
w.efficiency * R_efficiency +
w.planning * R_planning +
w.recovery * R_recovery +
w.exploration * R_exploration +
w.tools * R_tools +
w.memory * R_memory +
w.generalization * R_generalization
) - (P_redundancy + P_timeout + P_invalid)
# Clamp to [-1, 1]
reward_value = max(-1.0, min(1.0, reward_value))
# Build breakdown for interpretability
breakdown = {
"task_completion": R_completion,
"efficiency": R_efficiency,
"planning_quality": R_planning,
"recovery_ability": R_recovery,
"exploration_bonus": R_exploration,
"tool_usage": R_tools,
"memory_usage": R_memory,
"generalization": R_generalization,
"redundancy_penalty": -P_redundancy,
"timeout_penalty": -P_timeout,
"invalid_action_penalty": -P_invalid
}
# Generate explanation
message = generate_reward_explanation(breakdown, reward_value)
return Reward(
value=reward_value,
cumulative=episode.cumulative_reward + reward_value,
breakdown=breakdown,
message=message
)
```
### default-weights
```python
class RewardWeights(BaseModel):
completion: float = 0.40 # Most important
efficiency: float = 0.15 # Moderate importance
planning: float = 0.10 # Encourages good habits
recovery: float = 0.08 # Resilience
exploration: float = 0.05 # Early training
tools: float = 0.05 # Appropriate tool use
memory: float = 0.05 # Effective memory
generalization: float = 0.07 # Transfer learning
# Total: 0.95, leaves room for penalties
```
---
## configuration
### settings
```typescript
interface RewardConfig {
weights: RewardWeights;
// Component toggles
enablePlanningReward: boolean;
enableRecoveryReward: boolean;
enableExplorationBonus: boolean;
enableGeneralizationTest: boolean;
// Penalty settings
redundancyThreshold: number; // Penalize after N visits to same page
timeoutPenalty: number; // Penalty for exceeding time limit
invalidActionPenalty: number; // Penalty per invalid action
// Exploration decay
explorationDecayRate: number; // Default: 0.01
// Generalization
testTaskCount: number; // Number of unseen tasks to test on
}
```
### ui-component
```jsx
How quickly exploration bonus decreases over episodes
```
---
## reward-visualization
```jsx
{Object.entries(breakdown).map(([component, value]) => (
= 0 ? 'green' : 'red'}
/>
))}
{reward.message}
```
**Example Output:**
```
Reward Breakdown (Total: 0.72)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Task Completion: ████████████████████ 0.85
Efficiency: ████████████░░░░░░░░ 0.65
Planning Quality: ███████████████░░░░░ 0.78
Recovery Ability: ██████████████████░░ 0.90
Exploration: ████░░░░░░░░░░░░░░░░ 0.20
Tool Usage: ███████████████████░ 0.95
Memory Usage: ████████░░░░░░░░░░░░ 0.40
Generalization: ██████████████░░░░░░ 0.72
Redundancy Penalty: ░░░░░░░░░░░░░░░░░░░░ -0.15
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Explanation:
Excellent task completion (85% of fields extracted correctly)
Good efficiency (completed in 8/20 steps)
Strong recovery ability (recovered from 2/2 failures)
Moderate redundancy (visited homepage 3 times)
→ Overall: Strong performance!
```
---
**Next:** See [html-processing.md](./html-processing.md) for advanced HTML handling.
## related-api-reference
| item | value |
| --- | --- |
| api-reference | `api-reference.md` |
## document-metadata
| key | value |
| --- | --- |
| document | `rewards.md` |
| status | active |
## document-flow
```mermaid
flowchart TD
A[document] --> B[key-sections]
B --> C[implementation]
B --> D[operations]
B --> E[validation]
```