Spaces:

NeerajCodz
/

scrapeRL

Sleeping

App Files Files Community

scrapeRL / docs /rewards.md

NeerajCodz

docs: init proto

24f0bf0 about 1 month ago

preview code

raw

history blame contribute delete

20.4 kB

	# advanced-reward-function

	## table-of-contents
	1. [Overview](#overview)
	2. [Reward Components](#reward-components)
	3. [Planning Quality](#planning-quality)
	4. [Recovery Ability](#recovery-ability)
	5. [Exploration Bonus](#exploration-bonus)
	6. [Redundancy Penalty](#redundancy-penalty)
	7. [Generalization Score](#generalization-score)
	8. [Tool Usage Efficiency](#tool-usage-efficiency)
	9. [Memory Utilization](#memory-utilization)
	10. [Final Reward Formula](#final-reward-formula)
	11. [Configuration](#configuration)

	---

	## overview

	The Advanced Reward Function provides dense, interpretable signals that guide the agent toward intelligent, efficient, and generalizable web scraping strategies.

	### design-principles

	1. Dense Rewards: Provide feedback at every step, not just terminal states
	2. Interpretable: Each component has a clear purpose agents (and humans) can understand
	3. Balanced: Prevent reward hacking by balancing conflicting objectives
	4. Adaptive: Adjust weights based on task difficulty and agent progress

	### basic-vs-advanced

	Basic Reward (existing):
	```python
	reward = task_completion_score # 0.0 to 1.0
	```

	Advanced Reward:
	```python
	reward = (
	w1 * task_completion +
	w2 * efficiency +
	w3 * planning_quality +
	w4 * recovery_ability +
	w5 * exploration_bonus +
	w6 * tool_usage +
	w7 * memory_usage +
	w8 * generalization
	) - penalties
	```

	---

	## reward-components

	### 1-task-completion-w1-0-40

	Purpose: Measure how much of the task is complete.

	Calculation:
	```python
	def task_completion_score(extracted: Dict, ground_truth: Dict) -> float:
	"""Score based on field completeness and accuracy."""
	if not ground_truth:
	return 0.0

	total_fields = len(ground_truth)
	correct_fields = 0
	partial_fields = 0

	for field, true_value in ground_truth.items():
	extracted_value = extracted.get(field)

	if extracted_value is None:
	continue # Missing field, 0 points

	# Exact match
	if normalize(extracted_value) == normalize(true_value):
	correct_fields += 1
	# Partial match (fuzzy)
	elif similarity(extracted_value, true_value) > 0.7:
	partial_fields += 1

	score = (correct_fields + 0.5 * partial_fields) / total_fields
	return score
	```

	Example:
	```python
	# Task: Extract name, price, rating
	ground_truth = {"name": "Widget Pro", "price": "$49.99", "rating": "4.5"}

	# Agent extracted 2/3 correctly
	extracted = {"name": "Widget Pro", "price": "$49.99", "rating": None}
	task_completion = 2/3 = 0.67
	```

	---

	### 2-efficiency-w2-0-15

	Purpose: Reward completing tasks quickly with fewer actions.

	Calculation:
	```python
	def efficiency_score(steps_taken: int, max_steps: int, pages_visited: int) -> float:
	"""Lower steps and pages = higher efficiency."""
	# Step efficiency
	step_efficiency = 1.0 - (steps_taken / max_steps)

	# Page efficiency (prefer fewer page visits)
	ideal_pages = estimate_ideal_page_count(task)
	page_efficiency = 1.0 - abs(pages_visited - ideal_pages) / ideal_pages
	page_efficiency = max(0.0, page_efficiency)

	return 0.7 * step_efficiency + 0.3 * page_efficiency
	```

	Example:
	```python
	# Task with max 20 steps
	steps_taken = 8
	efficiency = 1.0 - (8/20) = 0.60 # Good!

	steps_taken = 18
	efficiency = 1.0 - (18/20) = 0.10 # Inefficient
	```

	---

	## planning-quality

	### 3-planning-quality-score-w3-0-10

	Purpose: Reward agents that plan before acting.

	Signals:
	- Used WRITE_MEMORY with reasoning notes
	- Actions follow a coherent strategy
	- Fewer backtracking actions

	Calculation:
	```python
	def planning_quality_score(episode_history: List[Action]) -> float:
	"""Measure planning behavior."""
	score = 0.0

	# 1. Did agent write reasoning notes?
	reasoning_actions = [a for a in episode_history if a.notes]
	if reasoning_actions:
	score += 0.3

	# 2. Action coherence: Do actions follow a logical sequence?
	coherence = measure_action_coherence(episode_history)
	score += 0.4 * coherence

	# 3. Backtracking penalty: Visiting same page multiple times
	unique_pages = len(set(a.navigate_to for a in episode_history if a.navigate_to))
	total_navigations = len([a for a in episode_history if a.action_type == "NAVIGATE"])
	if total_navigations > 0:
	backtrack_ratio = 1.0 - (unique_pages / total_navigations)
	score += 0.3 * (1.0 - backtrack_ratio) # Lower backtracking = higher score

	return min(score, 1.0)

	def measure_action_coherence(actions: List[Action]) -> float:
	"""Are actions logically connected?"""
	coherence_patterns = [
	# Good patterns
	("SEARCH_PAGE", "EXTRACT_FIELD"), # Search then extract
	("NAVIGATE", "EXTRACT_FIELD"), # Navigate then extract
	("EXTRACT_FIELD", "VERIFY_FACT"), # Extract then verify
	("SEARCH_ENGINE", "NAVIGATE"), # Search then visit
	]

	coherent_pairs = 0
	total_pairs = len(actions) - 1

	for i in range(total_pairs):
	pair = (actions[i].action_type, actions[i+1].action_type)
	if pair in coherence_patterns:
	coherent_pairs += 1

	return coherent_pairs / total_pairs if total_pairs > 0 else 0.0
	```

	Example:
	```python
	# Good planning:
	actions = [
	Action(type="SEARCH_PAGE", notes="Looking for price pattern"),
	Action(type="EXTRACT_FIELD", target="price"),
	Action(type="VERIFY_FACT", field="price")
	]
	planning_score = 0.3 (notes) + 0.4*0.67 (coherence) + 0.3 (no backtrack) = 0.87

	# Poor planning:
	actions = [
	Action(type="NAVIGATE", navigate_to="/page1"),
	Action(type="NAVIGATE", navigate_to="/page2"),
	Action(type="NAVIGATE", navigate_to="/page1"), # Backtrack!
	Action(type="EXTRACT_FIELD")
	]
	planning_score = 0.0 (no notes) + 0.40.0 (incoherent) + 0.30.33 (backtracking) = 0.10
	```

	---

	## recovery-ability

	### 4-recovery-ability-score-w4-0-08

	Purpose: Reward agents that recover from failures.

	Signals:
	- Action failed → Agent tried alternative approach
	- Extraction returned empty → Agent searched with different selector
	- Page blocked → Agent switched proxy/VPN

	Calculation:
	```python
	def recovery_ability_score(episode_history: List[Tuple[Action, Reward]]) -> float:
	"""Measure ability to recover from failures."""
	recoveries = 0
	failures = 0

	for i in range(len(episode_history) - 1):
	action, reward = episode_history[i]
	next_action, next_reward = episode_history[i + 1]

	# Detect failure (negative reward or empty result)
	if reward.value < 0 or "failed" in reward.message.lower():
	failures += 1

	# Check if next action was a recovery attempt
	if is_recovery_action(action, next_action):
	if next_reward.value > reward.value: # Recovery succeeded
	recoveries += 1

	return recoveries / failures if failures > 0 else 0.0

	def is_recovery_action(failed_action: Action, next_action: Action) -> bool:
	"""Is next_action a recovery attempt for failed_action?"""
	# Same action type with different parameters
	if failed_action.action_type == next_action.action_type:
	if failed_action.selector != next_action.selector:
	return True # Tried different selector

	# Switched to alternative action type
	recovery_alternatives = {
	"EXTRACT_FIELD": ["SEARCH_PAGE", "INSPECT_ELEMENT"],
	"NAVIGATE": ["FETCH_URL"], # Try direct fetch if navigate blocked
	"SEARCH_ENGINE": ["NAVIGATE"], # Try direct URL if search fails
	}

	if next_action.action_type in recovery_alternatives.get(failed_action.action_type, []):
	return True

	return False
	```

	Example:
	```python
	# Good recovery:
	history = [
	(Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1, message="Not found")),
	(Action(type="SEARCH_PAGE", query="price"), Reward(value=0.2, message="Found price pattern")),
	(Action(type="EXTRACT_FIELD", selector="span.product-price"), Reward(value=0.5, message="Extracted"))
	]
	recovery_score = 1/1 = 1.0 # 1 failure, 1 successful recovery

	# No recovery:
	history = [
	(Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1)),
	(Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1)), # Repeated same failed action!
	(Action(type="SUBMIT"), Reward(value=0.0))
	]
	recovery_score = 0/2 = 0.0 # 2 failures, 0 recoveries
	```

	---

	## exploration-bonus

	### 5-exploration-bonus-w5-0-05

	Purpose: Encourage discovering new pages and patterns early in training.

	Calculation:
	```python
	def exploration_bonus(
	pages_visited: List[str],
	known_pages: Set[str], # From long-term memory
	episode_number: int
	) -> float:
	"""Bonus for discovering new pages/patterns."""
	new_pages = set(pages_visited) - known_pages

	# Bonus decreases over time (we want agent to eventually exploit)
	decay_factor = math.exp(-0.01 * episode_number)

	# Bonus per new page discovered
	bonus_per_page = 0.1

	return min(len(new_pages) * bonus_per_page * decay_factor, 1.0)
	```

	Example:
	```python
	# Episode 10: Agent discovers 3 new pages
	exploration_bonus = 3 * 0.1 * exp(-0.0110) = 0.3 0.90 = 0.27

	# Episode 500: Same discovery
	exploration_bonus = 3 * 0.1 * exp(-0.01500) = 0.3 0.007 = 0.002 # Minimal bonus now
	```

	---

	## redundancy-penalty

	### 6-redundancy-penalty-penalty-not-bonus

	Purpose: Penalize visiting the same page repeatedly without progress.

	Calculation:
	```python
	def redundancy_penalty(pages_visited: List[str]) -> float:
	"""Penalty for revisiting pages."""
	from collections import Counter
	visit_counts = Counter(pages_visited)

	penalty = 0.0
	for page, count in visit_counts.items():
	if count > 1:
	# Exponential penalty for repeat visits
	penalty += 0.05 * (count - 1) ** 1.5

	return min(penalty, 1.0)
	```

	Example:
	```python
	pages = ["/page1", "/page2", "/page1", "/page1", "/page3"]
	# page1 visited 3 times
	redundancy_penalty = 0.05 * (3-1)*1.5 = 0.05 2.83 = 0.14
	```

	---

	## generalization-score

	### 7-generalization-score-w8-0-07

	Purpose: Reward strategies that work across different page layouts.

	Measurement: After training, evaluate agent on unseen task variations.

	Calculation:
	```python
	def generalization_score(
	agent: Agent,
	test_tasks: List[Task],
	training_tasks: List[Task]
	) -> float:
	"""Test agent on unseen variations of trained tasks."""
	test_results = []

	for task in test_tasks:
	# Ensure task is not in training set
	if task.id in [t.id for t in training_tasks]:
	continue

	result = agent.run(task)
	test_results.append(result.completion_score)

	# Average performance on unseen tasks
	return np.mean(test_results) if test_results else 0.0
	```

	---

	## tool-usage-efficiency

	### 8-tool-usage-w6-0-05

	Purpose: Reward using the right tools at the right time.

	Calculation:
	```python
	def tool_usage_score(actions: List[Action]) -> float:
	"""Reward appropriate tool usage."""
	score = 0.0

	# 1. Used memory appropriately
	memory_actions = [a for a in actions if a.action_type in ["READ_MEMORY", "WRITE_MEMORY"]]
	if memory_actions:
	score += 0.3

	# 2. Used MCP tools when appropriate
	mcp_actions = [a for a in actions if a.action_type == "MCP_TOOL_CALL"]
	if mcp_actions:
	score += 0.3

	# 3. Verified important extractions
	verify_actions = [a for a in actions if a.action_type == "VERIFY_FACT"]
	extract_actions = [a for a in actions if a.action_type == "EXTRACT_FIELD"]
	if verify_actions and extract_actions:
	verification_ratio = len(verify_actions) / len(extract_actions)
	score += 0.4 * min(verification_ratio, 1.0)

	return min(score, 1.0)
	```

	---

	## memory-utilization

	### 9-memory-usage-w7-0-05

	Purpose: Reward effective use of memory system.

	Calculation:
	```python
	def memory_usage_score(episode: Episode) -> float:
	"""Reward effective memory usage."""
	score = 0.0

	# 1. Did agent query long-term memory for similar patterns?
	if episode.memory_queries > 0:
	score += 0.4

	# 2. Did agent write successful patterns to long-term memory?
	if episode.memory_writes > 0:
	score += 0.3

	# 3. Did memory queries lead to successful actions?
	memory_assisted_success = episode.memory_assisted_actions / episode.total_actions
	score += 0.3 * memory_assisted_success

	return min(score, 1.0)
	```

	---

	## final-reward-formula

	### complete-formula

	```python
	def calculate_reward(episode: Episode, config: RewardConfig) -> Reward:
	"""Calculate comprehensive reward."""

	# Positive components
	R_completion = task_completion_score(episode.extracted, episode.ground_truth)
	R_efficiency = efficiency_score(episode.steps, episode.max_steps, len(episode.pages))
	R_planning = planning_quality_score(episode.actions)
	R_recovery = recovery_ability_score(episode.history)
	R_exploration = exploration_bonus(episode.pages, episode.memory.known_pages, episode.number)
	R_tools = tool_usage_score(episode.actions)
	R_memory = memory_usage_score(episode)
	R_generalization = generalization_score(episode.agent, episode.test_tasks, episode.training_tasks)

	# Penalties
	P_redundancy = redundancy_penalty(episode.pages)
	P_timeout = 1.0 if episode.timed_out else 0.0
	P_invalid = sum(1 for a in episode.actions if not a.valid) * 0.1

	# Weighted sum
	w = config.weights
	reward_value = (
	w.completion * R_completion +
	w.efficiency * R_efficiency +
	w.planning * R_planning +
	w.recovery * R_recovery +
	w.exploration * R_exploration +
	w.tools * R_tools +
	w.memory * R_memory +
	w.generalization * R_generalization
	) - (P_redundancy + P_timeout + P_invalid)

	# Clamp to [-1, 1]
	reward_value = max(-1.0, min(1.0, reward_value))

	# Build breakdown for interpretability
	breakdown = {
	"task_completion": R_completion,
	"efficiency": R_efficiency,
	"planning_quality": R_planning,
	"recovery_ability": R_recovery,
	"exploration_bonus": R_exploration,
	"tool_usage": R_tools,
	"memory_usage": R_memory,
	"generalization": R_generalization,
	"redundancy_penalty": -P_redundancy,
	"timeout_penalty": -P_timeout,
	"invalid_action_penalty": -P_invalid
	}

	# Generate explanation
	message = generate_reward_explanation(breakdown, reward_value)

	return Reward(
	value=reward_value,
	cumulative=episode.cumulative_reward + reward_value,
	breakdown=breakdown,
	message=message
	)
	```

	### default-weights

	```python
	class RewardWeights(BaseModel):
	completion: float = 0.40 # Most important
	efficiency: float = 0.15 # Moderate importance
	planning: float = 0.10 # Encourages good habits
	recovery: float = 0.08 # Resilience
	exploration: float = 0.05 # Early training
	tools: float = 0.05 # Appropriate tool use
	memory: float = 0.05 # Effective memory
	generalization: float = 0.07 # Transfer learning
	# Total: 0.95, leaves room for penalties
	```

	---

	## configuration

	### settings

	```typescript
	interface RewardConfig {
	weights: RewardWeights;

	// Component toggles
	enablePlanningReward: boolean;
	enableRecoveryReward: boolean;
	enableExplorationBonus: boolean;
	enableGeneralizationTest: boolean;

	// Penalty settings
	redundancyThreshold: number; // Penalize after N visits to same page
	timeoutPenalty: number; // Penalty for exceeding time limit
	invalidActionPenalty: number; // Penalty per invalid action

	// Exploration decay
	explorationDecayRate: number; // Default: 0.01

	// Generalization
	testTaskCount: number; // Number of unseen tasks to test on
	}
	```

	### ui-component

	```jsx
	<RewardSettings>
	<Section title="Component Weights">
	<Slider label="Task Completion" value={weights.completion} min={0} max={1} step={0.05} />
	<Slider label="Efficiency" value={weights.efficiency} min={0} max={1} step={0.05} />
	<Slider label="Planning Quality" value={weights.planning} min={0} max={1} step={0.05} />
	<Slider label="Recovery Ability" value={weights.recovery} min={0} max={1} step={0.05} />
	<Slider label="Exploration Bonus" value={weights.exploration} min={0} max={1} step={0.05} />
	<Slider label="Tool Usage" value={weights.tools} min={0} max={1} step={0.05} />
	<Slider label="Memory Usage" value={weights.memory} min={0} max={1} step={0.05} />
	<Slider label="Generalization" value={weights.generalization} min={0} max={1} step={0.05} />

	<TotalWeight value={Object.values(weights).reduce((a,b) => a+b, 0)} max={1.0} />
	</Section>

	<Section title="Penalties">
	<NumberInput label="Redundancy Threshold (page visits)" value={redundancyThreshold} />
	<NumberInput label="Timeout Penalty" value={timeoutPenalty} min={0} max={1} step={0.1} />
	<NumberInput label="Invalid Action Penalty" value={invalidActionPenalty} min={0} max={1} step={0.1} />
	</Section>

	<Section title="Exploration">
	<NumberInput label="Decay Rate" value={explorationDecayRate} min={0} max={0.1} step={0.001} />
	<HelpText>How quickly exploration bonus decreases over episodes</HelpText>
	</Section>

	<Section title="Presets">
	<Button onClick={() => loadPreset('balanced')}>Balanced (Default)</Button>
	<Button onClick={() => loadPreset('efficiency_focused')}>Efficiency Focused</Button>
	<Button onClick={() => loadPreset('quality_focused')}>Quality Focused</Button>
	<Button onClick={() => loadPreset('exploration')}>Exploration Mode</Button>
	</Section>
	</RewardSettings>
	```

	---

	## reward-visualization

	```jsx
	<RewardBreakdown>
	<BarChart>
	{Object.entries(breakdown).map(([component, value]) => (
	<Bar
	key={component}
	label={component}
	value={value}
	color={value >= 0 ? 'green' : 'red'}
	/>
	))}
	</BarChart>

	<TotalReward value={reward.value} />

	<Explanation>{reward.message}</Explanation>
	</RewardBreakdown>
	```

	Example Output:
	```
	Reward Breakdown (Total: 0.72)
	━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
	Task Completion: ████████████████████ 0.85
	Efficiency: ████████████░░░░░░░░ 0.65
	Planning Quality: ███████████████░░░░░ 0.78
	Recovery Ability: ██████████████████░░ 0.90
	Exploration: ████░░░░░░░░░░░░░░░░ 0.20
	Tool Usage: ███████████████████░ 0.95
	Memory Usage: ████████░░░░░░░░░░░░ 0.40
	Generalization: ██████████████░░░░░░ 0.72
	Redundancy Penalty: ░░░░░░░░░░░░░░░░░░░░ -0.15
	━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

	Explanation:
	Excellent task completion (85% of fields extracted correctly)
	Good efficiency (completed in 8/20 steps)
	Strong recovery ability (recovered from 2/2 failures)
	Moderate redundancy (visited homepage 3 times)
	→ Overall: Strong performance!
	```

	---

	Next: See [html-processing.md](./html-processing.md) for advanced HTML handling.


	## related-api-reference

	\| item \| value \|
	\| --- \| --- \|
	\| api-reference \| `api-reference.md` \|

	## document-metadata

	\| key \| value \|
	\| --- \| --- \|
	\| document \| `rewards.md` \|
	\| status \| active \|

	## document-flow

	```mermaid
	flowchart TD
	A[document] --> B[key-sections]
	B --> C[implementation]
	B --> D[operations]
	B --> E[validation]
	```