File size: 20,405 Bytes
24f0bf0
df47251
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
 
 
24f0bf0
df47251
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
 
 
 
df47251
 
 
 
 
 
24f0bf0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
# advanced-reward-function

## table-of-contents
1. [Overview](#overview)
2. [Reward Components](#reward-components)
3. [Planning Quality](#planning-quality)
4. [Recovery Ability](#recovery-ability)
5. [Exploration Bonus](#exploration-bonus)
6. [Redundancy Penalty](#redundancy-penalty)
7. [Generalization Score](#generalization-score)
8. [Tool Usage Efficiency](#tool-usage-efficiency)
9. [Memory Utilization](#memory-utilization)
10. [Final Reward Formula](#final-reward-formula)
11. [Configuration](#configuration)

---

## overview

The **Advanced Reward Function** provides dense, interpretable signals that guide the agent toward intelligent, efficient, and generalizable web scraping strategies.

### design-principles

1. **Dense Rewards:** Provide feedback at every step, not just terminal states
2. **Interpretable:** Each component has a clear purpose agents (and humans) can understand
3. **Balanced:** Prevent reward hacking by balancing conflicting objectives
4. **Adaptive:** Adjust weights based on task difficulty and agent progress

### basic-vs-advanced

**Basic Reward (existing):**
```python
reward = task_completion_score  # 0.0 to 1.0
```

**Advanced Reward:**
```python
reward = (
    w1 * task_completion +
    w2 * efficiency +
    w3 * planning_quality +
    w4 * recovery_ability +
    w5 * exploration_bonus +
    w6 * tool_usage +
    w7 * memory_usage +
    w8 * generalization
) - penalties
```

---

## reward-components

### 1-task-completion-w1-0-40

**Purpose:** Measure how much of the task is complete.

**Calculation:**
```python
def task_completion_score(extracted: Dict, ground_truth: Dict) -> float:
    """Score based on field completeness and accuracy."""
    if not ground_truth:
        return 0.0
    
    total_fields = len(ground_truth)
    correct_fields = 0
    partial_fields = 0
    
    for field, true_value in ground_truth.items():
        extracted_value = extracted.get(field)
        
        if extracted_value is None:
            continue  # Missing field, 0 points
        
        # Exact match
        if normalize(extracted_value) == normalize(true_value):
            correct_fields += 1
        # Partial match (fuzzy)
        elif similarity(extracted_value, true_value) > 0.7:
            partial_fields += 1
    
    score = (correct_fields + 0.5 * partial_fields) / total_fields
    return score
```

**Example:**
```python
# Task: Extract name, price, rating
ground_truth = {"name": "Widget Pro", "price": "$49.99", "rating": "4.5"}

# Agent extracted 2/3 correctly
extracted = {"name": "Widget Pro", "price": "$49.99", "rating": None}
task_completion = 2/3 = 0.67
```

---

### 2-efficiency-w2-0-15

**Purpose:** Reward completing tasks quickly with fewer actions.

**Calculation:**
```python
def efficiency_score(steps_taken: int, max_steps: int, pages_visited: int) -> float:
    """Lower steps and pages = higher efficiency."""
    # Step efficiency
    step_efficiency = 1.0 - (steps_taken / max_steps)
    
    # Page efficiency (prefer fewer page visits)
    ideal_pages = estimate_ideal_page_count(task)
    page_efficiency = 1.0 - abs(pages_visited - ideal_pages) / ideal_pages
    page_efficiency = max(0.0, page_efficiency)
    
    return 0.7 * step_efficiency + 0.3 * page_efficiency
```

**Example:**
```python
# Task with max 20 steps
steps_taken = 8
efficiency = 1.0 - (8/20) = 0.60  # Good!

steps_taken = 18
efficiency = 1.0 - (18/20) = 0.10  # Inefficient
```

---

## planning-quality

### 3-planning-quality-score-w3-0-10

**Purpose:** Reward agents that plan before acting.

**Signals:**
- Used WRITE_MEMORY with reasoning notes
- Actions follow a coherent strategy
- Fewer backtracking actions

**Calculation:**
```python
def planning_quality_score(episode_history: List[Action]) -> float:
    """Measure planning behavior."""
    score = 0.0
    
    # 1. Did agent write reasoning notes?
    reasoning_actions = [a for a in episode_history if a.notes]
    if reasoning_actions:
        score += 0.3
    
    # 2. Action coherence: Do actions follow a logical sequence?
    coherence = measure_action_coherence(episode_history)
    score += 0.4 * coherence
    
    # 3. Backtracking penalty: Visiting same page multiple times
    unique_pages = len(set(a.navigate_to for a in episode_history if a.navigate_to))
    total_navigations = len([a for a in episode_history if a.action_type == "NAVIGATE"])
    if total_navigations > 0:
        backtrack_ratio = 1.0 - (unique_pages / total_navigations)
        score += 0.3 * (1.0 - backtrack_ratio)  # Lower backtracking = higher score
    
    return min(score, 1.0)

def measure_action_coherence(actions: List[Action]) -> float:
    """Are actions logically connected?"""
    coherence_patterns = [
        # Good patterns
        ("SEARCH_PAGE", "EXTRACT_FIELD"),      # Search then extract
        ("NAVIGATE", "EXTRACT_FIELD"),          # Navigate then extract
        ("EXTRACT_FIELD", "VERIFY_FACT"),       # Extract then verify
        ("SEARCH_ENGINE", "NAVIGATE"),          # Search then visit
    ]
    
    coherent_pairs = 0
    total_pairs = len(actions) - 1
    
    for i in range(total_pairs):
        pair = (actions[i].action_type, actions[i+1].action_type)
        if pair in coherence_patterns:
            coherent_pairs += 1
    
    return coherent_pairs / total_pairs if total_pairs > 0 else 0.0
```

**Example:**
```python
# Good planning:
actions = [
    Action(type="SEARCH_PAGE", notes="Looking for price pattern"),
    Action(type="EXTRACT_FIELD", target="price"),
    Action(type="VERIFY_FACT", field="price")
]
planning_score = 0.3 (notes) + 0.4*0.67 (coherence) + 0.3 (no backtrack) = 0.87

# Poor planning:
actions = [
    Action(type="NAVIGATE", navigate_to="/page1"),
    Action(type="NAVIGATE", navigate_to="/page2"),
    Action(type="NAVIGATE", navigate_to="/page1"),  # Backtrack!
    Action(type="EXTRACT_FIELD")
]
planning_score = 0.0 (no notes) + 0.4*0.0 (incoherent) + 0.3*0.33 (backtracking) = 0.10
```

---

## recovery-ability

### 4-recovery-ability-score-w4-0-08

**Purpose:** Reward agents that recover from failures.

**Signals:**
- Action failed β†’ Agent tried alternative approach
- Extraction returned empty β†’ Agent searched with different selector
- Page blocked β†’ Agent switched proxy/VPN

**Calculation:**
```python
def recovery_ability_score(episode_history: List[Tuple[Action, Reward]]) -> float:
    """Measure ability to recover from failures."""
    recoveries = 0
    failures = 0
    
    for i in range(len(episode_history) - 1):
        action, reward = episode_history[i]
        next_action, next_reward = episode_history[i + 1]
        
        # Detect failure (negative reward or empty result)
        if reward.value < 0 or "failed" in reward.message.lower():
            failures += 1
            
            # Check if next action was a recovery attempt
            if is_recovery_action(action, next_action):
                if next_reward.value > reward.value:  # Recovery succeeded
                    recoveries += 1
    
    return recoveries / failures if failures > 0 else 0.0

def is_recovery_action(failed_action: Action, next_action: Action) -> bool:
    """Is next_action a recovery attempt for failed_action?"""
    # Same action type with different parameters
    if failed_action.action_type == next_action.action_type:
        if failed_action.selector != next_action.selector:
            return True  # Tried different selector
    
    # Switched to alternative action type
    recovery_alternatives = {
        "EXTRACT_FIELD": ["SEARCH_PAGE", "INSPECT_ELEMENT"],
        "NAVIGATE": ["FETCH_URL"],  # Try direct fetch if navigate blocked
        "SEARCH_ENGINE": ["NAVIGATE"],  # Try direct URL if search fails
    }
    
    if next_action.action_type in recovery_alternatives.get(failed_action.action_type, []):
        return True
    
    return False
```

**Example:**
```python
# Good recovery:
history = [
    (Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1, message="Not found")),
    (Action(type="SEARCH_PAGE", query="price"), Reward(value=0.2, message="Found price pattern")),
    (Action(type="EXTRACT_FIELD", selector="span.product-price"), Reward(value=0.5, message="Extracted"))
]
recovery_score = 1/1 = 1.0  # 1 failure, 1 successful recovery

# No recovery:
history = [
    (Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1)),
    (Action(type="EXTRACT_FIELD", selector=".price"), Reward(value=-0.1)),  # Repeated same failed action!
    (Action(type="SUBMIT"), Reward(value=0.0))
]
recovery_score = 0/2 = 0.0  # 2 failures, 0 recoveries
```

---

## exploration-bonus

### 5-exploration-bonus-w5-0-05

**Purpose:** Encourage discovering new pages and patterns early in training.

**Calculation:**
```python
def exploration_bonus(
    pages_visited: List[str],
    known_pages: Set[str],  # From long-term memory
    episode_number: int
) -> float:
    """Bonus for discovering new pages/patterns."""
    new_pages = set(pages_visited) - known_pages
    
    # Bonus decreases over time (we want agent to eventually exploit)
    decay_factor = math.exp(-0.01 * episode_number)
    
    # Bonus per new page discovered
    bonus_per_page = 0.1
    
    return min(len(new_pages) * bonus_per_page * decay_factor, 1.0)
```

**Example:**
```python
# Episode 10: Agent discovers 3 new pages
exploration_bonus = 3 * 0.1 * exp(-0.01*10) = 0.3 * 0.90 = 0.27

# Episode 500: Same discovery
exploration_bonus = 3 * 0.1 * exp(-0.01*500) = 0.3 * 0.007 = 0.002  # Minimal bonus now
```

---

## redundancy-penalty

### 6-redundancy-penalty-penalty-not-bonus

**Purpose:** Penalize visiting the same page repeatedly without progress.

**Calculation:**
```python
def redundancy_penalty(pages_visited: List[str]) -> float:
    """Penalty for revisiting pages."""
    from collections import Counter
    visit_counts = Counter(pages_visited)
    
    penalty = 0.0
    for page, count in visit_counts.items():
        if count > 1:
            # Exponential penalty for repeat visits
            penalty += 0.05 * (count - 1) ** 1.5
    
    return min(penalty, 1.0)
```

**Example:**
```python
pages = ["/page1", "/page2", "/page1", "/page1", "/page3"]
# page1 visited 3 times
redundancy_penalty = 0.05 * (3-1)**1.5 = 0.05 * 2.83 = 0.14
```

---

## generalization-score

### 7-generalization-score-w8-0-07

**Purpose:** Reward strategies that work across different page layouts.

**Measurement:** After training, evaluate agent on unseen task variations.

**Calculation:**
```python
def generalization_score(
    agent: Agent,
    test_tasks: List[Task],
    training_tasks: List[Task]
) -> float:
    """Test agent on unseen variations of trained tasks."""
    test_results = []
    
    for task in test_tasks:
        # Ensure task is not in training set
        if task.id in [t.id for t in training_tasks]:
            continue
        
        result = agent.run(task)
        test_results.append(result.completion_score)
    
    # Average performance on unseen tasks
    return np.mean(test_results) if test_results else 0.0
```

---

## tool-usage-efficiency

### 8-tool-usage-w6-0-05

**Purpose:** Reward using the right tools at the right time.

**Calculation:**
```python
def tool_usage_score(actions: List[Action]) -> float:
    """Reward appropriate tool usage."""
    score = 0.0
    
    # 1. Used memory appropriately
    memory_actions = [a for a in actions if a.action_type in ["READ_MEMORY", "WRITE_MEMORY"]]
    if memory_actions:
        score += 0.3
    
    # 2. Used MCP tools when appropriate
    mcp_actions = [a for a in actions if a.action_type == "MCP_TOOL_CALL"]
    if mcp_actions:
        score += 0.3
    
    # 3. Verified important extractions
    verify_actions = [a for a in actions if a.action_type == "VERIFY_FACT"]
    extract_actions = [a for a in actions if a.action_type == "EXTRACT_FIELD"]
    if verify_actions and extract_actions:
        verification_ratio = len(verify_actions) / len(extract_actions)
        score += 0.4 * min(verification_ratio, 1.0)
    
    return min(score, 1.0)
```

---

## memory-utilization

### 9-memory-usage-w7-0-05

**Purpose:** Reward effective use of memory system.

**Calculation:**
```python
def memory_usage_score(episode: Episode) -> float:
    """Reward effective memory usage."""
    score = 0.0
    
    # 1. Did agent query long-term memory for similar patterns?
    if episode.memory_queries > 0:
        score += 0.4
    
    # 2. Did agent write successful patterns to long-term memory?
    if episode.memory_writes > 0:
        score += 0.3
    
    # 3. Did memory queries lead to successful actions?
    memory_assisted_success = episode.memory_assisted_actions / episode.total_actions
    score += 0.3 * memory_assisted_success
    
    return min(score, 1.0)
```

---

## final-reward-formula

### complete-formula

```python
def calculate_reward(episode: Episode, config: RewardConfig) -> Reward:
    """Calculate comprehensive reward."""
    
    # Positive components
    R_completion = task_completion_score(episode.extracted, episode.ground_truth)
    R_efficiency = efficiency_score(episode.steps, episode.max_steps, len(episode.pages))
    R_planning = planning_quality_score(episode.actions)
    R_recovery = recovery_ability_score(episode.history)
    R_exploration = exploration_bonus(episode.pages, episode.memory.known_pages, episode.number)
    R_tools = tool_usage_score(episode.actions)
    R_memory = memory_usage_score(episode)
    R_generalization = generalization_score(episode.agent, episode.test_tasks, episode.training_tasks)
    
    # Penalties
    P_redundancy = redundancy_penalty(episode.pages)
    P_timeout = 1.0 if episode.timed_out else 0.0
    P_invalid = sum(1 for a in episode.actions if not a.valid) * 0.1
    
    # Weighted sum
    w = config.weights
    reward_value = (
        w.completion * R_completion +
        w.efficiency * R_efficiency +
        w.planning * R_planning +
        w.recovery * R_recovery +
        w.exploration * R_exploration +
        w.tools * R_tools +
        w.memory * R_memory +
        w.generalization * R_generalization
    ) - (P_redundancy + P_timeout + P_invalid)
    
    # Clamp to [-1, 1]
    reward_value = max(-1.0, min(1.0, reward_value))
    
    # Build breakdown for interpretability
    breakdown = {
        "task_completion": R_completion,
        "efficiency": R_efficiency,
        "planning_quality": R_planning,
        "recovery_ability": R_recovery,
        "exploration_bonus": R_exploration,
        "tool_usage": R_tools,
        "memory_usage": R_memory,
        "generalization": R_generalization,
        "redundancy_penalty": -P_redundancy,
        "timeout_penalty": -P_timeout,
        "invalid_action_penalty": -P_invalid
    }
    
    # Generate explanation
    message = generate_reward_explanation(breakdown, reward_value)
    
    return Reward(
        value=reward_value,
        cumulative=episode.cumulative_reward + reward_value,
        breakdown=breakdown,
        message=message
    )
```

### default-weights

```python
class RewardWeights(BaseModel):
    completion: float = 0.40      # Most important
    efficiency: float = 0.15       # Moderate importance
    planning: float = 0.10         # Encourages good habits
    recovery: float = 0.08         # Resilience
    exploration: float = 0.05      # Early training
    tools: float = 0.05            # Appropriate tool use
    memory: float = 0.05           # Effective memory
    generalization: float = 0.07   # Transfer learning
    # Total: 0.95, leaves room for penalties
```

---

## configuration

### settings

```typescript
interface RewardConfig {
  weights: RewardWeights;
  
  // Component toggles
  enablePlanningReward: boolean;
  enableRecoveryReward: boolean;
  enableExplorationBonus: boolean;
  enableGeneralizationTest: boolean;
  
  // Penalty settings
  redundancyThreshold: number;       // Penalize after N visits to same page
  timeoutPenalty: number;            // Penalty for exceeding time limit
  invalidActionPenalty: number;      // Penalty per invalid action
  
  // Exploration decay
  explorationDecayRate: number;      // Default: 0.01
  
  // Generalization
  testTaskCount: number;             // Number of unseen tasks to test on
}
```

### ui-component

```jsx
<RewardSettings>
  <Section title="Component Weights">
    <Slider label="Task Completion" value={weights.completion} min={0} max={1} step={0.05} />
    <Slider label="Efficiency" value={weights.efficiency} min={0} max={1} step={0.05} />
    <Slider label="Planning Quality" value={weights.planning} min={0} max={1} step={0.05} />
    <Slider label="Recovery Ability" value={weights.recovery} min={0} max={1} step={0.05} />
    <Slider label="Exploration Bonus" value={weights.exploration} min={0} max={1} step={0.05} />
    <Slider label="Tool Usage" value={weights.tools} min={0} max={1} step={0.05} />
    <Slider label="Memory Usage" value={weights.memory} min={0} max={1} step={0.05} />
    <Slider label="Generalization" value={weights.generalization} min={0} max={1} step={0.05} />
    
    <TotalWeight value={Object.values(weights).reduce((a,b) => a+b, 0)} max={1.0} />
  </Section>
  
  <Section title="Penalties">
    <NumberInput label="Redundancy Threshold (page visits)" value={redundancyThreshold} />
    <NumberInput label="Timeout Penalty" value={timeoutPenalty} min={0} max={1} step={0.1} />
    <NumberInput label="Invalid Action Penalty" value={invalidActionPenalty} min={0} max={1} step={0.1} />
  </Section>
  
  <Section title="Exploration">
    <NumberInput label="Decay Rate" value={explorationDecayRate} min={0} max={0.1} step={0.001} />
    <HelpText>How quickly exploration bonus decreases over episodes</HelpText>
  </Section>
  
  <Section title="Presets">
    <Button onClick={() => loadPreset('balanced')}>Balanced (Default)</Button>
    <Button onClick={() => loadPreset('efficiency_focused')}>Efficiency Focused</Button>
    <Button onClick={() => loadPreset('quality_focused')}>Quality Focused</Button>
    <Button onClick={() => loadPreset('exploration')}>Exploration Mode</Button>
  </Section>
</RewardSettings>
```

---

## reward-visualization

```jsx
<RewardBreakdown>
  <BarChart>
    {Object.entries(breakdown).map(([component, value]) => (
      <Bar 
        key={component}
        label={component}
        value={value}
        color={value >= 0 ? 'green' : 'red'}
      />
    ))}
  </BarChart>
  
  <TotalReward value={reward.value} />
  
  <Explanation>{reward.message}</Explanation>
</RewardBreakdown>
```

**Example Output:**
```
Reward Breakdown (Total: 0.72)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Task Completion:    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0.85
Efficiency:         β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.65
Planning Quality:   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘ 0.78
Recovery Ability:   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘ 0.90
Exploration:        β–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.20
Tool Usage:         β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘ 0.95
Memory Usage:       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ 0.40
Generalization:     β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‘β–‘β–‘β–‘β–‘β–‘ 0.72
Redundancy Penalty: β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘ -0.15
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Explanation:
 Excellent task completion (85% of fields extracted correctly)
 Good efficiency (completed in 8/20 steps)
 Strong recovery ability (recovered from 2/2 failures)
 Moderate redundancy (visited homepage 3 times)
β†’ Overall: Strong performance!
```

---

**Next:** See [html-processing.md](./html-processing.md) for advanced HTML handling.


## related-api-reference

| item | value |
| --- | --- |
| api-reference | `api-reference.md` |

## document-metadata

| key | value |
| --- | --- |
| document | `rewards.md` |
| status | active |

## document-flow

```mermaid
flowchart TD
    A[document] --> B[key-sections]
    B --> C[implementation]
    B --> D[operations]
    B --> E[validation]
```