File size: 2,619 Bytes
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# Observability and Dashboard

## Overview

Observability provides deep insight into runtime behavior, model usage, tool execution, memory quality, and rewards.

## Dashboard Sections

### 1. Live Thought Stream

- chronological reasoning notes
- model/router choice trace
- action confidence timeline
- override events

### 2. Navigation Map

Graph of visited pages:

- nodes = URLs
- edges = transitions
- node color = relevance/confidence
- revisit highlighting

### 3. MCP Usage Panel

- tool call count by server
- avg latency by tool
- error rate and retries
- top successful tool chains

### 4. Memory Viewer

- inspect short/working/long/shared memory
- filter by task/domain/confidence
- edit/delete entries
- prune previews

### 5. Reward Analytics

- per-step reward breakdown
- component contribution trends
- penalty heatmap
- episode comparison

### 6. Cost and Token Monitor

- per-provider usage
- per-model token counts
- cumulative cost vs budget
- forecasted burn rate

## Core Metrics

### Agent Metrics

- task completion rate
- avg steps to completion
- recovery score
- generalization score
- exploration ratio

### Tool Metrics

- tool success rate
- timeout ratio
- fallback frequency
- schema validation failures

### Memory Metrics

- retrieval hit rate
- relevance score distribution
- prune rate
- memory-assisted success ratio

### Search Metrics

- query success rate
- multi-hop depth distribution
- credibility score average
- duplicate result ratio

## Logging Model

Structured logs (JSON):

```json
{
  "timestamp": "2026-03-27T00:00:00Z",
  "episode_id": "ep_123",
  "step": 7,
  "event": "tool_call",
  "tool": "beautifulsoup.find_all",
  "latency_ms": 54,
  "success": true,
  "reward_delta": 0.08
}
```

## Tracing

Per-episode trace includes:

- observations
- actions
- rewards
- tool calls
- memory operations
- final submission and grader results

## Alerts

Configurable alerts:

- budget threshold crossed
- error spike
- tool outage
- memory bloat
- anomalous low reward streak

## APIs

- `GET /api/metrics/summary`
- `GET /api/metrics/timeseries`
- `GET /api/traces/{episode_id}`
- `GET /api/costs`
- `GET /api/memory/stats`
- `GET /api/tools/stats`

## Recommended Dashboard Layout

1. Top row: completion, cost, latency, error rate
2. Mid row: thought stream + navigation graph
3. Lower row: reward breakdown + MCP usage + memory viewer
4. Bottom row: raw trace and export controls

## Export and Audit

Exports:

- JSON trace
- CSV metrics
- reward analysis report
- model usage report

All exports include episode and configuration fingerprints for reproducibility.