Spaces:
Sleeping
Sleeping
File size: 2,619 Bytes
df47251 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | # Observability and Dashboard
## Overview
Observability provides deep insight into runtime behavior, model usage, tool execution, memory quality, and rewards.
## Dashboard Sections
### 1. Live Thought Stream
- chronological reasoning notes
- model/router choice trace
- action confidence timeline
- override events
### 2. Navigation Map
Graph of visited pages:
- nodes = URLs
- edges = transitions
- node color = relevance/confidence
- revisit highlighting
### 3. MCP Usage Panel
- tool call count by server
- avg latency by tool
- error rate and retries
- top successful tool chains
### 4. Memory Viewer
- inspect short/working/long/shared memory
- filter by task/domain/confidence
- edit/delete entries
- prune previews
### 5. Reward Analytics
- per-step reward breakdown
- component contribution trends
- penalty heatmap
- episode comparison
### 6. Cost and Token Monitor
- per-provider usage
- per-model token counts
- cumulative cost vs budget
- forecasted burn rate
## Core Metrics
### Agent Metrics
- task completion rate
- avg steps to completion
- recovery score
- generalization score
- exploration ratio
### Tool Metrics
- tool success rate
- timeout ratio
- fallback frequency
- schema validation failures
### Memory Metrics
- retrieval hit rate
- relevance score distribution
- prune rate
- memory-assisted success ratio
### Search Metrics
- query success rate
- multi-hop depth distribution
- credibility score average
- duplicate result ratio
## Logging Model
Structured logs (JSON):
```json
{
"timestamp": "2026-03-27T00:00:00Z",
"episode_id": "ep_123",
"step": 7,
"event": "tool_call",
"tool": "beautifulsoup.find_all",
"latency_ms": 54,
"success": true,
"reward_delta": 0.08
}
```
## Tracing
Per-episode trace includes:
- observations
- actions
- rewards
- tool calls
- memory operations
- final submission and grader results
## Alerts
Configurable alerts:
- budget threshold crossed
- error spike
- tool outage
- memory bloat
- anomalous low reward streak
## APIs
- `GET /api/metrics/summary`
- `GET /api/metrics/timeseries`
- `GET /api/traces/{episode_id}`
- `GET /api/costs`
- `GET /api/memory/stats`
- `GET /api/tools/stats`
## Recommended Dashboard Layout
1. Top row: completion, cost, latency, error rate
2. Mid row: thought stream + navigation graph
3. Lower row: reward breakdown + MCP usage + memory viewer
4. Bottom row: raw trace and export controls
## Export and Audit
Exports:
- JSON trace
- CSV metrics
- reward analysis report
- model usage report
All exports include episode and configuration fingerprints for reproducibility.
|