Spaces:
Sleeping
Sleeping
File size: 4,915 Bytes
24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 bcd9d5d 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 bcd9d5d 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 df47251 24f0bf0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 | # system-architecture
## overview
WebScraper-OpenEnv is designed as a modular, dashboard-first RL environment with extensible APIs, MCP tools, and multi-model routing.
## high-level-topology
```text
Frontend Dashboard (React/Vite)
|
v
FastAPI Control Plane
- episode lifecycle
- action dispatch
- reward engine
- tool registry API
- settings + policy
|
+--> Agent Runtime
| - planner/navigator/extractor/verifier
| - memory manager
| - model router
|
+--> MCP Gateway
| - tool discovery
| - lazy install/load
| - schema + timeout + retries
|
+--> Search Layer
| - provider routing
| - query optimization
| - credibility scoring
|
+--> Memory Layer
| - short/working/long/shared
| - vector index + persistent storage
|
+--> Observability
- traces/logs/metrics/cost dashboard
```
## core-subsystems
### 1-control-plane
Responsibilities:
- reset/step/state APIs
- request validation
- action authorization and policy checks
- deterministic episode management
### 2-agent-runtime
Responsibilities:
- policy inference
- strategy execution
- fallback handling
- action explainability
### 3-tooling-plane-mcp
Responsibilities:
- dynamic tool registry
- server health checks
- lazy installation
- composition workflows
### 3-5-site-template-layer
Responsibilities:
- maintain inbuilt domain templates (`backend/app/sites/`)
- map instructions/assets to known site behavior
- provide reusable navigation goals/fields for planner and navigator agents
- expose template catalog through `/api/sites*` endpoints
### 4-data-plane
Responsibilities:
- HTML ingestion and chunking
- extraction and normalization
- verification and reconciliation
- output persistence
### 5-analytics-plane
Responsibilities:
- reward component logging
- model/token/cost accounting
- tool usage telemetry
- memory quality analytics
## processing-pipeline
1. `reset(task_id, seed)`
2. observation emitted
3. policy selects action
4. action executes (native/MCP/search/memory)
5. reward computed and logged
6. done check
7. repeat until terminal
## batch-and-parallel-design
### batch
- large HTML split into semantic chunks
- chunk extraction batched with bounded size
- merge + dedupe + confidence rank
### parallel
- independent chunk tasks run concurrently
- search and verification can run in parallel branches
- configurable worker limits and queue priorities
## queue-and-scheduler
Task queue supports:
- priority classes (`high`, `normal`, `low`)
- cancellation tokens
- retry policy with backoff
- dead-letter queue for repeated failures
## storage-architecture
- Episode state: in-memory + optional persistence
- Long-term memory: vector DB + metadata store
- Logs/metrics: append-only time-series-friendly sink
- Exports: JSON/CSV trace packs
## backend-folder-notes-template-system
```text
backend/app/sites/
- models.py # SiteTemplate dataclass
- templates.py # 50+ inbuilt site templates
- registry.py # list/get/match/serialize helpers
```
## reliability
- per-tool timeout and retry
- per-step safety budget
- circuit breaker for failing providers
- deterministic fallback chains
## security
- API key vaulting via env/config secrets
- MCP allowlist
- output sanitization
- redaction of sensitive tokens in logs
## deployment
Single-container baseline:
- frontend static build served by API backend
- optional sidecars for DB/vector/MCP infra
Scale-out profile:
- separate API and worker pools
- managed vector DB
- queue-backed distributed execution
- central observability backend
## compatibility-goals
- local dev mode with minimal dependencies
- cloud mode with managed infra
- optional self-hosted LLM endpoints
## future-architecture-extensions
- distributed multi-agent graph execution
- adaptive autoscaling by queue pressure
- global memory federation across projects
## api-reference-alignment
| architecture-plane | primary-endpoints |
| --- | --- |
| control-plane | `/api/health`, `/api/ready`, `/api/settings`, `/api/tasks` |
| episode-runtime | `/api/episode/reset`, `/api/episode/step`, `/api/episode/state/{episode_id}` |
| agent-runtime | `/api/agents/*`, `/api/providers/*` |
| tooling-memory | `/api/tools/*`, `/api/plugins/*`, `/api/memory/*` |
| scraping-runtime | `/api/scrape/stream`, `/api/scrape/{session_id}/result`, `/ws/episode/{episode_id}` |
Use `api-reference.md` as the authoritative endpoint inventory.
## document-metadata
| key | value |
| --- | --- |
| document | `architecture.md` |
| status | active |
## document-flow
```mermaid
flowchart TD
A[document] --> B[key-sections]
B --> C[implementation]
B --> D[operations]
B --> E[validation]
```
|