Spaces:
Sleeping
Sleeping
| # system-architecture | |
| ## overview | |
| WebScraper-OpenEnv is designed as a modular, dashboard-first RL environment with extensible APIs, MCP tools, and multi-model routing. | |
| ## high-level-topology | |
| ```text | |
| Frontend Dashboard (React/Vite) | |
| | | |
| v | |
| FastAPI Control Plane | |
| - episode lifecycle | |
| - action dispatch | |
| - reward engine | |
| - tool registry API | |
| - settings + policy | |
| | | |
| +--> Agent Runtime | |
| | - planner/navigator/extractor/verifier | |
| | - memory manager | |
| | - model router | |
| | | |
| +--> MCP Gateway | |
| | - tool discovery | |
| | - lazy install/load | |
| | - schema + timeout + retries | |
| | | |
| +--> Search Layer | |
| | - provider routing | |
| | - query optimization | |
| | - credibility scoring | |
| | | |
| +--> Memory Layer | |
| | - short/working/long/shared | |
| | - vector index + persistent storage | |
| | | |
| +--> Observability | |
| - traces/logs/metrics/cost dashboard | |
| ``` | |
| ## core-subsystems | |
| ### 1-control-plane | |
| Responsibilities: | |
| - reset/step/state APIs | |
| - request validation | |
| - action authorization and policy checks | |
| - deterministic episode management | |
| ### 2-agent-runtime | |
| Responsibilities: | |
| - policy inference | |
| - strategy execution | |
| - fallback handling | |
| - action explainability | |
| ### 3-tooling-plane-mcp | |
| Responsibilities: | |
| - dynamic tool registry | |
| - server health checks | |
| - lazy installation | |
| - composition workflows | |
| ### 3-5-site-template-layer | |
| Responsibilities: | |
| - maintain inbuilt domain templates (`backend/app/sites/`) | |
| - map instructions/assets to known site behavior | |
| - provide reusable navigation goals/fields for planner and navigator agents | |
| - expose template catalog through `/api/sites*` endpoints | |
| ### 4-data-plane | |
| Responsibilities: | |
| - HTML ingestion and chunking | |
| - extraction and normalization | |
| - verification and reconciliation | |
| - output persistence | |
| ### 5-analytics-plane | |
| Responsibilities: | |
| - reward component logging | |
| - model/token/cost accounting | |
| - tool usage telemetry | |
| - memory quality analytics | |
| ## processing-pipeline | |
| 1. `reset(task_id, seed)` | |
| 2. observation emitted | |
| 3. policy selects action | |
| 4. action executes (native/MCP/search/memory) | |
| 5. reward computed and logged | |
| 6. done check | |
| 7. repeat until terminal | |
| ## batch-and-parallel-design | |
| ### batch | |
| - large HTML split into semantic chunks | |
| - chunk extraction batched with bounded size | |
| - merge + dedupe + confidence rank | |
| ### parallel | |
| - independent chunk tasks run concurrently | |
| - search and verification can run in parallel branches | |
| - configurable worker limits and queue priorities | |
| ## queue-and-scheduler | |
| Task queue supports: | |
| - priority classes (`high`, `normal`, `low`) | |
| - cancellation tokens | |
| - retry policy with backoff | |
| - dead-letter queue for repeated failures | |
| ## storage-architecture | |
| - Episode state: in-memory + optional persistence | |
| - Long-term memory: vector DB + metadata store | |
| - Logs/metrics: append-only time-series-friendly sink | |
| - Exports: JSON/CSV trace packs | |
| ## backend-folder-notes-template-system | |
| ```text | |
| backend/app/sites/ | |
| - models.py # SiteTemplate dataclass | |
| - templates.py # 50+ inbuilt site templates | |
| - registry.py # list/get/match/serialize helpers | |
| ``` | |
| ## reliability | |
| - per-tool timeout and retry | |
| - per-step safety budget | |
| - circuit breaker for failing providers | |
| - deterministic fallback chains | |
| ## security | |
| - API key vaulting via env/config secrets | |
| - MCP allowlist | |
| - output sanitization | |
| - redaction of sensitive tokens in logs | |
| ## deployment | |
| Single-container baseline: | |
| - frontend static build served by API backend | |
| - optional sidecars for DB/vector/MCP infra | |
| Scale-out profile: | |
| - separate API and worker pools | |
| - managed vector DB | |
| - queue-backed distributed execution | |
| - central observability backend | |
| ## compatibility-goals | |
| - local dev mode with minimal dependencies | |
| - cloud mode with managed infra | |
| - optional self-hosted LLM endpoints | |
| ## future-architecture-extensions | |
| - distributed multi-agent graph execution | |
| - adaptive autoscaling by queue pressure | |
| - global memory federation across projects | |
| ## api-reference-alignment | |
| | architecture-plane | primary-endpoints | | |
| | --- | --- | | |
| | control-plane | `/api/health`, `/api/ready`, `/api/settings`, `/api/tasks` | | |
| | episode-runtime | `/api/episode/reset`, `/api/episode/step`, `/api/episode/state/{episode_id}` | | |
| | agent-runtime | `/api/agents/*`, `/api/providers/*` | | |
| | tooling-memory | `/api/tools/*`, `/api/plugins/*`, `/api/memory/*` | | |
| | scraping-runtime | `/api/scrape/stream`, `/api/scrape/{session_id}/result`, `/ws/episode/{episode_id}` | | |
| Use `api-reference.md` as the authoritative endpoint inventory. | |
| ## document-metadata | |
| | key | value | | |
| | --- | --- | | |
| | document | `architecture.md` | | |
| | status | active | | |
| ## document-flow | |
| ```mermaid | |
| flowchart TD | |
| A[document] --> B[key-sections] | |
| B --> C[implementation] | |
| B --> D[operations] | |
| B --> E[validation] | |
| ``` | |