Spaces:

NeerajCodz
/

scrapeRL

Sleeping

App Files Files Community

scrapeRL / docs /architecture.md

NeerajCodz

docs: init proto

24f0bf0 about 1 month ago

preview code

raw

history blame contribute delete

4.92 kB

	# system-architecture

	## overview

	WebScraper-OpenEnv is designed as a modular, dashboard-first RL environment with extensible APIs, MCP tools, and multi-model routing.

	## high-level-topology

	```text
	Frontend Dashboard (React/Vite)
	\|
	v
	FastAPI Control Plane
	- episode lifecycle
	- action dispatch
	- reward engine
	- tool registry API
	- settings + policy
	\|
	+--> Agent Runtime
	\| - planner/navigator/extractor/verifier
	\| - memory manager
	\| - model router
	\|
	+--> MCP Gateway
	\| - tool discovery
	\| - lazy install/load
	\| - schema + timeout + retries
	\|
	+--> Search Layer
	\| - provider routing
	\| - query optimization
	\| - credibility scoring
	\|
	+--> Memory Layer
	\| - short/working/long/shared
	\| - vector index + persistent storage
	\|
	+--> Observability
	- traces/logs/metrics/cost dashboard
	```

	## core-subsystems

	### 1-control-plane

	Responsibilities:

	- reset/step/state APIs
	- request validation
	- action authorization and policy checks
	- deterministic episode management

	### 2-agent-runtime

	Responsibilities:

	- policy inference
	- strategy execution
	- fallback handling
	- action explainability

	### 3-tooling-plane-mcp

	Responsibilities:

	- dynamic tool registry
	- server health checks
	- lazy installation
	- composition workflows

	### 3-5-site-template-layer

	Responsibilities:

	- maintain inbuilt domain templates (`backend/app/sites/`)
	- map instructions/assets to known site behavior
	- provide reusable navigation goals/fields for planner and navigator agents
	- expose template catalog through `/api/sites*` endpoints

	### 4-data-plane

	Responsibilities:

	- HTML ingestion and chunking
	- extraction and normalization
	- verification and reconciliation
	- output persistence

	### 5-analytics-plane

	Responsibilities:

	- reward component logging
	- model/token/cost accounting
	- tool usage telemetry
	- memory quality analytics

	## processing-pipeline

	1. `reset(task_id, seed)`
	2. observation emitted
	3. policy selects action
	4. action executes (native/MCP/search/memory)
	5. reward computed and logged
	6. done check
	7. repeat until terminal

	## batch-and-parallel-design

	### batch

	- large HTML split into semantic chunks
	- chunk extraction batched with bounded size
	- merge + dedupe + confidence rank

	### parallel

	- independent chunk tasks run concurrently
	- search and verification can run in parallel branches
	- configurable worker limits and queue priorities

	## queue-and-scheduler

	Task queue supports:

	- priority classes (`high`, `normal`, `low`)
	- cancellation tokens
	- retry policy with backoff
	- dead-letter queue for repeated failures

	## storage-architecture

	- Episode state: in-memory + optional persistence
	- Long-term memory: vector DB + metadata store
	- Logs/metrics: append-only time-series-friendly sink
	- Exports: JSON/CSV trace packs

	## backend-folder-notes-template-system

	```text
	backend/app/sites/
	- models.py # SiteTemplate dataclass
	- templates.py # 50+ inbuilt site templates
	- registry.py # list/get/match/serialize helpers
	```

	## reliability

	- per-tool timeout and retry
	- per-step safety budget
	- circuit breaker for failing providers
	- deterministic fallback chains

	## security

	- API key vaulting via env/config secrets
	- MCP allowlist
	- output sanitization
	- redaction of sensitive tokens in logs

	## deployment

	Single-container baseline:

	- frontend static build served by API backend
	- optional sidecars for DB/vector/MCP infra

	Scale-out profile:

	- separate API and worker pools
	- managed vector DB
	- queue-backed distributed execution
	- central observability backend

	## compatibility-goals

	- local dev mode with minimal dependencies
	- cloud mode with managed infra
	- optional self-hosted LLM endpoints

	## future-architecture-extensions

	- distributed multi-agent graph execution
	- adaptive autoscaling by queue pressure
	- global memory federation across projects

	## api-reference-alignment

	\| architecture-plane \| primary-endpoints \|
	\| --- \| --- \|
	\| control-plane \| `/api/health`, `/api/ready`, `/api/settings`, `/api/tasks` \|
	\| episode-runtime \| `/api/episode/reset`, `/api/episode/step`, `/api/episode/state/{episode_id}` \|
	\| agent-runtime \| `/api/agents/`, `/api/providers/` \|
	\| tooling-memory \| `/api/tools/`, `/api/plugins/`, `/api/memory/*` \|
	\| scraping-runtime \| `/api/scrape/stream`, `/api/scrape/{session_id}/result`, `/ws/episode/{episode_id}` \|

	Use `api-reference.md` as the authoritative endpoint inventory.

	## document-metadata

	\| key \| value \|
	\| --- \| --- \|
	\| document \| `architecture.md` \|
	\| status \| active \|

	## document-flow

	```mermaid
	flowchart TD
	A[document] --> B[key-sections]
	B --> C[implementation]
	B --> D[operations]
	B --> E[validation]
	```