File size: 4,915 Bytes
24f0bf0
df47251
24f0bf0
df47251
 
 
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
24f0bf0
df47251
 
 
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
 
 
24f0bf0
bcd9d5d
 
 
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
 
 
 
24f0bf0
df47251
24f0bf0
df47251
 
 
 
 
24f0bf0
df47251
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
24f0bf0
bcd9d5d
 
 
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
df47251
 
 
 
 
24f0bf0
df47251
 
 
 
24f0bf0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
# system-architecture

## overview

WebScraper-OpenEnv is designed as a modular, dashboard-first RL environment with extensible APIs, MCP tools, and multi-model routing.

## high-level-topology

```text
Frontend Dashboard (React/Vite)
        |
        v
FastAPI Control Plane
  - episode lifecycle
  - action dispatch
  - reward engine
  - tool registry API
  - settings + policy
        |
        +--> Agent Runtime
        |      - planner/navigator/extractor/verifier
        |      - memory manager
        |      - model router
        |
        +--> MCP Gateway
        |      - tool discovery
        |      - lazy install/load
        |      - schema + timeout + retries
        |
        +--> Search Layer
        |      - provider routing
        |      - query optimization
        |      - credibility scoring
        |
        +--> Memory Layer
        |      - short/working/long/shared
        |      - vector index + persistent storage
        |
        +--> Observability
               - traces/logs/metrics/cost dashboard
```

## core-subsystems

### 1-control-plane

Responsibilities:

- reset/step/state APIs
- request validation
- action authorization and policy checks
- deterministic episode management

### 2-agent-runtime

Responsibilities:

- policy inference
- strategy execution
- fallback handling
- action explainability

### 3-tooling-plane-mcp

Responsibilities:

- dynamic tool registry
- server health checks
- lazy installation
- composition workflows

### 3-5-site-template-layer

Responsibilities:

- maintain inbuilt domain templates (`backend/app/sites/`)
- map instructions/assets to known site behavior
- provide reusable navigation goals/fields for planner and navigator agents
- expose template catalog through `/api/sites*` endpoints

### 4-data-plane

Responsibilities:

- HTML ingestion and chunking
- extraction and normalization
- verification and reconciliation
- output persistence

### 5-analytics-plane

Responsibilities:

- reward component logging
- model/token/cost accounting
- tool usage telemetry
- memory quality analytics

## processing-pipeline

1. `reset(task_id, seed)`
2. observation emitted
3. policy selects action
4. action executes (native/MCP/search/memory)
5. reward computed and logged
6. done check
7. repeat until terminal

## batch-and-parallel-design

### batch

- large HTML split into semantic chunks
- chunk extraction batched with bounded size
- merge + dedupe + confidence rank

### parallel

- independent chunk tasks run concurrently
- search and verification can run in parallel branches
- configurable worker limits and queue priorities

## queue-and-scheduler

Task queue supports:

- priority classes (`high`, `normal`, `low`)
- cancellation tokens
- retry policy with backoff
- dead-letter queue for repeated failures

## storage-architecture

- Episode state: in-memory + optional persistence
- Long-term memory: vector DB + metadata store
- Logs/metrics: append-only time-series-friendly sink
- Exports: JSON/CSV trace packs

## backend-folder-notes-template-system

```text
backend/app/sites/
  - models.py      # SiteTemplate dataclass
  - templates.py   # 50+ inbuilt site templates
  - registry.py    # list/get/match/serialize helpers
```

## reliability

- per-tool timeout and retry
- per-step safety budget
- circuit breaker for failing providers
- deterministic fallback chains

## security

- API key vaulting via env/config secrets
- MCP allowlist
- output sanitization
- redaction of sensitive tokens in logs

## deployment

Single-container baseline:

- frontend static build served by API backend
- optional sidecars for DB/vector/MCP infra

Scale-out profile:

- separate API and worker pools
- managed vector DB
- queue-backed distributed execution
- central observability backend

## compatibility-goals

- local dev mode with minimal dependencies
- cloud mode with managed infra
- optional self-hosted LLM endpoints

## future-architecture-extensions

- distributed multi-agent graph execution
- adaptive autoscaling by queue pressure
- global memory federation across projects

## api-reference-alignment

| architecture-plane | primary-endpoints |
| --- | --- |
| control-plane | `/api/health`, `/api/ready`, `/api/settings`, `/api/tasks` |
| episode-runtime | `/api/episode/reset`, `/api/episode/step`, `/api/episode/state/{episode_id}` |
| agent-runtime | `/api/agents/*`, `/api/providers/*` |
| tooling-memory | `/api/tools/*`, `/api/plugins/*`, `/api/memory/*` |
| scraping-runtime | `/api/scrape/stream`, `/api/scrape/{session_id}/result`, `/ws/episode/{episode_id}` |

Use `api-reference.md` as the authoritative endpoint inventory.

## document-metadata

| key | value |
| --- | --- |
| document | `architecture.md` |
| status | active |

## document-flow

```mermaid
flowchart TD
    A[document] --> B[key-sections]
    B --> C[implementation]
    B --> D[operations]
    B --> E[validation]
```