Spaces:

NeerajCodz
/

scrapeRL

Sleeping

App Files Files Community

scrapeRL / docs /test /agentic-sandbox-plugin-search-report.md

NeerajCodz

docs: init proto

24f0bf0 about 2 months ago

preview code

raw

history blame contribute delete

2.38 kB

	# agentic-scraper-sandbox-plugin-execution-report

	## goal
	Enable scraper as an agent that can:
	- search from non-URL prompts,
	- navigate and scrape links,
	- execute plugin-based Python analysis (`numpy`, `pandas`, `bs4`) safely,
	- run in a sandboxed per-request environment with cleanup.

	## what-was-implemented
	- Added sandbox plugin executor: `backend/app/plugins/python_sandbox.py`
	- AST safety validation (restricted imports and blocked dangerous calls/attributes)
	- isolated execution with `python -I`
	- per-request temp workspace
	- automatic cleanup after execution
	- Wired sandbox plugin execution into scrape flow (`/api/scrape/stream` and `/api/scrape/` via shared pipeline):
	- `mcp-python-sandbox`
	- `proc-python`
	- `proc-pandas`
	- `proc-numpy`
	- `proc-bs4`
	- Added optional request field:
	- `python_code` (sandboxed code, must assign `result`)
	- Enhanced non-URL asset resolution:
	- MCP search attempt via DuckDuckGo provider
	- deterministic fallback resolution for scraper workflows
	- Updated plugin registry and installed plugin set for new plugins.

	## safety-model
	- Sandbox runs in isolated temp directory per request (`scraperl-sandbox-<session>-*`)
	- Dangerous operations blocked by static AST checks (`open`, `exec`, `eval`, `subprocess`, `os`-style operations, dunder access, etc.)
	- No persistent artifacts are kept after run (workspace removed in `finally` cleanup).

	## one-request-validation-real-curl-n-runs
	All tests executed with one request to `POST /api/scrape/stream` each.

	\| Test \| Status \| Errors \| URLs Processed \| Python Analysis Present \| Dataset Row Count \|
	\| --- \| --- \| ---: \| ---: \| --- \| ---: \|
	\| gold-csv-agentic \| completed \| 0 \| 2 \| true \| 123 \|
	\| ev-data-search-json \| completed \| 0 \| 6 \| true \| - \|
	\| direct-dataset-python-analysis \| completed \| 0 \| 1 \| true \| 123 \|

	## notes
	- Gold trend request produced monthly dataset rows from 2016 onward with source links in one stream request.
	- Python plugin analysis was present in all validation scenarios.
	- Agent step stream included planner/search/navigator/extractor/verifier + sandbox analysis events.

	## document-flow

	```mermaid
	flowchart TD
	A[document] --> B[key-sections]
	B --> C[implementation]
	B --> D[operations]
	B --> E[validation]
	```
	## related-api-reference

	\| item \| value \|
	\| --- \| --- \|
	\| api-reference \| `api-reference.md` \|