Spaces:
Sleeping
Sleeping
File size: 2,379 Bytes
24f0bf0 54ec9cb 24f0bf0 54ec9cb 24f0bf0 54ec9cb 24f0bf0 54ec9cb 24f0bf0 54ec9cb 24f0bf0 54ec9cb 24f0bf0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | # agentic-scraper-sandbox-plugin-execution-report
## goal
Enable scraper as an agent that can:
- search from non-URL prompts,
- navigate and scrape links,
- execute plugin-based Python analysis (`numpy`, `pandas`, `bs4`) safely,
- run in a sandboxed per-request environment with cleanup.
## what-was-implemented
- Added sandbox plugin executor: `backend/app/plugins/python_sandbox.py`
- AST safety validation (restricted imports and blocked dangerous calls/attributes)
- isolated execution with `python -I`
- per-request temp workspace
- automatic cleanup after execution
- Wired sandbox plugin execution into scrape flow (`/api/scrape/stream` and `/api/scrape/` via shared pipeline):
- `mcp-python-sandbox`
- `proc-python`
- `proc-pandas`
- `proc-numpy`
- `proc-bs4`
- Added optional request field:
- `python_code` (sandboxed code, must assign `result`)
- Enhanced non-URL asset resolution:
- MCP search attempt via DuckDuckGo provider
- deterministic fallback resolution for scraper workflows
- Updated plugin registry and installed plugin set for new plugins.
## safety-model
- Sandbox runs in isolated temp directory per request (`scraperl-sandbox-<session>-*`)
- Dangerous operations blocked by static AST checks (`open`, `exec`, `eval`, `subprocess`, `os`-style operations, dunder access, etc.)
- No persistent artifacts are kept after run (workspace removed in `finally` cleanup).
## one-request-validation-real-curl-n-runs
All tests executed with one request to `POST /api/scrape/stream` each.
| Test | Status | Errors | URLs Processed | Python Analysis Present | Dataset Row Count |
| --- | --- | ---: | ---: | --- | ---: |
| gold-csv-agentic | completed | 0 | 2 | true | 123 |
| ev-data-search-json | completed | 0 | 6 | true | - |
| direct-dataset-python-analysis | completed | 0 | 1 | true | 123 |
## notes
- Gold trend request produced monthly dataset rows from 2016 onward with source links in one stream request.
- Python plugin analysis was present in all validation scenarios.
- Agent step stream included planner/search/navigator/extractor/verifier + sandbox analysis events.
## document-flow
```mermaid
flowchart TD
A[document] --> B[key-sections]
B --> C[implementation]
B --> D[operations]
B --> E[validation]
```
## related-api-reference
| item | value |
| --- | --- |
| api-reference | `api-reference.md` |
|