Spaces:

NeerajCodz
/

scrapeRL

Sleeping

File size: 2,379 Bytes

24f0bf0
54ec9cb
24f0bf0
54ec9cb
 
 
 
 
 
24f0bf0
54ec9cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24f0bf0
54ec9cb
 
 
 
24f0bf0
54ec9cb
 
 
 
 
 
 
 
24f0bf0
54ec9cb
 
 
24f0bf0

# agentic-scraper-sandbox-plugin-execution-report

## goal
Enable scraper as an agent that can:
- search from non-URL prompts,
- navigate and scrape links,
- execute plugin-based Python analysis (`numpy`, `pandas`, `bs4`) safely,
- run in a sandboxed per-request environment with cleanup.

## what-was-implemented
- Added sandbox plugin executor: `backend/app/plugins/python_sandbox.py`
  - AST safety validation (restricted imports and blocked dangerous calls/attributes)
  - isolated execution with `python -I`
  - per-request temp workspace
  - automatic cleanup after execution
- Wired sandbox plugin execution into scrape flow (`/api/scrape/stream` and `/api/scrape/` via shared pipeline):
  - `mcp-python-sandbox`
  - `proc-python`
  - `proc-pandas`
  - `proc-numpy`
  - `proc-bs4`
- Added optional request field:
  - `python_code` (sandboxed code, must assign `result`)
- Enhanced non-URL asset resolution:
  - MCP search attempt via DuckDuckGo provider
  - deterministic fallback resolution for scraper workflows
- Updated plugin registry and installed plugin set for new plugins.

## safety-model
- Sandbox runs in isolated temp directory per request (`scraperl-sandbox-<session>-*`)
- Dangerous operations blocked by static AST checks (`open`, `exec`, `eval`, `subprocess`, `os`-style operations, dunder access, etc.)
- No persistent artifacts are kept after run (workspace removed in `finally` cleanup).

## one-request-validation-real-curl-n-runs
All tests executed with one request to `POST /api/scrape/stream` each.

| Test | Status | Errors | URLs Processed | Python Analysis Present | Dataset Row Count |
| --- | --- | ---: | ---: | --- | ---: |
| gold-csv-agentic | completed | 0 | 2 | true | 123 |
| ev-data-search-json | completed | 0 | 6 | true | - |
| direct-dataset-python-analysis | completed | 0 | 1 | true | 123 |

## notes
- Gold trend request produced monthly dataset rows from 2016 onward with source links in one stream request.
- Python plugin analysis was present in all validation scenarios.
- Agent step stream included planner/search/navigator/extractor/verifier + sandbox analysis events.

## document-flow

```mermaid
flowchart TD
    A[document] --> B[key-sections]
    B --> C[implementation]
    B --> D[operations]
    B --> E[validation]
```
## related-api-reference

| item | value |
| --- | --- |
| api-reference | `api-reference.md` |