Spaces:
Sleeping
Sleeping
| title: ScrapeRL | |
| emoji: 🌖 | |
| colorFrom: blue | |
| colorTo: gray | |
| sdk: docker | |
| pinned: false | |
| # scraperl | |
| ScrapeRL is an AI-first web-scraping platform that combines reinforcement-learning style episodes, multi-agent planning, dynamic tool/plugin calls, and multi-provider LLM routing. It supports synchronous and streaming scrape APIs, session-based execution, real-time frontend updates, and OpenEnv-compatible inference. | |
| ## what-this-project-delivers | |
| | area | capability | | |
| | --- | --- | | |
| | scraping-runtime | endpoint-driven scraping with `json`, `csv`, `markdown`, and `text` output modes | | |
| | ai-routing | provider/model routing across OpenAI, Anthropic, Google, Groq, and NVIDIA | | |
| | agentic-tooling | registry-based runtime tool planning and execution with streamed `tool_call` steps | | |
| | memory | short-term, working, long-term, and shared memory layers | | |
| | interface | React + Vite dashboard with live stream progress and session visibility | | |
| | deployment | local dev, Docker Compose, and Hugging Face Space-compatible Docker setup | | |
| | evaluation | root `inference.py` following strict `[START]/[STEP]/[END]` OpenEnv output contract | | |
| ## system-topology | |
| ```mermaid | |
| flowchart TD | |
| A[frontend-dashboard] --> B[fastapi-control-plane] | |
| B --> C[episode-runtime] | |
| B --> D[scrape-runtime] | |
| B --> E[agent-runtime] | |
| E --> F[model-router] | |
| E --> G[tool-and-plugin-registry] | |
| E --> H[memory-manager] | |
| D --> G | |
| D --> H | |
| B --> I[websocket-and-sse-streams] | |
| ``` | |
| ## repository-layout | |
| ```text | |
| scrapeRL/ | |
| backend/ | |
| app/ | |
| api/routes/ # FastAPI route modules | |
| agents/ # agent planning/runtime logic | |
| models/ # model router + provider adapters | |
| plugins/ # plugin registry + runtime integrations | |
| memory/ # memory layers and manager | |
| core/ # env/reward/observation/action foundations | |
| requirements.txt | |
| frontend/ | |
| src/ # React app | |
| package.json | |
| docs/ # modular technical documentation | |
| inference.py # OpenEnv-compliant inference runner | |
| docker-compose.yml | |
| .env.example | |
| ``` | |
| ## quick-start | |
| ### docker-compose | |
| ```bash | |
| git clone https://github.com/NeerajCodz/scrapeRL.git | |
| cd scrapeRL | |
| cp .env.example .env | |
| # set api keys in .env | |
| docker compose up --build | |
| ``` | |
| | service | url | | |
| | --- | --- | | |
| | frontend | `http://localhost:3000` | | |
| | backend-api | `http://localhost:8000` | | |
| | swagger | `http://localhost:8000/swagger` | | |
| ### local-development | |
| Backend: | |
| ```bash | |
| cd backend | |
| pip install -r requirements.txt | |
| uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 | |
| ``` | |
| Frontend: | |
| ```bash | |
| cd frontend | |
| npm install | |
| npm run dev -- --host 0.0.0.0 --port 3000 | |
| ``` | |
| ## configuration | |
| Root configuration lives in `.env` (template: `.env.example`). | |
| ### provider-and-model-keys | |
| | variable | purpose | | |
| | --- | --- | | |
| | `OPENAI_API_KEY` | OpenAI chat + embeddings access | | |
| | `ANTHROPIC_API_KEY` | Anthropic model access | | |
| | `GOOGLE_API_KEY` | Google provider and embeddings access | | |
| | `GEMINI_API_KEY` | alias key used by tests/compose for Gemini | | |
| | `GROQ_API_KEY` | Groq provider access | | |
| | `NVIDIA_API_KEY` | NVIDIA provider access | | |
| | `NVIDIA_BASE_URL` | NVIDIA OpenAI-compatible endpoint base URL | | |
| | `GEMINI_MODEL_EMBEDDING` | embedding model id for Google embeddings | | |
| | `HF_TOKEN` | required token for `inference.py` OpenAI client auth | | |
| ### app-runtime | |
| | variable | default | | |
| | --- | --- | | |
| | `DEBUG` | `false` | | |
| | `LOG_LEVEL` | `INFO` | | |
| | `HOST` | `0.0.0.0` | | |
| | `PORT` | `8000` | | |
| | `CORS_ORIGINS` | `["http://localhost:5173","http://localhost:3000"]` | | |
| | `SESSION_TIMEOUT` | `3600` | | |
| | `MEMORY_TTL` | `86400` | | |
| ### inference-runtime | |
| | variable | default | | |
| | --- | --- | | |
| | `API_BASE_URL` | `https://api.openai.com/v1` | | |
| | `MODEL_NAME` | `gpt-4.1-mini` | | |
| | `ENV_API_BASE_URL` | `http://localhost:8000/api` | | |
| | `TASK_NAME` | `task_001` | | |
| | `BENCHMARK` | `openenv` | | |
| | `MAX_STEPS` | `12` | | |
| | `EPISODE_SEED` | `42` | | |
| | `LLM_TEMPERATURE` | `0.0` | | |
| | `PROMPT_HTML_LIMIT` | `5000` | | |
| | `REQUEST_TIMEOUT_SECONDS` | `30` | | |
| | `USE_OPENENV_SDK` | `true` | | |
| ## inferencepy-openenv-contract | |
| The root `inference.py` uses `from openai import OpenAI` for all LLM calls and emits strict structured logs: | |
| ```text | |
| [START] task=<task_name> env=<benchmark> model=<model_name> | |
| [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null> | |
| [END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn> | |
| ``` | |
| Run: | |
| ```bash | |
| python inference.py --task task_001 --benchmark openenv | |
| ``` | |
| ## api-quick-map | |
| Use `docs/api-reference.md` for the full endpoint inventory. Core surfaces: | |
| | surface | endpoints | | |
| | --- | --- | | |
| | health | `/api/health`, `/api/ready`, `/api/ping` | | |
| | episode | `/api/episode/reset`, `/api/episode/step`, `/api/episode/state/{episode_id}` | | |
| | scrape | `/api/scrape/stream`, `/api/scrape/{session_id}/status`, `/api/scrape/{session_id}/result` | | |
| | agents-tools-memory | `/api/agents/*`, `/api/tools/*`, `/api/plugins/*`, `/api/memory/*` | | |
| | realtime | `/ws/episode/{episode_id}` | | |
| ## documentation-map | |
| | document | purpose | | |
| | --- | --- | | |
| | `docs/overview.md` | platform overview and navigation | | |
| | `docs/api-reference.md` | authoritative HTTP and WebSocket reference | | |
| | `docs/architecture.md` | system architecture and runtime planes | | |
| | `docs/openenv.md` | OpenEnv environment contract | | |
| | `docs/tool-calls.md` | streamed tool-call event patterns | | |
| | `docs/plugins.md` | plugin registry and dynamic tool model | | |
| | `docs/memory.md` | memory design and operations | | |
| | `docs/readme.md` | docs index | | |
| ## testing-and-validation | |
| Backend: | |
| ```bash | |
| cd backend | |
| pytest | |
| ``` | |
| Frontend: | |
| ```bash | |
| cd frontend | |
| npm run test | |
| ``` | |
| ## deployment-notes | |
| | mode | notes | | |
| | --- | --- | | |
| | docker-compose | preferred local full-stack run | | |
| | hugging-face-space | root `README.md` front matter + Docker SDK config is compatible | | |
| | direct-backend | run `uvicorn app.main:app` with `.env` configured | | |
| ## troubleshooting | |
| | symptom | likely-cause | check | | |
| | --- | --- | --- | | |
| | provider not available | missing api key | verify `.env` provider key | | |
| | streaming has no step events | scrape runtime failed early | inspect `/api/scrape/{session_id}/status` | | |
| | inference exits with failure | missing `HF_TOKEN` or endpoint mismatch | verify `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME` | | |
| | no frontend data | backend not reachable from frontend | check `VITE_API_PROXY_TARGET` / backend health | | |
| ## license | |
| MIT. | |