scrapeRL / README.md
NeerajCodz's picture
docs: init proto
24f0bf0
---
title: ScrapeRL
emoji: 🌖
colorFrom: blue
colorTo: gray
sdk: docker
pinned: false
---
# scraperl
ScrapeRL is an AI-first web-scraping platform that combines reinforcement-learning style episodes, multi-agent planning, dynamic tool/plugin calls, and multi-provider LLM routing. It supports synchronous and streaming scrape APIs, session-based execution, real-time frontend updates, and OpenEnv-compatible inference.
## what-this-project-delivers
| area | capability |
| --- | --- |
| scraping-runtime | endpoint-driven scraping with `json`, `csv`, `markdown`, and `text` output modes |
| ai-routing | provider/model routing across OpenAI, Anthropic, Google, Groq, and NVIDIA |
| agentic-tooling | registry-based runtime tool planning and execution with streamed `tool_call` steps |
| memory | short-term, working, long-term, and shared memory layers |
| interface | React + Vite dashboard with live stream progress and session visibility |
| deployment | local dev, Docker Compose, and Hugging Face Space-compatible Docker setup |
| evaluation | root `inference.py` following strict `[START]/[STEP]/[END]` OpenEnv output contract |
## system-topology
```mermaid
flowchart TD
A[frontend-dashboard] --> B[fastapi-control-plane]
B --> C[episode-runtime]
B --> D[scrape-runtime]
B --> E[agent-runtime]
E --> F[model-router]
E --> G[tool-and-plugin-registry]
E --> H[memory-manager]
D --> G
D --> H
B --> I[websocket-and-sse-streams]
```
## repository-layout
```text
scrapeRL/
backend/
app/
api/routes/ # FastAPI route modules
agents/ # agent planning/runtime logic
models/ # model router + provider adapters
plugins/ # plugin registry + runtime integrations
memory/ # memory layers and manager
core/ # env/reward/observation/action foundations
requirements.txt
frontend/
src/ # React app
package.json
docs/ # modular technical documentation
inference.py # OpenEnv-compliant inference runner
docker-compose.yml
.env.example
```
## quick-start
### docker-compose
```bash
git clone https://github.com/NeerajCodz/scrapeRL.git
cd scrapeRL
cp .env.example .env
# set api keys in .env
docker compose up --build
```
| service | url |
| --- | --- |
| frontend | `http://localhost:3000` |
| backend-api | `http://localhost:8000` |
| swagger | `http://localhost:8000/swagger` |
### local-development
Backend:
```bash
cd backend
pip install -r requirements.txt
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```
Frontend:
```bash
cd frontend
npm install
npm run dev -- --host 0.0.0.0 --port 3000
```
## configuration
Root configuration lives in `.env` (template: `.env.example`).
### provider-and-model-keys
| variable | purpose |
| --- | --- |
| `OPENAI_API_KEY` | OpenAI chat + embeddings access |
| `ANTHROPIC_API_KEY` | Anthropic model access |
| `GOOGLE_API_KEY` | Google provider and embeddings access |
| `GEMINI_API_KEY` | alias key used by tests/compose for Gemini |
| `GROQ_API_KEY` | Groq provider access |
| `NVIDIA_API_KEY` | NVIDIA provider access |
| `NVIDIA_BASE_URL` | NVIDIA OpenAI-compatible endpoint base URL |
| `GEMINI_MODEL_EMBEDDING` | embedding model id for Google embeddings |
| `HF_TOKEN` | required token for `inference.py` OpenAI client auth |
### app-runtime
| variable | default |
| --- | --- |
| `DEBUG` | `false` |
| `LOG_LEVEL` | `INFO` |
| `HOST` | `0.0.0.0` |
| `PORT` | `8000` |
| `CORS_ORIGINS` | `["http://localhost:5173","http://localhost:3000"]` |
| `SESSION_TIMEOUT` | `3600` |
| `MEMORY_TTL` | `86400` |
### inference-runtime
| variable | default |
| --- | --- |
| `API_BASE_URL` | `https://api.openai.com/v1` |
| `MODEL_NAME` | `gpt-4.1-mini` |
| `ENV_API_BASE_URL` | `http://localhost:8000/api` |
| `TASK_NAME` | `task_001` |
| `BENCHMARK` | `openenv` |
| `MAX_STEPS` | `12` |
| `EPISODE_SEED` | `42` |
| `LLM_TEMPERATURE` | `0.0` |
| `PROMPT_HTML_LIMIT` | `5000` |
| `REQUEST_TIMEOUT_SECONDS` | `30` |
| `USE_OPENENV_SDK` | `true` |
## inferencepy-openenv-contract
The root `inference.py` uses `from openai import OpenAI` for all LLM calls and emits strict structured logs:
```text
[START] task=<task_name> env=<benchmark> model=<model_name>
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
[END] success=<true|false> steps=<n> rewards=<r1,r2,...,rn>
```
Run:
```bash
python inference.py --task task_001 --benchmark openenv
```
## api-quick-map
Use `docs/api-reference.md` for the full endpoint inventory. Core surfaces:
| surface | endpoints |
| --- | --- |
| health | `/api/health`, `/api/ready`, `/api/ping` |
| episode | `/api/episode/reset`, `/api/episode/step`, `/api/episode/state/{episode_id}` |
| scrape | `/api/scrape/stream`, `/api/scrape/{session_id}/status`, `/api/scrape/{session_id}/result` |
| agents-tools-memory | `/api/agents/*`, `/api/tools/*`, `/api/plugins/*`, `/api/memory/*` |
| realtime | `/ws/episode/{episode_id}` |
## documentation-map
| document | purpose |
| --- | --- |
| `docs/overview.md` | platform overview and navigation |
| `docs/api-reference.md` | authoritative HTTP and WebSocket reference |
| `docs/architecture.md` | system architecture and runtime planes |
| `docs/openenv.md` | OpenEnv environment contract |
| `docs/tool-calls.md` | streamed tool-call event patterns |
| `docs/plugins.md` | plugin registry and dynamic tool model |
| `docs/memory.md` | memory design and operations |
| `docs/readme.md` | docs index |
## testing-and-validation
Backend:
```bash
cd backend
pytest
```
Frontend:
```bash
cd frontend
npm run test
```
## deployment-notes
| mode | notes |
| --- | --- |
| docker-compose | preferred local full-stack run |
| hugging-face-space | root `README.md` front matter + Docker SDK config is compatible |
| direct-backend | run `uvicorn app.main:app` with `.env` configured |
## troubleshooting
| symptom | likely-cause | check |
| --- | --- | --- |
| provider not available | missing api key | verify `.env` provider key |
| streaming has no step events | scrape runtime failed early | inspect `/api/scrape/{session_id}/status` |
| inference exits with failure | missing `HF_TOKEN` or endpoint mismatch | verify `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME` |
| no frontend data | backend not reachable from frontend | check `VITE_API_PROXY_TARGET` / backend health |
## license
MIT.