Spaces:

NeerajCodz
/

scrapeRL

Sleeping

App Files Files Community

scrapeRL / README.md

NeerajCodz

docs: init proto

24f0bf0 about 1 month ago

preview code

raw

history blame contribute delete

6.48 kB

	---
	title: ScrapeRL
	emoji: 🌖
	colorFrom: blue
	colorTo: gray
	sdk: docker
	pinned: false
	---

	# scraperl

	ScrapeRL is an AI-first web-scraping platform that combines reinforcement-learning style episodes, multi-agent planning, dynamic tool/plugin calls, and multi-provider LLM routing. It supports synchronous and streaming scrape APIs, session-based execution, real-time frontend updates, and OpenEnv-compatible inference.

	## what-this-project-delivers

	\| area \| capability \|
	\| --- \| --- \|
	\| scraping-runtime \| endpoint-driven scraping with `json`, `csv`, `markdown`, and `text` output modes \|
	\| ai-routing \| provider/model routing across OpenAI, Anthropic, Google, Groq, and NVIDIA \|
	\| agentic-tooling \| registry-based runtime tool planning and execution with streamed `tool_call` steps \|
	\| memory \| short-term, working, long-term, and shared memory layers \|
	\| interface \| React + Vite dashboard with live stream progress and session visibility \|
	\| deployment \| local dev, Docker Compose, and Hugging Face Space-compatible Docker setup \|
	\| evaluation \| root `inference.py` following strict `[START]/[STEP]/[END]` OpenEnv output contract \|

	## system-topology

	```mermaid
	flowchart TD
	A[frontend-dashboard] --> B[fastapi-control-plane]
	B --> C[episode-runtime]
	B --> D[scrape-runtime]
	B --> E[agent-runtime]
	E --> F[model-router]
	E --> G[tool-and-plugin-registry]
	E --> H[memory-manager]
	D --> G
	D --> H
	B --> I[websocket-and-sse-streams]
	```

	## repository-layout

	```text
	scrapeRL/
	backend/
	app/
	api/routes/ # FastAPI route modules
	agents/ # agent planning/runtime logic
	models/ # model router + provider adapters
	plugins/ # plugin registry + runtime integrations
	memory/ # memory layers and manager
	core/ # env/reward/observation/action foundations
	requirements.txt
	frontend/
	src/ # React app
	package.json
	docs/ # modular technical documentation
	inference.py # OpenEnv-compliant inference runner
	docker-compose.yml
	.env.example
	```

	## quick-start

	### docker-compose

	```bash
	git clone https://github.com/NeerajCodz/scrapeRL.git
	cd scrapeRL
	cp .env.example .env
	# set api keys in .env
	docker compose up --build
	```

	\| service \| url \|
	\| --- \| --- \|
	\| frontend \| `http://localhost:3000` \|
	\| backend-api \| `http://localhost:8000` \|
	\| swagger \| `http://localhost:8000/swagger` \|

	### local-development

	Backend:

	```bash
	cd backend
	pip install -r requirements.txt
	uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
	```

	Frontend:

	```bash
	cd frontend
	npm install
	npm run dev -- --host 0.0.0.0 --port 3000
	```

	## configuration

	Root configuration lives in `.env` (template: `.env.example`).

	### provider-and-model-keys

	\| variable \| purpose \|
	\| --- \| --- \|
	\| `OPENAI_API_KEY` \| OpenAI chat + embeddings access \|
	\| `ANTHROPIC_API_KEY` \| Anthropic model access \|
	\| `GOOGLE_API_KEY` \| Google provider and embeddings access \|
	\| `GEMINI_API_KEY` \| alias key used by tests/compose for Gemini \|
	\| `GROQ_API_KEY` \| Groq provider access \|
	\| `NVIDIA_API_KEY` \| NVIDIA provider access \|
	\| `NVIDIA_BASE_URL` \| NVIDIA OpenAI-compatible endpoint base URL \|
	\| `GEMINI_MODEL_EMBEDDING` \| embedding model id for Google embeddings \|
	\| `HF_TOKEN` \| required token for `inference.py` OpenAI client auth \|

	### app-runtime

	\| variable \| default \|
	\| --- \| --- \|
	\| `DEBUG` \| `false` \|
	\| `LOG_LEVEL` \| `INFO` \|
	\| `HOST` \| `0.0.0.0` \|
	\| `PORT` \| `8000` \|
	\| `CORS_ORIGINS` \| `["http://localhost:5173","http://localhost:3000"]` \|
	\| `SESSION_TIMEOUT` \| `3600` \|
	\| `MEMORY_TTL` \| `86400` \|

	### inference-runtime

	\| variable \| default \|
	\| --- \| --- \|
	\| `API_BASE_URL` \| `https://api.openai.com/v1` \|
	\| `MODEL_NAME` \| `gpt-4.1-mini` \|
	\| `ENV_API_BASE_URL` \| `http://localhost:8000/api` \|
	\| `TASK_NAME` \| `task_001` \|
	\| `BENCHMARK` \| `openenv` \|
	\| `MAX_STEPS` \| `12` \|
	\| `EPISODE_SEED` \| `42` \|
	\| `LLM_TEMPERATURE` \| `0.0` \|
	\| `PROMPT_HTML_LIMIT` \| `5000` \|
	\| `REQUEST_TIMEOUT_SECONDS` \| `30` \|
	\| `USE_OPENENV_SDK` \| `true` \|

	## inferencepy-openenv-contract

	The root `inference.py` uses `from openai import OpenAI` for all LLM calls and emits strict structured logs:

	```text
	[START] task=<task_name> env=<benchmark> model=<model_name>
	[STEP] step=<n> action=<action_str> reward=<0.00> done=<true\|false> error=<msg\|null>
	[END] success=<true\|false> steps=<n> rewards=<r1,r2,...,rn>
	```

	Run:

	```bash
	python inference.py --task task_001 --benchmark openenv
	```

	## api-quick-map

	Use `docs/api-reference.md` for the full endpoint inventory. Core surfaces:

	\| surface \| endpoints \|
	\| --- \| --- \|
	\| health \| `/api/health`, `/api/ready`, `/api/ping` \|
	\| episode \| `/api/episode/reset`, `/api/episode/step`, `/api/episode/state/{episode_id}` \|
	\| scrape \| `/api/scrape/stream`, `/api/scrape/{session_id}/status`, `/api/scrape/{session_id}/result` \|
	\| agents-tools-memory \| `/api/agents/`, `/api/tools/`, `/api/plugins/`, `/api/memory/` \|
	\| realtime \| `/ws/episode/{episode_id}` \|

	## documentation-map

	\| document \| purpose \|
	\| --- \| --- \|
	\| `docs/overview.md` \| platform overview and navigation \|
	\| `docs/api-reference.md` \| authoritative HTTP and WebSocket reference \|
	\| `docs/architecture.md` \| system architecture and runtime planes \|
	\| `docs/openenv.md` \| OpenEnv environment contract \|
	\| `docs/tool-calls.md` \| streamed tool-call event patterns \|
	\| `docs/plugins.md` \| plugin registry and dynamic tool model \|
	\| `docs/memory.md` \| memory design and operations \|
	\| `docs/readme.md` \| docs index \|

	## testing-and-validation

	Backend:

	```bash
	cd backend
	pytest
	```

	Frontend:

	```bash
	cd frontend
	npm run test
	```

	## deployment-notes

	\| mode \| notes \|
	\| --- \| --- \|
	\| docker-compose \| preferred local full-stack run \|
	\| hugging-face-space \| root `README.md` front matter + Docker SDK config is compatible \|
	\| direct-backend \| run `uvicorn app.main:app` with `.env` configured \|

	## troubleshooting

	\| symptom \| likely-cause \| check \|
	\| --- \| --- \| --- \|
	\| provider not available \| missing api key \| verify `.env` provider key \|
	\| streaming has no step events \| scrape runtime failed early \| inspect `/api/scrape/{session_id}/status` \|
	\| inference exits with failure \| missing `HF_TOKEN` or endpoint mismatch \| verify `HF_TOKEN`, `API_BASE_URL`, `MODEL_NAME` \|
	\| no frontend data \| backend not reachable from frontend \| check `VITE_API_PROXY_TARGET` / backend health \|

	## license

	MIT.