Spaces:
Sleeping
title: ScrapeRL
emoji: π
colorFrom: blue
colorTo: gray
sdk: docker
pinned: false
ScrapeRL π
AI-Powered Web Scraping with Reinforcement Learning
A next-generation web scraping system that uses reinforcement learning and multi-agent coordination to intelligently extract data from websites. Features multiple AI provider support (OpenAI, Anthropic, Google Gemini, Groq, NVIDIA), embeddings, real-time WebSocket updates, and a modern navy blue/cyan themed UI.
β¨ Key Features
π€ AI & Machine Learning
- Multi-LLM Support - OpenAI, Anthropic (Claude), Google (Gemini 2.5/2.0/3.0), Groq (Llama 3.3, Mixtral, Gemma2), NVIDIA (DeepSeek, Nemotron, Llama 3.3)
- Smart Model Router - Automatic selection of optimal model based on task type (code, reasoning, extraction, etc.)
- Embeddings Service - Semantic search with OpenAI and Google embeddings, in-memory caching
- RL-Powered Scraping - Reinforcement learning agents that learn optimal extraction strategies
- Multi-Agent System - Coordinated planner, extractor, and navigator agents
β‘ Real-Time Features
- WebSocket Support - Live progress updates during scraping episodes
- Session-Based - Clean slate on each session, no persistent rewards
- Real-Time Metrics - Track rewards, progress, and extraction in real-time
π¨ Modern UI/UX
- Navy Blue & Cyan Theme - Beautiful gradient design with glow effects
- Fullscreen Layout - Optimized for productivity
- React + TailwindCSS - Responsive and modern interface
- Live Episode Monitoring - Watch scraper progress in real-time
π§ Developer Experience
- FastAPI Backend - High-performance async Python API
- TypeScript Frontend - Type-safe React application
- Docker Ready - Multi-stage builds with optimized images
- Comprehensive Testing - End-to-end test scripts included
- Plugin System - Extensible architecture with plugin support
π Quick Start
Prerequisites
- Python 3.11+
- Node.js 20+
- Docker (optional, but recommended)
- At least one AI provider API key (OpenAI, Anthropic, Google, Groq, or NVIDIA)
Docker (Recommended)
# Clone the repository
git clone https://github.com/NeerajCodz/scrapeRL.git
cd scrapeRL
# Copy and configure environment
cp .env.example .env
# Edit .env and add your API keys
# Build and run
docker-compose up --build
Access the app at http://localhost:7860
Local Development
Backend:
cd backend
pip install -r requirements.txt
# Copy environment file
cp ../.env.example ../.env
# Add your API keys to .env
# Run server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
Frontend:
cd frontend
npm install
npm run dev
Frontend will be at http://localhost:5173
π‘ API Endpoints
Core Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/health |
Health check and system status |
| POST | /api/episode/reset |
Create a new scraping episode |
| POST | /api/episode/step |
Execute an action in an episode |
| GET | /api/episode/state/{episode_id} |
Get current episode state |
Scrape Streaming Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/scrape/stream |
Run scrape with SSE live events (init, url_start, step, url_complete, complete) |
| POST | /api/scrape/ |
Start scrape in background and return session_id |
| GET | /api/scrape/{session_id}/status |
Session status, reward, steps, plugin info |
| GET | /api/scrape/{session_id}/result |
Final formatted output (json/csv/markdown/text) |
| GET | /api/scrape/sessions |
List active scrape sessions |
| DELETE | /api/scrape/{session_id} |
Cancel running scrape session |
Scrape plugin capabilities
- Query assets can be discovered via
mcp-search(non-URL asset text -> resolved links). - Python sandbox analysis plugins:
mcp-python-sandboxproc-pythonproc-pandasproc-numpyproc-bs4
- Optional request field:
python_code(sandboxed, validated code; must assignresult). - Sandbox execution is per-request isolated and cleaned after run.
AI Provider Endpoints
| Method | Endpoint | Description |
|---|---|---|
| GET | /api/providers |
List all configured AI providers |
| GET | /api/providers/{name} |
Get specific provider details |
| GET | /api/providers/models/all |
List all available models |
| GET | /api/providers/costs/summary |
Get cost tracking summary |
WebSocket Endpoints
| Type | Endpoint | Description |
|---|---|---|
| WS | /ws/episode/{episode_id} |
Real-time episode/session updates |
Other Endpoints
/api/tasks- Task management/api/agents- Agent configuration/api/tools- MCP tools registry/api/memory- Memory management/api/plugins- Plugin system/api/settings- System settings
ποΈ Architecture
scrapeRL/
βββ backend/
β βββ app/
β β βββ main.py # FastAPI app entry
β β βββ config.py # Configuration management
β β βββ api/
β β β βββ routes/ # API endpoints
β β β βββ episode.py # Episode management
β β β βββ providers.py # AI provider APIs
β β β βββ websocket.py # Real-time updates
β β β βββ ...
β β βββ core/
β β β βββ env.py # RL environment
β β β βββ reward.py # Reward engine
β β β βββ embeddings.py # Embeddings service
β β β βββ ...
β β βββ agents/
β β β βββ coordinator.py # Agent orchestration
β β β βββ planner.py # Planning agent
β β β βββ extractor.py # Extraction agent
β β β βββ navigator.py # Navigation agent
β β βββ models/
β β β βββ router.py # Smart model router
β β β βββ providers/ # AI provider implementations
β β β βββ openai.py # OpenAI GPT-4
β β β βββ anthropic.py # Claude 3.5 Sonnet
β β β βββ google.py # Gemini 2.5/2.0/3.0
β β β βββ groq.py # Llama 3.3, Mixtral
β β β βββ nvidia.py # DeepSeek, Nemotron
β β βββ memory/ # Memory system
β β βββ tools/ # MCP tools
β β βββ plugins/ # Sandboxed plugin executors
β β βββ types/ # Type definitions
β βββ requirements.txt
βββ frontend/
β βββ src/
β β βββ components/ # React components
β β βββ hooks/
β β β βββ useWebSocket.ts # WebSocket hook
β β β βββ useEpisodeProgress.ts # Episode tracking
β β βββ api/ # API clients
β β βββ types/ # TypeScript types
β β βββ index.css # Navy/cyan theme
β βββ package.json
βββ Dockerfile # Multi-stage build
βββ docker-compose.yml # Local development
βββ .env.example # Environment template
βββ README.md
βοΈ Configuration
Create a .env file in the root directory (see .env.example for template):
AI Provider API Keys (Optional - at least one recommended)
| Variable | Description | Provider |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key | GPT-4o, GPT-4o-mini, O1 |
ANTHROPIC_API_KEY |
Anthropic API key | Claude 3.5 Sonnet, Haiku, Opus |
GOOGLE_API_KEY |
Google AI API key | Gemini 2.5 Pro/Flash, Gemini 2.0, Gemini 3.0 |
GROQ_API_KEY |
Groq API key | Llama 3.3 70B, Llama 3.2 Vision, Mixtral, Gemma2 |
NVIDIA_API_KEY |
NVIDIA API key | DeepSeek R1/V3.2, Nemotron 70B, Llama 3.3 70B |
HuggingFace (Optional)
| Variable | Description |
|---|---|
HF_TOKEN |
HuggingFace token for model access |
App Settings
| Variable | Default | Description |
|---|---|---|
DEBUG |
false |
Enable debug mode |
LOG_LEVEL |
INFO |
Logging level (DEBUG, INFO, WARN, ERROR) |
HOST |
0.0.0.0 |
Server host |
PORT |
8000 |
Server port |
CORS Settings
| Variable | Default | Description |
|---|---|---|
CORS_ORIGINS |
["http://localhost:5173"] |
Allowed CORS origins |
Session & Memory
| Variable | Default | Description |
|---|---|---|
SESSION_TIMEOUT |
3600 |
Session timeout in seconds |
MEMORY_TTL |
86400 |
Memory TTL in seconds |
π§ͺ Testing
Run the end-to-end test script:
cd backend
python test_scraper.py
This will:
- Create a scraping episode
- Execute navigation and extraction actions
- Track rewards and progress
- Verify WebSocket connectivity
- Display final results
Expected output:
β Episode created: <uuid>
β Action executed successfully
Reward: 0.65
Progress: 0.0%
β Final state retrieved
Steps: 3
Total reward: 2.26
π Deployment
HuggingFace Spaces
This app is configured for HuggingFace Spaces with Docker SDK:
- Port: 7860
- Health check:
/api/health - Auto-builds on push
- Multi-stage build for optimized image size
Manual Docker
# Run frontend + backend together
docker compose up --build
After startup:
- Frontend:
http://localhost:3000 - Backend API:
http://localhost:8000/api
Environment Variables in Production
Set all required environment variables in your deployment platform:
- HuggingFace Spaces: Settings β Repository secrets
- Docker: Use
--env-fileor environment section in docker-compose - Kubernetes: ConfigMaps and Secrets
π― Usage Examples
Example 1: Simple Scraping Task
curl -X POST http://localhost:8000/api/episode/reset \
-H "Content-Type: application/json" \
-d '{
"task_id": "scrape-quotes",
"config": {
"start_url": "http://quotes.toscrape.com",
"target_fields": {
"quotes": {"text": "quote text", "author": "author name"}
},
"max_steps": 20
}
}'
Example 2: WebSocket Connection
// Frontend JavaScript
const ws = new WebSocket('ws://localhost:8000/ws/episode/<episode_id>');
ws.onmessage = (event) => {
const message = JSON.parse(event.data);
if (message.type === 'progress') {
console.log(`Step ${message.step}: ${message.action_type}`);
console.log(`Reward: ${message.reward}, Progress: ${message.progress}%`);
}
if (message.type === 'completion') {
console.log(`Episode completed! Success: ${message.success}`);
console.log(`Total reward: ${message.total_reward}`);
}
};
π€ Contributing
Contributions welcome! This project follows conventional commit messages:
feat:- New featuresfix:- Bug fixeschore:- Maintenance tasksdocs:- Documentation updatestest:- Test additions/updates
π License
MIT License - see LICENSE for details.
π Acknowledgments
- Built with FastAPI, React, TailwindCSS
- Powered by OpenAI, Anthropic, Google, Groq, and NVIDIA AI models
- Inspired by reinforcement learning research in web automation