--- title: ScrapeRL emoji: ๐ŸŒ– colorFrom: blue colorTo: gray sdk: docker pinned: false --- # ScrapeRL ๐ŸŒ– **AI-Powered Web Scraping with Reinforcement Learning** A next-generation web scraping system that uses reinforcement learning and multi-agent coordination to intelligently extract data from websites. Features multiple AI provider support (OpenAI, Anthropic, Google Gemini, Groq, NVIDIA), embeddings, real-time WebSocket updates, and a modern navy blue/cyan themed UI. ## โœจ Key Features ### ๐Ÿค– AI & Machine Learning - **Multi-LLM Support** - OpenAI, Anthropic (Claude), Google (Gemini 2.5/2.0/3.0), Groq (Llama 3.3, Mixtral, Gemma2), NVIDIA (DeepSeek, Nemotron, Llama 3.3) - **Smart Model Router** - Automatic selection of optimal model based on task type (code, reasoning, extraction, etc.) - **Embeddings Service** - Semantic search with OpenAI and Google embeddings, in-memory caching - **RL-Powered Scraping** - Reinforcement learning agents that learn optimal extraction strategies - **Multi-Agent System** - Coordinated planner, extractor, and navigator agents ### โšก Real-Time Features - **WebSocket Support** - Live progress updates during scraping episodes - **Session-Based** - Clean slate on each session, no persistent rewards - **Real-Time Metrics** - Track rewards, progress, and extraction in real-time ### ๐ŸŽจ Modern UI/UX - **Navy Blue & Cyan Theme** - Beautiful gradient design with glow effects - **Fullscreen Layout** - Optimized for productivity - **React + TailwindCSS** - Responsive and modern interface - **Live Episode Monitoring** - Watch scraper progress in real-time ### ๐Ÿ”ง Developer Experience - **FastAPI Backend** - High-performance async Python API - **TypeScript Frontend** - Type-safe React application - **Docker Ready** - Multi-stage builds with optimized images - **Comprehensive Testing** - End-to-end test scripts included - **Plugin System** - Extensible architecture with plugin support ## ๐Ÿš€ Quick Start ### Prerequisites - Python 3.11+ - Node.js 20+ - Docker (optional, but recommended) - At least one AI provider API key (OpenAI, Anthropic, Google, Groq, or NVIDIA) ### Docker (Recommended) ```bash # Clone the repository git clone https://github.com/NeerajCodz/scrapeRL.git cd scrapeRL # Copy and configure environment cp .env.example .env # Edit .env and add your API keys # Build and run docker-compose up --build ``` Access the app at **http://localhost:7860** ### Local Development **Backend:** ```bash cd backend pip install -r requirements.txt # Copy environment file cp ../.env.example ../.env # Add your API keys to .env # Run server uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 ``` **Frontend:** ```bash cd frontend npm install npm run dev ``` Frontend will be at **http://localhost:5173** ## ๐Ÿ“ก API Endpoints ### Core Endpoints | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/health` | Health check and system status | | POST | `/api/episode/reset` | Create a new scraping episode | | POST | `/api/episode/step` | Execute an action in an episode | | GET | `/api/episode/state/{episode_id}` | Get current episode state | ### Scrape Streaming Endpoints | Method | Endpoint | Description | |--------|----------|-------------| | POST | `/api/scrape/stream` | Run scrape with SSE live events (`init`, `url_start`, `step`, `url_complete`, `complete`) | | POST | `/api/scrape/` | Start scrape in background and return `session_id` | | GET | `/api/scrape/{session_id}/status` | Session status, reward, steps, plugin info | | GET | `/api/scrape/{session_id}/result` | Final formatted output (json/csv/markdown/text) | | GET | `/api/scrape/sessions` | List active scrape sessions | | DELETE | `/api/scrape/{session_id}` | Cancel running scrape session | #### Scrape plugin capabilities - Query assets can be discovered via `mcp-search` (non-URL asset text -> resolved links). - Python sandbox analysis plugins: - `mcp-python-sandbox` - `proc-python` - `proc-pandas` - `proc-numpy` - `proc-bs4` - Optional request field: `python_code` (sandboxed, validated code; must assign `result`). - Sandbox execution is per-request isolated and cleaned after run. ### AI Provider Endpoints | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/api/providers` | List all configured AI providers | | GET | `/api/providers/{name}` | Get specific provider details | | GET | `/api/providers/models/all` | List all available models | | GET | `/api/providers/costs/summary` | Get cost tracking summary | ### WebSocket Endpoints | Type | Endpoint | Description | |------|----------|-------------| | WS | `/ws/episode/{episode_id}` | Real-time episode/session updates | ### Other Endpoints - `/api/tasks` - Task management - `/api/agents` - Agent configuration - `/api/tools` - MCP tools registry - `/api/memory` - Memory management - `/api/plugins` - Plugin system - `/api/settings` - System settings ## ๐Ÿ—๏ธ Architecture ``` scrapeRL/ โ”œโ”€โ”€ backend/ โ”‚ โ”œโ”€โ”€ app/ โ”‚ โ”‚ โ”œโ”€โ”€ main.py # FastAPI app entry โ”‚ โ”‚ โ”œโ”€โ”€ config.py # Configuration management โ”‚ โ”‚ โ”œโ”€โ”€ api/ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ routes/ # API endpoints โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ episode.py # Episode management โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ providers.py # AI provider APIs โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ websocket.py # Real-time updates โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ ... โ”‚ โ”‚ โ”œโ”€โ”€ core/ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ env.py # RL environment โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ reward.py # Reward engine โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ embeddings.py # Embeddings service โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ ... โ”‚ โ”‚ โ”œโ”€โ”€ agents/ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ coordinator.py # Agent orchestration โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ planner.py # Planning agent โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ extractor.py # Extraction agent โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ navigator.py # Navigation agent โ”‚ โ”‚ โ”œโ”€โ”€ models/ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ router.py # Smart model router โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ providers/ # AI provider implementations โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ openai.py # OpenAI GPT-4 โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ anthropic.py # Claude 3.5 Sonnet โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ google.py # Gemini 2.5/2.0/3.0 โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ groq.py # Llama 3.3, Mixtral โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ nvidia.py # DeepSeek, Nemotron โ”‚ โ”‚ โ”œโ”€โ”€ memory/ # Memory system โ”‚ โ”‚ โ”œโ”€โ”€ tools/ # MCP tools โ”‚ โ”‚ โ”œโ”€โ”€ plugins/ # Sandboxed plugin executors โ”‚ โ”‚ โ””โ”€โ”€ types/ # Type definitions โ”‚ โ””โ”€โ”€ requirements.txt โ”œโ”€โ”€ frontend/ โ”‚ โ”œโ”€โ”€ src/ โ”‚ โ”‚ โ”œโ”€โ”€ components/ # React components โ”‚ โ”‚ โ”œโ”€โ”€ hooks/ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€ useWebSocket.ts # WebSocket hook โ”‚ โ”‚ โ”‚ โ””โ”€โ”€ useEpisodeProgress.ts # Episode tracking โ”‚ โ”‚ โ”œโ”€โ”€ api/ # API clients โ”‚ โ”‚ โ”œโ”€โ”€ types/ # TypeScript types โ”‚ โ”‚ โ””โ”€โ”€ index.css # Navy/cyan theme โ”‚ โ””โ”€โ”€ package.json โ”œโ”€โ”€ Dockerfile # Multi-stage build โ”œโ”€โ”€ docker-compose.yml # Local development โ”œโ”€โ”€ .env.example # Environment template โ””โ”€โ”€ README.md ``` ## โš™๏ธ Configuration Create a `.env` file in the root directory (see `.env.example` for template): ### AI Provider API Keys (Optional - at least one recommended) | Variable | Description | Provider | |----------|-------------|----------| | `OPENAI_API_KEY` | OpenAI API key | GPT-4o, GPT-4o-mini, O1 | | `ANTHROPIC_API_KEY` | Anthropic API key | Claude 3.5 Sonnet, Haiku, Opus | | `GOOGLE_API_KEY` | Google AI API key | Gemini 2.5 Pro/Flash, Gemini 2.0, Gemini 3.0 | | `GROQ_API_KEY` | Groq API key | Llama 3.3 70B, Llama 3.2 Vision, Mixtral, Gemma2 | | `NVIDIA_API_KEY` | NVIDIA API key | DeepSeek R1/V3.2, Nemotron 70B, Llama 3.3 70B | ### HuggingFace (Optional) | Variable | Description | |----------|-------------| | `HF_TOKEN` | HuggingFace token for model access | ### App Settings | Variable | Default | Description | |----------|---------|-------------| | `DEBUG` | `false` | Enable debug mode | | `LOG_LEVEL` | `INFO` | Logging level (DEBUG, INFO, WARN, ERROR) | | `HOST` | `0.0.0.0` | Server host | | `PORT` | `8000` | Server port | ### CORS Settings | Variable | Default | Description | |----------|---------|-------------| | `CORS_ORIGINS` | `["http://localhost:5173"]` | Allowed CORS origins | ### Session & Memory | Variable | Default | Description | |----------|---------|-------------| | `SESSION_TIMEOUT` | `3600` | Session timeout in seconds | | `MEMORY_TTL` | `86400` | Memory TTL in seconds | ## ๐Ÿงช Testing Run the end-to-end test script: ```bash cd backend python test_scraper.py ``` This will: 1. Create a scraping episode 2. Execute navigation and extraction actions 3. Track rewards and progress 4. Verify WebSocket connectivity 5. Display final results Expected output: ``` โœ“ Episode created: โœ“ Action executed successfully Reward: 0.65 Progress: 0.0% โœ“ Final state retrieved Steps: 3 Total reward: 2.26 ``` ## ๐Ÿš€ Deployment ### HuggingFace Spaces This app is configured for HuggingFace Spaces with Docker SDK: - Port: 7860 - Health check: `/api/health` - Auto-builds on push - Multi-stage build for optimized image size ### Manual Docker ```bash # Run frontend + backend together docker compose up --build ``` After startup: - Frontend: `http://localhost:3000` - Backend API: `http://localhost:8000/api` ### Environment Variables in Production Set all required environment variables in your deployment platform: - HuggingFace Spaces: Settings โ†’ Repository secrets - Docker: Use `--env-file` or environment section in docker-compose - Kubernetes: ConfigMaps and Secrets ## ๐ŸŽฏ Usage Examples ### Example 1: Simple Scraping Task ```bash curl -X POST http://localhost:8000/api/episode/reset \ -H "Content-Type: application/json" \ -d '{ "task_id": "scrape-quotes", "config": { "start_url": "http://quotes.toscrape.com", "target_fields": { "quotes": {"text": "quote text", "author": "author name"} }, "max_steps": 20 } }' ``` ### Example 2: WebSocket Connection ```javascript // Frontend JavaScript const ws = new WebSocket('ws://localhost:8000/ws/episode/'); ws.onmessage = (event) => { const message = JSON.parse(event.data); if (message.type === 'progress') { console.log(`Step ${message.step}: ${message.action_type}`); console.log(`Reward: ${message.reward}, Progress: ${message.progress}%`); } if (message.type === 'completion') { console.log(`Episode completed! Success: ${message.success}`); console.log(`Total reward: ${message.total_reward}`); } }; ``` ## ๐Ÿค Contributing Contributions welcome! This project follows conventional commit messages: - `feat:` - New features - `fix:` - Bug fixes - `chore:` - Maintenance tasks - `docs:` - Documentation updates - `test:` - Test additions/updates ## ๐Ÿ“„ License MIT License - see [LICENSE](LICENSE) for details. ## ๐Ÿ™ Acknowledgments - Built with FastAPI, React, TailwindCSS - Powered by OpenAI, Anthropic, Google, Groq, and NVIDIA AI models - Inspired by reinforcement learning research in web automation