Spaces:
Sleeping
Sleeping
| title: ScrapeRL | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: gray | |
| sdk: docker | |
| pinned: false | |
| # ScrapeRL π | |
| **AI-Powered Web Scraping with Reinforcement Learning** | |
| A next-generation web scraping system that uses reinforcement learning and multi-agent coordination to intelligently extract data from websites. Features multiple AI provider support (OpenAI, Anthropic, Google Gemini, Groq, NVIDIA), embeddings, real-time WebSocket updates, and a modern navy blue/cyan themed UI. | |
| ## β¨ Key Features | |
| ### π€ AI & Machine Learning | |
| - **Multi-LLM Support** - OpenAI, Anthropic (Claude), Google (Gemini 2.5/2.0/3.0), Groq (Llama 3.3, Mixtral, Gemma2), NVIDIA (DeepSeek, Nemotron, Llama 3.3) | |
| - **Smart Model Router** - Automatic selection of optimal model based on task type (code, reasoning, extraction, etc.) | |
| - **Embeddings Service** - Semantic search with OpenAI and Google embeddings, in-memory caching | |
| - **RL-Powered Scraping** - Reinforcement learning agents that learn optimal extraction strategies | |
| - **Multi-Agent System** - Coordinated planner, extractor, and navigator agents | |
| ### β‘ Real-Time Features | |
| - **WebSocket Support** - Live progress updates during scraping episodes | |
| - **Session-Based** - Clean slate on each session, no persistent rewards | |
| - **Real-Time Metrics** - Track rewards, progress, and extraction in real-time | |
| ### π¨ Modern UI/UX | |
| - **Navy Blue & Cyan Theme** - Beautiful gradient design with glow effects | |
| - **Fullscreen Layout** - Optimized for productivity | |
| - **React + TailwindCSS** - Responsive and modern interface | |
| - **Live Episode Monitoring** - Watch scraper progress in real-time | |
| ### π§ Developer Experience | |
| - **FastAPI Backend** - High-performance async Python API | |
| - **TypeScript Frontend** - Type-safe React application | |
| - **Docker Ready** - Multi-stage builds with optimized images | |
| - **Comprehensive Testing** - End-to-end test scripts included | |
| - **Plugin System** - Extensible architecture with plugin support | |
| ## π Quick Start | |
| ### Prerequisites | |
| - Python 3.11+ | |
| - Node.js 20+ | |
| - Docker (optional, but recommended) | |
| - At least one AI provider API key (OpenAI, Anthropic, Google, Groq, or NVIDIA) | |
| ### Docker (Recommended) | |
| ```bash | |
| # Clone the repository | |
| git clone https://github.com/NeerajCodz/scrapeRL.git | |
| cd scrapeRL | |
| # Copy and configure environment | |
| cp .env.example .env | |
| # Edit .env and add your API keys | |
| # Build and run | |
| docker-compose up --build | |
| ``` | |
| Access the app at **http://localhost:7860** | |
| ### Local Development | |
| **Backend:** | |
| ```bash | |
| cd backend | |
| pip install -r requirements.txt | |
| # Copy environment file | |
| cp ../.env.example ../.env | |
| # Add your API keys to .env | |
| # Run server | |
| uvicorn app.main:app --reload --host 0.0.0.0 --port 8000 | |
| ``` | |
| **Frontend:** | |
| ```bash | |
| cd frontend | |
| npm install | |
| npm run dev | |
| ``` | |
| Frontend will be at **http://localhost:5173** | |
| ## π‘ API Endpoints | |
| ### Core Endpoints | |
| | Method | Endpoint | Description | | |
| |--------|----------|-------------| | |
| | GET | `/api/health` | Health check and system status | | |
| | POST | `/api/episode/reset` | Create a new scraping episode | | |
| | POST | `/api/episode/step` | Execute an action in an episode | | |
| | GET | `/api/episode/state/{episode_id}` | Get current episode state | | |
| ### Scrape Streaming Endpoints | |
| | Method | Endpoint | Description | | |
| |--------|----------|-------------| | |
| | POST | `/api/scrape/stream` | Run scrape with SSE live events (`init`, `url_start`, `step`, `url_complete`, `complete`) | | |
| | POST | `/api/scrape/` | Start scrape in background and return `session_id` | | |
| | GET | `/api/scrape/{session_id}/status` | Session status, reward, steps, plugin info | | |
| | GET | `/api/scrape/{session_id}/result` | Final formatted output (json/csv/markdown/text) | | |
| | GET | `/api/scrape/sessions` | List active scrape sessions | | |
| | DELETE | `/api/scrape/{session_id}` | Cancel running scrape session | | |
| #### Scrape plugin capabilities | |
| - Query assets can be discovered via `mcp-search` (non-URL asset text -> resolved links). | |
| - Python sandbox analysis plugins: | |
| - `mcp-python-sandbox` | |
| - `proc-python` | |
| - `proc-pandas` | |
| - `proc-numpy` | |
| - `proc-bs4` | |
| - Optional request field: `python_code` (sandboxed, validated code; must assign `result`). | |
| - Sandbox execution is per-request isolated and cleaned after run. | |
| ### AI Provider Endpoints | |
| | Method | Endpoint | Description | | |
| |--------|----------|-------------| | |
| | GET | `/api/providers` | List all configured AI providers | | |
| | GET | `/api/providers/{name}` | Get specific provider details | | |
| | GET | `/api/providers/models/all` | List all available models | | |
| | GET | `/api/providers/costs/summary` | Get cost tracking summary | | |
| ### WebSocket Endpoints | |
| | Type | Endpoint | Description | | |
| |------|----------|-------------| | |
| | WS | `/ws/episode/{episode_id}` | Real-time episode/session updates | | |
| ### Other Endpoints | |
| - `/api/tasks` - Task management | |
| - `/api/agents` - Agent configuration | |
| - `/api/tools` - MCP tools registry | |
| - `/api/memory` - Memory management | |
| - `/api/plugins` - Plugin system | |
| - `/api/settings` - System settings | |
| ## ποΈ Architecture | |
| ``` | |
| scrapeRL/ | |
| βββ backend/ | |
| β βββ app/ | |
| β β βββ main.py # FastAPI app entry | |
| β β βββ config.py # Configuration management | |
| β β βββ api/ | |
| β β β βββ routes/ # API endpoints | |
| β β β βββ episode.py # Episode management | |
| β β β βββ providers.py # AI provider APIs | |
| β β β βββ websocket.py # Real-time updates | |
| β β β βββ ... | |
| β β βββ core/ | |
| β β β βββ env.py # RL environment | |
| β β β βββ reward.py # Reward engine | |
| β β β βββ embeddings.py # Embeddings service | |
| β β β βββ ... | |
| β β βββ agents/ | |
| β β β βββ coordinator.py # Agent orchestration | |
| β β β βββ planner.py # Planning agent | |
| β β β βββ extractor.py # Extraction agent | |
| β β β βββ navigator.py # Navigation agent | |
| β β βββ models/ | |
| β β β βββ router.py # Smart model router | |
| β β β βββ providers/ # AI provider implementations | |
| β β β βββ openai.py # OpenAI GPT-4 | |
| β β β βββ anthropic.py # Claude 3.5 Sonnet | |
| β β β βββ google.py # Gemini 2.5/2.0/3.0 | |
| β β β βββ groq.py # Llama 3.3, Mixtral | |
| β β β βββ nvidia.py # DeepSeek, Nemotron | |
| β β βββ memory/ # Memory system | |
| β β βββ tools/ # MCP tools | |
| β β βββ plugins/ # Sandboxed plugin executors | |
| β β βββ types/ # Type definitions | |
| β βββ requirements.txt | |
| βββ frontend/ | |
| β βββ src/ | |
| β β βββ components/ # React components | |
| β β βββ hooks/ | |
| β β β βββ useWebSocket.ts # WebSocket hook | |
| β β β βββ useEpisodeProgress.ts # Episode tracking | |
| β β βββ api/ # API clients | |
| β β βββ types/ # TypeScript types | |
| β β βββ index.css # Navy/cyan theme | |
| β βββ package.json | |
| βββ Dockerfile # Multi-stage build | |
| βββ docker-compose.yml # Local development | |
| βββ .env.example # Environment template | |
| βββ README.md | |
| ``` | |
| ## βοΈ Configuration | |
| Create a `.env` file in the root directory (see `.env.example` for template): | |
| ### AI Provider API Keys (Optional - at least one recommended) | |
| | Variable | Description | Provider | | |
| |----------|-------------|----------| | |
| | `OPENAI_API_KEY` | OpenAI API key | GPT-4o, GPT-4o-mini, O1 | | |
| | `ANTHROPIC_API_KEY` | Anthropic API key | Claude 3.5 Sonnet, Haiku, Opus | | |
| | `GOOGLE_API_KEY` | Google AI API key | Gemini 2.5 Pro/Flash, Gemini 2.0, Gemini 3.0 | | |
| | `GROQ_API_KEY` | Groq API key | Llama 3.3 70B, Llama 3.2 Vision, Mixtral, Gemma2 | | |
| | `NVIDIA_API_KEY` | NVIDIA API key | DeepSeek R1/V3.2, Nemotron 70B, Llama 3.3 70B | | |
| ### HuggingFace (Optional) | |
| | Variable | Description | | |
| |----------|-------------| | |
| | `HF_TOKEN` | HuggingFace token for model access | | |
| ### App Settings | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `DEBUG` | `false` | Enable debug mode | | |
| | `LOG_LEVEL` | `INFO` | Logging level (DEBUG, INFO, WARN, ERROR) | | |
| | `HOST` | `0.0.0.0` | Server host | | |
| | `PORT` | `8000` | Server port | | |
| ### CORS Settings | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `CORS_ORIGINS` | `["http://localhost:5173"]` | Allowed CORS origins | | |
| ### Session & Memory | |
| | Variable | Default | Description | | |
| |----------|---------|-------------| | |
| | `SESSION_TIMEOUT` | `3600` | Session timeout in seconds | | |
| | `MEMORY_TTL` | `86400` | Memory TTL in seconds | | |
| ## π§ͺ Testing | |
| Run the end-to-end test script: | |
| ```bash | |
| cd backend | |
| python test_scraper.py | |
| ``` | |
| This will: | |
| 1. Create a scraping episode | |
| 2. Execute navigation and extraction actions | |
| 3. Track rewards and progress | |
| 4. Verify WebSocket connectivity | |
| 5. Display final results | |
| Expected output: | |
| ``` | |
| β Episode created: <uuid> | |
| β Action executed successfully | |
| Reward: 0.65 | |
| Progress: 0.0% | |
| β Final state retrieved | |
| Steps: 3 | |
| Total reward: 2.26 | |
| ``` | |
| ## π Deployment | |
| ### HuggingFace Spaces | |
| This app is configured for HuggingFace Spaces with Docker SDK: | |
| - Port: 7860 | |
| - Health check: `/api/health` | |
| - Auto-builds on push | |
| - Multi-stage build for optimized image size | |
| ### Manual Docker | |
| ```bash | |
| # Run frontend + backend together | |
| docker compose up --build | |
| ``` | |
| After startup: | |
| - Frontend: `http://localhost:3000` | |
| - Backend API: `http://localhost:8000/api` | |
| ### Environment Variables in Production | |
| Set all required environment variables in your deployment platform: | |
| - HuggingFace Spaces: Settings β Repository secrets | |
| - Docker: Use `--env-file` or environment section in docker-compose | |
| - Kubernetes: ConfigMaps and Secrets | |
| ## π― Usage Examples | |
| ### Example 1: Simple Scraping Task | |
| ```bash | |
| curl -X POST http://localhost:8000/api/episode/reset \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "task_id": "scrape-quotes", | |
| "config": { | |
| "start_url": "http://quotes.toscrape.com", | |
| "target_fields": { | |
| "quotes": {"text": "quote text", "author": "author name"} | |
| }, | |
| "max_steps": 20 | |
| } | |
| }' | |
| ``` | |
| ### Example 2: WebSocket Connection | |
| ```javascript | |
| // Frontend JavaScript | |
| const ws = new WebSocket('ws://localhost:8000/ws/episode/<episode_id>'); | |
| ws.onmessage = (event) => { | |
| const message = JSON.parse(event.data); | |
| if (message.type === 'progress') { | |
| console.log(`Step ${message.step}: ${message.action_type}`); | |
| console.log(`Reward: ${message.reward}, Progress: ${message.progress}%`); | |
| } | |
| if (message.type === 'completion') { | |
| console.log(`Episode completed! Success: ${message.success}`); | |
| console.log(`Total reward: ${message.total_reward}`); | |
| } | |
| }; | |
| ``` | |
| ## π€ Contributing | |
| Contributions welcome! This project follows conventional commit messages: | |
| - `feat:` - New features | |
| - `fix:` - Bug fixes | |
| - `chore:` - Maintenance tasks | |
| - `docs:` - Documentation updates | |
| - `test:` - Test additions/updates | |
| ## π License | |
| MIT License - see [LICENSE](LICENSE) for details. | |
| ## π Acknowledgments | |
| - Built with FastAPI, React, TailwindCSS | |
| - Powered by OpenAI, Anthropic, Google, Groq, and NVIDIA AI models | |
| - Inspired by reinforcement learning research in web automation | |