scrapeRL / README.md
NeerajCodz's picture
fix: resolve scraper functionality and plugin issues
54ec9cb
|
raw
history blame
11.5 kB
---
title: ScrapeRL
emoji: πŸŒ–
colorFrom: blue
colorTo: gray
sdk: docker
pinned: false
---
# ScrapeRL πŸŒ–
**AI-Powered Web Scraping with Reinforcement Learning**
A next-generation web scraping system that uses reinforcement learning and multi-agent coordination to intelligently extract data from websites. Features multiple AI provider support (OpenAI, Anthropic, Google Gemini, Groq, NVIDIA), embeddings, real-time WebSocket updates, and a modern navy blue/cyan themed UI.
## ✨ Key Features
### πŸ€– AI & Machine Learning
- **Multi-LLM Support** - OpenAI, Anthropic (Claude), Google (Gemini 2.5/2.0/3.0), Groq (Llama 3.3, Mixtral, Gemma2), NVIDIA (DeepSeek, Nemotron, Llama 3.3)
- **Smart Model Router** - Automatic selection of optimal model based on task type (code, reasoning, extraction, etc.)
- **Embeddings Service** - Semantic search with OpenAI and Google embeddings, in-memory caching
- **RL-Powered Scraping** - Reinforcement learning agents that learn optimal extraction strategies
- **Multi-Agent System** - Coordinated planner, extractor, and navigator agents
### ⚑ Real-Time Features
- **WebSocket Support** - Live progress updates during scraping episodes
- **Session-Based** - Clean slate on each session, no persistent rewards
- **Real-Time Metrics** - Track rewards, progress, and extraction in real-time
### 🎨 Modern UI/UX
- **Navy Blue & Cyan Theme** - Beautiful gradient design with glow effects
- **Fullscreen Layout** - Optimized for productivity
- **React + TailwindCSS** - Responsive and modern interface
- **Live Episode Monitoring** - Watch scraper progress in real-time
### πŸ”§ Developer Experience
- **FastAPI Backend** - High-performance async Python API
- **TypeScript Frontend** - Type-safe React application
- **Docker Ready** - Multi-stage builds with optimized images
- **Comprehensive Testing** - End-to-end test scripts included
- **Plugin System** - Extensible architecture with plugin support
## πŸš€ Quick Start
### Prerequisites
- Python 3.11+
- Node.js 20+
- Docker (optional, but recommended)
- At least one AI provider API key (OpenAI, Anthropic, Google, Groq, or NVIDIA)
### Docker (Recommended)
```bash
# Clone the repository
git clone https://github.com/NeerajCodz/scrapeRL.git
cd scrapeRL
# Copy and configure environment
cp .env.example .env
# Edit .env and add your API keys
# Build and run
docker-compose up --build
```
Access the app at **http://localhost:7860**
### Local Development
**Backend:**
```bash
cd backend
pip install -r requirements.txt
# Copy environment file
cp ../.env.example ../.env
# Add your API keys to .env
# Run server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
```
**Frontend:**
```bash
cd frontend
npm install
npm run dev
```
Frontend will be at **http://localhost:5173**
## πŸ“‘ API Endpoints
### Core Endpoints
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/health` | Health check and system status |
| POST | `/api/episode/reset` | Create a new scraping episode |
| POST | `/api/episode/step` | Execute an action in an episode |
| GET | `/api/episode/state/{episode_id}` | Get current episode state |
### Scrape Streaming Endpoints
| Method | Endpoint | Description |
|--------|----------|-------------|
| POST | `/api/scrape/stream` | Run scrape with SSE live events (`init`, `url_start`, `step`, `url_complete`, `complete`) |
| POST | `/api/scrape/` | Start scrape in background and return `session_id` |
| GET | `/api/scrape/{session_id}/status` | Session status, reward, steps, plugin info |
| GET | `/api/scrape/{session_id}/result` | Final formatted output (json/csv/markdown/text) |
| GET | `/api/scrape/sessions` | List active scrape sessions |
| DELETE | `/api/scrape/{session_id}` | Cancel running scrape session |
#### Scrape plugin capabilities
- Query assets can be discovered via `mcp-search` (non-URL asset text -> resolved links).
- Python sandbox analysis plugins:
- `mcp-python-sandbox`
- `proc-python`
- `proc-pandas`
- `proc-numpy`
- `proc-bs4`
- Optional request field: `python_code` (sandboxed, validated code; must assign `result`).
- Sandbox execution is per-request isolated and cleaned after run.
### AI Provider Endpoints
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/providers` | List all configured AI providers |
| GET | `/api/providers/{name}` | Get specific provider details |
| GET | `/api/providers/models/all` | List all available models |
| GET | `/api/providers/costs/summary` | Get cost tracking summary |
### WebSocket Endpoints
| Type | Endpoint | Description |
|------|----------|-------------|
| WS | `/ws/episode/{episode_id}` | Real-time episode/session updates |
### Other Endpoints
- `/api/tasks` - Task management
- `/api/agents` - Agent configuration
- `/api/tools` - MCP tools registry
- `/api/memory` - Memory management
- `/api/plugins` - Plugin system
- `/api/settings` - System settings
## πŸ—οΈ Architecture
```
scrapeRL/
β”œβ”€β”€ backend/
β”‚ β”œβ”€β”€ app/
β”‚ β”‚ β”œβ”€β”€ main.py # FastAPI app entry
β”‚ β”‚ β”œβ”€β”€ config.py # Configuration management
β”‚ β”‚ β”œβ”€β”€ api/
β”‚ β”‚ β”‚ └── routes/ # API endpoints
β”‚ β”‚ β”‚ β”œβ”€β”€ episode.py # Episode management
β”‚ β”‚ β”‚ β”œβ”€β”€ providers.py # AI provider APIs
β”‚ β”‚ β”‚ β”œβ”€β”€ websocket.py # Real-time updates
β”‚ β”‚ β”‚ └── ...
β”‚ β”‚ β”œβ”€β”€ core/
β”‚ β”‚ β”‚ β”œβ”€β”€ env.py # RL environment
β”‚ β”‚ β”‚ β”œβ”€β”€ reward.py # Reward engine
β”‚ β”‚ β”‚ β”œβ”€β”€ embeddings.py # Embeddings service
β”‚ β”‚ β”‚ └── ...
β”‚ β”‚ β”œβ”€β”€ agents/
β”‚ β”‚ β”‚ β”œβ”€β”€ coordinator.py # Agent orchestration
β”‚ β”‚ β”‚ β”œβ”€β”€ planner.py # Planning agent
β”‚ β”‚ β”‚ β”œβ”€β”€ extractor.py # Extraction agent
β”‚ β”‚ β”‚ └── navigator.py # Navigation agent
β”‚ β”‚ β”œβ”€β”€ models/
β”‚ β”‚ β”‚ β”œβ”€β”€ router.py # Smart model router
β”‚ β”‚ β”‚ └── providers/ # AI provider implementations
β”‚ β”‚ β”‚ β”œβ”€β”€ openai.py # OpenAI GPT-4
β”‚ β”‚ β”‚ β”œβ”€β”€ anthropic.py # Claude 3.5 Sonnet
β”‚ β”‚ β”‚ β”œβ”€β”€ google.py # Gemini 2.5/2.0/3.0
β”‚ β”‚ β”‚ β”œβ”€β”€ groq.py # Llama 3.3, Mixtral
β”‚ β”‚ β”‚ └── nvidia.py # DeepSeek, Nemotron
β”‚ β”‚ β”œβ”€β”€ memory/ # Memory system
β”‚ β”‚ β”œβ”€β”€ tools/ # MCP tools
β”‚ β”‚ β”œβ”€β”€ plugins/ # Sandboxed plugin executors
β”‚ β”‚ └── types/ # Type definitions
β”‚ └── requirements.txt
β”œβ”€β”€ frontend/
β”‚ β”œβ”€β”€ src/
β”‚ β”‚ β”œβ”€β”€ components/ # React components
β”‚ β”‚ β”œβ”€β”€ hooks/
β”‚ β”‚ β”‚ β”œβ”€β”€ useWebSocket.ts # WebSocket hook
β”‚ β”‚ β”‚ └── useEpisodeProgress.ts # Episode tracking
β”‚ β”‚ β”œβ”€β”€ api/ # API clients
β”‚ β”‚ β”œβ”€β”€ types/ # TypeScript types
β”‚ β”‚ └── index.css # Navy/cyan theme
β”‚ └── package.json
β”œβ”€β”€ Dockerfile # Multi-stage build
β”œβ”€β”€ docker-compose.yml # Local development
β”œβ”€β”€ .env.example # Environment template
└── README.md
```
## βš™οΈ Configuration
Create a `.env` file in the root directory (see `.env.example` for template):
### AI Provider API Keys (Optional - at least one recommended)
| Variable | Description | Provider |
|----------|-------------|----------|
| `OPENAI_API_KEY` | OpenAI API key | GPT-4o, GPT-4o-mini, O1 |
| `ANTHROPIC_API_KEY` | Anthropic API key | Claude 3.5 Sonnet, Haiku, Opus |
| `GOOGLE_API_KEY` | Google AI API key | Gemini 2.5 Pro/Flash, Gemini 2.0, Gemini 3.0 |
| `GROQ_API_KEY` | Groq API key | Llama 3.3 70B, Llama 3.2 Vision, Mixtral, Gemma2 |
| `NVIDIA_API_KEY` | NVIDIA API key | DeepSeek R1/V3.2, Nemotron 70B, Llama 3.3 70B |
### HuggingFace (Optional)
| Variable | Description |
|----------|-------------|
| `HF_TOKEN` | HuggingFace token for model access |
### App Settings
| Variable | Default | Description |
|----------|---------|-------------|
| `DEBUG` | `false` | Enable debug mode |
| `LOG_LEVEL` | `INFO` | Logging level (DEBUG, INFO, WARN, ERROR) |
| `HOST` | `0.0.0.0` | Server host |
| `PORT` | `8000` | Server port |
### CORS Settings
| Variable | Default | Description |
|----------|---------|-------------|
| `CORS_ORIGINS` | `["http://localhost:5173"]` | Allowed CORS origins |
### Session & Memory
| Variable | Default | Description |
|----------|---------|-------------|
| `SESSION_TIMEOUT` | `3600` | Session timeout in seconds |
| `MEMORY_TTL` | `86400` | Memory TTL in seconds |
## πŸ§ͺ Testing
Run the end-to-end test script:
```bash
cd backend
python test_scraper.py
```
This will:
1. Create a scraping episode
2. Execute navigation and extraction actions
3. Track rewards and progress
4. Verify WebSocket connectivity
5. Display final results
Expected output:
```
βœ“ Episode created: <uuid>
βœ“ Action executed successfully
Reward: 0.65
Progress: 0.0%
βœ“ Final state retrieved
Steps: 3
Total reward: 2.26
```
## πŸš€ Deployment
### HuggingFace Spaces
This app is configured for HuggingFace Spaces with Docker SDK:
- Port: 7860
- Health check: `/api/health`
- Auto-builds on push
- Multi-stage build for optimized image size
### Manual Docker
```bash
# Run frontend + backend together
docker compose up --build
```
After startup:
- Frontend: `http://localhost:3000`
- Backend API: `http://localhost:8000/api`
### Environment Variables in Production
Set all required environment variables in your deployment platform:
- HuggingFace Spaces: Settings β†’ Repository secrets
- Docker: Use `--env-file` or environment section in docker-compose
- Kubernetes: ConfigMaps and Secrets
## 🎯 Usage Examples
### Example 1: Simple Scraping Task
```bash
curl -X POST http://localhost:8000/api/episode/reset \
-H "Content-Type: application/json" \
-d '{
"task_id": "scrape-quotes",
"config": {
"start_url": "http://quotes.toscrape.com",
"target_fields": {
"quotes": {"text": "quote text", "author": "author name"}
},
"max_steps": 20
}
}'
```
### Example 2: WebSocket Connection
```javascript
// Frontend JavaScript
const ws = new WebSocket('ws://localhost:8000/ws/episode/<episode_id>');
ws.onmessage = (event) => {
const message = JSON.parse(event.data);
if (message.type === 'progress') {
console.log(`Step ${message.step}: ${message.action_type}`);
console.log(`Reward: ${message.reward}, Progress: ${message.progress}%`);
}
if (message.type === 'completion') {
console.log(`Episode completed! Success: ${message.success}`);
console.log(`Total reward: ${message.total_reward}`);
}
};
```
## 🀝 Contributing
Contributions welcome! This project follows conventional commit messages:
- `feat:` - New features
- `fix:` - Bug fixes
- `chore:` - Maintenance tasks
- `docs:` - Documentation updates
- `test:` - Test additions/updates
## πŸ“„ License
MIT License - see [LICENSE](LICENSE) for details.
## πŸ™ Acknowledgments
- Built with FastAPI, React, TailwindCSS
- Powered by OpenAI, Anthropic, Google, Groq, and NVIDIA AI models
- Inspired by reinforcement learning research in web automation