Spaces:

NeerajCodz
/

scrapeRL

Sleeping

App Files Files Community

scrapeRL / README.md

NeerajCodz

fix: resolve scraper functionality and plugin issues

54ec9cb about 2 months ago

preview code

raw

history blame

11.5 kB

	---
	title: ScrapeRL
	emoji: 🌖
	colorFrom: blue
	colorTo: gray
	sdk: docker
	pinned: false
	---

	# ScrapeRL 🌖

	AI-Powered Web Scraping with Reinforcement Learning

	A next-generation web scraping system that uses reinforcement learning and multi-agent coordination to intelligently extract data from websites. Features multiple AI provider support (OpenAI, Anthropic, Google Gemini, Groq, NVIDIA), embeddings, real-time WebSocket updates, and a modern navy blue/cyan themed UI.

	## ✨ Key Features

	### 🤖 AI & Machine Learning
	- Multi-LLM Support - OpenAI, Anthropic (Claude), Google (Gemini 2.5/2.0/3.0), Groq (Llama 3.3, Mixtral, Gemma2), NVIDIA (DeepSeek, Nemotron, Llama 3.3)
	- Smart Model Router - Automatic selection of optimal model based on task type (code, reasoning, extraction, etc.)
	- Embeddings Service - Semantic search with OpenAI and Google embeddings, in-memory caching
	- RL-Powered Scraping - Reinforcement learning agents that learn optimal extraction strategies
	- Multi-Agent System - Coordinated planner, extractor, and navigator agents

	### ⚡ Real-Time Features
	- WebSocket Support - Live progress updates during scraping episodes
	- Session-Based - Clean slate on each session, no persistent rewards
	- Real-Time Metrics - Track rewards, progress, and extraction in real-time

	### 🎨 Modern UI/UX
	- Navy Blue & Cyan Theme - Beautiful gradient design with glow effects
	- Fullscreen Layout - Optimized for productivity
	- React + TailwindCSS - Responsive and modern interface
	- Live Episode Monitoring - Watch scraper progress in real-time

	### 🔧 Developer Experience
	- FastAPI Backend - High-performance async Python API
	- TypeScript Frontend - Type-safe React application
	- Docker Ready - Multi-stage builds with optimized images
	- Comprehensive Testing - End-to-end test scripts included
	- Plugin System - Extensible architecture with plugin support

	## 🚀 Quick Start

	### Prerequisites
	- Python 3.11+
	- Node.js 20+
	- Docker (optional, but recommended)
	- At least one AI provider API key (OpenAI, Anthropic, Google, Groq, or NVIDIA)

	### Docker (Recommended)

	```bash
	# Clone the repository
	git clone https://github.com/NeerajCodz/scrapeRL.git
	cd scrapeRL

	# Copy and configure environment
	cp .env.example .env
	# Edit .env and add your API keys

	# Build and run
	docker-compose up --build
	```

	Access the app at http://localhost:7860

	### Local Development

	Backend:
	```bash
	cd backend
	pip install -r requirements.txt

	# Copy environment file
	cp ../.env.example ../.env
	# Add your API keys to .env

	# Run server
	uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
	```

	Frontend:
	```bash
	cd frontend
	npm install
	npm run dev
	```

	Frontend will be at http://localhost:5173

	## 📡 API Endpoints

	### Core Endpoints
	\| Method \| Endpoint \| Description \|
	\|--------\|----------\|-------------\|
	\| GET \| `/api/health` \| Health check and system status \|
	\| POST \| `/api/episode/reset` \| Create a new scraping episode \|
	\| POST \| `/api/episode/step` \| Execute an action in an episode \|
	\| GET \| `/api/episode/state/{episode_id}` \| Get current episode state \|

	### Scrape Streaming Endpoints
	\| Method \| Endpoint \| Description \|
	\|--------\|----------\|-------------\|
	\| POST \| `/api/scrape/stream` \| Run scrape with SSE live events (`init`, `url_start`, `step`, `url_complete`, `complete`) \|
	\| POST \| `/api/scrape/` \| Start scrape in background and return `session_id` \|
	\| GET \| `/api/scrape/{session_id}/status` \| Session status, reward, steps, plugin info \|
	\| GET \| `/api/scrape/{session_id}/result` \| Final formatted output (json/csv/markdown/text) \|
	\| GET \| `/api/scrape/sessions` \| List active scrape sessions \|
	\| DELETE \| `/api/scrape/{session_id}` \| Cancel running scrape session \|

	#### Scrape plugin capabilities
	- Query assets can be discovered via `mcp-search` (non-URL asset text -> resolved links).
	- Python sandbox analysis plugins:
	- `mcp-python-sandbox`
	- `proc-python`
	- `proc-pandas`
	- `proc-numpy`
	- `proc-bs4`
	- Optional request field: `python_code` (sandboxed, validated code; must assign `result`).
	- Sandbox execution is per-request isolated and cleaned after run.

	### AI Provider Endpoints
	\| Method \| Endpoint \| Description \|
	\|--------\|----------\|-------------\|
	\| GET \| `/api/providers` \| List all configured AI providers \|
	\| GET \| `/api/providers/{name}` \| Get specific provider details \|
	\| GET \| `/api/providers/models/all` \| List all available models \|
	\| GET \| `/api/providers/costs/summary` \| Get cost tracking summary \|

	### WebSocket Endpoints
	\| Type \| Endpoint \| Description \|
	\|------\|----------\|-------------\|
	\| WS \| `/ws/episode/{episode_id}` \| Real-time episode/session updates \|

	### Other Endpoints
	- `/api/tasks` - Task management
	- `/api/agents` - Agent configuration
	- `/api/tools` - MCP tools registry
	- `/api/memory` - Memory management
	- `/api/plugins` - Plugin system
	- `/api/settings` - System settings

	## 🏗️ Architecture

	```
	scrapeRL/
	├── backend/
	│ ├── app/
	│ │ ├── main.py # FastAPI app entry
	│ │ ├── config.py # Configuration management
	│ │ ├── api/
	│ │ │ └── routes/ # API endpoints
	│ │ │ ├── episode.py # Episode management
	│ │ │ ├── providers.py # AI provider APIs
	│ │ │ ├── websocket.py # Real-time updates
	│ │ │ └── ...
	│ │ ├── core/
	│ │ │ ├── env.py # RL environment
	│ │ │ ├── reward.py # Reward engine
	│ │ │ ├── embeddings.py # Embeddings service
	│ │ │ └── ...
	│ │ ├── agents/
	│ │ │ ├── coordinator.py # Agent orchestration
	│ │ │ ├── planner.py # Planning agent
	│ │ │ ├── extractor.py # Extraction agent
	│ │ │ └── navigator.py # Navigation agent
	│ │ ├── models/
	│ │ │ ├── router.py # Smart model router
	│ │ │ └── providers/ # AI provider implementations
	│ │ │ ├── openai.py # OpenAI GPT-4
	│ │ │ ├── anthropic.py # Claude 3.5 Sonnet
	│ │ │ ├── google.py # Gemini 2.5/2.0/3.0
	│ │ │ ├── groq.py # Llama 3.3, Mixtral
	│ │ │ └── nvidia.py # DeepSeek, Nemotron
	│ │ ├── memory/ # Memory system
	│ │ ├── tools/ # MCP tools
	│ │ ├── plugins/ # Sandboxed plugin executors
	│ │ └── types/ # Type definitions
	│ └── requirements.txt
	├── frontend/
	│ ├── src/
	│ │ ├── components/ # React components
	│ │ ├── hooks/
	│ │ │ ├── useWebSocket.ts # WebSocket hook
	│ │ │ └── useEpisodeProgress.ts # Episode tracking
	│ │ ├── api/ # API clients
	│ │ ├── types/ # TypeScript types
	│ │ └── index.css # Navy/cyan theme
	│ └── package.json
	├── Dockerfile # Multi-stage build
	├── docker-compose.yml # Local development
	├── .env.example # Environment template
	└── README.md
	```

	## ⚙️ Configuration

	Create a `.env` file in the root directory (see `.env.example` for template):

	### AI Provider API Keys (Optional - at least one recommended)
	\| Variable \| Description \| Provider \|
	\|----------\|-------------\|----------\|
	\| `OPENAI_API_KEY` \| OpenAI API key \| GPT-4o, GPT-4o-mini, O1 \|
	\| `ANTHROPIC_API_KEY` \| Anthropic API key \| Claude 3.5 Sonnet, Haiku, Opus \|
	\| `GOOGLE_API_KEY` \| Google AI API key \| Gemini 2.5 Pro/Flash, Gemini 2.0, Gemini 3.0 \|
	\| `GROQ_API_KEY` \| Groq API key \| Llama 3.3 70B, Llama 3.2 Vision, Mixtral, Gemma2 \|
	\| `NVIDIA_API_KEY` \| NVIDIA API key \| DeepSeek R1/V3.2, Nemotron 70B, Llama 3.3 70B \|

	### HuggingFace (Optional)
	\| Variable \| Description \|
	\|----------\|-------------\|
	\| `HF_TOKEN` \| HuggingFace token for model access \|

	### App Settings
	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `DEBUG` \| `false` \| Enable debug mode \|
	\| `LOG_LEVEL` \| `INFO` \| Logging level (DEBUG, INFO, WARN, ERROR) \|
	\| `HOST` \| `0.0.0.0` \| Server host \|
	\| `PORT` \| `8000` \| Server port \|

	### CORS Settings
	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `CORS_ORIGINS` \| `["http://localhost:5173"]` \| Allowed CORS origins \|

	### Session & Memory
	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `SESSION_TIMEOUT` \| `3600` \| Session timeout in seconds \|
	\| `MEMORY_TTL` \| `86400` \| Memory TTL in seconds \|

	## 🧪 Testing

	Run the end-to-end test script:

	```bash
	cd backend
	python test_scraper.py
	```

	This will:
	1. Create a scraping episode
	2. Execute navigation and extraction actions
	3. Track rewards and progress
	4. Verify WebSocket connectivity
	5. Display final results

	Expected output:
	```
	✓ Episode created: <uuid>
	✓ Action executed successfully
	Reward: 0.65
	Progress: 0.0%
	✓ Final state retrieved
	Steps: 3
	Total reward: 2.26
	```

	## 🚀 Deployment

	### HuggingFace Spaces

	This app is configured for HuggingFace Spaces with Docker SDK:
	- Port: 7860
	- Health check: `/api/health`
	- Auto-builds on push
	- Multi-stage build for optimized image size

	### Manual Docker

	```bash
	# Run frontend + backend together
	docker compose up --build
	```

	After startup:
	- Frontend: `http://localhost:3000`
	- Backend API: `http://localhost:8000/api`

	### Environment Variables in Production

	Set all required environment variables in your deployment platform:
	- HuggingFace Spaces: Settings → Repository secrets
	- Docker: Use `--env-file` or environment section in docker-compose
	- Kubernetes: ConfigMaps and Secrets

	## 🎯 Usage Examples

	### Example 1: Simple Scraping Task

	```bash
	curl -X POST http://localhost:8000/api/episode/reset \
	-H "Content-Type: application/json" \
	-d '{
	"task_id": "scrape-quotes",
	"config": {
	"start_url": "http://quotes.toscrape.com",
	"target_fields": {
	"quotes": {"text": "quote text", "author": "author name"}
	},
	"max_steps": 20
	}
	}'
	```

	### Example 2: WebSocket Connection

	```javascript
	// Frontend JavaScript
	const ws = new WebSocket('ws://localhost:8000/ws/episode/<episode_id>');

	ws.onmessage = (event) => {
	const message = JSON.parse(event.data);

	if (message.type === 'progress') {
	console.log(`Step ${message.step}: ${message.action_type}`);
	console.log(`Reward: ${message.reward}, Progress: ${message.progress}%`);
	}

	if (message.type === 'completion') {
	console.log(`Episode completed! Success: ${message.success}`);
	console.log(`Total reward: ${message.total_reward}`);
	}
	};
	```

	## 🤝 Contributing

	Contributions welcome! This project follows conventional commit messages:
	- `feat:` - New features
	- `fix:` - Bug fixes
	- `chore:` - Maintenance tasks
	- `docs:` - Documentation updates
	- `test:` - Test additions/updates

	## 📄 License

	MIT License - see [LICENSE](LICENSE) for details.

	## 🙏 Acknowledgments

	- Built with FastAPI, React, TailwindCSS
	- Powered by OpenAI, Anthropic, Google, Groq, and NVIDIA AI models
	- Inspired by reinforcement learning research in web automation