scrapeRL / README.md
NeerajCodz's picture
fix: resolve scraper functionality and plugin issues
54ec9cb
|
raw
history blame
11.5 kB
metadata
title: ScrapeRL
emoji: πŸŒ–
colorFrom: blue
colorTo: gray
sdk: docker
pinned: false

ScrapeRL πŸŒ–

AI-Powered Web Scraping with Reinforcement Learning

A next-generation web scraping system that uses reinforcement learning and multi-agent coordination to intelligently extract data from websites. Features multiple AI provider support (OpenAI, Anthropic, Google Gemini, Groq, NVIDIA), embeddings, real-time WebSocket updates, and a modern navy blue/cyan themed UI.

✨ Key Features

πŸ€– AI & Machine Learning

  • Multi-LLM Support - OpenAI, Anthropic (Claude), Google (Gemini 2.5/2.0/3.0), Groq (Llama 3.3, Mixtral, Gemma2), NVIDIA (DeepSeek, Nemotron, Llama 3.3)
  • Smart Model Router - Automatic selection of optimal model based on task type (code, reasoning, extraction, etc.)
  • Embeddings Service - Semantic search with OpenAI and Google embeddings, in-memory caching
  • RL-Powered Scraping - Reinforcement learning agents that learn optimal extraction strategies
  • Multi-Agent System - Coordinated planner, extractor, and navigator agents

⚑ Real-Time Features

  • WebSocket Support - Live progress updates during scraping episodes
  • Session-Based - Clean slate on each session, no persistent rewards
  • Real-Time Metrics - Track rewards, progress, and extraction in real-time

🎨 Modern UI/UX

  • Navy Blue & Cyan Theme - Beautiful gradient design with glow effects
  • Fullscreen Layout - Optimized for productivity
  • React + TailwindCSS - Responsive and modern interface
  • Live Episode Monitoring - Watch scraper progress in real-time

πŸ”§ Developer Experience

  • FastAPI Backend - High-performance async Python API
  • TypeScript Frontend - Type-safe React application
  • Docker Ready - Multi-stage builds with optimized images
  • Comprehensive Testing - End-to-end test scripts included
  • Plugin System - Extensible architecture with plugin support

πŸš€ Quick Start

Prerequisites

  • Python 3.11+
  • Node.js 20+
  • Docker (optional, but recommended)
  • At least one AI provider API key (OpenAI, Anthropic, Google, Groq, or NVIDIA)

Docker (Recommended)

# Clone the repository
git clone https://github.com/NeerajCodz/scrapeRL.git
cd scrapeRL

# Copy and configure environment
cp .env.example .env
# Edit .env and add your API keys

# Build and run
docker-compose up --build

Access the app at http://localhost:7860

Local Development

Backend:

cd backend
pip install -r requirements.txt

# Copy environment file
cp ../.env.example ../.env
# Add your API keys to .env

# Run server
uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

Frontend:

cd frontend
npm install
npm run dev

Frontend will be at http://localhost:5173

πŸ“‘ API Endpoints

Core Endpoints

Method Endpoint Description
GET /api/health Health check and system status
POST /api/episode/reset Create a new scraping episode
POST /api/episode/step Execute an action in an episode
GET /api/episode/state/{episode_id} Get current episode state

Scrape Streaming Endpoints

Method Endpoint Description
POST /api/scrape/stream Run scrape with SSE live events (init, url_start, step, url_complete, complete)
POST /api/scrape/ Start scrape in background and return session_id
GET /api/scrape/{session_id}/status Session status, reward, steps, plugin info
GET /api/scrape/{session_id}/result Final formatted output (json/csv/markdown/text)
GET /api/scrape/sessions List active scrape sessions
DELETE /api/scrape/{session_id} Cancel running scrape session

Scrape plugin capabilities

  • Query assets can be discovered via mcp-search (non-URL asset text -> resolved links).
  • Python sandbox analysis plugins:
    • mcp-python-sandbox
    • proc-python
    • proc-pandas
    • proc-numpy
    • proc-bs4
  • Optional request field: python_code (sandboxed, validated code; must assign result).
  • Sandbox execution is per-request isolated and cleaned after run.

AI Provider Endpoints

Method Endpoint Description
GET /api/providers List all configured AI providers
GET /api/providers/{name} Get specific provider details
GET /api/providers/models/all List all available models
GET /api/providers/costs/summary Get cost tracking summary

WebSocket Endpoints

Type Endpoint Description
WS /ws/episode/{episode_id} Real-time episode/session updates

Other Endpoints

  • /api/tasks - Task management
  • /api/agents - Agent configuration
  • /api/tools - MCP tools registry
  • /api/memory - Memory management
  • /api/plugins - Plugin system
  • /api/settings - System settings

πŸ—οΈ Architecture

scrapeRL/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ main.py              # FastAPI app entry
β”‚   β”‚   β”œβ”€β”€ config.py            # Configuration management
β”‚   β”‚   β”œβ”€β”€ api/
β”‚   β”‚   β”‚   └── routes/          # API endpoints
β”‚   β”‚   β”‚       β”œβ”€β”€ episode.py   # Episode management
β”‚   β”‚   β”‚       β”œβ”€β”€ providers.py # AI provider APIs
β”‚   β”‚   β”‚       β”œβ”€β”€ websocket.py # Real-time updates
β”‚   β”‚   β”‚       └── ...
β”‚   β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”‚   β”œβ”€β”€ env.py           # RL environment
β”‚   β”‚   β”‚   β”œβ”€β”€ reward.py        # Reward engine
β”‚   β”‚   β”‚   β”œβ”€β”€ embeddings.py   # Embeddings service
β”‚   β”‚   β”‚   └── ...
β”‚   β”‚   β”œβ”€β”€ agents/
β”‚   β”‚   β”‚   β”œβ”€β”€ coordinator.py   # Agent orchestration
β”‚   β”‚   β”‚   β”œβ”€β”€ planner.py       # Planning agent
β”‚   β”‚   β”‚   β”œβ”€β”€ extractor.py     # Extraction agent
β”‚   β”‚   β”‚   └── navigator.py     # Navigation agent
β”‚   β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”‚   β”œβ”€β”€ router.py        # Smart model router
β”‚   β”‚   β”‚   └── providers/       # AI provider implementations
β”‚   β”‚   β”‚       β”œβ”€β”€ openai.py    # OpenAI GPT-4
β”‚   β”‚   β”‚       β”œβ”€β”€ anthropic.py # Claude 3.5 Sonnet
β”‚   β”‚   β”‚       β”œβ”€β”€ google.py    # Gemini 2.5/2.0/3.0
β”‚   β”‚   β”‚       β”œβ”€β”€ groq.py      # Llama 3.3, Mixtral
β”‚   β”‚   β”‚       └── nvidia.py    # DeepSeek, Nemotron
β”‚   β”‚   β”œβ”€β”€ memory/              # Memory system
β”‚   β”‚   β”œβ”€β”€ tools/               # MCP tools
β”‚   β”‚   β”œβ”€β”€ plugins/             # Sandboxed plugin executors
β”‚   β”‚   └── types/               # Type definitions
β”‚   └── requirements.txt
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/          # React components
β”‚   β”‚   β”œβ”€β”€ hooks/
β”‚   β”‚   β”‚   β”œβ”€β”€ useWebSocket.ts  # WebSocket hook
β”‚   β”‚   β”‚   └── useEpisodeProgress.ts # Episode tracking
β”‚   β”‚   β”œβ”€β”€ api/                 # API clients
β”‚   β”‚   β”œβ”€β”€ types/               # TypeScript types
β”‚   β”‚   └── index.css            # Navy/cyan theme
β”‚   └── package.json
β”œβ”€β”€ Dockerfile                   # Multi-stage build
β”œβ”€β”€ docker-compose.yml           # Local development
β”œβ”€β”€ .env.example                 # Environment template
└── README.md

βš™οΈ Configuration

Create a .env file in the root directory (see .env.example for template):

AI Provider API Keys (Optional - at least one recommended)

Variable Description Provider
OPENAI_API_KEY OpenAI API key GPT-4o, GPT-4o-mini, O1
ANTHROPIC_API_KEY Anthropic API key Claude 3.5 Sonnet, Haiku, Opus
GOOGLE_API_KEY Google AI API key Gemini 2.5 Pro/Flash, Gemini 2.0, Gemini 3.0
GROQ_API_KEY Groq API key Llama 3.3 70B, Llama 3.2 Vision, Mixtral, Gemma2
NVIDIA_API_KEY NVIDIA API key DeepSeek R1/V3.2, Nemotron 70B, Llama 3.3 70B

HuggingFace (Optional)

Variable Description
HF_TOKEN HuggingFace token for model access

App Settings

Variable Default Description
DEBUG false Enable debug mode
LOG_LEVEL INFO Logging level (DEBUG, INFO, WARN, ERROR)
HOST 0.0.0.0 Server host
PORT 8000 Server port

CORS Settings

Variable Default Description
CORS_ORIGINS ["http://localhost:5173"] Allowed CORS origins

Session & Memory

Variable Default Description
SESSION_TIMEOUT 3600 Session timeout in seconds
MEMORY_TTL 86400 Memory TTL in seconds

πŸ§ͺ Testing

Run the end-to-end test script:

cd backend
python test_scraper.py

This will:

  1. Create a scraping episode
  2. Execute navigation and extraction actions
  3. Track rewards and progress
  4. Verify WebSocket connectivity
  5. Display final results

Expected output:

βœ“ Episode created: <uuid>
βœ“ Action executed successfully
  Reward: 0.65
  Progress: 0.0%
βœ“ Final state retrieved
  Steps: 3
  Total reward: 2.26

πŸš€ Deployment

HuggingFace Spaces

This app is configured for HuggingFace Spaces with Docker SDK:

  • Port: 7860
  • Health check: /api/health
  • Auto-builds on push
  • Multi-stage build for optimized image size

Manual Docker

# Run frontend + backend together
docker compose up --build

After startup:

  • Frontend: http://localhost:3000
  • Backend API: http://localhost:8000/api

Environment Variables in Production

Set all required environment variables in your deployment platform:

  • HuggingFace Spaces: Settings β†’ Repository secrets
  • Docker: Use --env-file or environment section in docker-compose
  • Kubernetes: ConfigMaps and Secrets

🎯 Usage Examples

Example 1: Simple Scraping Task

curl -X POST http://localhost:8000/api/episode/reset \
  -H "Content-Type: application/json" \
  -d '{
    "task_id": "scrape-quotes",
    "config": {
      "start_url": "http://quotes.toscrape.com",
      "target_fields": {
        "quotes": {"text": "quote text", "author": "author name"}
      },
      "max_steps": 20
    }
  }'

Example 2: WebSocket Connection

// Frontend JavaScript
const ws = new WebSocket('ws://localhost:8000/ws/episode/<episode_id>');

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  
  if (message.type === 'progress') {
    console.log(`Step ${message.step}: ${message.action_type}`);
    console.log(`Reward: ${message.reward}, Progress: ${message.progress}%`);
  }
  
  if (message.type === 'completion') {
    console.log(`Episode completed! Success: ${message.success}`);
    console.log(`Total reward: ${message.total_reward}`);
  }
};

🀝 Contributing

Contributions welcome! This project follows conventional commit messages:

  • feat: - New features
  • fix: - Bug fixes
  • chore: - Maintenance tasks
  • docs: - Documentation updates
  • test: - Test additions/updates

πŸ“„ License

MIT License - see LICENSE for details.

πŸ™ Acknowledgments

  • Built with FastAPI, React, TailwindCSS
  • Powered by OpenAI, Anthropic, Google, Groq, and NVIDIA AI models
  • Inspired by reinforcement learning research in web automation