Spaces:

NeerajCodz
/

scrapeRL

Sleeping

App Files Files Community

NeerajCodz commited on Apr 4

Commit

9160ee4

1 Parent(s): 8512126

docs: comprehensive README update with all new features and examples

Browse files

Files changed (1) hide show

README.md +248 -38

README.md CHANGED Viewed

@@ -9,34 +9,61 @@ pinned: false
 # ScrapeRL 🌖
-A reinforcement learning-powered web scraping tool with a FastAPI backend and React frontend.
-## Features
-- 🤖 **RL-Powered Scraping** - Intelligent web scraping using reinforcement learning
-- 🔌 **Multi-LLM Support** - Works with OpenAI, Anthropic, Google, and Groq
-- ⚡ **FastAPI Backend** - High-performance async API
-- 🎨 **React Frontend** - Modern, responsive UI
-- 🐳 **Docker Ready** - Easy deployment with Docker
-- 🤗 **HuggingFace Spaces** - One-click deployment
-## Quick Start
 ### Docker (Recommended)
 ```bash
 # Clone the repository
-git clone https://github.com/yourusername/scrapeRL.git
 cd scrapeRL
-# Copy environment file
 cp .env.example .env
 # Build and run
 docker-compose up --build
 ```
-Access the app at http://localhost:7860
 ### Local Development
@@ -44,7 +71,13 @@ Access the app at http://localhost:7860
 ```bash
 cd backend
 pip install -r requirements.txt
-uvicorn app.main:app --reload --port 7860
 ```
 **Frontend:**
@@ -54,63 +87,240 @@ npm install
 npm run dev
 ```
-## API Endpoints
 | Method | Endpoint | Description |
 |--------|----------|-------------|
-| GET | `/health` | Health check |
-| GET | `/api/v1/...` | API routes |
-| GET | `/` | Serve frontend |
-## Architecture
 ```
 scrapeRL/
 ├── backend/
 │   ├── app/
-│   │   ├── main.py         # FastAPI app entry
-│   │   ├── api/            # API routes
-│   │   ├── core/           # Core logic
-│   │   └── services/       # Business logic
 │   └── requirements.txt
 ├── frontend/
 │   ├── src/
 │   └── package.json
-├── Dockerfile              # Multi-stage build
-├── docker-compose.yml      # Local development
-└── .env.example
 ```
-## Configuration
-Set these environment variables (see `.env.example`):
-| Variable | Description | Required |
 |----------|-------------|----------|
-| `OPENAI_API_KEY` | OpenAI API key | No |
-| `ANTHROPIC_API_KEY` | Anthropic API key | No |
-| `GOOGLE_API_KEY` | Google AI API key | No |
-| `GROQ_API_KEY` | Groq API key | No |
-| `HF_TOKEN` | HuggingFace token | No |
-| `DEBUG` | Enable debug mode | No |
-| `LOG_LEVEL` | Logging level | No |
-## Deployment
 ### HuggingFace Spaces
 This app is configured for HuggingFace Spaces with Docker SDK:
 - Port: 7860
-- Health check: `/health`
 - Auto-builds on push
 ### Manual Docker
 ```bash
 docker build -t scraperl .
 docker run -p 7860:7860 --env-file .env scraperl
 ```
-## License
 MIT License - see [LICENSE](LICENSE) for details.

 # ScrapeRL 🌖
+**AI-Powered Web Scraping with Reinforcement Learning**
+A next-generation web scraping system that uses reinforcement learning and multi-agent coordination to intelligently extract data from websites. Features multiple AI provider support (OpenAI, Anthropic, Google Gemini, Groq, NVIDIA), embeddings, real-time WebSocket updates, and a modern navy blue/cyan themed UI.
+## ✨ Key Features
+### 🤖 AI & Machine Learning
+- **Multi-LLM Support** - OpenAI, Anthropic (Claude), Google (Gemini 2.5/2.0/3.0), Groq (Llama 3.3, Mixtral, Gemma2), NVIDIA (DeepSeek, Nemotron, Llama 3.3)
+- **Smart Model Router** - Automatic selection of optimal model based on task type (code, reasoning, extraction, etc.)
+- **Embeddings Service** - Semantic search with OpenAI and Google embeddings, in-memory caching
+- **RL-Powered Scraping** - Reinforcement learning agents that learn optimal extraction strategies
+- **Multi-Agent System** - Coordinated planner, extractor, and navigator agents
+### ⚡ Real-Time Features
+- **WebSocket Support** - Live progress updates during scraping episodes
+- **Session-Based** - Clean slate on each session, no persistent rewards
+- **Real-Time Metrics** - Track rewards, progress, and extraction in real-time
+### 🎨 Modern UI/UX
+- **Navy Blue & Cyan Theme** - Beautiful gradient design with glow effects
+- **Fullscreen Layout** - Optimized for productivity
+- **React + TailwindCSS** - Responsive and modern interface
+- **Live Episode Monitoring** - Watch scraper progress in real-time
+### 🔧 Developer Experience
+- **FastAPI Backend** - High-performance async Python API
+- **TypeScript Frontend** - Type-safe React application
+- **Docker Ready** - Multi-stage builds with optimized images
+- **Comprehensive Testing** - End-to-end test scripts included
+- **Plugin System** - Extensible architecture with plugin support
+## 🚀 Quick Start
+### Prerequisites
+- Python 3.11+
+- Node.js 20+
+- Docker (optional, but recommended)
+- At least one AI provider API key (OpenAI, Anthropic, Google, Groq, or NVIDIA)
 ### Docker (Recommended)
 ```bash
 # Clone the repository
+git clone https://github.com/NeerajCodz/scrapeRL.git
 cd scrapeRL
+# Copy and configure environment
 cp .env.example .env
+# Edit .env and add your API keys
 # Build and run
 docker-compose up --build
 ```
+Access the app at **http://localhost:7860**
 ### Local Development
 ```bash
 cd backend
 pip install -r requirements.txt
+# Copy environment file
+cp ../.env.example ../.env
+# Add your API keys to .env
+# Run server
+uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
 ```
 **Frontend:**
 npm run dev
 ```
+Frontend will be at **http://localhost:5173**
+## 📡 API Endpoints
+### Core Endpoints
 | Method | Endpoint | Description |
 |--------|----------|-------------|
+| GET | `/api/health` | Health check and system status |
+| POST | `/api/episode/reset` | Create a new scraping episode |
+| POST | `/api/episode/step` | Execute an action in an episode |
+| GET | `/api/episode/state/{episode_id}` | Get current episode state |
+### AI Provider Endpoints
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/api/providers` | List all configured AI providers |
+| GET | `/api/providers/{name}` | Get specific provider details |
+| GET | `/api/providers/models/all` | List all available models |
+| GET | `/api/providers/costs/summary` | Get cost tracking summary |
+### WebSocket Endpoints
+| Type | Endpoint | Description |
+|------|----------|-------------|
+| WS | `/ws/episode/{episode_id}` | Real-time episode progress updates |
+### Other Endpoints
+- `/api/tasks` - Task management
+- `/api/agents` - Agent configuration
+- `/api/tools` - MCP tools registry
+- `/api/memory` - Memory management
+- `/api/plugins` - Plugin system
+- `/api/settings` - System settings
+## 🏗️ Architecture
 ```
 scrapeRL/
 ├── backend/
 │   ├── app/
+│   │   ├── main.py              # FastAPI app entry
+│   │   ├── config.py            # Configuration management
+│   │   ├── api/
+│   │   │   └── routes/          # API endpoints
+│   │   │       ├── episode.py   # Episode management
+│   │   │       ├── providers.py # AI provider APIs
+│   │   │       ├── websocket.py # Real-time updates
+│   │   │       └── ...
+│   │   ├── core/
+│   │   │   ├── env.py           # RL environment
+│   │   │   ├── reward.py        # Reward engine
+│   │   │   ├── embeddings.py   # Embeddings service
+│   │   │   └── ...
+│   │   ├── agents/
+│   │   │   ├── coordinator.py   # Agent orchestration
+│   │   │   ├── planner.py       # Planning agent
+│   │   │   ├── extractor.py     # Extraction agent
+│   │   │   └── navigator.py     # Navigation agent
+│   │   ├── models/
+│   │   │   ├── router.py        # Smart model router
+│   │   │   └── providers/       # AI provider implementations
+│   │   │       ├── openai.py    # OpenAI GPT-4
+│   │   │       ├── anthropic.py # Claude 3.5 Sonnet
+│   │   │       ├── google.py    # Gemini 2.5/2.0/3.0
+│   │   │       ├── groq.py      # Llama 3.3, Mixtral
+│   │   │       └── nvidia.py    # DeepSeek, Nemotron
+│   │   ├── memory/              # Memory system
+│   │   ├── tools/               # MCP tools
+│   │   └── types/               # Type definitions
 │   └── requirements.txt
 ├── frontend/
 │   ├── src/
+│   │   ├── components/          # React components
+│   │   ├── hooks/
+│   │   │   ├── useWebSocket.ts  # WebSocket hook
+│   │   │   └── useEpisodeProgress.ts # Episode tracking
+│   │   ├── api/                 # API clients
+│   │   ├── types/               # TypeScript types
+│   │   └── index.css            # Navy/cyan theme
 │   └── package.json
+├── Dockerfile                   # Multi-stage build
+├── docker-compose.yml           # Local development
+├── .env.example                 # Environment template
+└── README.md
 ```
+## ⚙️ Configuration
+Create a `.env` file in the root directory (see `.env.example` for template):
+### AI Provider API Keys (Optional - at least one recommended)
+| Variable | Description | Provider |
 |----------|-------------|----------|
+| `OPENAI_API_KEY` | OpenAI API key | GPT-4o, GPT-4o-mini, O1 |
+| `ANTHROPIC_API_KEY` | Anthropic API key | Claude 3.5 Sonnet, Haiku, Opus |
+| `GOOGLE_API_KEY` | Google AI API key | Gemini 2.5 Pro/Flash, Gemini 2.0, Gemini 3.0 |
+| `GROQ_API_KEY` | Groq API key | Llama 3.3 70B, Llama 3.2 Vision, Mixtral, Gemma2 |
+| `NVIDIA_API_KEY` | NVIDIA API key | DeepSeek R1/V3.2, Nemotron 70B, Llama 3.3 70B |
+### HuggingFace (Optional)
+| Variable | Description |
+|----------|-------------|
+| `HF_TOKEN` | HuggingFace token for model access |
+### App Settings
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `DEBUG` | `false` | Enable debug mode |
+| `LOG_LEVEL` | `INFO` | Logging level (DEBUG, INFO, WARN, ERROR) |
+| `HOST` | `0.0.0.0` | Server host |
+| `PORT` | `8000` | Server port |
+### CORS Settings
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `CORS_ORIGINS` | `["http://localhost:5173"]` | Allowed CORS origins |
+### Session & Memory
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `SESSION_TIMEOUT` | `3600` | Session timeout in seconds |
+| `MEMORY_TTL` | `86400` | Memory TTL in seconds |
+## 🧪 Testing
+Run the end-to-end test script:
+```bash
+cd backend
+python test_scraper.py
+```
+This will:
+1. Create a scraping episode
+2. Execute navigation and extraction actions
+3. Track rewards and progress
+4. Verify WebSocket connectivity
+5. Display final results
+Expected output:
+```
+✓ Episode created: <uuid>
+✓ Action executed successfully
+  Reward: 0.65
+  Progress: 0.0%
+✓ Final state retrieved
+  Steps: 3
+  Total reward: 2.26
+```
+## 🚀 Deployment
 ### HuggingFace Spaces
 This app is configured for HuggingFace Spaces with Docker SDK:
 - Port: 7860
+- Health check: `/api/health`
 - Auto-builds on push
+- Multi-stage build for optimized image size
 ### Manual Docker
 ```bash
+# Build
 docker build -t scraperl .
+# Run
 docker run -p 7860:7860 --env-file .env scraperl
+# Or use docker-compose
+docker-compose up
 ```
+### Environment Variables in Production
+Set all required environment variables in your deployment platform:
+- HuggingFace Spaces: Settings → Repository secrets
+- Docker: Use `--env-file` or environment section in docker-compose
+- Kubernetes: ConfigMaps and Secrets
+## 🎯 Usage Examples
+### Example 1: Simple Scraping Task
+```bash
+curl -X POST http://localhost:8000/api/episode/reset \
+  -H "Content-Type: application/json" \
+  -d '{
+    "task_id": "scrape-quotes",
+    "config": {
+      "start_url": "http://quotes.toscrape.com",
+      "target_fields": {
+        "quotes": {"text": "quote text", "author": "author name"}
+      },
+      "max_steps": 20
+    }
+  }'
+```
+### Example 2: WebSocket Connection
+```javascript
+// Frontend JavaScript
+const ws = new WebSocket('ws://localhost:8000/ws/episode/<episode_id>');
+ws.onmessage = (event) => {
+  const message = JSON.parse(event.data);
+  if (message.type === 'progress') {
+    console.log(`Step ${message.step}: ${message.action_type}`);
+    console.log(`Reward: ${message.reward}, Progress: ${message.progress}%`);
+  }
+  if (message.type === 'completion') {
+    console.log(`Episode completed! Success: ${message.success}`);
+    console.log(`Total reward: ${message.total_reward}`);
+  }
+};
+```
+## 🤝 Contributing
+Contributions welcome! This project follows conventional commit messages:
+- `feat:` - New features
+- `fix:` - Bug fixes
+- `chore:` - Maintenance tasks
+- `docs:` - Documentation updates
+- `test:` - Test additions/updates
+## 📄 License
 MIT License - see [LICENSE](LICENSE) for details.
+## 🙏 Acknowledgments
+- Built with FastAPI, React, TailwindCSS
+- Powered by OpenAI, Anthropic, Google, Groq, and NVIDIA AI models
+- Inspired by reinforcement learning research in web automation