NeerajCodz commited on
Commit
9160ee4
Β·
1 Parent(s): 8512126

docs: comprehensive README update with all new features and examples

Browse files
Files changed (1) hide show
  1. README.md +248 -38
README.md CHANGED
@@ -9,34 +9,61 @@ pinned: false
9
 
10
  # ScrapeRL πŸŒ–
11
 
12
- A reinforcement learning-powered web scraping tool with a FastAPI backend and React frontend.
13
 
14
- ## Features
15
 
16
- - πŸ€– **RL-Powered Scraping** - Intelligent web scraping using reinforcement learning
17
- - πŸ”Œ **Multi-LLM Support** - Works with OpenAI, Anthropic, Google, and Groq
18
- - ⚑ **FastAPI Backend** - High-performance async API
19
- - 🎨 **React Frontend** - Modern, responsive UI
20
- - 🐳 **Docker Ready** - Easy deployment with Docker
21
- - πŸ€— **HuggingFace Spaces** - One-click deployment
22
 
23
- ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ### Docker (Recommended)
26
 
27
  ```bash
28
  # Clone the repository
29
- git clone https://github.com/yourusername/scrapeRL.git
30
  cd scrapeRL
31
 
32
- # Copy environment file
33
  cp .env.example .env
 
34
 
35
  # Build and run
36
  docker-compose up --build
37
  ```
38
 
39
- Access the app at http://localhost:7860
40
 
41
  ### Local Development
42
 
@@ -44,7 +71,13 @@ Access the app at http://localhost:7860
44
  ```bash
45
  cd backend
46
  pip install -r requirements.txt
47
- uvicorn app.main:app --reload --port 7860
 
 
 
 
 
 
48
  ```
49
 
50
  **Frontend:**
@@ -54,63 +87,240 @@ npm install
54
  npm run dev
55
  ```
56
 
57
- ## API Endpoints
58
 
 
 
 
59
  | Method | Endpoint | Description |
60
  |--------|----------|-------------|
61
- | GET | `/health` | Health check |
62
- | GET | `/api/v1/...` | API routes |
63
- | GET | `/` | Serve frontend |
 
64
 
65
- ## Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
  ```
68
  scrapeRL/
69
  β”œβ”€β”€ backend/
70
  β”‚ β”œβ”€β”€ app/
71
- β”‚ β”‚ β”œβ”€β”€ main.py # FastAPI app entry
72
- β”‚ β”‚ β”œβ”€β”€ api/ # API routes
73
- β”‚ β”‚ β”œβ”€β”€ core/ # Core logic
74
- β”‚ β”‚ └── services/ # Business logic
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  β”‚ └── requirements.txt
76
  β”œβ”€β”€ frontend/
77
  β”‚ β”œβ”€β”€ src/
 
 
 
 
 
 
 
78
  β”‚ └── package.json
79
- β”œβ”€β”€ Dockerfile # Multi-stage build
80
- β”œβ”€β”€ docker-compose.yml # Local development
81
- └── .env.example
 
82
  ```
83
 
84
- ## Configuration
85
 
86
- Set these environment variables (see `.env.example`):
87
 
88
- | Variable | Description | Required |
 
89
  |----------|-------------|----------|
90
- | `OPENAI_API_KEY` | OpenAI API key | No |
91
- | `ANTHROPIC_API_KEY` | Anthropic API key | No |
92
- | `GOOGLE_API_KEY` | Google AI API key | No |
93
- | `GROQ_API_KEY` | Groq API key | No |
94
- | `HF_TOKEN` | HuggingFace token | No |
95
- | `DEBUG` | Enable debug mode | No |
96
- | `LOG_LEVEL` | Logging level | No |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
 
98
- ## Deployment
99
 
100
  ### HuggingFace Spaces
101
 
102
  This app is configured for HuggingFace Spaces with Docker SDK:
103
  - Port: 7860
104
- - Health check: `/health`
105
  - Auto-builds on push
 
106
 
107
  ### Manual Docker
108
 
109
  ```bash
 
110
  docker build -t scraperl .
 
 
111
  docker run -p 7860:7860 --env-file .env scraperl
 
 
 
112
  ```
113
 
114
- ## License
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
  MIT License - see [LICENSE](LICENSE) for details.
 
 
 
 
 
 
 
9
 
10
  # ScrapeRL πŸŒ–
11
 
12
+ **AI-Powered Web Scraping with Reinforcement Learning**
13
 
14
+ A next-generation web scraping system that uses reinforcement learning and multi-agent coordination to intelligently extract data from websites. Features multiple AI provider support (OpenAI, Anthropic, Google Gemini, Groq, NVIDIA), embeddings, real-time WebSocket updates, and a modern navy blue/cyan themed UI.
15
 
16
+ ## ✨ Key Features
 
 
 
 
 
17
 
18
+ ### πŸ€– AI & Machine Learning
19
+ - **Multi-LLM Support** - OpenAI, Anthropic (Claude), Google (Gemini 2.5/2.0/3.0), Groq (Llama 3.3, Mixtral, Gemma2), NVIDIA (DeepSeek, Nemotron, Llama 3.3)
20
+ - **Smart Model Router** - Automatic selection of optimal model based on task type (code, reasoning, extraction, etc.)
21
+ - **Embeddings Service** - Semantic search with OpenAI and Google embeddings, in-memory caching
22
+ - **RL-Powered Scraping** - Reinforcement learning agents that learn optimal extraction strategies
23
+ - **Multi-Agent System** - Coordinated planner, extractor, and navigator agents
24
+
25
+ ### ⚑ Real-Time Features
26
+ - **WebSocket Support** - Live progress updates during scraping episodes
27
+ - **Session-Based** - Clean slate on each session, no persistent rewards
28
+ - **Real-Time Metrics** - Track rewards, progress, and extraction in real-time
29
+
30
+ ### 🎨 Modern UI/UX
31
+ - **Navy Blue & Cyan Theme** - Beautiful gradient design with glow effects
32
+ - **Fullscreen Layout** - Optimized for productivity
33
+ - **React + TailwindCSS** - Responsive and modern interface
34
+ - **Live Episode Monitoring** - Watch scraper progress in real-time
35
+
36
+ ### πŸ”§ Developer Experience
37
+ - **FastAPI Backend** - High-performance async Python API
38
+ - **TypeScript Frontend** - Type-safe React application
39
+ - **Docker Ready** - Multi-stage builds with optimized images
40
+ - **Comprehensive Testing** - End-to-end test scripts included
41
+ - **Plugin System** - Extensible architecture with plugin support
42
+
43
+ ## πŸš€ Quick Start
44
+
45
+ ### Prerequisites
46
+ - Python 3.11+
47
+ - Node.js 20+
48
+ - Docker (optional, but recommended)
49
+ - At least one AI provider API key (OpenAI, Anthropic, Google, Groq, or NVIDIA)
50
 
51
  ### Docker (Recommended)
52
 
53
  ```bash
54
  # Clone the repository
55
+ git clone https://github.com/NeerajCodz/scrapeRL.git
56
  cd scrapeRL
57
 
58
+ # Copy and configure environment
59
  cp .env.example .env
60
+ # Edit .env and add your API keys
61
 
62
  # Build and run
63
  docker-compose up --build
64
  ```
65
 
66
+ Access the app at **http://localhost:7860**
67
 
68
  ### Local Development
69
 
 
71
  ```bash
72
  cd backend
73
  pip install -r requirements.txt
74
+
75
+ # Copy environment file
76
+ cp ../.env.example ../.env
77
+ # Add your API keys to .env
78
+
79
+ # Run server
80
+ uvicorn app.main:app --reload --host 0.0.0.0 --port 8000
81
  ```
82
 
83
  **Frontend:**
 
87
  npm run dev
88
  ```
89
 
90
+ Frontend will be at **http://localhost:5173**
91
 
92
+ ## πŸ“‘ API Endpoints
93
+
94
+ ### Core Endpoints
95
  | Method | Endpoint | Description |
96
  |--------|----------|-------------|
97
+ | GET | `/api/health` | Health check and system status |
98
+ | POST | `/api/episode/reset` | Create a new scraping episode |
99
+ | POST | `/api/episode/step` | Execute an action in an episode |
100
+ | GET | `/api/episode/state/{episode_id}` | Get current episode state |
101
 
102
+ ### AI Provider Endpoints
103
+ | Method | Endpoint | Description |
104
+ |--------|----------|-------------|
105
+ | GET | `/api/providers` | List all configured AI providers |
106
+ | GET | `/api/providers/{name}` | Get specific provider details |
107
+ | GET | `/api/providers/models/all` | List all available models |
108
+ | GET | `/api/providers/costs/summary` | Get cost tracking summary |
109
+
110
+ ### WebSocket Endpoints
111
+ | Type | Endpoint | Description |
112
+ |------|----------|-------------|
113
+ | WS | `/ws/episode/{episode_id}` | Real-time episode progress updates |
114
+
115
+ ### Other Endpoints
116
+ - `/api/tasks` - Task management
117
+ - `/api/agents` - Agent configuration
118
+ - `/api/tools` - MCP tools registry
119
+ - `/api/memory` - Memory management
120
+ - `/api/plugins` - Plugin system
121
+ - `/api/settings` - System settings
122
+
123
+ ## πŸ—οΈ Architecture
124
 
125
  ```
126
  scrapeRL/
127
  β”œβ”€β”€ backend/
128
  β”‚ β”œβ”€β”€ app/
129
+ β”‚ β”‚ β”œβ”€β”€ main.py # FastAPI app entry
130
+ β”‚ β”‚ β”œβ”€β”€ config.py # Configuration management
131
+ β”‚ β”‚ β”œβ”€β”€ api/
132
+ β”‚ β”‚ β”‚ └── routes/ # API endpoints
133
+ β”‚ β”‚ β”‚ β”œβ”€β”€ episode.py # Episode management
134
+ β”‚ β”‚ β”‚ β”œβ”€β”€ providers.py # AI provider APIs
135
+ β”‚ β”‚ β”‚ β”œβ”€β”€ websocket.py # Real-time updates
136
+ β”‚ β”‚ β”‚ └── ...
137
+ β”‚ β”‚ β”œβ”€β”€ core/
138
+ β”‚ β”‚ β”‚ β”œβ”€β”€ env.py # RL environment
139
+ β”‚ β”‚ β”‚ β”œβ”€β”€ reward.py # Reward engine
140
+ β”‚ β”‚ β”‚ β”œβ”€β”€ embeddings.py # Embeddings service
141
+ β”‚ β”‚ β”‚ └── ...
142
+ β”‚ β”‚ β”œβ”€β”€ agents/
143
+ β”‚ β”‚ β”‚ β”œβ”€β”€ coordinator.py # Agent orchestration
144
+ β”‚ β”‚ β”‚ β”œβ”€β”€ planner.py # Planning agent
145
+ β”‚ β”‚ β”‚ β”œβ”€β”€ extractor.py # Extraction agent
146
+ β”‚ β”‚ β”‚ └── navigator.py # Navigation agent
147
+ β”‚ β”‚ β”œβ”€β”€ models/
148
+ β”‚ β”‚ β”‚ β”œβ”€β”€ router.py # Smart model router
149
+ β”‚ β”‚ β”‚ └── providers/ # AI provider implementations
150
+ β”‚ β”‚ β”‚ β”œβ”€β”€ openai.py # OpenAI GPT-4
151
+ β”‚ β”‚ β”‚ β”œβ”€β”€ anthropic.py # Claude 3.5 Sonnet
152
+ β”‚ β”‚ β”‚ β”œβ”€β”€ google.py # Gemini 2.5/2.0/3.0
153
+ β”‚ β”‚ β”‚ β”œβ”€β”€ groq.py # Llama 3.3, Mixtral
154
+ β”‚ β”‚ β”‚ └── nvidia.py # DeepSeek, Nemotron
155
+ β”‚ β”‚ β”œβ”€β”€ memory/ # Memory system
156
+ β”‚ β”‚ β”œβ”€β”€ tools/ # MCP tools
157
+ β”‚ β”‚ └── types/ # Type definitions
158
  β”‚ └── requirements.txt
159
  β”œβ”€β”€ frontend/
160
  β”‚ β”œβ”€β”€ src/
161
+ β”‚ β”‚ β”œβ”€β”€ components/ # React components
162
+ β”‚ β”‚ β”œβ”€β”€ hooks/
163
+ β”‚ β”‚ β”‚ β”œβ”€β”€ useWebSocket.ts # WebSocket hook
164
+ β”‚ β”‚ β”‚ └── useEpisodeProgress.ts # Episode tracking
165
+ β”‚ β”‚ β”œβ”€β”€ api/ # API clients
166
+ β”‚ β”‚ β”œβ”€β”€ types/ # TypeScript types
167
+ β”‚ β”‚ └── index.css # Navy/cyan theme
168
  β”‚ └── package.json
169
+ β”œβ”€β”€ Dockerfile # Multi-stage build
170
+ β”œβ”€β”€ docker-compose.yml # Local development
171
+ β”œβ”€β”€ .env.example # Environment template
172
+ └── README.md
173
  ```
174
 
175
+ ## βš™οΈ Configuration
176
 
177
+ Create a `.env` file in the root directory (see `.env.example` for template):
178
 
179
+ ### AI Provider API Keys (Optional - at least one recommended)
180
+ | Variable | Description | Provider |
181
  |----------|-------------|----------|
182
+ | `OPENAI_API_KEY` | OpenAI API key | GPT-4o, GPT-4o-mini, O1 |
183
+ | `ANTHROPIC_API_KEY` | Anthropic API key | Claude 3.5 Sonnet, Haiku, Opus |
184
+ | `GOOGLE_API_KEY` | Google AI API key | Gemini 2.5 Pro/Flash, Gemini 2.0, Gemini 3.0 |
185
+ | `GROQ_API_KEY` | Groq API key | Llama 3.3 70B, Llama 3.2 Vision, Mixtral, Gemma2 |
186
+ | `NVIDIA_API_KEY` | NVIDIA API key | DeepSeek R1/V3.2, Nemotron 70B, Llama 3.3 70B |
187
+
188
+ ### HuggingFace (Optional)
189
+ | Variable | Description |
190
+ |----------|-------------|
191
+ | `HF_TOKEN` | HuggingFace token for model access |
192
+
193
+ ### App Settings
194
+ | Variable | Default | Description |
195
+ |----------|---------|-------------|
196
+ | `DEBUG` | `false` | Enable debug mode |
197
+ | `LOG_LEVEL` | `INFO` | Logging level (DEBUG, INFO, WARN, ERROR) |
198
+ | `HOST` | `0.0.0.0` | Server host |
199
+ | `PORT` | `8000` | Server port |
200
+
201
+ ### CORS Settings
202
+ | Variable | Default | Description |
203
+ |----------|---------|-------------|
204
+ | `CORS_ORIGINS` | `["http://localhost:5173"]` | Allowed CORS origins |
205
+
206
+ ### Session & Memory
207
+ | Variable | Default | Description |
208
+ |----------|---------|-------------|
209
+ | `SESSION_TIMEOUT` | `3600` | Session timeout in seconds |
210
+ | `MEMORY_TTL` | `86400` | Memory TTL in seconds |
211
+
212
+ ## πŸ§ͺ Testing
213
+
214
+ Run the end-to-end test script:
215
+
216
+ ```bash
217
+ cd backend
218
+ python test_scraper.py
219
+ ```
220
+
221
+ This will:
222
+ 1. Create a scraping episode
223
+ 2. Execute navigation and extraction actions
224
+ 3. Track rewards and progress
225
+ 4. Verify WebSocket connectivity
226
+ 5. Display final results
227
+
228
+ Expected output:
229
+ ```
230
+ βœ“ Episode created: <uuid>
231
+ βœ“ Action executed successfully
232
+ Reward: 0.65
233
+ Progress: 0.0%
234
+ βœ“ Final state retrieved
235
+ Steps: 3
236
+ Total reward: 2.26
237
+ ```
238
 
239
+ ## πŸš€ Deployment
240
 
241
  ### HuggingFace Spaces
242
 
243
  This app is configured for HuggingFace Spaces with Docker SDK:
244
  - Port: 7860
245
+ - Health check: `/api/health`
246
  - Auto-builds on push
247
+ - Multi-stage build for optimized image size
248
 
249
  ### Manual Docker
250
 
251
  ```bash
252
+ # Build
253
  docker build -t scraperl .
254
+
255
+ # Run
256
  docker run -p 7860:7860 --env-file .env scraperl
257
+
258
+ # Or use docker-compose
259
+ docker-compose up
260
  ```
261
 
262
+ ### Environment Variables in Production
263
+
264
+ Set all required environment variables in your deployment platform:
265
+ - HuggingFace Spaces: Settings β†’ Repository secrets
266
+ - Docker: Use `--env-file` or environment section in docker-compose
267
+ - Kubernetes: ConfigMaps and Secrets
268
+
269
+ ## 🎯 Usage Examples
270
+
271
+ ### Example 1: Simple Scraping Task
272
+
273
+ ```bash
274
+ curl -X POST http://localhost:8000/api/episode/reset \
275
+ -H "Content-Type: application/json" \
276
+ -d '{
277
+ "task_id": "scrape-quotes",
278
+ "config": {
279
+ "start_url": "http://quotes.toscrape.com",
280
+ "target_fields": {
281
+ "quotes": {"text": "quote text", "author": "author name"}
282
+ },
283
+ "max_steps": 20
284
+ }
285
+ }'
286
+ ```
287
+
288
+ ### Example 2: WebSocket Connection
289
+
290
+ ```javascript
291
+ // Frontend JavaScript
292
+ const ws = new WebSocket('ws://localhost:8000/ws/episode/<episode_id>');
293
+
294
+ ws.onmessage = (event) => {
295
+ const message = JSON.parse(event.data);
296
+
297
+ if (message.type === 'progress') {
298
+ console.log(`Step ${message.step}: ${message.action_type}`);
299
+ console.log(`Reward: ${message.reward}, Progress: ${message.progress}%`);
300
+ }
301
+
302
+ if (message.type === 'completion') {
303
+ console.log(`Episode completed! Success: ${message.success}`);
304
+ console.log(`Total reward: ${message.total_reward}`);
305
+ }
306
+ };
307
+ ```
308
+
309
+ ## 🀝 Contributing
310
+
311
+ Contributions welcome! This project follows conventional commit messages:
312
+ - `feat:` - New features
313
+ - `fix:` - Bug fixes
314
+ - `chore:` - Maintenance tasks
315
+ - `docs:` - Documentation updates
316
+ - `test:` - Test additions/updates
317
+
318
+ ## πŸ“„ License
319
 
320
  MIT License - see [LICENSE](LICENSE) for details.
321
+
322
+ ## πŸ™ Acknowledgments
323
+
324
+ - Built with FastAPI, React, TailwindCSS
325
+ - Powered by OpenAI, Anthropic, Google, Groq, and NVIDIA AI models
326
+ - Inspired by reinforcement learning research in web automation