| --- |
| title: Scrapling - Web Scraping API |
| emoji: 🕷️ |
| colorFrom: purple |
| colorTo: blue |
| sdk: docker |
| pinned: false |
| license: mit |
| --- |
| |
| # Scrapling - Advanced Web Scraping API |
|
|
| A powerful web scraping API with AI-powered content extraction, session management, and multiple scraping modes (HTTP, JavaScript rendering, and stealthy browser automation). |
|
|
| ## Features |
|
|
| - 🚀 **REST API** - FastAPI-based endpoints for programmatic access |
| - 🤖 **AI-Powered Extraction** - Natural language queries for content extraction |
| - 🔐 **Session Management** - Persistent sessions for efficient batch processing |
| - 🌐 **Multiple Scraping Modes**: |
| - Standard HTTP (fast, low protection) |
| - Dynamic fetching (JavaScript support) |
| - Stealthy browser (anti-bot bypass) |
| - 📊 **Structured Output** - Returns data in JSON, Markdown, HTML, or Text formats |
| - 🎨 **Gradio UI** - Interactive web interface for testing |
|
|
| ## API Endpoints |
|
|
| ### Base URL |
| ``` |
| https://grazieprego-scrapling.hf.space |
| ``` |
|
|
| ### Quick Reference |
|
|
| | Endpoint | Method | Description | |
| |----------|--------|-------------| |
| | `/health` | GET | Check API status | |
| | `/api/scrape` | POST | Stateless scrape request | |
| | `/api/session` | POST | Create persistent session | |
| | `/api/session/{id}/scrape` | POST | Scrape using session | |
| | `/api/session/{id}` | DELETE | Close session | |
| | `/docs` | GET | API documentation (HTML) | |
| | `/api-docs` | GET | API documentation (JSON) | |
|
|
| ## Usage Examples |
|
|
| ### 1. Stateless Scrape (One-off requests) |
|
|
| ```bash |
| curl -X POST https://grazieprego-scrapling.hf.space/api/scrape \ |
| -H "Content-Type: application/json" \ |
| -d '{ |
| "url": "https://example.com", |
| "query": "Extract all product prices", |
| "model_name": "alias-fast" |
| }' |
| ``` |
|
|
| ### 2. Session-Based Scraping (Multiple requests) |
|
|
| ```python |
| import requests |
| |
| # Create session |
| session = requests.post( |
| 'https://grazieprego-scrapling.hf.space/api/session', |
| json={'model_name': 'alias-fast'} |
| ) |
| session_id = session.json()['session_id'] |
| |
| try: |
| # Multiple scrapes using the same session |
| urls = [ |
| 'https://example.com/page1', |
| 'https://example.com/page2', |
| 'https://example.com/page3' |
| ] |
| |
| for url in urls: |
| result = requests.post( |
| f'https://grazieprego-scrapling.hf.space/api/session/{session_id}/scrape', |
| json={'url': url, 'query': 'Extract product data'} |
| ) |
| print(f"Scraped {url}: {result.json()}") |
| finally: |
| # Always close the session |
| requests.delete(f'https://grazieprego-scrapling.hf.space/api/session/{session_id}') |
| ``` |
|
|
| ### 3. Using the Gradio UI |
|
|
| Visit the space URL and use the interactive interface: |
| - **Fetch (HTTP)** tab: For standard HTTP scraping |
| - **Stealthy Fetch (Browser)** tab: For sites with bot protection |
|
|
| ## API Documentation |
|
|
| - **HTML Docs**: https://grazieprego-scrapling.hf.space/docs |
| - **JSON Docs**: https://grazieprego-scrapling.hf.space/api-docs |
|
|
| ## Request Parameters |
|
|
| ### `/api/scrape` & `/api/session/{id}/scrape` |
|
|
| ```json |
| { |
| "url": "https://example.com", |
| "query": "Extract all headings and prices", |
| "model_name": "alias-fast" |
| } |
| ``` |
|
|
| **Parameters:** |
| - `url` (string, required): The URL to scrape |
| - `query` (string, required): Natural language extraction instruction |
| - `model_name` (string, optional): AI model to use (default: "alias-fast") |
|
|
| ### `/api/session` |
|
|
| ```json |
| { |
| "model_name": "alias-fast" |
| } |
| ``` |
|
|
| ## Response Format |
|
|
| ```json |
| { |
| "url": "https://example.com", |
| "query": "Extract prices", |
| "response": { |
| "status": 200, |
| "content": ["# Product 1: $19.99", "# Product 2: $29.99"], |
| "url": "https://example.com" |
| } |
| } |
| ``` |
|
|
| ## Best Practices |
|
|
| 1. **Use stateless endpoints** for one-off requests |
| 2. **Use sessions** for batch processing multiple URLs |
| 3. **Always close sessions** when finished to free resources |
| 4. **Implement error handling** - 500 errors may occur on complex sites |
| 5. **Add retry logic** for production use |
| 6. **Respect rate limits** - use responsibly |
|
|
| ## Error Handling |
|
|
| - **404**: Session not found |
| - **500**: Internal server error (check `detail` field for specifics) |
| - **Common issues**: |
| - URL unreachable or timeout |
| - JavaScript-heavy sites may need `stealthy_fetch` |
| - Bot protection may block requests |
|
|
| ## Deployment |
|
|
| This space uses Docker with: |
| - Python 3.11 |
| - FastAPI + Uvicorn |
| - Gradio 5.x |
| - Playwright for browser automation |
| - Scrapling for advanced scraping |
|
|
| ## License |
|
|
| MIT License - See LICENSE file for details |
|
|
| ## Credits |
|
|
| Built with [Scrapling](https://github.com/D4Vinci/Scrapling) - Advanced web scraping library |
|
|
| --- |
|
|
| **Note**: This is a demonstration space. For production use, consider self-hosting with appropriate rate limiting and authentication. |
|
|