File size: 4,691 Bytes
2d58347 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 | ---
title: Scrapling - Web Scraping API
emoji: π·οΈ
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: mit
---
# Scrapling - Advanced Web Scraping API
A powerful web scraping API with AI-powered content extraction, session management, and multiple scraping modes (HTTP, JavaScript rendering, and stealthy browser automation).
## Features
- π **REST API** - FastAPI-based endpoints for programmatic access
- π€ **AI-Powered Extraction** - Natural language queries for content extraction
- π **Session Management** - Persistent sessions for efficient batch processing
- π **Multiple Scraping Modes**:
- Standard HTTP (fast, low protection)
- Dynamic fetching (JavaScript support)
- Stealthy browser (anti-bot bypass)
- π **Structured Output** - Returns data in JSON, Markdown, HTML, or Text formats
- π¨ **Gradio UI** - Interactive web interface for testing
## API Endpoints
### Base URL
```
https://grazieprego-scrapling.hf.space
```
### Quick Reference
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Check API status |
| `/api/scrape` | POST | Stateless scrape request |
| `/api/session` | POST | Create persistent session |
| `/api/session/{id}/scrape` | POST | Scrape using session |
| `/api/session/{id}` | DELETE | Close session |
| `/docs` | GET | API documentation (HTML) |
| `/api-docs` | GET | API documentation (JSON) |
## Usage Examples
### 1. Stateless Scrape (One-off requests)
```bash
curl -X POST https://grazieprego-scrapling.hf.space/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"query": "Extract all product prices",
"model_name": "alias-fast"
}'
```
### 2. Session-Based Scraping (Multiple requests)
```python
import requests
# Create session
session = requests.post(
'https://grazieprego-scrapling.hf.space/api/session',
json={'model_name': 'alias-fast'}
)
session_id = session.json()['session_id']
try:
# Multiple scrapes using the same session
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
]
for url in urls:
result = requests.post(
f'https://grazieprego-scrapling.hf.space/api/session/{session_id}/scrape',
json={'url': url, 'query': 'Extract product data'}
)
print(f"Scraped {url}: {result.json()}")
finally:
# Always close the session
requests.delete(f'https://grazieprego-scrapling.hf.space/api/session/{session_id}')
```
### 3. Using the Gradio UI
Visit the space URL and use the interactive interface:
- **Fetch (HTTP)** tab: For standard HTTP scraping
- **Stealthy Fetch (Browser)** tab: For sites with bot protection
## API Documentation
- **HTML Docs**: https://grazieprego-scrapling.hf.space/docs
- **JSON Docs**: https://grazieprego-scrapling.hf.space/api-docs
## Request Parameters
### `/api/scrape` & `/api/session/{id}/scrape`
```json
{
"url": "https://example.com",
"query": "Extract all headings and prices",
"model_name": "alias-fast"
}
```
**Parameters:**
- `url` (string, required): The URL to scrape
- `query` (string, required): Natural language extraction instruction
- `model_name` (string, optional): AI model to use (default: "alias-fast")
### `/api/session`
```json
{
"model_name": "alias-fast"
}
```
## Response Format
```json
{
"url": "https://example.com",
"query": "Extract prices",
"response": {
"status": 200,
"content": ["# Product 1: $19.99", "# Product 2: $29.99"],
"url": "https://example.com"
}
}
```
## Best Practices
1. **Use stateless endpoints** for one-off requests
2. **Use sessions** for batch processing multiple URLs
3. **Always close sessions** when finished to free resources
4. **Implement error handling** - 500 errors may occur on complex sites
5. **Add retry logic** for production use
6. **Respect rate limits** - use responsibly
## Error Handling
- **404**: Session not found
- **500**: Internal server error (check `detail` field for specifics)
- **Common issues**:
- URL unreachable or timeout
- JavaScript-heavy sites may need `stealthy_fetch`
- Bot protection may block requests
## Deployment
This space uses Docker with:
- Python 3.11
- FastAPI + Uvicorn
- Gradio 5.x
- Playwright for browser automation
- Scrapling for advanced scraping
## License
MIT License - See LICENSE file for details
## Credits
Built with [Scrapling](https://github.com/D4Vinci/Scrapling) - Advanced web scraping library
---
**Note**: This is a demonstration space. For production use, consider self-hosting with appropriate rate limiting and authentication.
|