scrapling / README.md
GraziePrego's picture
Add comprehensive README for HF Space
2d58347 verified
---
title: Scrapling - Web Scraping API
emoji: 🕷️
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: mit
---
# Scrapling - Advanced Web Scraping API
A powerful web scraping API with AI-powered content extraction, session management, and multiple scraping modes (HTTP, JavaScript rendering, and stealthy browser automation).
## Features
- 🚀 **REST API** - FastAPI-based endpoints for programmatic access
- 🤖 **AI-Powered Extraction** - Natural language queries for content extraction
- 🔐 **Session Management** - Persistent sessions for efficient batch processing
- 🌐 **Multiple Scraping Modes**:
- Standard HTTP (fast, low protection)
- Dynamic fetching (JavaScript support)
- Stealthy browser (anti-bot bypass)
- 📊 **Structured Output** - Returns data in JSON, Markdown, HTML, or Text formats
- 🎨 **Gradio UI** - Interactive web interface for testing
## API Endpoints
### Base URL
```
https://grazieprego-scrapling.hf.space
```
### Quick Reference
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Check API status |
| `/api/scrape` | POST | Stateless scrape request |
| `/api/session` | POST | Create persistent session |
| `/api/session/{id}/scrape` | POST | Scrape using session |
| `/api/session/{id}` | DELETE | Close session |
| `/docs` | GET | API documentation (HTML) |
| `/api-docs` | GET | API documentation (JSON) |
## Usage Examples
### 1. Stateless Scrape (One-off requests)
```bash
curl -X POST https://grazieprego-scrapling.hf.space/api/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"query": "Extract all product prices",
"model_name": "alias-fast"
}'
```
### 2. Session-Based Scraping (Multiple requests)
```python
import requests
# Create session
session = requests.post(
'https://grazieprego-scrapling.hf.space/api/session',
json={'model_name': 'alias-fast'}
)
session_id = session.json()['session_id']
try:
# Multiple scrapes using the same session
urls = [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
]
for url in urls:
result = requests.post(
f'https://grazieprego-scrapling.hf.space/api/session/{session_id}/scrape',
json={'url': url, 'query': 'Extract product data'}
)
print(f"Scraped {url}: {result.json()}")
finally:
# Always close the session
requests.delete(f'https://grazieprego-scrapling.hf.space/api/session/{session_id}')
```
### 3. Using the Gradio UI
Visit the space URL and use the interactive interface:
- **Fetch (HTTP)** tab: For standard HTTP scraping
- **Stealthy Fetch (Browser)** tab: For sites with bot protection
## API Documentation
- **HTML Docs**: https://grazieprego-scrapling.hf.space/docs
- **JSON Docs**: https://grazieprego-scrapling.hf.space/api-docs
## Request Parameters
### `/api/scrape` & `/api/session/{id}/scrape`
```json
{
"url": "https://example.com",
"query": "Extract all headings and prices",
"model_name": "alias-fast"
}
```
**Parameters:**
- `url` (string, required): The URL to scrape
- `query` (string, required): Natural language extraction instruction
- `model_name` (string, optional): AI model to use (default: "alias-fast")
### `/api/session`
```json
{
"model_name": "alias-fast"
}
```
## Response Format
```json
{
"url": "https://example.com",
"query": "Extract prices",
"response": {
"status": 200,
"content": ["# Product 1: $19.99", "# Product 2: $29.99"],
"url": "https://example.com"
}
}
```
## Best Practices
1. **Use stateless endpoints** for one-off requests
2. **Use sessions** for batch processing multiple URLs
3. **Always close sessions** when finished to free resources
4. **Implement error handling** - 500 errors may occur on complex sites
5. **Add retry logic** for production use
6. **Respect rate limits** - use responsibly
## Error Handling
- **404**: Session not found
- **500**: Internal server error (check `detail` field for specifics)
- **Common issues**:
- URL unreachable or timeout
- JavaScript-heavy sites may need `stealthy_fetch`
- Bot protection may block requests
## Deployment
This space uses Docker with:
- Python 3.11
- FastAPI + Uvicorn
- Gradio 5.x
- Playwright for browser automation
- Scrapling for advanced scraping
## License
MIT License - See LICENSE file for details
## Credits
Built with [Scrapling](https://github.com/D4Vinci/Scrapling) - Advanced web scraping library
---
**Note**: This is a demonstration space. For production use, consider self-hosting with appropriate rate limiting and authentication.