Add comprehensive README for HF Space

2d58347 verified about 1 month ago

4.69 kB

	---
	title: Scrapling - Web Scraping API
	emoji: 🕷️
	colorFrom: purple
	colorTo: blue
	sdk: docker
	pinned: false
	license: mit
	---

	# Scrapling - Advanced Web Scraping API

	A powerful web scraping API with AI-powered content extraction, session management, and multiple scraping modes (HTTP, JavaScript rendering, and stealthy browser automation).

	## Features

	- 🚀 REST API - FastAPI-based endpoints for programmatic access
	- 🤖 AI-Powered Extraction - Natural language queries for content extraction
	- 🔐 Session Management - Persistent sessions for efficient batch processing
	- 🌐 Multiple Scraping Modes:
	- Standard HTTP (fast, low protection)
	- Dynamic fetching (JavaScript support)
	- Stealthy browser (anti-bot bypass)
	- 📊 Structured Output - Returns data in JSON, Markdown, HTML, or Text formats
	- 🎨 Gradio UI - Interactive web interface for testing

	## API Endpoints

	### Base URL
	```
	https://grazieprego-scrapling.hf.space
	```

	### Quick Reference

	\| Endpoint \| Method \| Description \|
	\|----------\|--------\|-------------\|
	\| `/health` \| GET \| Check API status \|
	\| `/api/scrape` \| POST \| Stateless scrape request \|
	\| `/api/session` \| POST \| Create persistent session \|
	\| `/api/session/{id}/scrape` \| POST \| Scrape using session \|
	\| `/api/session/{id}` \| DELETE \| Close session \|
	\| `/docs` \| GET \| API documentation (HTML) \|
	\| `/api-docs` \| GET \| API documentation (JSON) \|

	## Usage Examples

	### 1. Stateless Scrape (One-off requests)

	```bash
	curl -X POST https://grazieprego-scrapling.hf.space/api/scrape \
	-H "Content-Type: application/json" \
	-d '{
	"url": "https://example.com",
	"query": "Extract all product prices",
	"model_name": "alias-fast"
	}'
	```

	### 2. Session-Based Scraping (Multiple requests)

	```python
	import requests

	# Create session
	session = requests.post(
	'https://grazieprego-scrapling.hf.space/api/session',
	json={'model_name': 'alias-fast'}
	)
	session_id = session.json()['session_id']

	try:
	# Multiple scrapes using the same session
	urls = [
	'https://example.com/page1',
	'https://example.com/page2',
	'https://example.com/page3'
	]

	for url in urls:
	result = requests.post(
	f'https://grazieprego-scrapling.hf.space/api/session/{session_id}/scrape',
	json={'url': url, 'query': 'Extract product data'}
	)
	print(f"Scraped {url}: {result.json()}")
	finally:
	# Always close the session
	requests.delete(f'https://grazieprego-scrapling.hf.space/api/session/{session_id}')
	```

	### 3. Using the Gradio UI

	Visit the space URL and use the interactive interface:
	- Fetch (HTTP) tab: For standard HTTP scraping
	- Stealthy Fetch (Browser) tab: For sites with bot protection

	## API Documentation

	- HTML Docs: https://grazieprego-scrapling.hf.space/docs
	- JSON Docs: https://grazieprego-scrapling.hf.space/api-docs

	## Request Parameters

	### `/api/scrape` & `/api/session/{id}/scrape`

	```json
	{
	"url": "https://example.com",
	"query": "Extract all headings and prices",
	"model_name": "alias-fast"
	}
	```

	Parameters:
	- `url` (string, required): The URL to scrape
	- `query` (string, required): Natural language extraction instruction
	- `model_name` (string, optional): AI model to use (default: "alias-fast")

	### `/api/session`

	```json
	{
	"model_name": "alias-fast"
	}
	```

	## Response Format

	```json
	{
	"url": "https://example.com",
	"query": "Extract prices",
	"response": {
	"status": 200,
	"content": ["# Product 1: $19.99", "# Product 2: $29.99"],
	"url": "https://example.com"
	}
	}
	```

	## Best Practices

	1. Use stateless endpoints for one-off requests
	2. Use sessions for batch processing multiple URLs
	3. Always close sessions when finished to free resources
	4. Implement error handling - 500 errors may occur on complex sites
	5. Add retry logic for production use
	6. Respect rate limits - use responsibly

	## Error Handling

	- 404: Session not found
	- 500: Internal server error (check `detail` field for specifics)
	- Common issues:
	- URL unreachable or timeout
	- JavaScript-heavy sites may need `stealthy_fetch`
	- Bot protection may block requests

	## Deployment

	This space uses Docker with:
	- Python 3.11
	- FastAPI + Uvicorn
	- Gradio 5.x
	- Playwright for browser automation
	- Scrapling for advanced scraping

	## License

	MIT License - See LICENSE file for details

	## Credits

	Built with [Scrapling](https://github.com/D4Vinci/Scrapling) - Advanced web scraping library

	---

	Note: This is a demonstration space. For production use, consider self-hosting with appropriate rate limiting and authentication.