File size: 4,691 Bytes

2d58347

---
title: Scrapling - Web Scraping API
emoji: 🕷️
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: mit
---

# Scrapling - Advanced Web Scraping API

A powerful web scraping API with AI-powered content extraction, session management, and multiple scraping modes (HTTP, JavaScript rendering, and stealthy browser automation).

## Features

- 🚀 **REST API** - FastAPI-based endpoints for programmatic access
- 🤖 **AI-Powered Extraction** - Natural language queries for content extraction
- 🔐 **Session Management** - Persistent sessions for efficient batch processing
- 🌐 **Multiple Scraping Modes**:
  - Standard HTTP (fast, low protection)
  - Dynamic fetching (JavaScript support)
  - Stealthy browser (anti-bot bypass)
- 📊 **Structured Output** - Returns data in JSON, Markdown, HTML, or Text formats
- 🎨 **Gradio UI** - Interactive web interface for testing

## API Endpoints

### Base URL
```
https://grazieprego-scrapling.hf.space
```

### Quick Reference

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Check API status |
| `/api/scrape` | POST | Stateless scrape request |
| `/api/session` | POST | Create persistent session |
| `/api/session/{id}/scrape` | POST | Scrape using session |
| `/api/session/{id}` | DELETE | Close session |
| `/docs` | GET | API documentation (HTML) |
| `/api-docs` | GET | API documentation (JSON) |

## Usage Examples

### 1. Stateless Scrape (One-off requests)

```bash
curl -X POST https://grazieprego-scrapling.hf.space/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "query": "Extract all product prices",
    "model_name": "alias-fast"
  }'
```

### 2. Session-Based Scraping (Multiple requests)

```python
import requests

# Create session
session = requests.post(
    'https://grazieprego-scrapling.hf.space/api/session',
    json={'model_name': 'alias-fast'}
)
session_id = session.json()['session_id']

try:
    # Multiple scrapes using the same session
    urls = [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page3'
    ]
    
    for url in urls:
        result = requests.post(
            f'https://grazieprego-scrapling.hf.space/api/session/{session_id}/scrape',
            json={'url': url, 'query': 'Extract product data'}
        )
        print(f"Scraped {url}: {result.json()}")
finally:
    # Always close the session
    requests.delete(f'https://grazieprego-scrapling.hf.space/api/session/{session_id}')
```

### 3. Using the Gradio UI

Visit the space URL and use the interactive interface:
- **Fetch (HTTP)** tab: For standard HTTP scraping
- **Stealthy Fetch (Browser)** tab: For sites with bot protection

## API Documentation

- **HTML Docs**: https://grazieprego-scrapling.hf.space/docs
- **JSON Docs**: https://grazieprego-scrapling.hf.space/api-docs

## Request Parameters

### `/api/scrape` & `/api/session/{id}/scrape`

```json
{
  "url": "https://example.com",
  "query": "Extract all headings and prices",
  "model_name": "alias-fast"
}
```

**Parameters:**
- `url` (string, required): The URL to scrape
- `query` (string, required): Natural language extraction instruction
- `model_name` (string, optional): AI model to use (default: "alias-fast")

### `/api/session`

```json
{
  "model_name": "alias-fast"
}
```

## Response Format

```json
{
  "url": "https://example.com",
  "query": "Extract prices",
  "response": {
    "status": 200,
    "content": ["# Product 1: $19.99", "# Product 2: $29.99"],
    "url": "https://example.com"
  }
}
```

## Best Practices

1. **Use stateless endpoints** for one-off requests
2. **Use sessions** for batch processing multiple URLs
3. **Always close sessions** when finished to free resources
4. **Implement error handling** - 500 errors may occur on complex sites
5. **Add retry logic** for production use
6. **Respect rate limits** - use responsibly

## Error Handling

- **404**: Session not found
- **500**: Internal server error (check `detail` field for specifics)
- **Common issues**:
  - URL unreachable or timeout
  - JavaScript-heavy sites may need `stealthy_fetch`
  - Bot protection may block requests

## Deployment

This space uses Docker with:
- Python 3.11
- FastAPI + Uvicorn
- Gradio 5.x
- Playwright for browser automation
- Scrapling for advanced scraping

## License

MIT License - See LICENSE file for details

## Credits

Built with [Scrapling](https://github.com/D4Vinci/Scrapling) - Advanced web scraping library

---

**Note**: This is a demonstration space. For production use, consider self-hosting with appropriate rate limiting and authentication.