File size: 4,691 Bytes
2d58347
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
---
title: Scrapling - Web Scraping API
emoji: πŸ•·οΈ
colorFrom: purple
colorTo: blue
sdk: docker
pinned: false
license: mit
---

# Scrapling - Advanced Web Scraping API

A powerful web scraping API with AI-powered content extraction, session management, and multiple scraping modes (HTTP, JavaScript rendering, and stealthy browser automation).

## Features

- πŸš€ **REST API** - FastAPI-based endpoints for programmatic access
- πŸ€– **AI-Powered Extraction** - Natural language queries for content extraction
- πŸ” **Session Management** - Persistent sessions for efficient batch processing
- 🌐 **Multiple Scraping Modes**:
  - Standard HTTP (fast, low protection)
  - Dynamic fetching (JavaScript support)
  - Stealthy browser (anti-bot bypass)
- πŸ“Š **Structured Output** - Returns data in JSON, Markdown, HTML, or Text formats
- 🎨 **Gradio UI** - Interactive web interface for testing

## API Endpoints

### Base URL
```
https://grazieprego-scrapling.hf.space
```

### Quick Reference

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/health` | GET | Check API status |
| `/api/scrape` | POST | Stateless scrape request |
| `/api/session` | POST | Create persistent session |
| `/api/session/{id}/scrape` | POST | Scrape using session |
| `/api/session/{id}` | DELETE | Close session |
| `/docs` | GET | API documentation (HTML) |
| `/api-docs` | GET | API documentation (JSON) |

## Usage Examples

### 1. Stateless Scrape (One-off requests)

```bash
curl -X POST https://grazieprego-scrapling.hf.space/api/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "query": "Extract all product prices",
    "model_name": "alias-fast"
  }'
```

### 2. Session-Based Scraping (Multiple requests)

```python
import requests

# Create session
session = requests.post(
    'https://grazieprego-scrapling.hf.space/api/session',
    json={'model_name': 'alias-fast'}
)
session_id = session.json()['session_id']

try:
    # Multiple scrapes using the same session
    urls = [
        'https://example.com/page1',
        'https://example.com/page2',
        'https://example.com/page3'
    ]
    
    for url in urls:
        result = requests.post(
            f'https://grazieprego-scrapling.hf.space/api/session/{session_id}/scrape',
            json={'url': url, 'query': 'Extract product data'}
        )
        print(f"Scraped {url}: {result.json()}")
finally:
    # Always close the session
    requests.delete(f'https://grazieprego-scrapling.hf.space/api/session/{session_id}')
```

### 3. Using the Gradio UI

Visit the space URL and use the interactive interface:
- **Fetch (HTTP)** tab: For standard HTTP scraping
- **Stealthy Fetch (Browser)** tab: For sites with bot protection

## API Documentation

- **HTML Docs**: https://grazieprego-scrapling.hf.space/docs
- **JSON Docs**: https://grazieprego-scrapling.hf.space/api-docs

## Request Parameters

### `/api/scrape` & `/api/session/{id}/scrape`

```json
{
  "url": "https://example.com",
  "query": "Extract all headings and prices",
  "model_name": "alias-fast"
}
```

**Parameters:**
- `url` (string, required): The URL to scrape
- `query` (string, required): Natural language extraction instruction
- `model_name` (string, optional): AI model to use (default: "alias-fast")

### `/api/session`

```json
{
  "model_name": "alias-fast"
}
```

## Response Format

```json
{
  "url": "https://example.com",
  "query": "Extract prices",
  "response": {
    "status": 200,
    "content": ["# Product 1: $19.99", "# Product 2: $29.99"],
    "url": "https://example.com"
  }
}
```

## Best Practices

1. **Use stateless endpoints** for one-off requests
2. **Use sessions** for batch processing multiple URLs
3. **Always close sessions** when finished to free resources
4. **Implement error handling** - 500 errors may occur on complex sites
5. **Add retry logic** for production use
6. **Respect rate limits** - use responsibly

## Error Handling

- **404**: Session not found
- **500**: Internal server error (check `detail` field for specifics)
- **Common issues**:
  - URL unreachable or timeout
  - JavaScript-heavy sites may need `stealthy_fetch`
  - Bot protection may block requests

## Deployment

This space uses Docker with:
- Python 3.11
- FastAPI + Uvicorn
- Gradio 5.x
- Playwright for browser automation
- Scrapling for advanced scraping

## License

MIT License - See LICENSE file for details

## Credits

Built with [Scrapling](https://github.com/D4Vinci/Scrapling) - Advanced web scraping library

---

**Note**: This is a demonstration space. For production use, consider self-hosting with appropriate rate limiting and authentication.