Spaces:
Sleeping
Sleeping
File size: 18,447 Bytes
e840680 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 | # HTTP requests
The `Fetcher` class provides rapid and lightweight HTTP requests using the high-performance `curl_cffi` library with a lot of stealth capabilities.
!!! success "Prerequisites"
1. You've completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
2. You've completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
3. You've completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
## Basic Usage
You have one primary way to import this Fetcher, which is the same for all fetchers.
```python
>>> from scrapling.fetchers import Fetcher
```
Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
### Shared arguments
All methods for making requests here share some arguments, so let's discuss them first.
- **url**: The targeted URL
- **stealthy_headers**: If enabled (default), it creates and adds real browser headers. It also sets the referer header as if this request came from a Google search of the URL's domain.
- **follow_redirects**: As the name implies, tell the fetcher to follow redirections. **Enabled by default**
- **timeout**: The number of seconds to wait for each request to be finished. **Defaults to 30 seconds**.
- **retries**: The number of retries that the fetcher will do for failed requests. **Defaults to three retries**.
- **retry_delay**: Number of seconds to wait between retry attempts. **Defaults to 1 second**.
- **impersonate**: Impersonate specific browsers' TLS fingerprints. Accepts browser strings or a list of them like `"chrome110"`, `"firefox102"`, `"safari15_5"` to use specific versions or `"chrome"`, `"firefox"`, `"safari"`, `"edge"` to automatically use the latest version available. This makes your requests appear to come from real browsers at the TLS level. If you pass it a list of strings, it will choose a random one with each request. **Defaults to the latest available Chrome version.**
- **http3**: Use HTTP/3 protocol for requests. **Defaults to False**. It might be problematic if used with `impersonate`.
- **cookies**: Cookies to use in the request. Can be a dictionary of `name→value` or a list of dictionaries.
- **proxy**: As the name implies, the proxy for this request is used to route all traffic (HTTP and HTTPS). The format accepted here is `http://username:password@localhost:8030`.
- **proxy_auth**: HTTP basic auth for proxy, tuple of (username, password).
- **proxies**: Dict of proxies to use. Format: `{"http": proxy_url, "https": proxy_url}`.
- **proxy_rotator**: A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy` or `proxies`.
- **headers**: Headers to include in the request. Can override any header generated by the `stealthy_headers` argument
- **max_redirects**: Maximum number of redirects. **Defaults to 30**, use -1 for unlimited.
- **verify**: Whether to verify HTTPS certificates. **Defaults to True**.
- **cert**: Tuple of (cert, key) filenames for the client certificate.
- **selector_config**: A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class.
!!! note "Notes:"
1. The currently available browsers to impersonate are (`"edge"`, `"chrome"`, `"chrome_android"`, `"safari"`, `"safari_beta"`, `"safari_ios"`, `"safari_ios_beta"`, `"firefox"`, `"tor"`)<br/>
2. The available browsers to impersonate, along with their corresponding versions, are automatically displayed in the argument autocompletion and updated with each `curl_cffi` update.<br/>
3. If any of the arguments `impersonate` or `stealthy_headers` are enabled, the fetchers will automatically generate real browser headers that match the browser version used.
Other than this, for further customization, you can pass any arguments that `curl_cffi` supports for any method if that method doesn't already support them.
### HTTP Methods
There are additional arguments for each method, depending on the method, such as `params` for GET requests and `data`/`json` for POST/PUT/DELETE requests.
Examples are the best way to explain this:
> Hence: `OPTIONS` and `HEAD` methods are not supported.
#### GET
```python
>>> from scrapling.fetchers import Fetcher
>>> # Basic GET
>>> page = Fetcher.get('https://example.com')
>>> page = Fetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True)
>>> page = Fetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:password@localhost:8030')
>>> # With parameters
>>> page = Fetcher.get('https://example.com/search', params={'q': 'query'})
>>>
>>> # With headers
>>> page = Fetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
>>> # Basic HTTP authentication
>>> page = Fetcher.get("https://example.com", auth=("my_user", "password123"))
>>> # Browser impersonation
>>> page = Fetcher.get('https://example.com', impersonate='chrome')
>>> # HTTP/3 support
>>> page = Fetcher.get('https://example.com', http3=True)
```
And for asynchronous requests, it's a small adjustment
```python
>>> from scrapling.fetchers import AsyncFetcher
>>> # Basic GET
>>> page = await AsyncFetcher.get('https://example.com')
>>> page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True)
>>> page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:password@localhost:8030')
>>> # With parameters
>>> page = await AsyncFetcher.get('https://example.com/search', params={'q': 'query'})
>>>
>>> # With headers
>>> page = await AsyncFetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
>>> # Basic HTTP authentication
>>> page = await AsyncFetcher.get("https://example.com", auth=("my_user", "password123"))
>>> # Browser impersonation
>>> page = await AsyncFetcher.get('https://example.com', impersonate='chrome110')
>>> # HTTP/3 support
>>> page = await AsyncFetcher.get('https://example.com', http3=True)
```
Needless to say, the `page` object in all cases is [Response](choosing.md#response-object) object, which is a [Selector](../parsing/main_classes.md#selector) as we said, so you can use it directly
```python
>>> page.css('.something.something')
>>> page = Fetcher.get('https://api.github.com/events')
>>> page.json()
[{'id': '<redacted>',
'type': 'PushEvent',
'actor': {'id': '<redacted>',
'login': '<redacted>',
'display_login': '<redacted>',
'gravatar_id': '',
'url': 'https://api.github.com/users/<redacted>',
'avatar_url': 'https://avatars.githubusercontent.com/u/<redacted>'},
'repo': {'id': '<redacted>',
...
```
#### POST
```python
>>> from scrapling.fetchers import Fetcher
>>> # Basic POST
>>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, params={'q': 'query'})
>>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
>>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030', impersonate="chrome")
>>> # Another example of form-encoded data
>>> page = Fetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)
>>> # JSON data
>>> page = Fetcher.post('https://example.com/api', json={'key': 'value'})
```
And for asynchronous requests, it's a small adjustment
```python
>>> from scrapling.fetchers import AsyncFetcher
>>> # Basic POST
>>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'})
>>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
>>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030', impersonate="chrome")
>>> # Another example of form-encoded data
>>> page = await AsyncFetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)
>>> # JSON data
>>> page = await AsyncFetcher.post('https://example.com/api', json={'key': 'value'})
```
#### PUT
```python
>>> from scrapling.fetchers import Fetcher
>>> # Basic PUT
>>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'})
>>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True, impersonate="chrome")
>>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
>>> # Another example of form-encoded data
>>> page = Fetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']})
```
And for asynchronous requests, it's a small adjustment
```python
>>> from scrapling.fetchers import AsyncFetcher
>>> # Basic PUT
>>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'})
>>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True, impersonate="chrome")
>>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
>>> # Another example of form-encoded data
>>> page = await AsyncFetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']})
```
#### DELETE
```python
>>> from scrapling.fetchers import Fetcher
>>> page = Fetcher.delete('https://example.com/resource/123')
>>> page = Fetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True, impersonate="chrome")
>>> page = Fetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
```
And for asynchronous requests, it's a small adjustment
```python
>>> from scrapling.fetchers import AsyncFetcher
>>> page = await AsyncFetcher.delete('https://example.com/resource/123')
>>> page = await AsyncFetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True, impersonate="chrome")
>>> page = await AsyncFetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
```
## Session Management
For making multiple requests with the same configuration, use the `FetcherSession` class. It can be used in both synchronous and asynchronous code without issue; the class automatically detects and changes the session type, without requiring a different import.
The `FetcherSession` class can accept nearly all the arguments that the methods can take, which enables you to specify a config for the entire session and later choose a different config for one of the requests effortlessly, as you will see in the following examples.
```python
from scrapling.fetchers import FetcherSession
# Create a session with default configuration
with FetcherSession(
impersonate='chrome',
http3=True,
stealthy_headers=True,
timeout=30,
retries=3
) as session:
# Make multiple requests with the same settings and the same cookies
page1 = session.get('https://scrapling.requestcatcher.com/get')
page2 = session.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'})
page3 = session.get('https://api.github.com/events')
# All requests share the same session and connection pool
```
You can also use a `ProxyRotator` with `FetcherSession` for automatic proxy rotation across requests:
```python
from scrapling.fetchers import FetcherSession, ProxyRotator
rotator = ProxyRotator([
'http://proxy1:8080',
'http://proxy2:8080',
'http://proxy3:8080',
])
with FetcherSession(proxy_rotator=rotator, impersonate='chrome') as session:
# Each request automatically uses the next proxy in rotation
page1 = session.get('https://example.com/page1')
page2 = session.get('https://example.com/page2')
# You can check which proxy was used via the response metadata
print(page1.meta['proxy'])
```
You can also override the session proxy (or rotator) for a specific request by passing `proxy=` directly to the request method:
```python
with FetcherSession(proxy='http://default-proxy:8080') as session:
# Uses the session proxy
page1 = session.get('https://example.com/page1')
# Override the proxy for this specific request
page2 = session.get('https://example.com/page2', proxy='http://special-proxy:9090')
```
And here's an async example
```python
async with FetcherSession(impersonate='firefox', http3=True) as session:
# All standard HTTP methods available
response = await session.get('https://example.com')
response = await session.post('https://scrapling.requestcatcher.com/post', json={'data': 'value'})
response = await session.put('https://scrapling.requestcatcher.com/put', data={'update': 'info'})
response = await session.delete('https://scrapling.requestcatcher.com/delete')
```
or better
```python
import asyncio
from scrapling.fetchers import FetcherSession
# Async session usage
async with FetcherSession(impersonate="safari") as session:
urls = ['https://example.com/page1', 'https://example.com/page2']
tasks = [
session.get(url) for url in urls
]
pages = await asyncio.gather(*tasks)
```
The `Fetcher` class uses `FetcherSession` to create a temporary session with each request you make.
### Session Benefits
- **A lot faster**: 10 times faster than creating a single session for each request
- **Cookie persistence**: Automatic cookie handling across requests
- **Resource efficiency**: Better memory and CPU usage for multiple requests
- **Centralized configuration**: Single place to manage request settings
## Examples
Some well-rounded examples to aid newcomers to Web Scraping
### Basic HTTP Request
```python
from scrapling.fetchers import Fetcher
# Make a request
page = Fetcher.get('https://example.com')
# Check the status
if page.status == 200:
# Extract title
title = page.css('title::text').get()
print(f"Page title: {title}")
# Extract all links
links = page.css('a::attr(href)').getall()
print(f"Found {len(links)} links")
```
### Product Scraping
```python
from scrapling.fetchers import Fetcher
def scrape_products():
page = Fetcher.get('https://example.com/products')
# Find all product elements
products = page.css('.product')
results = []
for product in products:
results.append({
'title': product.css('.title::text').get(),
'price': product.css('.price::text').re_first(r'\d+\.\d{2}'),
'description': product.css('.description::text').get(),
'in_stock': product.has_class('in-stock')
})
return results
```
### Downloading Files
```python
from scrapling.fetchers import Fetcher
page = Fetcher.get('https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/main_cover.png')
with open(file='main_cover.png', mode='wb') as f:
f.write(page.body)
```
### Pagination Handling
```python
from scrapling.fetchers import Fetcher
def scrape_all_pages():
base_url = 'https://example.com/products?page={}'
page_num = 1
all_products = []
while True:
# Get current page
page = Fetcher.get(base_url.format(page_num))
# Find products
products = page.css('.product')
if not products:
break
# Process products
for product in products:
all_products.append({
'name': product.css('.name::text').get(),
'price': product.css('.price::text').get()
})
# Next page
page_num += 1
return all_products
```
### Form Submission
```python
from scrapling.fetchers import Fetcher
# Submit login form
response = Fetcher.post(
'https://example.com/login',
data={
'username': 'user@example.com',
'password': 'password123'
}
)
# Check login success
if response.status == 200:
# Extract user info
user_name = response.css('.user-name::text').get()
print(f"Logged in as: {user_name}")
```
### Table Extraction
```python
from scrapling.fetchers import Fetcher
def extract_table():
page = Fetcher.get('https://example.com/data')
# Find table
table = page.css('table')[0]
# Extract headers
headers = [
th.text for th in table.css('thead th')
]
# Extract rows
rows = []
for row in table.css('tbody tr'):
cells = [td.text for td in row.css('td')]
rows.append(dict(zip(headers, cells)))
return rows
```
### Navigation Menu
```python
from scrapling.fetchers import Fetcher
def extract_menu():
page = Fetcher.get('https://example.com')
# Find navigation
nav = page.css('nav')[0]
menu = {}
for item in nav.css('li'):
links = item.css('a')
if links:
link = links[0]
menu[link.text] = {
'url': link['href'],
'has_submenu': bool(item.css('.submenu'))
}
return menu
```
## When to Use
Use `Fetcher` when:
- Need rapid HTTP requests.
- Want minimal overhead.
- Don't need JavaScript execution (the website can be scraped through requests).
- Need some stealth features (ex, the targeted website is using protection but doesn't use JavaScript challenges).
Use `FetcherSession` when:
- Making multiple requests to the same or different sites.
- Need to maintain cookies/authentication between requests.
- Want connection pooling for better performance.
- Require consistent configuration across requests.
- Working with APIs that require a session state.
Use other fetchers when:
- Need browser automation.
- Need advanced anti-bot/stealth capabilities.
- Need JavaScript support or interacting with dynamic content |