Spaces:

lenson78
/

Scrapling

Paused

App Files Files Community

Scrapling / docs /fetching /static.md

Karim shoair

docs: update the website to reflect the google referer logic

61dec8a 22 days ago

preview code

raw

history blame contribute delete

18.4 kB

	# HTTP requests

	The `Fetcher` class provides rapid and lightweight HTTP requests using the high-performance `curl_cffi` library with a lot of stealth capabilities.

	!!! success "Prerequisites"

	1. You've completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
	2. You've completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
	3. You've completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.

	## Basic Usage
	You have one primary way to import this Fetcher, which is the same for all fetchers.

	```python
	>>> from scrapling.fetchers import Fetcher
	```
	Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)

	### Shared arguments
	All methods for making requests here share some arguments, so let's discuss them first.

	- url: The targeted URL
	- stealthy_headers: If enabled (default), it creates and adds real browser headers. It also sets a Google referer header.
	- follow_redirects: As the name implies, tell the fetcher to follow redirections. Enabled by default
	- timeout: The number of seconds to wait for each request to be finished. Defaults to 30 seconds.
	- retries: The number of retries that the fetcher will do for failed requests. Defaults to three retries.
	- retry_delay: Number of seconds to wait between retry attempts. Defaults to 1 second.
	- impersonate: Impersonate specific browsers' TLS fingerprints. Accepts browser strings or a list of them like `"chrome110"`, `"firefox102"`, `"safari15_5"` to use specific versions or `"chrome"`, `"firefox"`, `"safari"`, `"edge"` to automatically use the latest version available. This makes your requests appear to come from real browsers at the TLS level. If you pass it a list of strings, it will choose a random one with each request. Defaults to the latest available Chrome version.
	- http3: Use HTTP/3 protocol for requests. Defaults to False. It might be problematic if used with `impersonate`.
	- cookies: Cookies to use in the request. Can be a dictionary of `name→value` or a list of dictionaries.
	- proxy: As the name implies, the proxy for this request is used to route all traffic (HTTP and HTTPS). The format accepted here is `http://username:password@localhost:8030`.
	- proxy_auth: HTTP basic auth for proxy, tuple of (username, password).
	- proxies: Dict of proxies to use. Format: `{"http": proxy_url, "https": proxy_url}`.
	- proxy_rotator: A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy` or `proxies`.
	- headers: Headers to include in the request. Can override any header generated by the `stealthy_headers` argument
	- max_redirects: Maximum number of redirects. Defaults to 30, use -1 for unlimited.
	- verify: Whether to verify HTTPS certificates. Defaults to True.
	- cert: Tuple of (cert, key) filenames for the client certificate.
	- selector_config: A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class.

	!!! note "Notes:"

	1. The currently available browsers to impersonate are (`"edge"`, `"chrome"`, `"chrome_android"`, `"safari"`, `"safari_beta"`, `"safari_ios"`, `"safari_ios_beta"`, `"firefox"`, `"tor"`)<br/>
	2. The available browsers to impersonate, along with their corresponding versions, are automatically displayed in the argument autocompletion and updated with each `curl_cffi` update.<br/>
	3. If any of the arguments `impersonate` or `stealthy_headers` are enabled, the fetchers will automatically generate real browser headers that match the browser version used.

	Other than this, for further customization, you can pass any arguments that `curl_cffi` supports for any method if that method doesn't already support them.

	### HTTP Methods
	There are additional arguments for each method, depending on the method, such as `params` for GET requests and `data`/`json` for POST/PUT/DELETE requests.

	Examples are the best way to explain this:

	> Hence: `OPTIONS` and `HEAD` methods are not supported.
	#### GET
	```python
	>>> from scrapling.fetchers import Fetcher
	>>> # Basic GET
	>>> page = Fetcher.get('https://example.com')
	>>> page = Fetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True)
	>>> page = Fetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:password@localhost:8030')
	>>> # With parameters
	>>> page = Fetcher.get('https://example.com/search', params={'q': 'query'})
	>>>
	>>> # With headers
	>>> page = Fetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
	>>> # Basic HTTP authentication
	>>> page = Fetcher.get("https://example.com", auth=("my_user", "password123"))
	>>> # Browser impersonation
	>>> page = Fetcher.get('https://example.com', impersonate='chrome')
	>>> # HTTP/3 support
	>>> page = Fetcher.get('https://example.com', http3=True)
	```
	And for asynchronous requests, it's a small adjustment
	```python
	>>> from scrapling.fetchers import AsyncFetcher
	>>> # Basic GET
	>>> page = await AsyncFetcher.get('https://example.com')
	>>> page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True)
	>>> page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', proxy='http://username:password@localhost:8030')
	>>> # With parameters
	>>> page = await AsyncFetcher.get('https://example.com/search', params={'q': 'query'})
	>>>
	>>> # With headers
	>>> page = await AsyncFetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
	>>> # Basic HTTP authentication
	>>> page = await AsyncFetcher.get("https://example.com", auth=("my_user", "password123"))
	>>> # Browser impersonation
	>>> page = await AsyncFetcher.get('https://example.com', impersonate='chrome110')
	>>> # HTTP/3 support
	>>> page = await AsyncFetcher.get('https://example.com', http3=True)
	```
	Needless to say, the `page` object in all cases is [Response](choosing.md#response-object) object, which is a [Selector](../parsing/main_classes.md#selector) as we said, so you can use it directly
	```python
	>>> page.css('.something.something')

	>>> page = Fetcher.get('https://api.github.com/events')
	>>> page.json()
	[{'id': '<redacted>',
	'type': 'PushEvent',
	'actor': {'id': '<redacted>',
	'login': '<redacted>',
	'display_login': '<redacted>',
	'gravatar_id': '',
	'url': 'https://api.github.com/users/<redacted>',
	'avatar_url': 'https://avatars.githubusercontent.com/u/<redacted>'},
	'repo': {'id': '<redacted>',
	...
	```
	#### POST
	```python
	>>> from scrapling.fetchers import Fetcher
	>>> # Basic POST
	>>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, params={'q': 'query'})
	>>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
	>>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030', impersonate="chrome")
	>>> # Another example of form-encoded data
	>>> page = Fetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)
	>>> # JSON data
	>>> page = Fetcher.post('https://example.com/api', json={'key': 'value'})
	```
	And for asynchronous requests, it's a small adjustment
	```python
	>>> from scrapling.fetchers import AsyncFetcher
	>>> # Basic POST
	>>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'})
	>>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
	>>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030', impersonate="chrome")
	>>> # Another example of form-encoded data
	>>> page = await AsyncFetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)
	>>> # JSON data
	>>> page = await AsyncFetcher.post('https://example.com/api', json={'key': 'value'})
	```
	#### PUT
	```python
	>>> from scrapling.fetchers import Fetcher
	>>> # Basic PUT
	>>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'})
	>>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True, impersonate="chrome")
	>>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
	>>> # Another example of form-encoded data
	>>> page = Fetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']})
	```
	And for asynchronous requests, it's a small adjustment
	```python
	>>> from scrapling.fetchers import AsyncFetcher
	>>> # Basic PUT
	>>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'})
	>>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True, impersonate="chrome")
	>>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
	>>> # Another example of form-encoded data
	>>> page = await AsyncFetcher.put("https://scrapling.requestcatcher.com/put", data={'key': ['value1', 'value2']})
	```

	#### DELETE
	```python
	>>> from scrapling.fetchers import Fetcher
	>>> page = Fetcher.delete('https://example.com/resource/123')
	>>> page = Fetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True, impersonate="chrome")
	>>> page = Fetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
	```
	And for asynchronous requests, it's a small adjustment
	```python
	>>> from scrapling.fetchers import AsyncFetcher
	>>> page = await AsyncFetcher.delete('https://example.com/resource/123')
	>>> page = await AsyncFetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True, impersonate="chrome")
	>>> page = await AsyncFetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
	```

	## Session Management

	For making multiple requests with the same configuration, use the `FetcherSession` class. It can be used in both synchronous and asynchronous code without issue; the class automatically detects and changes the session type, without requiring a different import.

	The `FetcherSession` class can accept nearly all the arguments that the methods can take, which enables you to specify a config for the entire session and later choose a different config for one of the requests effortlessly, as you will see in the following examples.

	```python
	from scrapling.fetchers import FetcherSession

	# Create a session with default configuration
	with FetcherSession(
	impersonate='chrome',
	http3=True,
	stealthy_headers=True,
	timeout=30,
	retries=3
	) as session:
	# Make multiple requests with the same settings and the same cookies
	page1 = session.get('https://scrapling.requestcatcher.com/get')
	page2 = session.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'})
	page3 = session.get('https://api.github.com/events')

	# All requests share the same session and connection pool
	```

	You can also use a `ProxyRotator` with `FetcherSession` for automatic proxy rotation across requests:

	```python
	from scrapling.fetchers import FetcherSession, ProxyRotator

	rotator = ProxyRotator([
	'http://proxy1:8080',
	'http://proxy2:8080',
	'http://proxy3:8080',
	])

	with FetcherSession(proxy_rotator=rotator, impersonate='chrome') as session:
	# Each request automatically uses the next proxy in rotation
	page1 = session.get('https://example.com/page1')
	page2 = session.get('https://example.com/page2')

	# You can check which proxy was used via the response metadata
	print(page1.meta['proxy'])
	```

	You can also override the session proxy (or rotator) for a specific request by passing `proxy=` directly to the request method:

	```python
	with FetcherSession(proxy='http://default-proxy:8080') as session:
	# Uses the session proxy
	page1 = session.get('https://example.com/page1')

	# Override the proxy for this specific request
	page2 = session.get('https://example.com/page2', proxy='http://special-proxy:9090')
	```

	And here's an async example

	```python
	async with FetcherSession(impersonate='firefox', http3=True) as session:
	# All standard HTTP methods available
	response = await session.get('https://example.com')
	response = await session.post('https://scrapling.requestcatcher.com/post', json={'data': 'value'})
	response = await session.put('https://scrapling.requestcatcher.com/put', data={'update': 'info'})
	response = await session.delete('https://scrapling.requestcatcher.com/delete')
	```
	or better
	```python
	import asyncio
	from scrapling.fetchers import FetcherSession

	# Async session usage
	async with FetcherSession(impersonate="safari") as session:
	urls = ['https://example.com/page1', 'https://example.com/page2']

	tasks = [
	session.get(url) for url in urls
	]

	pages = await asyncio.gather(*tasks)
	```

	The `Fetcher` class uses `FetcherSession` to create a temporary session with each request you make.

	### Session Benefits

	- A lot faster: 10 times faster than creating a single session for each request
	- Cookie persistence: Automatic cookie handling across requests
	- Resource efficiency: Better memory and CPU usage for multiple requests
	- Centralized configuration: Single place to manage request settings

	## Examples
	Some well-rounded examples to aid newcomers to Web Scraping

	### Basic HTTP Request

	```python
	from scrapling.fetchers import Fetcher

	# Make a request
	page = Fetcher.get('https://example.com')

	# Check the status
	if page.status == 200:
	# Extract title
	title = page.css('title::text').get()
	print(f"Page title: {title}")

	# Extract all links
	links = page.css('a::attr(href)').getall()
	print(f"Found {len(links)} links")
	```

	### Product Scraping

	```python
	from scrapling.fetchers import Fetcher

	def scrape_products():
	page = Fetcher.get('https://example.com/products')

	# Find all product elements
	products = page.css('.product')

	results = []
	for product in products:
	results.append({
	'title': product.css('.title::text').get(),
	'price': product.css('.price::text').re_first(r'\d+\.\d{2}'),
	'description': product.css('.description::text').get(),
	'in_stock': product.has_class('in-stock')
	})

	return results
	```

	### Downloading Files

	```python
	from scrapling.fetchers import Fetcher

	page = Fetcher.get('https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/main_cover.png')
	with open(file='main_cover.png', mode='wb') as f:
	f.write(page.body)
	```

	### Pagination Handling

	```python
	from scrapling.fetchers import Fetcher

	def scrape_all_pages():
	base_url = 'https://example.com/products?page={}'
	page_num = 1
	all_products = []

	while True:
	# Get current page
	page = Fetcher.get(base_url.format(page_num))

	# Find products
	products = page.css('.product')
	if not products:
	break

	# Process products
	for product in products:
	all_products.append({
	'name': product.css('.name::text').get(),
	'price': product.css('.price::text').get()
	})

	# Next page
	page_num += 1

	return all_products
	```

	### Form Submission

	```python
	from scrapling.fetchers import Fetcher

	# Submit login form
	response = Fetcher.post(
	'https://example.com/login',
	data={
	'username': 'user@example.com',
	'password': 'password123'
	}
	)

	# Check login success
	if response.status == 200:
	# Extract user info
	user_name = response.css('.user-name::text').get()
	print(f"Logged in as: {user_name}")
	```

	### Table Extraction

	```python
	from scrapling.fetchers import Fetcher

	def extract_table():
	page = Fetcher.get('https://example.com/data')

	# Find table
	table = page.css('table')[0]

	# Extract headers
	headers = [
	th.text for th in table.css('thead th')
	]

	# Extract rows
	rows = []
	for row in table.css('tbody tr'):
	cells = [td.text for td in row.css('td')]
	rows.append(dict(zip(headers, cells)))

	return rows
	```

	### Navigation Menu

	```python
	from scrapling.fetchers import Fetcher

	def extract_menu():
	page = Fetcher.get('https://example.com')

	# Find navigation
	nav = page.css('nav')[0]

	menu = {}
	for item in nav.css('li'):
	links = item.css('a')
	if links:
	link = links[0]
	menu[link.text] = {
	'url': link['href'],
	'has_submenu': bool(item.css('.submenu'))
	}

	return menu
	```

	## When to Use

	Use `Fetcher` when:

	- Need rapid HTTP requests.
	- Want minimal overhead.
	- Don't need JavaScript execution (the website can be scraped through requests).
	- Need some stealth features (ex, the targeted website is using protection but doesn't use JavaScript challenges).

	Use `FetcherSession` when:

	- Making multiple requests to the same or different sites.
	- Need to maintain cookies/authentication between requests.
	- Want connection pooling for better performance.
	- Require consistent configuration across requests.
	- Working with APIs that require a session state.

	Use other fetchers when:

	- Need browser automation.
	- Need advanced anti-bot/stealth capabilities.
	- Need JavaScript support or interacting with dynamic content