Spaces:

lenson78
/

Scrapling

Paused

App Files Files Community

Karim shoair commited on Aug 27, 2025

Commit

edde4f1

1 Parent(s): d7284eb

docs: update the fetcher page

Browse files

Files changed (1) hide show

docs/fetching/static.md +118 -30

docs/fetching/static.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Introduction
-The `Fetcher` class provides fast and lightweight HTTP requests with some stealth capabilities. This class uses [httpx](https://www.python-httpx.org/) as an engine for making requests. For advanced usages, you will need some knowledge about [httpx](https://www.python-httpx.org/), but it becomes simpler and simpler with user feedback and updates.
 ## Basic Usage
 You have one primary way to import this Fetcher, which is the same for all fetchers.
@@ -13,17 +13,34 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
 ### Shared arguments
 All methods for making requests here share some arguments, so let's discuss them first.
-- **url**: The URL you want to request, of course :)
 - **proxy**: As the name implies, the proxy for this request is used to route all traffic (HTTP and HTTPS). The format accepted here is `http://username:password@localhost:8030`.
-- **stealthy_headers**: Generate and use real browser's headers, then create a referer header as if this request came from a Google search page of this URL's domain. Enabled by default, all headers generated can be overwritten by you through the `headers` argument.
-- **follow_redirects**: As the name implies, tell the fetcher to follow redirections. Enabled by default
-- **timeout**: The timeout to wait for each request to be finished in milliseconds. The default is 30000ms (30 seconds).
-- **retries**: The number of retries that [httpx](https://www.python-httpx.org/) will do for failed requests. The default number of retries is 3.
-Other than this, you can pass any arguments that `httpx.<method_name>` takes, and that's why I said, in the beginning, you need a bit of knowledge about [httpx](https://www.python-httpx.org/), but in the following examples, we will try to cover most cases.
 ### HTTP Methods
-Examples are the best way to explain this
 > Hence: `OPTIONS` and `HEAD` methods are not supported.
 #### GET
@@ -40,6 +57,10 @@ Examples are the best way to explain this
 >>> page = Fetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
 >>> # Basic HTTP authentication
 >>> page = Fetcher.get("https://example.com", auth=("my_user", "password123"))
 ```
 And for asynchronous requests, it's a small adjustment
 ```python
@@ -55,8 +76,12 @@ And for asynchronous requests, it's a small adjustment
 >>> page = await AsyncFetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
 >>> # Basic HTTP authentication
 >>> page = await AsyncFetcher.get("https://example.com", auth=("my_user", "password123"))
 ```
-Needless to say, the `page` object in all cases is [Response](choosing.md#response-object) object, which is an `Adaptor` as we said, so you will use it directly
 ```python
 >>> page.css('.something.something')
@@ -77,15 +102,13 @@ Needless to say, the `page` object in all cases is [Response](choosing.md#respon
 ```python
 >>> from scrapling.fetchers import Fetcher
 >>> # Basic POST
->>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'})
 >>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
->>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
 >>> # Another example of form-encoded data
->>> page = Fetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'})
 >>> # JSON data
 >>> page = Fetcher.post('https://example.com/api', json={'key': 'value'})
->>> # Uploading file
->>> r = Fetcher.post("https://httpbin.org/post", files={'upload-file': open('something.xlsx', 'rb')})
 ```
 And for asynchronous requests, it's a small adjustment
 ```python
@@ -93,20 +116,18 @@ And for asynchronous requests, it's a small adjustment
 >>> # Basic POST
 >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'})
 >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
->>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
 >>> # Another example of form-encoded data
->>> page = await AsyncFetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'})
 >>> # JSON data
 >>> page = await AsyncFetcher.post('https://example.com/api', json={'key': 'value'})
->>> # Uploading file
->>> r = await AsyncFetcher.post("https://httpbin.org/post", files={'upload-file': open('something.xlsx', 'rb')})
 ```
 #### PUT
 ```python
 >>> from scrapling.fetchers import Fetcher
 >>> # Basic PUT
 >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'})
->>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True)
 >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
 >>> # Another example of form-encoded data
 >>> page = Fetcher.put("https://httpbin.org/put", data={'key': ['value1', 'value2']})
@@ -116,7 +137,7 @@ And for asynchronous requests, it's a small adjustment
 >>> from scrapling.fetchers import AsyncFetcher
 >>> # Basic PUT
 >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'})
->>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True)
 >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
 >>> # Another example of form-encoded data
 >>> page = await AsyncFetcher.put("https://httpbin.org/put", data={'key': ['value1', 'value2']})
@@ -126,17 +147,77 @@ And for asynchronous requests, it's a small adjustment
 ```python
 >>> from scrapling.fetchers import Fetcher
 >>> page = Fetcher.delete('https://example.com/resource/123')
->>> page = Fetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True)
 >>> page = Fetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
 ```
 And for asynchronous requests, it's a small adjustment
 ```python
 >>> from scrapling.fetchers import AsyncFetcher
 >>> page = await AsyncFetcher.delete('https://example.com/resource/123')
->>> page = await AsyncFetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True)
 >>> page = await AsyncFetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
 ```
 ## Examples
 Some well-rounded examples to aid newcomers to Web Scraping
@@ -276,7 +357,7 @@ def extract_menu():
         link = item.css_first('a')
         if link:
             menu[link.text] = {
-                'url': link.attrib['href'],
                 'has_submenu': bool(item.css('.submenu'))
             }
@@ -287,14 +368,21 @@ def extract_menu():
 Use `Fetcher` when:
-- Need fast HTTP requests
-- Want minimal overhead
-- Don't need JavaScript
-- Want simple configuration
-- Need basic stealth features
 Use other fetchers when:
 - Need browser automation.
-- Need advanced anti-bot/stealth.
-- Need JavaScript support.

 # Introduction
+The `Fetcher` class provides rapid and lightweight HTTP requests using the high-performance `curl_cffi` library with a lot of stealth capabilities.
 ## Basic Usage
 You have one primary way to import this Fetcher, which is the same for all fetchers.
 ### Shared arguments
 All methods for making requests here share some arguments, so let's discuss them first.
+- **url**: The targeted URL
+- **stealthy_headers**: If enabled (default), it creates and adds real browser headers. It also sets the referer header as if this request came from a Google search of the URL's domain.
+- **follow_redirects**: As the name implies, tell the fetcher to follow redirections. **Enabled by default**
+- **timeout**: The number of seconds to wait for each request to be finished. **Defaults to 30 seconds**.
+- **retries**: The number of retries that the fetcher will do for failed requests. **Defaults to three retries**.
+- **retry_delay**: Number of seconds to wait between retry attempts. **Defaults to 1 second**.
+- **impersonate**: Impersonate specific browsers' TLS fingerprints. Accepts browser strings like `"chrome110"`, `"firefox102"`, `"safari15_5"` to use specific versions or `"chrome"`, `"firefox"`, `"safari"`, `"edge"` to automatically use the latest version available. This makes your requests appear as if they're coming from real browsers at the TLS level. **Defaults to the latest available Chrome version.**
+- **http3**: Use HTTP/3 protocol for requests. **Defaults to False**. It might be problematic if used with `impersonate`.
+- **cookies**: Cookies to use in the request. Can be a dictionary of `name→value` or a list of dictionaries.
 - **proxy**: As the name implies, the proxy for this request is used to route all traffic (HTTP and HTTPS). The format accepted here is `http://username:password@localhost:8030`.
+- **proxy_auth**: HTTP basic auth for proxy, tuple of (username, password).
+- **proxies**: Dict of proxies to use. Format: `{"http": proxy_url, "https": proxy_url}`.
+- **headers**: Headers to include in the request. Can override any header generated by the `stealthy_headers` argument
+- **max_redirects**: Maximum number of redirects. **Defaults to 30**, use -1 for unlimited.
+- **verify**: Whether to verify HTTPS certificates. **Defaults to True**.
+- **cert**: Tuple of (cert, key) filenames for the client certificate.
+- **selector_config**: A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class.
+> Note: <br/>
+> 1. The currently available browsers to impersonate are (`"edge"`, `"chrome"`, `"chrome_android"`, `"safari"`, `"safari_beta"`, `"safari_ios"`, `"safari_ios_beta"`, `"firefox"`, `"tor"`)<br/>
+> 2. The available browsers to impersonate and their corresponding versions are automatically displayed in the argument autocompletion and updated automatically with each `curl_cffi` update.
+Other than this, for further customization, you can pass any arguments that `curl_cffi` supports for any method if that method doesn't already support it.
 ### HTTP Methods
+There are additional arguments for each method, depending on the method, such as `params` for GET requests and `data`/`json` for POST/PUT/DELETE requests.
+Examples are the best way to explain this, as follows.
 > Hence: `OPTIONS` and `HEAD` methods are not supported.
 #### GET
 >>> page = Fetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
 >>> # Basic HTTP authentication
 >>> page = Fetcher.get("https://example.com", auth=("my_user", "password123"))
+>>> # Browser impersonation
+>>> page = Fetcher.get('https://example.com', impersonate='chrome')
+>>> # HTTP/3 support
+>>> page = Fetcher.get('https://example.com', http3=True)
 ```
 And for asynchronous requests, it's a small adjustment
 ```python
 >>> page = await AsyncFetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
 >>> # Basic HTTP authentication
 >>> page = await AsyncFetcher.get("https://example.com", auth=("my_user", "password123"))
+>>> # Browser impersonation
+>>> page = await AsyncFetcher.get('https://example.com', impersonate='chrome110')
+>>> # HTTP/3 support
+>>> page = await AsyncFetcher.get('https://example.com', http3=True)
 ```
+Needless to say, the `page` object in all cases is [Response](choosing.md#response-object) object, which is a [Selector](../parsing/main_classes.md#selector) as we said, so you can use it directly
 ```python
 >>> page.css('.something.something')
 ```python
 >>> from scrapling.fetchers import Fetcher
 >>> # Basic POST
+>>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, params={'q': 'query'})
 >>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
+>>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030', impersonate="chrome")
 >>> # Another example of form-encoded data
+>>> page = Fetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)
 >>> # JSON data
 >>> page = Fetcher.post('https://example.com/api', json={'key': 'value'})
 ```
 And for asynchronous requests, it's a small adjustment
 ```python
 >>> # Basic POST
 >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'})
 >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
+>>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030', impersonate="chrome")
 >>> # Another example of form-encoded data
+>>> page = await AsyncFetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)
 >>> # JSON data
 >>> page = await AsyncFetcher.post('https://example.com/api', json={'key': 'value'})
 ```
 #### PUT
 ```python
 >>> from scrapling.fetchers import Fetcher
 >>> # Basic PUT
 >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'})
+>>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True, impersonate="chrome")
 >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
 >>> # Another example of form-encoded data
 >>> page = Fetcher.put("https://httpbin.org/put", data={'key': ['value1', 'value2']})
 >>> from scrapling.fetchers import AsyncFetcher
 >>> # Basic PUT
 >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'})
+>>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True, impersonate="chrome")
 >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
 >>> # Another example of form-encoded data
 >>> page = await AsyncFetcher.put("https://httpbin.org/put", data={'key': ['value1', 'value2']})
 ```python
 >>> from scrapling.fetchers import Fetcher
 >>> page = Fetcher.delete('https://example.com/resource/123')
+>>> page = Fetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True, impersonate="chrome")
 >>> page = Fetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
 ```
 And for asynchronous requests, it's a small adjustment
 ```python
 >>> from scrapling.fetchers import AsyncFetcher
 >>> page = await AsyncFetcher.delete('https://example.com/resource/123')
+>>> page = await AsyncFetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True, impersonate="chrome")
 >>> page = await AsyncFetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
 ```
+## Session Management
+For making multiple requests with the same configuration, use the `FetcherSession` class. It can be used in both synchronous and asynchronous code without issue; the class detects and changes the session type automatically without requiring a different import.
+The `FetcherSession` class can accept nearly all the arguments that the methods can take, which enables you to specify a config for the entire session and later choose a different config for one of the requests effortlessly, as you will see in the following examples.
+```python
+from scrapling.fetchers import FetcherSession
+# Create a session with default configuration
+with FetcherSession(
+    impersonate='chrome',
+    http3=True,
+    stealthy_headers=True,
+    timeout=30,
+    retries=3
+) as session:
+    # Make multiple requests with the same settings
+    page1 = session.get('https://httpbin.org/get')
+    page2 = session.post('https://httpbin.org/post', data={'key': 'value'})
+    page3 = session.get('https://api.github.com/events')
+    # All requests share the same session and connection pool
+```
+And here's an async example
+```python
+async with FetcherSession(impersonate='firefox', http3=True) as session:
+    # All standard HTTP methods available
+    response = async session.get('https://example.com')
+    response = async session.post('https://httpbin.org/post', json={'data': 'value'})
+    response = async session.put('https://httpbin.org/put', data={'update': 'info'})
+    response = async session.delete('https://httpbin.org/delete')
+```
+or better
+```python
+import asyncio
+from scrapling.fetchers import FetcherSession
+# Async session usage
+async with FetcherSession(impersonate="safari") as session:
+    urls = ['https://example.com/page1', 'https://example.com/page2']
+    tasks = [
+        session.get(url) for url in urls
+    ]
+    pages = await asyncio.gather(*tasks)
+```
+The `Fetcher` class uses `FetcherSession` to create a temporary session with each request you make.
+### Session Benefits
+- **A lot faster**: 10 times faster than creating a single session for each request
+- **Cookie persistence**: Automatic cookie handling across requests
+- **Resource efficiency**: Better memory and CPU usage for multiple requests
+- **Centralized configuration**: Single place to manage request settings
 ## Examples
 Some well-rounded examples to aid newcomers to Web Scraping
         link = item.css_first('a')
         if link:
             menu[link.text] = {
+                'url': link['href'],
                 'has_submenu': bool(item.css('.submenu'))
             }
 Use `Fetcher` when:
+- Need rapid HTTP requests.
+- Want minimal overhead.
+- Don't need JavaScript execution (the website can be scraped through requests).
+- Need some stealth features (ex, the targeted website is using protection but doesn't use JavaScript challenges).
+Use `FetcherSession` when:
+- Making multiple requests to the same or different sites.
+- Need to maintain cookies/authentication between requests.
+- Want connection pooling for better performance.
+- Require consistent configuration across requests.
+- Working with APIs that require a session state.
 Use other fetchers when:
 - Need browser automation.
+- Need advanced anti-bot/stealth capabilities.
+- Need JavaScript support or interacting with dynamic content