Karim shoair commited on
Commit
edde4f1
·
1 Parent(s): d7284eb

docs: update the fetcher page

Browse files
Files changed (1) hide show
  1. docs/fetching/static.md +118 -30
docs/fetching/static.md CHANGED
@@ -1,6 +1,6 @@
1
  # Introduction
2
 
3
- The `Fetcher` class provides fast and lightweight HTTP requests with some stealth capabilities. This class uses [httpx](https://www.python-httpx.org/) as an engine for making requests. For advanced usages, you will need some knowledge about [httpx](https://www.python-httpx.org/), but it becomes simpler and simpler with user feedback and updates.
4
 
5
  ## Basic Usage
6
  You have one primary way to import this Fetcher, which is the same for all fetchers.
@@ -13,17 +13,34 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
13
  ### Shared arguments
14
  All methods for making requests here share some arguments, so let's discuss them first.
15
 
16
- - **url**: The URL you want to request, of course :)
 
 
 
 
 
 
 
 
17
  - **proxy**: As the name implies, the proxy for this request is used to route all traffic (HTTP and HTTPS). The format accepted here is `http://username:password@localhost:8030`.
18
- - **stealthy_headers**: Generate and use real browser's headers, then create a referer header as if this request came from a Google search page of this URL's domain. Enabled by default, all headers generated can be overwritten by you through the `headers` argument.
19
- - **follow_redirects**: As the name implies, tell the fetcher to follow redirections. Enabled by default
20
- - **timeout**: The timeout to wait for each request to be finished in milliseconds. The default is 30000ms (30 seconds).
21
- - **retries**: The number of retries that [httpx](https://www.python-httpx.org/) will do for failed requests. The default number of retries is 3.
 
 
 
22
 
23
- Other than this, you can pass any arguments that `httpx.<method_name>` takes, and that's why I said, in the beginning, you need a bit of knowledge about [httpx](https://www.python-httpx.org/), but in the following examples, we will try to cover most cases.
 
 
 
 
24
 
25
  ### HTTP Methods
26
- Examples are the best way to explain this
 
 
27
 
28
  > Hence: `OPTIONS` and `HEAD` methods are not supported.
29
  #### GET
@@ -40,6 +57,10 @@ Examples are the best way to explain this
40
  >>> page = Fetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
41
  >>> # Basic HTTP authentication
42
  >>> page = Fetcher.get("https://example.com", auth=("my_user", "password123"))
 
 
 
 
43
  ```
44
  And for asynchronous requests, it's a small adjustment
45
  ```python
@@ -55,8 +76,12 @@ And for asynchronous requests, it's a small adjustment
55
  >>> page = await AsyncFetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
56
  >>> # Basic HTTP authentication
57
  >>> page = await AsyncFetcher.get("https://example.com", auth=("my_user", "password123"))
 
 
 
 
58
  ```
59
- Needless to say, the `page` object in all cases is [Response](choosing.md#response-object) object, which is an `Adaptor` as we said, so you will use it directly
60
  ```python
61
  >>> page.css('.something.something')
62
 
@@ -77,15 +102,13 @@ Needless to say, the `page` object in all cases is [Response](choosing.md#respon
77
  ```python
78
  >>> from scrapling.fetchers import Fetcher
79
  >>> # Basic POST
80
- >>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'})
81
  >>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
82
- >>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
83
  >>> # Another example of form-encoded data
84
- >>> page = Fetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'})
85
  >>> # JSON data
86
  >>> page = Fetcher.post('https://example.com/api', json={'key': 'value'})
87
- >>> # Uploading file
88
- >>> r = Fetcher.post("https://httpbin.org/post", files={'upload-file': open('something.xlsx', 'rb')})
89
  ```
90
  And for asynchronous requests, it's a small adjustment
91
  ```python
@@ -93,20 +116,18 @@ And for asynchronous requests, it's a small adjustment
93
  >>> # Basic POST
94
  >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'})
95
  >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
96
- >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030')
97
  >>> # Another example of form-encoded data
98
- >>> page = await AsyncFetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'})
99
  >>> # JSON data
100
  >>> page = await AsyncFetcher.post('https://example.com/api', json={'key': 'value'})
101
- >>> # Uploading file
102
- >>> r = await AsyncFetcher.post("https://httpbin.org/post", files={'upload-file': open('something.xlsx', 'rb')})
103
  ```
104
  #### PUT
105
  ```python
106
  >>> from scrapling.fetchers import Fetcher
107
  >>> # Basic PUT
108
  >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'})
109
- >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True)
110
  >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
111
  >>> # Another example of form-encoded data
112
  >>> page = Fetcher.put("https://httpbin.org/put", data={'key': ['value1', 'value2']})
@@ -116,7 +137,7 @@ And for asynchronous requests, it's a small adjustment
116
  >>> from scrapling.fetchers import AsyncFetcher
117
  >>> # Basic PUT
118
  >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'})
119
- >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True)
120
  >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
121
  >>> # Another example of form-encoded data
122
  >>> page = await AsyncFetcher.put("https://httpbin.org/put", data={'key': ['value1', 'value2']})
@@ -126,17 +147,77 @@ And for asynchronous requests, it's a small adjustment
126
  ```python
127
  >>> from scrapling.fetchers import Fetcher
128
  >>> page = Fetcher.delete('https://example.com/resource/123')
129
- >>> page = Fetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True)
130
  >>> page = Fetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
131
  ```
132
  And for asynchronous requests, it's a small adjustment
133
  ```python
134
  >>> from scrapling.fetchers import AsyncFetcher
135
  >>> page = await AsyncFetcher.delete('https://example.com/resource/123')
136
- >>> page = await AsyncFetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True)
137
  >>> page = await AsyncFetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
138
  ```
139
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  ## Examples
141
  Some well-rounded examples to aid newcomers to Web Scraping
142
 
@@ -276,7 +357,7 @@ def extract_menu():
276
  link = item.css_first('a')
277
  if link:
278
  menu[link.text] = {
279
- 'url': link.attrib['href'],
280
  'has_submenu': bool(item.css('.submenu'))
281
  }
282
 
@@ -287,14 +368,21 @@ def extract_menu():
287
 
288
  Use `Fetcher` when:
289
 
290
- - Need fast HTTP requests
291
- - Want minimal overhead
292
- - Don't need JavaScript
293
- - Want simple configuration
294
- - Need basic stealth features
 
 
 
 
 
 
 
295
 
296
  Use other fetchers when:
297
 
298
  - Need browser automation.
299
- - Need advanced anti-bot/stealth.
300
- - Need JavaScript support.
 
1
  # Introduction
2
 
3
+ The `Fetcher` class provides rapid and lightweight HTTP requests using the high-performance `curl_cffi` library with a lot of stealth capabilities.
4
 
5
  ## Basic Usage
6
  You have one primary way to import this Fetcher, which is the same for all fetchers.
 
13
  ### Shared arguments
14
  All methods for making requests here share some arguments, so let's discuss them first.
15
 
16
+ - **url**: The targeted URL
17
+ - **stealthy_headers**: If enabled (default), it creates and adds real browser headers. It also sets the referer header as if this request came from a Google search of the URL's domain.
18
+ - **follow_redirects**: As the name implies, tell the fetcher to follow redirections. **Enabled by default**
19
+ - **timeout**: The number of seconds to wait for each request to be finished. **Defaults to 30 seconds**.
20
+ - **retries**: The number of retries that the fetcher will do for failed requests. **Defaults to three retries**.
21
+ - **retry_delay**: Number of seconds to wait between retry attempts. **Defaults to 1 second**.
22
+ - **impersonate**: Impersonate specific browsers' TLS fingerprints. Accepts browser strings like `"chrome110"`, `"firefox102"`, `"safari15_5"` to use specific versions or `"chrome"`, `"firefox"`, `"safari"`, `"edge"` to automatically use the latest version available. This makes your requests appear as if they're coming from real browsers at the TLS level. **Defaults to the latest available Chrome version.**
23
+ - **http3**: Use HTTP/3 protocol for requests. **Defaults to False**. It might be problematic if used with `impersonate`.
24
+ - **cookies**: Cookies to use in the request. Can be a dictionary of `name→value` or a list of dictionaries.
25
  - **proxy**: As the name implies, the proxy for this request is used to route all traffic (HTTP and HTTPS). The format accepted here is `http://username:password@localhost:8030`.
26
+ - **proxy_auth**: HTTP basic auth for proxy, tuple of (username, password).
27
+ - **proxies**: Dict of proxies to use. Format: `{"http": proxy_url, "https": proxy_url}`.
28
+ - **headers**: Headers to include in the request. Can override any header generated by the `stealthy_headers` argument
29
+ - **max_redirects**: Maximum number of redirects. **Defaults to 30**, use -1 for unlimited.
30
+ - **verify**: Whether to verify HTTPS certificates. **Defaults to True**.
31
+ - **cert**: Tuple of (cert, key) filenames for the client certificate.
32
+ - **selector_config**: A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class.
33
 
34
+ > Note: <br/>
35
+ > 1. The currently available browsers to impersonate are (`"edge"`, `"chrome"`, `"chrome_android"`, `"safari"`, `"safari_beta"`, `"safari_ios"`, `"safari_ios_beta"`, `"firefox"`, `"tor"`)<br/>
36
+ > 2. The available browsers to impersonate and their corresponding versions are automatically displayed in the argument autocompletion and updated automatically with each `curl_cffi` update.
37
+
38
+ Other than this, for further customization, you can pass any arguments that `curl_cffi` supports for any method if that method doesn't already support it.
39
 
40
  ### HTTP Methods
41
+ There are additional arguments for each method, depending on the method, such as `params` for GET requests and `data`/`json` for POST/PUT/DELETE requests.
42
+
43
+ Examples are the best way to explain this, as follows.
44
 
45
  > Hence: `OPTIONS` and `HEAD` methods are not supported.
46
  #### GET
 
57
  >>> page = Fetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
58
  >>> # Basic HTTP authentication
59
  >>> page = Fetcher.get("https://example.com", auth=("my_user", "password123"))
60
+ >>> # Browser impersonation
61
+ >>> page = Fetcher.get('https://example.com', impersonate='chrome')
62
+ >>> # HTTP/3 support
63
+ >>> page = Fetcher.get('https://example.com', http3=True)
64
  ```
65
  And for asynchronous requests, it's a small adjustment
66
  ```python
 
76
  >>> page = await AsyncFetcher.get('https://example.com', headers={'User-Agent': 'Custom/1.0'})
77
  >>> # Basic HTTP authentication
78
  >>> page = await AsyncFetcher.get("https://example.com", auth=("my_user", "password123"))
79
+ >>> # Browser impersonation
80
+ >>> page = await AsyncFetcher.get('https://example.com', impersonate='chrome110')
81
+ >>> # HTTP/3 support
82
+ >>> page = await AsyncFetcher.get('https://example.com', http3=True)
83
  ```
84
+ Needless to say, the `page` object in all cases is [Response](choosing.md#response-object) object, which is a [Selector](../parsing/main_classes.md#selector) as we said, so you can use it directly
85
  ```python
86
  >>> page.css('.something.something')
87
 
 
102
  ```python
103
  >>> from scrapling.fetchers import Fetcher
104
  >>> # Basic POST
105
+ >>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, params={'q': 'query'})
106
  >>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
107
+ >>> page = Fetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030', impersonate="chrome")
108
  >>> # Another example of form-encoded data
109
+ >>> page = Fetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)
110
  >>> # JSON data
111
  >>> page = Fetcher.post('https://example.com/api', json={'key': 'value'})
 
 
112
  ```
113
  And for asynchronous requests, it's a small adjustment
114
  ```python
 
116
  >>> # Basic POST
117
  >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'})
118
  >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, stealthy_headers=True, follow_redirects=True)
119
+ >>> page = await AsyncFetcher.post('https://httpbin.org/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030', impersonate="chrome")
120
  >>> # Another example of form-encoded data
121
+ >>> page = await AsyncFetcher.post('https://example.com/submit', data={'username': 'user', 'password': 'pass'}, http3=True)
122
  >>> # JSON data
123
  >>> page = await AsyncFetcher.post('https://example.com/api', json={'key': 'value'})
 
 
124
  ```
125
  #### PUT
126
  ```python
127
  >>> from scrapling.fetchers import Fetcher
128
  >>> # Basic PUT
129
  >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'})
130
+ >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True, impersonate="chrome")
131
  >>> page = Fetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
132
  >>> # Another example of form-encoded data
133
  >>> page = Fetcher.put("https://httpbin.org/put", data={'key': ['value1', 'value2']})
 
137
  >>> from scrapling.fetchers import AsyncFetcher
138
  >>> # Basic PUT
139
  >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'})
140
+ >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, stealthy_headers=True, follow_redirects=True, impersonate="chrome")
141
  >>> page = await AsyncFetcher.put('https://example.com/update', data={'status': 'updated'}, proxy='http://username:password@localhost:8030')
142
  >>> # Another example of form-encoded data
143
  >>> page = await AsyncFetcher.put("https://httpbin.org/put", data={'key': ['value1', 'value2']})
 
147
  ```python
148
  >>> from scrapling.fetchers import Fetcher
149
  >>> page = Fetcher.delete('https://example.com/resource/123')
150
+ >>> page = Fetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True, impersonate="chrome")
151
  >>> page = Fetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
152
  ```
153
  And for asynchronous requests, it's a small adjustment
154
  ```python
155
  >>> from scrapling.fetchers import AsyncFetcher
156
  >>> page = await AsyncFetcher.delete('https://example.com/resource/123')
157
+ >>> page = await AsyncFetcher.delete('https://example.com/resource/123', stealthy_headers=True, follow_redirects=True, impersonate="chrome")
158
  >>> page = await AsyncFetcher.delete('https://example.com/resource/123', proxy='http://username:password@localhost:8030')
159
  ```
160
 
161
+ ## Session Management
162
+
163
+ For making multiple requests with the same configuration, use the `FetcherSession` class. It can be used in both synchronous and asynchronous code without issue; the class detects and changes the session type automatically without requiring a different import.
164
+
165
+ The `FetcherSession` class can accept nearly all the arguments that the methods can take, which enables you to specify a config for the entire session and later choose a different config for one of the requests effortlessly, as you will see in the following examples.
166
+
167
+ ```python
168
+ from scrapling.fetchers import FetcherSession
169
+
170
+ # Create a session with default configuration
171
+ with FetcherSession(
172
+ impersonate='chrome',
173
+ http3=True,
174
+ stealthy_headers=True,
175
+ timeout=30,
176
+ retries=3
177
+ ) as session:
178
+ # Make multiple requests with the same settings
179
+ page1 = session.get('https://httpbin.org/get')
180
+ page2 = session.post('https://httpbin.org/post', data={'key': 'value'})
181
+ page3 = session.get('https://api.github.com/events')
182
+
183
+ # All requests share the same session and connection pool
184
+ ```
185
+
186
+ And here's an async example
187
+
188
+ ```python
189
+ async with FetcherSession(impersonate='firefox', http3=True) as session:
190
+ # All standard HTTP methods available
191
+ response = async session.get('https://example.com')
192
+ response = async session.post('https://httpbin.org/post', json={'data': 'value'})
193
+ response = async session.put('https://httpbin.org/put', data={'update': 'info'})
194
+ response = async session.delete('https://httpbin.org/delete')
195
+ ```
196
+ or better
197
+ ```python
198
+ import asyncio
199
+ from scrapling.fetchers import FetcherSession
200
+
201
+ # Async session usage
202
+ async with FetcherSession(impersonate="safari") as session:
203
+ urls = ['https://example.com/page1', 'https://example.com/page2']
204
+
205
+ tasks = [
206
+ session.get(url) for url in urls
207
+ ]
208
+
209
+ pages = await asyncio.gather(*tasks)
210
+ ```
211
+
212
+ The `Fetcher` class uses `FetcherSession` to create a temporary session with each request you make.
213
+
214
+ ### Session Benefits
215
+
216
+ - **A lot faster**: 10 times faster than creating a single session for each request
217
+ - **Cookie persistence**: Automatic cookie handling across requests
218
+ - **Resource efficiency**: Better memory and CPU usage for multiple requests
219
+ - **Centralized configuration**: Single place to manage request settings
220
+
221
  ## Examples
222
  Some well-rounded examples to aid newcomers to Web Scraping
223
 
 
357
  link = item.css_first('a')
358
  if link:
359
  menu[link.text] = {
360
+ 'url': link['href'],
361
  'has_submenu': bool(item.css('.submenu'))
362
  }
363
 
 
368
 
369
  Use `Fetcher` when:
370
 
371
+ - Need rapid HTTP requests.
372
+ - Want minimal overhead.
373
+ - Don't need JavaScript execution (the website can be scraped through requests).
374
+ - Need some stealth features (ex, the targeted website is using protection but doesn't use JavaScript challenges).
375
+
376
+ Use `FetcherSession` when:
377
+
378
+ - Making multiple requests to the same or different sites.
379
+ - Need to maintain cookies/authentication between requests.
380
+ - Want connection pooling for better performance.
381
+ - Require consistent configuration across requests.
382
+ - Working with APIs that require a session state.
383
 
384
  Use other fetchers when:
385
 
386
  - Need browser automation.
387
+ - Need advanced anti-bot/stealth capabilities.
388
+ - Need JavaScript support or interacting with dynamic content