Karim shoair commited on
Commit
e1908ab
·
1 Parent(s): 9b4c21b

docs: update the StealthyFetcher page

Browse files
Files changed (1) hide show
  1. docs/fetching/stealthy.md +130 -42
docs/fetching/stealthy.md CHANGED
@@ -1,8 +1,8 @@
1
  # Introduction
2
 
3
- Here, we will discuss the `StealthyFetcher` class. This class is similar to [PlayWrightFetcher](dynamic.md#introduction) in many ways, like browser automation and using [PlayWright](https://playwright.dev/python/docs/intro) as an engine for fetching websites. The main difference is that this class provides advanced anti-bot protection bypass capabilities and a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), from which most stealth comes.
4
 
5
- As with [PlayWrightFetcher](dynamic.md#introduction), you will need some knowledge about [PlayWright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
6
 
7
  ## Basic Usage
8
  You have one primary way to import this Fetcher, which is the same for all fetchers.
@@ -14,40 +14,43 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
14
 
15
  > Notes:
16
  >
17
- > 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (waits for the `domcontentloaded` state).
18
  > 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
19
 
20
  ## Full list of arguments
21
  Before jumping to [examples](#examples), here's the full list of arguments
22
 
23
 
24
- | Argument | Description | Optional |
25
- |:--------------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
26
- | url | Target url | ❌ |
27
- | headless | Pass `True` to run the browser in headless/hidden (**default**), `virtual` to run it in virtual screen mode, or `False` for headful/visible mode. The `virtual` mode requires having `xvfb` installed. | ✔️ |
28
- | block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage, but be careful with this option as it makes some websites never finish loading._ | ✔️ |
29
- | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage, but be careful with this option as it makes some websites never finish loading._ | ✔️ |
30
- | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search for this website's domain name. | ✔️ |
31
- | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
32
- | block_webrtc | Blocks WebRTC entirely. | ✔️ |
33
- | page_action | Added for automation. A function that takes the `page` object and does the automation you need, then returns `page` again. | ✔️ |
34
- | addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
35
- | humanize | Humanize the cursor movement. The cursor movement takes either True or the MAX duration in seconds. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
36
- | allow_webgl | Enabled by default. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
37
- | geoip | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. | ✔️ |
38
- | os_randomize | If enabled, Scrapling will randomize the OS fingerprints used. The default is matching the fingerprints with the current OS. | ✔️ |
39
- | disable_ads | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled. | ✔️ |
40
- | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
41
- | timeout | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000. | ✔️ |
42
- | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
43
- | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
44
- | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
45
- | proxy | The proxy to be used with requests. It can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | ✔️ |
46
- | additional_arguments | Arguments passed to Camoufox as additional settings that take higher priority than Scrapling's. | ✔️ |
 
 
 
47
 
48
 
49
  ## Examples
50
- It's easier to understand with examples, so now we will go over most of the arguments individually with examples.
51
 
52
  ### Browser Modes
53
 
@@ -55,9 +58,6 @@ It's easier to understand with examples, so now we will go over most of the argu
55
  # Headless/hidden mode (default)
56
  page = StealthyFetcher.fetch('https://example.com', headless=True)
57
 
58
- # Virtual display mode (requires having `xvfb` installed)
59
- page = StealthyFetcher.fetch('https://example.com', headless='virtual')
60
-
61
  # Visible browser mode
62
  page = StealthyFetcher.fetch('https://example.com', headless=False)
63
  ```
@@ -72,6 +72,37 @@ page = StealthyFetcher.fetch('https://example.com', block_images=True)
72
  page = StealthyFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc.
73
  ```
74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  ### Additional stealth options
76
 
77
  ```python
@@ -79,7 +110,7 @@ page = StealthyFetcher.fetch(
79
  'https://example.com',
80
  block_webrtc=True, # Block WebRTC
81
  allow_webgl=False, # Disable WebGL
82
- humanize=True, # Make the mouse move as how a human would move it
83
  geoip=True, # Use IP's longitude, latitude, timezone, country, and locale, then spoof the WebRTC IP address...
84
  os_randomize=True, # Randomize the OS fingerprints used. The default is matching the fingerprints with the current OS.
85
  disable_ads=True, # Block ads with uBlock Origin addon (enabled by default)
@@ -93,7 +124,7 @@ page = StealthyFetcher.fetch(
93
  )
94
  ```
95
 
96
- The `google_search` argument is enabled by default. It makes the request as if it came from Google, so for a request for `https://example.com`, it will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
97
 
98
  ### Network Control
99
 
@@ -112,9 +143,9 @@ page = StealthyFetcher.fetch(
112
  ```
113
 
114
  ### Browser Automation
115
- This is where your knowledge about [PlayWright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, does what you want, and then returns it again for the current fetcher to continue working on it.
116
 
117
- This function is executed right after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, so it can be used for many things, not just automation. You can alter the page as you want.
118
 
119
  In the example below, I used page [mouse events](https://playwright.dev/python/docs/api/class-mouse) to move the mouse wheel to scroll the page and then move the mouse.
120
  ```python
@@ -158,14 +189,14 @@ page = StealthyFetcher.fetch(
158
  ```
159
  This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
160
 
161
- After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) and wait for them to be. If you have enabled `network_idle` with this, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
162
 
163
- The states the fetcher can wait for can be either ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
164
 
165
- - `attached`: wait for the element to be present in DOM.
166
- - `detached`: wait for the element to not be present in DOM.
167
- - `visible`: wait for the element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible.
168
- - `hidden`: Wait for the element to be detached from DOM, have an empty bounding box, or have `visibility:hidden`. This is opposite to the `'visible'` option.
169
 
170
  ### Firefox Addons
171
 
@@ -179,7 +210,7 @@ page = StealthyFetcher.fetch(
179
  The paths here must be paths of extracted addons, which will be installed automatically upon browser launch.
180
 
181
  ### Real-world example (Amazon)
182
- This is for educational purposes only; this example was generated by AI, which shows too how easy it is to work with Scrapling through AI
183
  ```python
184
  def scrape_amazon_product(url):
185
  # Use StealthyFetcher to bypass protection
@@ -201,6 +232,62 @@ def scrape_amazon_product(url):
201
  }
202
  ```
203
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
204
  ## When to Use
205
 
206
  Use StealthyFetcher when:
@@ -209,4 +296,5 @@ Use StealthyFetcher when:
209
  - Need a reliable browser fingerprint
210
  - Full JavaScript support needed
211
  - Want automatic stealth features
212
- - Need browser automation
 
 
1
  # Introduction
2
 
3
+ Here, we will discuss the `StealthyFetcher` class. This class is similar to [DynamicFetcher](dynamic.md#introduction) in many ways, such as browser automation and utilizing [Playwright's API](https://playwright.dev/python/docs/intro). The main difference is that this class provides advanced anti-bot protection bypass capabilities and a custom version of a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), from which most stealth comes.
4
 
5
+ As with [DynamicFetcher](dynamic.md#introduction), you will need some knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
6
 
7
  ## Basic Usage
8
  You have one primary way to import this Fetcher, which is the same for all fetchers.
 
14
 
15
  > Notes:
16
  >
17
+ > 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (wait for the `domcontentloaded` state).
18
  > 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
19
 
20
  ## Full list of arguments
21
  Before jumping to [examples](#examples), here's the full list of arguments
22
 
23
 
24
+ | Argument | Description | Optional |
25
+ |:-------------------:|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
26
+ | url | Target url | ❌ |
27
+ | headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ✔️ |
28
+ | block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage, but be cautious with this option, as it may cause some websites to never finish loading._ | ✔️ |
29
+ | disable_resources | Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage, but be cautious with this option, as it may cause some websites to never finish loading._ | ✔️ |
30
+ | cookies | Set cookies for the next request. | ✔️ |
31
+ | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
32
+ | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
33
+ | block_webrtc | Blocks WebRTC entirely. | ✔️ |
34
+ | page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation, then returns `page` again. | ✔️ |
35
+ | addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
36
+ | humanize | Humanize the cursor movement. The cursor movement takes either True or the maximum duration in seconds. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
37
+ | allow_webgl | Enabled by default. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
38
+ | geoip | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. | ✔️ |
39
+ | os_randomize | If enabled, Scrapling will randomize the OS fingerprints used. The default is matching the fingerprints with the current OS. | ✔️ |
40
+ | disable_ads | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled. | ✔️ |
41
+ | solve_cloudflare | When enabled, fetcher solves all three types of Cloudflare's Turnstile wait/captcha page before returning the response to you. | ✔️ |
42
+ | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
43
+ | timeout | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000. | ✔️ |
44
+ | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
45
+ | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
46
+ | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
47
+ | proxy | The proxy to be used with requests. It can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | ✔️ |
48
+ | additional_args | Additional arguments to be passed to Camoufox as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
49
+ | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
50
 
51
 
52
  ## Examples
53
+ It's easier to understand with examples, so we will now review most of the arguments individually with examples.
54
 
55
  ### Browser Modes
56
 
 
58
  # Headless/hidden mode (default)
59
  page = StealthyFetcher.fetch('https://example.com', headless=True)
60
 
 
 
 
61
  # Visible browser mode
62
  page = StealthyFetcher.fetch('https://example.com', headless=False)
63
  ```
 
72
  page = StealthyFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc.
73
  ```
74
 
75
+ ### Cloudflare Protection Bypass
76
+
77
+ ```python
78
+ # Automatic Cloudflare solver
79
+ page = StealthyFetcher.fetch(
80
+ 'https://nopecha.com/demo/cloudflare',
81
+ solve_cloudflare=True # Automatically solve Cloudflare challenges
82
+ )
83
+
84
+ # Works with other stealth options
85
+ page = StealthyFetcher.fetch(
86
+ 'https://protected-site.com',
87
+ solve_cloudflare=True,
88
+ humanize=True,
89
+ geoip=True,
90
+ os_randomize=True
91
+ )
92
+ ```
93
+
94
+ The `solve_cloudflare` parameter enables automatic detection and solving all three types of Cloudflare's Turnstile challenges:
95
+
96
+ - JavaScript challenges (managed)
97
+ - Interactive challenges (clicking verification boxes)
98
+ - Invisible challenges (automatic background verification)
99
+
100
+ **Important notes:**
101
+
102
+ - When `solve_cloudflare=True` is enabled, `humanize=True` is automatically activated for more realistic behavior
103
+ - The timeout should be at least 60 seconds when using Cloudflare solver for sufficient challenge-solving time
104
+ - This feature works seamlessly with proxies and other stealth options
105
+
106
  ### Additional stealth options
107
 
108
  ```python
 
110
  'https://example.com',
111
  block_webrtc=True, # Block WebRTC
112
  allow_webgl=False, # Disable WebGL
113
+ humanize=True, # Make the mouse move as a human would move it
114
  geoip=True, # Use IP's longitude, latitude, timezone, country, and locale, then spoof the WebRTC IP address...
115
  os_randomize=True, # Randomize the OS fingerprints used. The default is matching the fingerprints with the current OS.
116
  disable_ads=True, # Block ads with uBlock Origin addon (enabled by default)
 
124
  )
125
  ```
126
 
127
+ The `google_search` argument is enabled by default, making the request look as if it came from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
128
 
129
  ### Network Control
130
 
 
143
  ```
144
 
145
  ### Browser Automation
146
+ This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then returns it for the current fetcher to continue processing.
147
 
148
+ This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for various purposes, not just automation. You can alter the page as you want.
149
 
150
  In the example below, I used page [mouse events](https://playwright.dev/python/docs/api/class-mouse) to move the mouse wheel to scroll the page and then move the mouse.
151
  ```python
 
189
  ```
190
  This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
191
 
192
+ After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle`, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
193
 
194
+ The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
195
 
196
+ - `attached`: Wait for an element to be present in the DOM.
197
+ - `detached`: Wait for an element to not be present in the DOM.
198
+ - `visible`: wait for an element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible.
199
+ - `hidden`: wait for an element to be either detached from the DOM, or have an empty bounding box, or `visibility:hidden`. This is opposite to the `'visible'` option.
200
 
201
  ### Firefox Addons
202
 
 
210
  The paths here must be paths of extracted addons, which will be installed automatically upon browser launch.
211
 
212
  ### Real-world example (Amazon)
213
+ This is for educational purposes only; this example was generated by AI, which shows how easy it is to work with Scrapling through AI
214
  ```python
215
  def scrape_amazon_product(url):
216
  # Use StealthyFetcher to bypass protection
 
232
  }
233
  ```
234
 
235
+ ## Session Management
236
+
237
+ To keep the browser open until you make multiple requests with the same configuration, use `StealthySession`/`AsyncStealthySession` classes. Those classes can accept all the arguments that the `fetch` function can take, which enables you to specify a config for the entire session.
238
+
239
+ ```python
240
+ from scrapling.fetchers import StealthySession
241
+
242
+ # Create a session with default configuration
243
+ with StealthySession(
244
+ headless=True,
245
+ geoip=True,
246
+ humanize=True,
247
+ solve_cloudflare=True
248
+ ) as session:
249
+ # Make multiple requests with the same browser instance
250
+ page1 = session.fetch('https://example1.com')
251
+ page2 = session.fetch('https://example2.com')
252
+ page3 = session.fetch('https://nopecha.com/demo/cloudflare')
253
+
254
+ # All requests reuse the same tab on the same browser instance
255
+ ```
256
+
257
+ ### Async Session Usage
258
+
259
+ ```python
260
+ import asyncio
261
+ from scrapling.fetchers import AsyncStealthySession
262
+
263
+ async def scrape_multiple_sites():
264
+ async with AsyncStealthySession(
265
+ geoip=True,
266
+ os_randomize=True,
267
+ solve_cloudflare=True,
268
+ timeout=60000, # 60 seconds for Cloudflare challenges
269
+ max_pages=3
270
+ ) as session:
271
+ # Make async requests with shared browser configuration
272
+ pages = await asyncio.gather(
273
+ session.fetch('https://site1.com'),
274
+ session.fetch('https://site2.com'),
275
+ session.fetch('https://protected-site.com')
276
+ )
277
+ return pages
278
+ ```
279
+
280
+ You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of waiting for one browser tab to become ready, it checks if the next tab in the pool is ready to be used and uses it. This allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
281
+
282
+ When all tabs inside the pool are busy, the fetcher checks every subsecond if a tab becomes ready. If none become free within a 30-second interval, it raises a `TimeoutError` error. This can happen when the website you are fetching becomes unresponsive for some reason.
283
+
284
+ ### Session Benefits
285
+
286
+ - **Browser reuse**: Much faster subsequent requests by reusing the same browser instance.
287
+ - **Cookie persistence**: Automatic cookie and session state handling as any browser does automatically.
288
+ - **Consistent fingerprint**: Same browser fingerprint across all requests.
289
+ - **Memory efficiency**: Better resource usage compared to launching new browsers with each fetch.
290
+
291
  ## When to Use
292
 
293
  Use StealthyFetcher when:
 
296
  - Need a reliable browser fingerprint
297
  - Full JavaScript support needed
298
  - Want automatic stealth features
299
+ - Need browser automation
300
+ - Dealing with Cloudflare protection