Karim shoair commited on
Commit
6aa96de
·
1 Parent(s): 1f369fd

docs: update the StealthyFetcher page

Browse files
Files changed (2) hide show
  1. docs/fetching/dynamic.md +1 -1
  2. docs/fetching/stealthy.md +160 -144
docs/fetching/dynamic.md CHANGED
@@ -115,7 +115,7 @@ page = DynamicFetcher.fetch('https://example.com', network_idle=True)
115
  # Custom timeout (in milliseconds)
116
  page = DynamicFetcher.fetch('https://example.com', timeout=30000) # 30 seconds
117
 
118
- # Proxy support (It can also be a dictionary with the keys 'server', 'username', and 'password' only)
119
  page = DynamicFetcher.fetch('https://example.com', proxy='http://username:password@host:port')
120
  ```
121
 
 
115
  # Custom timeout (in milliseconds)
116
  page = DynamicFetcher.fetch('https://example.com', timeout=30000) # 30 seconds
117
 
118
+ # Proxy support (It can also be a dictionary with only the keys 'server', 'username', and 'password'.)
119
  page = DynamicFetcher.fetch('https://example.com', proxy='http://username:password@host:port')
120
  ```
121
 
docs/fetching/stealthy.md CHANGED
@@ -1,14 +1,17 @@
1
  # Introduction
2
 
3
- Here, we will discuss the `StealthyFetcher` class. This class is similar to [DynamicFetcher](dynamic.md#introduction) in many ways, including browser automation and the use of [Playwright's API](https://playwright.dev/python/docs/intro). The main difference is that this class provides advanced anti-bot protection bypass capabilities and a custom version of a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), from which most stealth comes.
4
 
5
  As with [DynamicFetcher](dynamic.md#introduction), you will need some knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
6
 
 
 
7
  > 💡 **Prerequisites:**
8
  >
9
- > 1. Youve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
10
- > 2. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
11
- > 3. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
 
12
 
13
  ## Basic Usage
14
  You have one primary way to import this Fetcher, which is the same for all fetchers.
@@ -20,81 +23,80 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
20
 
21
  > Note: The async version of the `fetch` method is the `async_fetch` method, of course.
22
 
23
- ## Full list of arguments
24
- Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
25
 
 
26
 
27
- | Argument | Description | Optional |
28
- |:-------------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
29
- | url | Target url | ❌ |
30
- | headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ��️ |
31
- | block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage, but be cautious with this option, as it may cause some websites to never finish loading._ | ✔️ |
32
- | disable_resources | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. | ✔️ |
33
- | cookies | Set cookies for the next request. | ✔️ |
34
- | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
35
- | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
36
- | block_webrtc | Blocks WebRTC entirely. | ✔️ |
37
- | page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | ✔️ |
38
- | addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
39
- | humanize | Humanize the cursor movement. The cursor movement takes either True or the maximum duration in seconds. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
40
- | allow_webgl | Enabled by default. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
41
- | geoip | Recommended to use with proxies; Automatically use IPs' longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. | ✔️ |
42
- | os_randomize | If enabled, Scrapling will randomize the OS fingerprints used. The default is matching the fingerprints with the current OS. | ✔️ |
43
- | disable_ads | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled. | ✔️ |
44
- | solve_cloudflare | When enabled, fetcher solves all types of Cloudflare's Turnstile/Interstitial challenges before returning the response to you. | ✔️ |
45
- | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
46
- | load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | ✔️ |
47
- | timeout | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000. | ✔️ |
48
- | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
49
- | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
50
- | init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
51
- | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
52
- | proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
53
- | user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
54
- | additional_args | Additional arguments to be passed to Camoufox as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
55
- | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
56
-
57
- In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, and `selector_config`.
58
 
59
- ## Examples
60
- It's easier to understand with examples, so we will now review most of the arguments individually.
61
-
62
- ### Browser Modes
63
 
64
- ```python
65
- # Headless/hidden mode (default)
66
- page = StealthyFetcher.fetch('https://example.com', headless=True)
67
 
68
- # Visible browser mode
69
- page = StealthyFetcher.fetch('https://example.com', headless=False)
70
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
- ### Resource Control
73
 
74
- ```python
75
- # Block images
76
- page = StealthyFetcher.fetch('https://example.com', block_images=True)
 
 
 
77
 
78
- # Disable unnecessary resources
79
- page = StealthyFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc.
80
- ```
81
 
82
- ### Cloudflare Protection Bypass
83
 
84
  ```python
85
  # Automatic Cloudflare solver
86
- page = StealthyFetcher.fetch(
87
- 'https://nopecha.com/demo/cloudflare',
88
- solve_cloudflare=True # Automatically solve Cloudflare challenges
89
- )
90
 
91
  # Works with other stealth options
92
  page = StealthyFetcher.fetch(
93
  'https://protected-site.com',
94
  solve_cloudflare=True,
95
- humanize=True,
96
- geoip=True,
97
- os_randomize=True
 
 
98
  )
99
  ```
100
 
@@ -104,64 +106,13 @@ The `solve_cloudflare` parameter enables automatic detection and solving all typ
104
  - Interactive challenges (clicking verification boxes)
105
  - Invisible challenges (automatic background verification)
106
 
107
- And even solves the custom pages.
108
-
109
- **Important notes:**
110
-
111
- - Sometimes, with websites that use custom implementations, you will need to use `wait_selector` to make sure Scrapling waits for the real website content to be loaded after solving the captcha. Some websites can be the real definition of an edge case while we are trying to make the solver as generic as possible.
112
- - When `solve_cloudflare=True` is enabled, `humanize=True` is automatically activated for more realistic behavior
113
- - The timeout should be at least 60 seconds when using the Cloudflare solver for sufficient challenge-solving time
114
- - This feature works seamlessly with proxies and other stealth options
115
-
116
- ### Additional stealth options
117
-
118
- ```python
119
- page = StealthyFetcher.fetch(
120
- 'https://example.com',
121
- block_webrtc=True, # Block WebRTC
122
- allow_webgl=False, # Disable WebGL
123
- humanize=True, # Make the mouse move as a human would move it
124
- geoip=True, # Use IP's longitude, latitude, timezone, country, and locale, then spoof the WebRTC IP address...
125
- os_randomize=True, # Randomize the OS fingerprints used. The default is matching the fingerprints with the current OS.
126
- disable_ads=True, # Block ads with uBlock Origin addon (enabled by default)
127
- google_search=True
128
- )
129
-
130
- # Custom humanization duration
131
- page = StealthyFetcher.fetch(
132
- 'https://example.com',
133
- humanize=1.5 # Max 1.5 seconds for cursor movement
134
- )
135
- ```
136
-
137
- The `google_search` argument is enabled by default, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
138
-
139
- ### Network Control
140
-
141
- ```python
142
- # Wait for network idle (Consider fetch to be finished when there are no network connections for at least 500 ms)
143
- page = StealthyFetcher.fetch('https://example.com', network_idle=True)
144
-
145
- # Custom timeout (in milliseconds)
146
- page = StealthyFetcher.fetch('https://example.com', timeout=30000) # 30 seconds
147
 
148
- # Proxy support
149
- page = StealthyFetcher.fetch(
150
- 'https://example.com',
151
- proxy='http://username:password@host:port' # Or it can be a dictionary with the keys 'server', 'username', and 'password' only
152
- )
153
- ```
154
-
155
- ### Downloading Files
156
-
157
- ```python
158
- page = StealthyFetcher.fetch('https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/poster.png')
159
-
160
- with open(file='poster.png', mode='wb') as f:
161
- f.write(page.body)
162
- ```
163
-
164
- The `body` attribute of the `Response` object is a `bytes` object containing the response body in case of Non-HTML responses.
165
 
166
  ### Browser Automation
167
  This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
@@ -177,10 +128,7 @@ def scroll_page(page: Page):
177
  page.mouse.move(100, 400)
178
  page.mouse.up()
179
 
180
- page = StealthyFetcher.fetch(
181
- 'https://example.com',
182
- page_action=scroll_page
183
- )
184
  ```
185
  Of course, if you use the async fetch version, the function must also be async.
186
  ```python
@@ -191,10 +139,7 @@ async def scroll_page(page: Page):
191
  await page.mouse.move(100, 400)
192
  await page.mouse.up()
193
 
194
- page = await StealthyFetcher.async_fetch(
195
- 'https://example.com',
196
- page_action=scroll_page
197
- )
198
  ```
199
 
200
  ### Wait Conditions
@@ -217,19 +162,9 @@ The states the fetcher can wait for can be any of the following ([source](https:
217
  - `visible`: wait for an element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible.
218
  - `hidden`: wait for an element to be either detached from the DOM, or have an empty bounding box, or `visibility:hidden`. This is opposite to the `'visible'` option.
219
 
220
- ### Firefox Addons
221
-
222
- ```python
223
- # Custom Firefox addons
224
- page = StealthyFetcher.fetch(
225
- 'https://example.com',
226
- addons=['/path/to/addon1', '/path/to/addon2']
227
- )
228
- ```
229
- The paths here must point to extracted addons that will be installed automatically upon browser launch.
230
 
231
  ### Real-world example (Amazon)
232
- This is for educational purposes only; this example was generated by AI, which shows how easy it is to work with Scrapling through AI
233
  ```python
234
  def scrape_amazon_product(url):
235
  # Use StealthyFetcher to bypass protection
@@ -261,8 +196,8 @@ from scrapling.fetchers import StealthySession
261
  # Create a session with default configuration
262
  with StealthySession(
263
  headless=True,
264
- geoip=True,
265
- humanize=True,
266
  solve_cloudflare=True
267
  ) as session:
268
  # Make multiple requests with the same browser instance
@@ -281,8 +216,8 @@ from scrapling.fetchers import AsyncStealthySession
281
 
282
  async def scrape_multiple_sites():
283
  async with AsyncStealthySession(
284
- geoip=True,
285
- os_randomize=True,
286
  solve_cloudflare=True,
287
  timeout=60000, # 60 seconds for Cloudflare challenges
288
  max_pages=3
@@ -296,12 +231,12 @@ async def scrape_multiple_sites():
296
  return pages
297
  ```
298
 
299
- You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
300
 
301
  1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
302
  2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
303
 
304
- This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
305
 
306
  In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
307
 
@@ -312,6 +247,87 @@ In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resou
312
  - **Consistent fingerprint**: Same browser fingerprint across all requests.
313
  - **Memory efficiency**: Better resource usage compared to launching new browsers with each fetch.
314
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
315
  ## When to Use
316
 
317
  Use StealthyFetcher when:
 
1
  # Introduction
2
 
3
+ Here, we will discuss the `StealthyFetcher` class. This class is very similar to the [DynamicFetcher](dynamic.md#introduction) class, including the browsers, the automation, and the use of [Playwright's API](https://playwright.dev/python/docs/intro). The main difference is that this class provides advanced anti-bot protection bypass capabilities; most of them are handled automatically under the hood, and the rest is up to you to enable.
4
 
5
  As with [DynamicFetcher](dynamic.md#introduction), you will need some knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
6
 
7
+ **Note:** _This fetcher was using a custom version of [Camoufox](https://github.com/daijro/camoufox) as an engine before version 0.3.13, which was replaced now with [patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright) for many reasons. See [this section](#using-camoufox-as-an-engine) for information if you still need to use [Camoufox](https://github.com/daijro/camoufox). We might switch back to [Camoufox](https://github.com/daijro/camoufox) in the future if its development continues._
8
+
9
  > 💡 **Prerequisites:**
10
  >
11
+ > 1. You've completed or read the [DynamicFetcher](dynamic.md#introduction) page since this class builds upon it, and we won't repeat the same information here for that reason.
12
+ > 2. You’ve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
13
+ > 3. You’ve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
14
+ > 4. You’ve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
15
 
16
  ## Basic Usage
17
  You have one primary way to import this Fetcher, which is the same for all fetchers.
 
23
 
24
  > Note: The async version of the `fetch` method is the `async_fetch` method, of course.
25
 
26
+ ## What does it do?
 
27
 
28
+ The `StealthyFetcher` class is a stealthy version of the [DynamicFetcher](dynamic.md#introduction) class, and here are some of the things it does:
29
 
30
+ 1. It easily bypasses all types of Cloudflare's Turnstile/Interstitial automatically.
31
+ 2. It bypasses CDP runtime leaks and WebRTC leaks.
32
+ 3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do.
33
+ 4. It generates canvas noise to prevent fingerprinting through canvas.
34
+ 5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks.
35
+ 6. It makes requests look as if they came from Google's search page of the requested website.
36
+ 7. and other anti-protection options...
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
+ ## Full list of arguments
39
+ Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
 
 
40
 
 
 
 
41
 
42
+ | Argument | Description | Optional |
43
+ |:-------------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
44
+ | url | Target url | ❌ |
45
+ | headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ✔️ |
46
+ | disable_resources | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. | ✔️ |
47
+ | cookies | Set cookies for the next request. | ✔️ |
48
+ | useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser and version.** | ✔️ |
49
+ | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
50
+ | load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | ✔️ |
51
+ | timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
52
+ | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
53
+ | page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | ✔️ |
54
+ | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
55
+ | init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
56
+ | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
57
+ | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
58
+ | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
59
+ | proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
60
+ | real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
61
+ | locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
62
+ | timezone_id | Changes the timezone of the browser. Defaults to the system timezone. | ✔️ |
63
+ | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
64
+ | user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
65
+ | extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
66
+ | solve_cloudflare | When enabled, fetcher solves all types of Cloudflare's Turnstile/Interstitial challenges before returning the response to you. | ✔️ |
67
+ | block_webrtc | Forces WebRTC to respect proxy settings to prevent local IP address leak. | ✔️ |
68
+ | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
69
+ | allow_webgl | Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
70
+ | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
71
+ | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
72
 
73
+ In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, and `selector_config`.
74
 
75
+ > 🔍 Notes:
76
+ >
77
+ > 1. It's basically the same arguments as [DynamicFetcher](dynamic.md#introduction) class but with these additional arguments `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`.
78
+ > 2. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
79
+ > 3. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
80
+ > 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
81
 
82
+ ## Examples
83
+ It's easier to understand with examples, so we will now review most of the arguments individually. Since it's the same class as the [DynamicFetcher](dynamic.md#introduction), you can refer to that page for more examples, as we won't repeat all the examples from there.
 
84
 
85
+ ### Cloudflare and stealth options
86
 
87
  ```python
88
  # Automatic Cloudflare solver
89
+ page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare', solve_cloudflare=True)
 
 
 
90
 
91
  # Works with other stealth options
92
  page = StealthyFetcher.fetch(
93
  'https://protected-site.com',
94
  solve_cloudflare=True,
95
+ block_webrtc=True,
96
+ real_chrome=True,
97
+ hide_canvas=True,
98
+ google_search=True,
99
+ proxy='http://username:password@host:port', # It can also be a dictionary with only the keys 'server', 'username', and 'password'.
100
  )
101
  ```
102
 
 
106
  - Interactive challenges (clicking verification boxes)
107
  - Invisible challenges (automatic background verification)
108
 
109
+ And even solves the custom pages with embedded captcha.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
+ > 🔍 **Important notes:**
112
+ >
113
+ > 1. Sometimes, with websites that use custom implementations, you will need to use `wait_selector` to make sure Scrapling waits for the real website content to be loaded after solving the captcha. Some websites can be the real definition of an edge case while we are trying to make the solver as generic as possible.
114
+ > 2. The timeout should be at least 60 seconds when using the Cloudflare solver for sufficient challenge-solving time.
115
+ > 3. This feature works seamlessly with proxies and other stealth options.
 
 
 
 
 
 
 
 
 
 
 
 
116
 
117
  ### Browser Automation
118
  This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
 
128
  page.mouse.move(100, 400)
129
  page.mouse.up()
130
 
131
+ page = StealthyFetcher.fetch('https://example.com', page_action=scroll_page)
 
 
 
132
  ```
133
  Of course, if you use the async fetch version, the function must also be async.
134
  ```python
 
139
  await page.mouse.move(100, 400)
140
  await page.mouse.up()
141
 
142
+ page = await StealthyFetcher.async_fetch('https://example.com', page_action=scroll_page)
 
 
 
143
  ```
144
 
145
  ### Wait Conditions
 
162
  - `visible`: wait for an element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible.
163
  - `hidden`: wait for an element to be either detached from the DOM, or have an empty bounding box, or `visibility:hidden`. This is opposite to the `'visible'` option.
164
 
 
 
 
 
 
 
 
 
 
 
165
 
166
  ### Real-world example (Amazon)
167
+ This is for educational purposes only; this example was generated by AI, which also shows how easy it is to work with Scrapling through AI
168
  ```python
169
  def scrape_amazon_product(url):
170
  # Use StealthyFetcher to bypass protection
 
196
  # Create a session with default configuration
197
  with StealthySession(
198
  headless=True,
199
+ real_chrome=True,
200
+ block_webrtc=True,
201
  solve_cloudflare=True
202
  ) as session:
203
  # Make multiple requests with the same browser instance
 
216
 
217
  async def scrape_multiple_sites():
218
  async with AsyncStealthySession(
219
+ real_chrome=True,
220
+ block_webrtc=True,
221
  solve_cloudflare=True,
222
  timeout=60000, # 60 seconds for Cloudflare challenges
223
  max_pages=3
 
231
  return pages
232
  ```
233
 
234
+ You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages that can be displayed at once. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
235
 
236
  1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
237
  2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
238
 
239
+ This logic allows for multiple URLs to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
240
 
241
  In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
242
 
 
247
  - **Consistent fingerprint**: Same browser fingerprint across all requests.
248
  - **Memory efficiency**: Better resource usage compared to launching new browsers with each fetch.
249
 
250
+ ## Using Camoufox as an engine
251
+ If you see that Camoufox is stable on your device, has no high memory issues, and want to continue using Camoufox as before v0.3.13. This section is for you.
252
+
253
+ First, you will need to install the Camoufox library, browser, and Firefox system dependencies if you didn't already:
254
+ ```commandline
255
+ pip install camoufox
256
+ playwright install-deps firefox
257
+ camoufox fetch
258
+ ```
259
+ Then you will inherit from `StealthySession` and set it as below:
260
+ ```python
261
+ from scrapling.fetchers import StealthySession
262
+ from playwright.sync_api import sync_playwright
263
+ from camoufox.utils import launch_options as generate_launch_options
264
+
265
+ class StealthySession(StealthySession):
266
+ def start(self):
267
+ """Create a browser for this instance and context."""
268
+ if not self.playwright:
269
+ self.playwright = sync_playwright().start()
270
+ # Configure camoufox run options here
271
+ launch_options = generate_launch_options(**{"headless": True, "user_data_dir": ''})
272
+ # Here's an example, part of what we have been doing before v0.3.13
273
+ launch_options = generate_launch_options(**{
274
+ "geoip": False,
275
+ "proxy": self._config.proxy,
276
+ "headless": self._config.headless,
277
+ "humanize": True if self._config.solve_cloudflare else False, # Better enable humanize for Cloudflare, otherwise it's up to you
278
+ "i_know_what_im_doing": True, # To turn warnings off with the user configurations
279
+ "allow_webgl": self._config.allow_webgl,
280
+ "block_webrtc": self._config.block_webrtc,
281
+ "os": None,
282
+ "user_data_dir": self._config.user_data_dir,
283
+ "firefox_user_prefs": {
284
+ # This is what enabling `enable_cache` does internally, so we do it from here instead
285
+ "browser.sessionhistory.max_entries": 10,
286
+ "browser.sessionhistory.max_total_viewers": -1,
287
+ "browser.cache.memory.enable": True,
288
+ "browser.cache.disk_cache_ssl": True,
289
+ "browser.cache.disk.smart_size.enabled": True,
290
+ },
291
+ # etc...
292
+ })
293
+ self.context = self.playwright.firefox.launch_persistent_context(**launch_options)
294
+ else:
295
+ raise RuntimeError("Session has been already started")
296
+ ```
297
+ After that, you can use it normally as before, even for solving Cloudflare challenges:
298
+ ```python
299
+ with StealthySession(solve_cloudflare=True, headless=True) as session:
300
+ page = session.fetch('https://sergiodemo.com/security/challenge/legacy-challenge')
301
+ if page.css('#page-not-found-404'):
302
+ print('Cloudflare challenge solved successfully!')
303
+ ```
304
+
305
+ The same logic applies to the `AsyncStealthySession` class with a few differences:
306
+ ```python
307
+ from scrapling.fetchers import AsyncStealthySession
308
+ from playwright.async_api import async_playwright
309
+ from camoufox.utils import launch_options as generate_launch_options
310
+
311
+ class AsyncStealthySession(AsyncStealthySession):
312
+ async def start(self):
313
+ """Create a browser for this instance and context."""
314
+ if not self.playwright:
315
+ self.playwright = await async_playwright().start()
316
+ # Configure camoufox run options here
317
+ launch_options = generate_launch_options(**{"headless": True, "user_data_dir": ''})
318
+ # or set the launch options as in the above example
319
+ self.context = await self.playwright.firefox.launch_persistent_context(**launch_options)
320
+ else:
321
+ raise RuntimeError("Session has been already started")
322
+
323
+ async with AsyncStealthySession(solve_cloudflare=True, headless=True) as session:
324
+ page = await session.fetch('https://sergiodemo.com/security/challenge/legacy-challenge')
325
+ if page.css('#page-not-found-404'):
326
+ print('Cloudflare challenge solved successfully!')
327
+ ```
328
+
329
+ Enjoy! :)
330
+
331
  ## When to Use
332
 
333
  Use StealthyFetcher when: