Karim shoair commited on
Commit ·
e1908ab
1
Parent(s): 9b4c21b
docs: update the StealthyFetcher page
Browse files- docs/fetching/stealthy.md +130 -42
docs/fetching/stealthy.md
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
# Introduction
|
| 2 |
|
| 3 |
-
Here, we will discuss the `StealthyFetcher` class. This class is similar to [
|
| 4 |
|
| 5 |
-
As with [
|
| 6 |
|
| 7 |
## Basic Usage
|
| 8 |
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
|
@@ -14,40 +14,43 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
|
|
| 14 |
|
| 15 |
> Notes:
|
| 16 |
>
|
| 17 |
-
> 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (
|
| 18 |
> 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
|
| 19 |
|
| 20 |
## Full list of arguments
|
| 21 |
Before jumping to [examples](#examples), here's the full list of arguments
|
| 22 |
|
| 23 |
|
| 24 |
-
|
|
| 25 |
-
|:-------------------
|
| 26 |
-
| url
|
| 27 |
-
|
|
| 28 |
-
|
|
| 29 |
-
| disable_resources
|
| 30 |
-
|
|
| 31 |
-
|
|
| 32 |
-
|
|
| 33 |
-
|
|
| 34 |
-
|
|
| 35 |
-
|
|
| 36 |
-
|
|
| 37 |
-
|
|
| 38 |
-
|
|
| 39 |
-
|
|
| 40 |
-
|
|
| 41 |
-
|
|
| 42 |
-
|
|
| 43 |
-
|
|
| 44 |
-
|
|
| 45 |
-
|
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
|
| 49 |
## Examples
|
| 50 |
-
It's easier to understand with examples, so
|
| 51 |
|
| 52 |
### Browser Modes
|
| 53 |
|
|
@@ -55,9 +58,6 @@ It's easier to understand with examples, so now we will go over most of the argu
|
|
| 55 |
# Headless/hidden mode (default)
|
| 56 |
page = StealthyFetcher.fetch('https://example.com', headless=True)
|
| 57 |
|
| 58 |
-
# Virtual display mode (requires having `xvfb` installed)
|
| 59 |
-
page = StealthyFetcher.fetch('https://example.com', headless='virtual')
|
| 60 |
-
|
| 61 |
# Visible browser mode
|
| 62 |
page = StealthyFetcher.fetch('https://example.com', headless=False)
|
| 63 |
```
|
|
@@ -72,6 +72,37 @@ page = StealthyFetcher.fetch('https://example.com', block_images=True)
|
|
| 72 |
page = StealthyFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc.
|
| 73 |
```
|
| 74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
### Additional stealth options
|
| 76 |
|
| 77 |
```python
|
|
@@ -79,7 +110,7 @@ page = StealthyFetcher.fetch(
|
|
| 79 |
'https://example.com',
|
| 80 |
block_webrtc=True, # Block WebRTC
|
| 81 |
allow_webgl=False, # Disable WebGL
|
| 82 |
-
humanize=True, # Make the mouse move as
|
| 83 |
geoip=True, # Use IP's longitude, latitude, timezone, country, and locale, then spoof the WebRTC IP address...
|
| 84 |
os_randomize=True, # Randomize the OS fingerprints used. The default is matching the fingerprints with the current OS.
|
| 85 |
disable_ads=True, # Block ads with uBlock Origin addon (enabled by default)
|
|
@@ -93,7 +124,7 @@ page = StealthyFetcher.fetch(
|
|
| 93 |
)
|
| 94 |
```
|
| 95 |
|
| 96 |
-
The `google_search` argument is enabled by default
|
| 97 |
|
| 98 |
### Network Control
|
| 99 |
|
|
@@ -112,9 +143,9 @@ page = StealthyFetcher.fetch(
|
|
| 112 |
```
|
| 113 |
|
| 114 |
### Browser Automation
|
| 115 |
-
This is where your knowledge about [
|
| 116 |
|
| 117 |
-
This function is executed
|
| 118 |
|
| 119 |
In the example below, I used page [mouse events](https://playwright.dev/python/docs/api/class-mouse) to move the mouse wheel to scroll the page and then move the mouse.
|
| 120 |
```python
|
|
@@ -158,14 +189,14 @@ page = StealthyFetcher.fetch(
|
|
| 158 |
```
|
| 159 |
This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
|
| 160 |
|
| 161 |
-
After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state)
|
| 162 |
|
| 163 |
-
The states the fetcher can wait for can be
|
| 164 |
|
| 165 |
-
- `attached`:
|
| 166 |
-
- `detached`:
|
| 167 |
-
- `visible`: wait for
|
| 168 |
-
- `hidden`:
|
| 169 |
|
| 170 |
### Firefox Addons
|
| 171 |
|
|
@@ -179,7 +210,7 @@ page = StealthyFetcher.fetch(
|
|
| 179 |
The paths here must be paths of extracted addons, which will be installed automatically upon browser launch.
|
| 180 |
|
| 181 |
### Real-world example (Amazon)
|
| 182 |
-
This is for educational purposes only; this example was generated by AI, which shows
|
| 183 |
```python
|
| 184 |
def scrape_amazon_product(url):
|
| 185 |
# Use StealthyFetcher to bypass protection
|
|
@@ -201,6 +232,62 @@ def scrape_amazon_product(url):
|
|
| 201 |
}
|
| 202 |
```
|
| 203 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 204 |
## When to Use
|
| 205 |
|
| 206 |
Use StealthyFetcher when:
|
|
@@ -209,4 +296,5 @@ Use StealthyFetcher when:
|
|
| 209 |
- Need a reliable browser fingerprint
|
| 210 |
- Full JavaScript support needed
|
| 211 |
- Want automatic stealth features
|
| 212 |
-
- Need browser automation
|
|
|
|
|
|
| 1 |
# Introduction
|
| 2 |
|
| 3 |
+
Here, we will discuss the `StealthyFetcher` class. This class is similar to [DynamicFetcher](dynamic.md#introduction) in many ways, such as browser automation and utilizing [Playwright's API](https://playwright.dev/python/docs/intro). The main difference is that this class provides advanced anti-bot protection bypass capabilities and a custom version of a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), from which most stealth comes.
|
| 4 |
|
| 5 |
+
As with [DynamicFetcher](dynamic.md#introduction), you will need some knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
|
| 6 |
|
| 7 |
## Basic Usage
|
| 8 |
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
|
|
|
| 14 |
|
| 15 |
> Notes:
|
| 16 |
>
|
| 17 |
+
> 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (wait for the `domcontentloaded` state).
|
| 18 |
> 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
|
| 19 |
|
| 20 |
## Full list of arguments
|
| 21 |
Before jumping to [examples](#examples), here's the full list of arguments
|
| 22 |
|
| 23 |
|
| 24 |
+
| Argument | Description | Optional |
|
| 25 |
+
|:-------------------:|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
| 26 |
+
| url | Target url | ❌ |
|
| 27 |
+
| headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ✔️ |
|
| 28 |
+
| block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage, but be cautious with this option, as it may cause some websites to never finish loading._ | ✔️ |
|
| 29 |
+
| disable_resources | Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage, but be cautious with this option, as it may cause some websites to never finish loading._ | ✔️ |
|
| 30 |
+
| cookies | Set cookies for the next request. | ✔️ |
|
| 31 |
+
| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
|
| 32 |
+
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
|
| 33 |
+
| block_webrtc | Blocks WebRTC entirely. | ✔️ |
|
| 34 |
+
| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation, then returns `page` again. | ✔️ |
|
| 35 |
+
| addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
|
| 36 |
+
| humanize | Humanize the cursor movement. The cursor movement takes either True or the maximum duration in seconds. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
|
| 37 |
+
| allow_webgl | Enabled by default. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
|
| 38 |
+
| geoip | Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. | ✔️ |
|
| 39 |
+
| os_randomize | If enabled, Scrapling will randomize the OS fingerprints used. The default is matching the fingerprints with the current OS. | ✔️ |
|
| 40 |
+
| disable_ads | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled. | ✔️ |
|
| 41 |
+
| solve_cloudflare | When enabled, fetcher solves all three types of Cloudflare's Turnstile wait/captcha page before returning the response to you. | ✔️ |
|
| 42 |
+
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 43 |
+
| timeout | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000. | ✔️ |
|
| 44 |
+
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
|
| 45 |
+
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 46 |
+
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 47 |
+
| proxy | The proxy to be used with requests. It can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | ✔️ |
|
| 48 |
+
| additional_args | Additional arguments to be passed to Camoufox as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 49 |
+
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 50 |
|
| 51 |
|
| 52 |
## Examples
|
| 53 |
+
It's easier to understand with examples, so we will now review most of the arguments individually with examples.
|
| 54 |
|
| 55 |
### Browser Modes
|
| 56 |
|
|
|
|
| 58 |
# Headless/hidden mode (default)
|
| 59 |
page = StealthyFetcher.fetch('https://example.com', headless=True)
|
| 60 |
|
|
|
|
|
|
|
|
|
|
| 61 |
# Visible browser mode
|
| 62 |
page = StealthyFetcher.fetch('https://example.com', headless=False)
|
| 63 |
```
|
|
|
|
| 72 |
page = StealthyFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc.
|
| 73 |
```
|
| 74 |
|
| 75 |
+
### Cloudflare Protection Bypass
|
| 76 |
+
|
| 77 |
+
```python
|
| 78 |
+
# Automatic Cloudflare solver
|
| 79 |
+
page = StealthyFetcher.fetch(
|
| 80 |
+
'https://nopecha.com/demo/cloudflare',
|
| 81 |
+
solve_cloudflare=True # Automatically solve Cloudflare challenges
|
| 82 |
+
)
|
| 83 |
+
|
| 84 |
+
# Works with other stealth options
|
| 85 |
+
page = StealthyFetcher.fetch(
|
| 86 |
+
'https://protected-site.com',
|
| 87 |
+
solve_cloudflare=True,
|
| 88 |
+
humanize=True,
|
| 89 |
+
geoip=True,
|
| 90 |
+
os_randomize=True
|
| 91 |
+
)
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
The `solve_cloudflare` parameter enables automatic detection and solving all three types of Cloudflare's Turnstile challenges:
|
| 95 |
+
|
| 96 |
+
- JavaScript challenges (managed)
|
| 97 |
+
- Interactive challenges (clicking verification boxes)
|
| 98 |
+
- Invisible challenges (automatic background verification)
|
| 99 |
+
|
| 100 |
+
**Important notes:**
|
| 101 |
+
|
| 102 |
+
- When `solve_cloudflare=True` is enabled, `humanize=True` is automatically activated for more realistic behavior
|
| 103 |
+
- The timeout should be at least 60 seconds when using Cloudflare solver for sufficient challenge-solving time
|
| 104 |
+
- This feature works seamlessly with proxies and other stealth options
|
| 105 |
+
|
| 106 |
### Additional stealth options
|
| 107 |
|
| 108 |
```python
|
|
|
|
| 110 |
'https://example.com',
|
| 111 |
block_webrtc=True, # Block WebRTC
|
| 112 |
allow_webgl=False, # Disable WebGL
|
| 113 |
+
humanize=True, # Make the mouse move as a human would move it
|
| 114 |
geoip=True, # Use IP's longitude, latitude, timezone, country, and locale, then spoof the WebRTC IP address...
|
| 115 |
os_randomize=True, # Randomize the OS fingerprints used. The default is matching the fingerprints with the current OS.
|
| 116 |
disable_ads=True, # Block ads with uBlock Origin addon (enabled by default)
|
|
|
|
| 124 |
)
|
| 125 |
```
|
| 126 |
|
| 127 |
+
The `google_search` argument is enabled by default, making the request look as if it came from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
|
| 128 |
|
| 129 |
### Network Control
|
| 130 |
|
|
|
|
| 143 |
```
|
| 144 |
|
| 145 |
### Browser Automation
|
| 146 |
+
This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then returns it for the current fetcher to continue processing.
|
| 147 |
|
| 148 |
+
This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for various purposes, not just automation. You can alter the page as you want.
|
| 149 |
|
| 150 |
In the example below, I used page [mouse events](https://playwright.dev/python/docs/api/class-mouse) to move the mouse wheel to scroll the page and then move the mouse.
|
| 151 |
```python
|
|
|
|
| 189 |
```
|
| 190 |
This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
|
| 191 |
|
| 192 |
+
After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle`, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
|
| 193 |
|
| 194 |
+
The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
|
| 195 |
|
| 196 |
+
- `attached`: Wait for an element to be present in the DOM.
|
| 197 |
+
- `detached`: Wait for an element to not be present in the DOM.
|
| 198 |
+
- `visible`: wait for an element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible.
|
| 199 |
+
- `hidden`: wait for an element to be either detached from the DOM, or have an empty bounding box, or `visibility:hidden`. This is opposite to the `'visible'` option.
|
| 200 |
|
| 201 |
### Firefox Addons
|
| 202 |
|
|
|
|
| 210 |
The paths here must be paths of extracted addons, which will be installed automatically upon browser launch.
|
| 211 |
|
| 212 |
### Real-world example (Amazon)
|
| 213 |
+
This is for educational purposes only; this example was generated by AI, which shows how easy it is to work with Scrapling through AI
|
| 214 |
```python
|
| 215 |
def scrape_amazon_product(url):
|
| 216 |
# Use StealthyFetcher to bypass protection
|
|
|
|
| 232 |
}
|
| 233 |
```
|
| 234 |
|
| 235 |
+
## Session Management
|
| 236 |
+
|
| 237 |
+
To keep the browser open until you make multiple requests with the same configuration, use `StealthySession`/`AsyncStealthySession` classes. Those classes can accept all the arguments that the `fetch` function can take, which enables you to specify a config for the entire session.
|
| 238 |
+
|
| 239 |
+
```python
|
| 240 |
+
from scrapling.fetchers import StealthySession
|
| 241 |
+
|
| 242 |
+
# Create a session with default configuration
|
| 243 |
+
with StealthySession(
|
| 244 |
+
headless=True,
|
| 245 |
+
geoip=True,
|
| 246 |
+
humanize=True,
|
| 247 |
+
solve_cloudflare=True
|
| 248 |
+
) as session:
|
| 249 |
+
# Make multiple requests with the same browser instance
|
| 250 |
+
page1 = session.fetch('https://example1.com')
|
| 251 |
+
page2 = session.fetch('https://example2.com')
|
| 252 |
+
page3 = session.fetch('https://nopecha.com/demo/cloudflare')
|
| 253 |
+
|
| 254 |
+
# All requests reuse the same tab on the same browser instance
|
| 255 |
+
```
|
| 256 |
+
|
| 257 |
+
### Async Session Usage
|
| 258 |
+
|
| 259 |
+
```python
|
| 260 |
+
import asyncio
|
| 261 |
+
from scrapling.fetchers import AsyncStealthySession
|
| 262 |
+
|
| 263 |
+
async def scrape_multiple_sites():
|
| 264 |
+
async with AsyncStealthySession(
|
| 265 |
+
geoip=True,
|
| 266 |
+
os_randomize=True,
|
| 267 |
+
solve_cloudflare=True,
|
| 268 |
+
timeout=60000, # 60 seconds for Cloudflare challenges
|
| 269 |
+
max_pages=3
|
| 270 |
+
) as session:
|
| 271 |
+
# Make async requests with shared browser configuration
|
| 272 |
+
pages = await asyncio.gather(
|
| 273 |
+
session.fetch('https://site1.com'),
|
| 274 |
+
session.fetch('https://site2.com'),
|
| 275 |
+
session.fetch('https://protected-site.com')
|
| 276 |
+
)
|
| 277 |
+
return pages
|
| 278 |
+
```
|
| 279 |
+
|
| 280 |
+
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of waiting for one browser tab to become ready, it checks if the next tab in the pool is ready to be used and uses it. This allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
|
| 281 |
+
|
| 282 |
+
When all tabs inside the pool are busy, the fetcher checks every subsecond if a tab becomes ready. If none become free within a 30-second interval, it raises a `TimeoutError` error. This can happen when the website you are fetching becomes unresponsive for some reason.
|
| 283 |
+
|
| 284 |
+
### Session Benefits
|
| 285 |
+
|
| 286 |
+
- **Browser reuse**: Much faster subsequent requests by reusing the same browser instance.
|
| 287 |
+
- **Cookie persistence**: Automatic cookie and session state handling as any browser does automatically.
|
| 288 |
+
- **Consistent fingerprint**: Same browser fingerprint across all requests.
|
| 289 |
+
- **Memory efficiency**: Better resource usage compared to launching new browsers with each fetch.
|
| 290 |
+
|
| 291 |
## When to Use
|
| 292 |
|
| 293 |
Use StealthyFetcher when:
|
|
|
|
| 296 |
- Need a reliable browser fingerprint
|
| 297 |
- Full JavaScript support needed
|
| 298 |
- Want automatic stealth features
|
| 299 |
+
- Need browser automation
|
| 300 |
+
- Dealing with Cloudflare protection
|