Karim shoair commited on
Commit ·
3ba53b4
1
Parent(s): e6bbd0d
docs: update website accordingly
Browse files- docs/fetching/dynamic.md +12 -12
- docs/fetching/stealthy.md +14 -11
docs/fetching/dynamic.md
CHANGED
|
@@ -17,7 +17,7 @@ Now, we will review most of the arguments one by one, using examples. If you wan
|
|
| 17 |
> Note: The async version of the `fetch` method is the `async_fetch` method, of course.
|
| 18 |
|
| 19 |
|
| 20 |
-
This fetcher currently provides four main run options
|
| 21 |
|
| 22 |
Which are:
|
| 23 |
|
|
@@ -62,7 +62,7 @@ DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
|
|
| 62 |
Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
|
| 63 |
|
| 64 |
## Full list of arguments
|
| 65 |
-
Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of
|
| 66 |
|
| 67 |
| Argument | Description | Optional |
|
| 68 |
|:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
|
@@ -81,7 +81,7 @@ Scrapling provides many options with this fetcher and its session classes. To ma
|
|
| 81 |
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 82 |
| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
|
| 83 |
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
|
| 84 |
-
| proxy | The proxy to be used with requests. It can be a string or a dictionary with the keys 'server', 'username', and 'password'
|
| 85 |
| hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
|
| 86 |
| disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
|
| 87 |
| stealth | Enables stealth mode; you should always check the documentation to see what the stealth mode does currently. | ✔️ |
|
|
@@ -89,7 +89,7 @@ Scrapling provides many options with this fetcher and its session classes. To ma
|
|
| 89 |
| locale | Set the locale for the browser if wanted. The default value is `en-US`. | ✔️ |
|
| 90 |
| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
|
| 91 |
| user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
|
| 92 |
-
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and
|
| 93 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 94 |
|
| 95 |
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
|
|
@@ -125,11 +125,11 @@ page = DynamicFetcher.fetch(
|
|
| 125 |
```
|
| 126 |
|
| 127 |
### Browser Automation
|
| 128 |
-
This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then
|
| 129 |
|
| 130 |
-
This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for
|
| 131 |
|
| 132 |
-
In the example below, I used
|
| 133 |
```python
|
| 134 |
from playwright.sync_api import Page
|
| 135 |
|
|
@@ -203,9 +203,9 @@ page = DynamicFetcher.fetch(
|
|
| 203 |
locale='en-US'
|
| 204 |
)
|
| 205 |
```
|
| 206 |
-
Hence, the `hide_canvas` argument doesn't disable the canvas
|
| 207 |
|
| 208 |
-
The `google_search` argument is enabled by default, making the request
|
| 209 |
|
| 210 |
### General example
|
| 211 |
```python
|
|
@@ -274,14 +274,14 @@ async def scrape_multiple_sites():
|
|
| 274 |
return pages
|
| 275 |
```
|
| 276 |
|
| 277 |
-
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs**
|
| 278 |
|
| 279 |
1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
|
| 280 |
-
2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive
|
| 281 |
|
| 282 |
This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
|
| 283 |
|
| 284 |
-
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved
|
| 285 |
|
| 286 |
### Session Benefits
|
| 287 |
|
|
|
|
| 17 |
> Note: The async version of the `fetch` method is the `async_fetch` method, of course.
|
| 18 |
|
| 19 |
|
| 20 |
+
This fetcher currently provides four main run options that can be combined as desired.
|
| 21 |
|
| 22 |
Which are:
|
| 23 |
|
|
|
|
| 62 |
Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
|
| 63 |
|
| 64 |
## Full list of arguments
|
| 65 |
+
Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
|
| 66 |
|
| 67 |
| Argument | Description | Optional |
|
| 68 |
|:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
|
|
|
| 81 |
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 82 |
| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
|
| 83 |
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
|
| 84 |
+
| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
|
| 85 |
| hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
|
| 86 |
| disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
|
| 87 |
| stealth | Enables stealth mode; you should always check the documentation to see what the stealth mode does currently. | ✔️ |
|
|
|
|
| 89 |
| locale | Set the locale for the browser if wanted. The default value is `en-US`. | ✔️ |
|
| 90 |
| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
|
| 91 |
| user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
|
| 92 |
+
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 93 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 94 |
|
| 95 |
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
|
|
|
|
| 125 |
```
|
| 126 |
|
| 127 |
### Browser Automation
|
| 128 |
+
This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
|
| 129 |
|
| 130 |
+
This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for purposes beyond automation. You can alter the page as you want.
|
| 131 |
|
| 132 |
+
In the example below, I used the pages' [mouse events](https://playwright.dev/python/docs/api/class-mouse) to scroll the page with the mouse wheel, then move the mouse.
|
| 133 |
```python
|
| 134 |
from playwright.sync_api import Page
|
| 135 |
|
|
|
|
| 203 |
locale='en-US'
|
| 204 |
)
|
| 205 |
```
|
| 206 |
+
Hence, the `hide_canvas` argument doesn't disable the canvas; instead, it hides it by adding random noise to canvas operations, preventing fingerprinting. Also, if you didn't set a user agent (preferred), the fetcher will generate a real User Agent of the same browser and use it.
|
| 207 |
|
| 208 |
+
The `google_search` argument is enabled by default, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
|
| 209 |
|
| 210 |
### General example
|
| 211 |
```python
|
|
|
|
| 274 |
return pages
|
| 275 |
```
|
| 276 |
|
| 277 |
+
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
|
| 278 |
|
| 279 |
1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
|
| 280 |
+
2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
|
| 281 |
|
| 282 |
This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
|
| 283 |
|
| 284 |
+
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
|
| 285 |
|
| 286 |
### Session Benefits
|
| 287 |
|
docs/fetching/stealthy.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# Introduction
|
| 2 |
|
| 3 |
-
Here, we will discuss the `StealthyFetcher` class. This class is similar to [DynamicFetcher](dynamic.md#introduction) in many ways,
|
| 4 |
|
| 5 |
As with [DynamicFetcher](dynamic.md#introduction), you will need some knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
|
| 6 |
|
|
@@ -43,7 +43,7 @@ Scrapling provides many options with this fetcher and its session classes. Befor
|
|
| 43 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 44 |
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
|
| 45 |
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 46 |
-
| proxy | The proxy to be used with requests. It can be a string or a dictionary with the keys 'server', 'username', and 'password'
|
| 47 |
| user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
|
| 48 |
| additional_args | Additional arguments to be passed to Camoufox as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 49 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
|
@@ -51,7 +51,7 @@ Scrapling provides many options with this fetcher and its session classes. Befor
|
|
| 51 |
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, and `selector_config`.
|
| 52 |
|
| 53 |
## Examples
|
| 54 |
-
It's easier to understand with examples, so we will now review most of the arguments individually
|
| 55 |
|
| 56 |
### Browser Modes
|
| 57 |
|
|
@@ -98,8 +98,11 @@ The `solve_cloudflare` parameter enables automatic detection and solving all typ
|
|
| 98 |
- Interactive challenges (clicking verification boxes)
|
| 99 |
- Invisible challenges (automatic background verification)
|
| 100 |
|
|
|
|
|
|
|
| 101 |
**Important notes:**
|
| 102 |
|
|
|
|
| 103 |
- When `solve_cloudflare=True` is enabled, `humanize=True` is automatically activated for more realistic behavior
|
| 104 |
- The timeout should be at least 60 seconds when using the Cloudflare solver for sufficient challenge-solving time
|
| 105 |
- This feature works seamlessly with proxies and other stealth options
|
|
@@ -125,7 +128,7 @@ page = StealthyFetcher.fetch(
|
|
| 125 |
)
|
| 126 |
```
|
| 127 |
|
| 128 |
-
The `google_search` argument is enabled by default, making the request
|
| 129 |
|
| 130 |
### Network Control
|
| 131 |
|
|
@@ -144,11 +147,11 @@ page = StealthyFetcher.fetch(
|
|
| 144 |
```
|
| 145 |
|
| 146 |
### Browser Automation
|
| 147 |
-
This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then
|
| 148 |
|
| 149 |
-
This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for
|
| 150 |
|
| 151 |
-
In the example below, I used
|
| 152 |
```python
|
| 153 |
from playwright.sync_api import Page
|
| 154 |
|
|
@@ -206,7 +209,7 @@ page = StealthyFetcher.fetch(
|
|
| 206 |
addons=['/path/to/addon1', '/path/to/addon2']
|
| 207 |
)
|
| 208 |
```
|
| 209 |
-
The paths here must
|
| 210 |
|
| 211 |
### Real-world example (Amazon)
|
| 212 |
This is for educational purposes only; this example was generated by AI, which shows how easy it is to work with Scrapling through AI
|
|
@@ -276,14 +279,14 @@ async def scrape_multiple_sites():
|
|
| 276 |
return pages
|
| 277 |
```
|
| 278 |
|
| 279 |
-
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs**
|
| 280 |
|
| 281 |
1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
|
| 282 |
-
2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive
|
| 283 |
|
| 284 |
This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
|
| 285 |
|
| 286 |
-
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved
|
| 287 |
|
| 288 |
### Session Benefits
|
| 289 |
|
|
|
|
| 1 |
# Introduction
|
| 2 |
|
| 3 |
+
Here, we will discuss the `StealthyFetcher` class. This class is similar to [DynamicFetcher](dynamic.md#introduction) in many ways, including browser automation and the use of [Playwright's API](https://playwright.dev/python/docs/intro). The main difference is that this class provides advanced anti-bot protection bypass capabilities and a custom version of a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), from which most stealth comes.
|
| 4 |
|
| 5 |
As with [DynamicFetcher](dynamic.md#introduction), you will need some knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
|
| 6 |
|
|
|
|
| 43 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 44 |
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
|
| 45 |
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 46 |
+
| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
|
| 47 |
| user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
|
| 48 |
| additional_args | Additional arguments to be passed to Camoufox as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 49 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
|
|
|
| 51 |
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, and `selector_config`.
|
| 52 |
|
| 53 |
## Examples
|
| 54 |
+
It's easier to understand with examples, so we will now review most of the arguments individually.
|
| 55 |
|
| 56 |
### Browser Modes
|
| 57 |
|
|
|
|
| 98 |
- Interactive challenges (clicking verification boxes)
|
| 99 |
- Invisible challenges (automatic background verification)
|
| 100 |
|
| 101 |
+
And even solves the custom pages.
|
| 102 |
+
|
| 103 |
**Important notes:**
|
| 104 |
|
| 105 |
+
- Sometimes, with websites that use custom implementations, you will need to use `wait_selector` to make sure Scrapling waits for the real website content to be loaded after solving the captcha. Some websites can be the real definition of an edge case while we are trying to make the solver as generic as possible.
|
| 106 |
- When `solve_cloudflare=True` is enabled, `humanize=True` is automatically activated for more realistic behavior
|
| 107 |
- The timeout should be at least 60 seconds when using the Cloudflare solver for sufficient challenge-solving time
|
| 108 |
- This feature works seamlessly with proxies and other stealth options
|
|
|
|
| 128 |
)
|
| 129 |
```
|
| 130 |
|
| 131 |
+
The `google_search` argument is enabled by default, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
|
| 132 |
|
| 133 |
### Network Control
|
| 134 |
|
|
|
|
| 147 |
```
|
| 148 |
|
| 149 |
### Browser Automation
|
| 150 |
+
This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
|
| 151 |
|
| 152 |
+
This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for purposes beyond automation. You can alter the page as you want.
|
| 153 |
|
| 154 |
+
In the example below, I used the pages' [mouse events](https://playwright.dev/python/docs/api/class-mouse) to scroll the page with the mouse wheel, then move the mouse.
|
| 155 |
```python
|
| 156 |
from playwright.sync_api import Page
|
| 157 |
|
|
|
|
| 209 |
addons=['/path/to/addon1', '/path/to/addon2']
|
| 210 |
)
|
| 211 |
```
|
| 212 |
+
The paths here must point to extracted addons that will be installed automatically upon browser launch.
|
| 213 |
|
| 214 |
### Real-world example (Amazon)
|
| 215 |
This is for educational purposes only; this example was generated by AI, which shows how easy it is to work with Scrapling through AI
|
|
|
|
| 279 |
return pages
|
| 280 |
```
|
| 281 |
|
| 282 |
+
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
|
| 283 |
|
| 284 |
1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
|
| 285 |
+
2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
|
| 286 |
|
| 287 |
This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
|
| 288 |
|
| 289 |
+
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
|
| 290 |
|
| 291 |
### Session Benefits
|
| 292 |
|