Karim shoair commited on
Commit ·
d10d3f9
1
Parent(s): 24ea5ca
docs: update DynamicFetcher page
Browse files- docs/fetching/dynamic.md +47 -74
docs/fetching/dynamic.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# Introduction
|
| 2 |
|
| 3 |
-
Here, we will discuss the `DynamicFetcher` class (
|
| 4 |
|
| 5 |
As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
|
| 6 |
|
|
@@ -23,7 +23,7 @@ Now, we will review most of the arguments one by one, using examples. If you wan
|
|
| 23 |
> Note: The async version of the `fetch` method is the `async_fetch` method, of course.
|
| 24 |
|
| 25 |
|
| 26 |
-
This fetcher currently provides
|
| 27 |
|
| 28 |
Which are:
|
| 29 |
|
|
@@ -31,77 +31,66 @@ Which are:
|
|
| 31 |
```python
|
| 32 |
DynamicFetcher.fetch('https://example.com')
|
| 33 |
```
|
| 34 |
-
Using it in that manner will open a Chromium browser and load the page. There are no tricks or extra features unless you enable some; it's just a plain PlayWright API.
|
| 35 |
|
| 36 |
-
### 2.
|
| 37 |
-
```python
|
| 38 |
-
DynamicFetcher.fetch('https://example.com', stealth=True)
|
| 39 |
-
```
|
| 40 |
-
It's the same as the vanilla Playwright option, but it provides a simple stealth mode suitable for websites with a small to medium protection layer(s).
|
| 41 |
-
|
| 42 |
-
Some of the things this fetcher's stealth mode does include:
|
| 43 |
-
|
| 44 |
-
* Patching the CDP runtime fingerprint by using PatchRight.
|
| 45 |
-
* Mimics some of the real browsers' properties by injecting several JS files and using custom options.
|
| 46 |
-
* Custom flags are used on launch to hide Playwright even more and make it faster.
|
| 47 |
-
* Generates real browser headers of the same type and user OS, then appends them to the request's headers.
|
| 48 |
-
|
| 49 |
-
### 3. Real Chrome
|
| 50 |
```python
|
| 51 |
DynamicFetcher.fetch('https://example.com', real_chrome=True)
|
| 52 |
```
|
| 53 |
-
If you have a Google Chrome browser installed, use this option. It's the same as the first option, but it will use the Google Chrome browser you installed on your device instead of Chromium.
|
| 54 |
|
| 55 |
-
This will make your requests look more authentic, so it's less detectable, and you can even use the `stealth=True` mode with it for better results, like below:
|
| 56 |
-
```python
|
| 57 |
-
DynamicFetcher.fetch('https://example.com', real_chrome=True, stealth=True)
|
| 58 |
-
```
|
| 59 |
If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually:
|
| 60 |
```commandline
|
| 61 |
playwright install chrome
|
| 62 |
```
|
| 63 |
|
| 64 |
-
###
|
| 65 |
```python
|
| 66 |
DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
|
| 67 |
```
|
| 68 |
Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
|
| 69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
## Full list of arguments
|
| 71 |
Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
|
| 72 |
|
| 73 |
-
| Argument | Description
|
| 74 |
-
|:-------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| 75 |
-
| url | Target url
|
| 76 |
-
| headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode.
|
| 77 |
-
| disable_resources | Drop requests for unnecessary resources for a speed boost.
|
| 78 |
-
| cookies | Set cookies for the next request.
|
| 79 |
-
| useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser.**
|
| 80 |
-
| network_idle | Wait for the page until there are no network connections for at least 500 ms.
|
| 81 |
-
| load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state).
|
| 82 |
-
| timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds).
|
| 83 |
-
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object.
|
| 84 |
-
| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation.
|
| 85 |
-
| wait_selector | Wait for a specific css selector to be in a specific state.
|
| 86 |
-
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.
|
| 87 |
-
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._
|
| 88 |
-
| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name.
|
| 89 |
-
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
|
| 90 |
-
| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.
|
| 91 |
-
|
|
| 92 |
-
|
|
| 93 |
-
|
|
| 94 |
-
|
|
| 95 |
-
|
|
| 96 |
-
|
|
| 97 |
-
|
|
| 98 |
-
|
|
| 99 |
-
| extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
|
| 100 |
-
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 101 |
-
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 102 |
|
| 103 |
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
|
| 104 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
|
| 106 |
## Examples
|
| 107 |
It's easier to understand with examples, so let's take a look.
|
|
@@ -201,28 +190,14 @@ The states the fetcher can wait for can be any of the following ([source](https:
|
|
| 201 |
### Some Stealth Features
|
| 202 |
|
| 203 |
```python
|
| 204 |
-
# Full stealth mode
|
| 205 |
-
page = DynamicFetcher.fetch(
|
| 206 |
-
'https://example.com',
|
| 207 |
-
stealth=True,
|
| 208 |
-
hide_canvas=True,
|
| 209 |
-
disable_webgl=True,
|
| 210 |
-
google_search=True
|
| 211 |
-
)
|
| 212 |
-
|
| 213 |
-
# Custom user agent
|
| 214 |
-
page = DynamicFetcher.fetch(
|
| 215 |
-
'https://example.com',
|
| 216 |
-
useragent='Mozilla/5.0...'
|
| 217 |
-
)
|
| 218 |
-
|
| 219 |
-
# Set browser locale
|
| 220 |
page = DynamicFetcher.fetch(
|
| 221 |
'https://example.com',
|
| 222 |
-
|
|
|
|
|
|
|
| 223 |
)
|
| 224 |
```
|
| 225 |
-
|
| 226 |
|
| 227 |
The `google_search` argument is enabled by default, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
|
| 228 |
|
|
@@ -259,7 +234,6 @@ from scrapling.fetchers import DynamicSession
|
|
| 259 |
# Create a session with default configuration
|
| 260 |
with DynamicSession(
|
| 261 |
headless=True,
|
| 262 |
-
stealth=True,
|
| 263 |
disable_resources=True,
|
| 264 |
real_chrome=True
|
| 265 |
) as session:
|
|
@@ -279,7 +253,6 @@ from scrapling.fetchers import AsyncDynamicSession
|
|
| 279 |
|
| 280 |
async def scrape_multiple_sites():
|
| 281 |
async with AsyncDynamicSession(
|
| 282 |
-
stealth=True,
|
| 283 |
network_idle=True,
|
| 284 |
timeout=30000,
|
| 285 |
max_pages=3
|
|
@@ -298,7 +271,7 @@ You may have noticed the `max_pages` argument. This is a new argument that enabl
|
|
| 298 |
1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
|
| 299 |
2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
|
| 300 |
|
| 301 |
-
This logic allows for multiple
|
| 302 |
|
| 303 |
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
|
| 304 |
|
|
@@ -317,6 +290,6 @@ Use DynamicFetcher when:
|
|
| 317 |
- Want multiple browser options
|
| 318 |
- Using a real Chrome browser
|
| 319 |
- Need custom browser config
|
| 320 |
-
- Want
|
| 321 |
|
| 322 |
If you want more stealth and control without much config, check out the [StealthyFetcher](stealthy.md).
|
|
|
|
| 1 |
# Introduction
|
| 2 |
|
| 3 |
+
Here, we will discuss the `DynamicFetcher` class (formerly `PlayWrightFetcher`). This class provides flexible browser automation with multiple configuration options and little underhood stealth improvements.
|
| 4 |
|
| 5 |
As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
|
| 6 |
|
|
|
|
| 23 |
> Note: The async version of the `fetch` method is the `async_fetch` method, of course.
|
| 24 |
|
| 25 |
|
| 26 |
+
This fetcher currently provides three main run options that can be combined as desired.
|
| 27 |
|
| 28 |
Which are:
|
| 29 |
|
|
|
|
| 31 |
```python
|
| 32 |
DynamicFetcher.fetch('https://example.com')
|
| 33 |
```
|
| 34 |
+
Using it in that manner will open a Chromium browser and load the page. There are optimizations for speed, and some stealth goes automatically under the hood but other than that there are no tricks or extra features unless you enable some; it's just a plain PlayWright API.
|
| 35 |
|
| 36 |
+
### 2. Real Chrome
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
```python
|
| 38 |
DynamicFetcher.fetch('https://example.com', real_chrome=True)
|
| 39 |
```
|
| 40 |
+
If you have a Google Chrome browser installed, use this option. It's the same as the first option, but it will use the Google Chrome browser you installed on your device instead of Chromium. This will make your requests look more authentic, so it's less detectable for better results.
|
| 41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually:
|
| 43 |
```commandline
|
| 44 |
playwright install chrome
|
| 45 |
```
|
| 46 |
|
| 47 |
+
### 3. CDP Connection
|
| 48 |
```python
|
| 49 |
DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
|
| 50 |
```
|
| 51 |
Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
|
| 52 |
|
| 53 |
+
|
| 54 |
+
> Note:<br/>
|
| 55 |
+
> * There was a `stealth` option here, but it was moved to `StealthyFetcher` class as explained in the next page with more features since version 0.3.13.<br/>
|
| 56 |
+
> * This makes it less confusing for new users, easier to maintain and other reasons explained in the [StealthyFetcher page](../fetching/stealthy.md).
|
| 57 |
+
|
| 58 |
## Full list of arguments
|
| 59 |
Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
|
| 60 |
|
| 61 |
+
| Argument | Description | Optional |
|
| 62 |
+
|:-------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
| 63 |
+
| url | Target url | ❌ |
|
| 64 |
+
| headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ✔️ |
|
| 65 |
+
| disable_resources | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. | ✔️ |
|
| 66 |
+
| cookies | Set cookies for the next request. | ✔️ |
|
| 67 |
+
| useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser and same version.** | ✔️ |
|
| 68 |
+
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 69 |
+
| load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | ✔️ |
|
| 70 |
+
| timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
|
| 71 |
+
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
|
| 72 |
+
| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | ✔️ |
|
| 73 |
+
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 74 |
+
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
|
| 75 |
+
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 76 |
+
| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
|
| 77 |
+
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
|
| 78 |
+
| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
|
| 79 |
+
| real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
|
| 80 |
+
| locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
|
| 81 |
+
| timezone_id | Changes the timezone of the browser. Defaults to the system timezone. | ✔️ |
|
| 82 |
+
| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
|
| 83 |
+
| user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
|
| 84 |
+
| extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
|
| 85 |
+
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 86 |
+
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
|
| 89 |
|
| 90 |
+
> 🔍 Notes:<br/>
|
| 91 |
+
> 1. The `disable_resources` made requests ~25% faster in my tests for some websites and it can help save your proxy usage but be careful with this option as it makes some websites never finish loading.<br/>
|
| 92 |
+
> 2. Since version 0.3.13, the `stealth` option has been removed from here in favor for the `StealthyFetcher` class and the `hide_canvas` option moved to it. The `disable_webgl` has been moved to the `StealthyFetcher` class as well and renamed as `allow_webgl`.
|
| 93 |
+
|
| 94 |
|
| 95 |
## Examples
|
| 96 |
It's easier to understand with examples, so let's take a look.
|
|
|
|
| 190 |
### Some Stealth Features
|
| 191 |
|
| 192 |
```python
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 193 |
page = DynamicFetcher.fetch(
|
| 194 |
'https://example.com',
|
| 195 |
+
google_search=True,
|
| 196 |
+
useragent='Mozilla/5.0...', # Custom user agent
|
| 197 |
+
locale='en-US', # Set browser locale
|
| 198 |
)
|
| 199 |
```
|
| 200 |
+
If you didn't set a user agent and enabled headless mode, the fetcher will generate a real User Agent of the same browser version and use it. If you didn't set a user agent without enabling headless mode, the fetcher will leave the browser's default User Agent because it's the same exact as normal browsers in the latest versions.
|
| 201 |
|
| 202 |
The `google_search` argument is enabled by default, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
|
| 203 |
|
|
|
|
| 234 |
# Create a session with default configuration
|
| 235 |
with DynamicSession(
|
| 236 |
headless=True,
|
|
|
|
| 237 |
disable_resources=True,
|
| 238 |
real_chrome=True
|
| 239 |
) as session:
|
|
|
|
| 253 |
|
| 254 |
async def scrape_multiple_sites():
|
| 255 |
async with AsyncDynamicSession(
|
|
|
|
| 256 |
network_idle=True,
|
| 257 |
timeout=30000,
|
| 258 |
max_pages=3
|
|
|
|
| 271 |
1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
|
| 272 |
2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
|
| 273 |
|
| 274 |
+
This logic allows for multiple URLs to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
|
| 275 |
|
| 276 |
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
|
| 277 |
|
|
|
|
| 290 |
- Want multiple browser options
|
| 291 |
- Using a real Chrome browser
|
| 292 |
- Need custom browser config
|
| 293 |
+
- Want a few stealth options
|
| 294 |
|
| 295 |
If you want more stealth and control without much config, check out the [StealthyFetcher](stealthy.md).
|