Spaces:

lenson78
/

Scrapling

Paused

App Files Files Community

Karim shoair commited on Dec 29, 2025

Commit

d540c57

1 Parent(s): d10d3f9

docs: update DynamicFetcher page

Browse files

Files changed (1) hide show

docs/fetching/dynamic.md +45 -56

docs/fetching/dynamic.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Introduction
-Here, we will discuss the `DynamicFetcher` class (formerly `PlayWrightFetcher`). This class provides flexible browser automation with multiple configuration options and little underhood stealth improvements.
 As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
@@ -31,13 +31,13 @@ Which are:
 ```python
 DynamicFetcher.fetch('https://example.com')
 ```
-Using it in that manner will open a Chromium browser and load the page. There are optimizations for speed, and some stealth goes automatically under the hood but other than that there are no tricks or extra features unless you enable some; it's just a plain PlayWright API.
 ### 2. Real Chrome
 ```python
 DynamicFetcher.fetch('https://example.com', real_chrome=True)
 ```
-If you have a Google Chrome browser installed, use this option. It's the same as the first option, but it will use the Google Chrome browser you installed on your device instead of Chromium. This will make your requests look more authentic, so it's less detectable for better results.
 If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually:
 ```commandline
@@ -51,45 +51,49 @@ DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
 Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
-> Note:<br/>
-> * There was a `stealth` option here, but it was moved to `StealthyFetcher` class as explained in the next page with more features since version 0.3.13.<br/>
-> * This makes it less confusing for new users, easier to maintain and other reasons explained in the [StealthyFetcher page](../fetching/stealthy.md).
 ## Full list of arguments
 Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
-|      Argument       | Description                                                                                                                                                                                                                        | Optional |
-|:-------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
-|         url         | Target url                                                                                                                                                                                                                         |    ❌     |
-|      headless       | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode.                                                                                                                               |    ✔️    |
-|  disable_resources  | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.                        |    ✔️    |
-|       cookies       | Set cookies for the next request.                                                                                                                                                                                                  |    ✔️    |
-|      useragent      | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser and same version.**                                                                                        |    ✔️    |
-|    network_idle     | Wait for the page until there are no network connections for at least 500 ms.                                                                                                                                                      |    ✔️    |
-|      load_dom       | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state).                                                                                                          |    ✔️    |
-|       timeout       | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds).                                                                                                               |    ✔️    |
-|        wait         | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object.                                                                                               |    ✔️    |
-|     page_action     | Added for automation. Pass a function that takes the `page` object and does the necessary automation.                                                                                                                              |    ✔️    |
-|    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                        |    ✔️    |
-|     init_script     | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.                                                                                                                               |    ✔️    |
-| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._                                                                                                |    ✔️    |
-|    google_search    | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name.                                                                                              |    ✔️    |
-|    extra_headers    | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._                                                                  |    ✔️    |
-|        proxy        | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.                                                                                                    |    ✔️    |
-|     real_chrome     | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser.                                                                                               |    ✔️    |
-|       locale        | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value as well as number and date formatting rules. Defaults to the system default locale. |    ✔️    |
-|     timezone_id     | Changes the timezone of the browser. Defaults to the system timezone.                                                                                                                                                              |    ✔️    |
-|       cdp_url       | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.                                                                                                                         |    ✔️    |
-|    user_data_dir    | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions**                                                      |    ✔️    |
-|     extra_flags     | A list of additional browser flags to pass to the browser on launch.                                                                                                                                                               |    ✔️    |
-|   additional_args   | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings.                                                                                         |    ✔️    |
-|   selector_config   | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class.                                                                                                                           |    ✔️    |
 In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
-> 🔍 Notes:<br/>
-> 1. The `disable_resources` made requests ~25% faster in my tests for some websites and it can help save your proxy usage but be careful with this option as it makes some websites never finish loading.<br/>
-> 2. Since version 0.3.13, the `stealth` option has been removed from here in favor for the `StealthyFetcher` class and the `hide_canvas` option moved to it. The `disable_webgl` has been moved to the `StealthyFetcher` class as well and renamed as `allow_webgl`.
 ## Examples
@@ -99,10 +103,7 @@ It's easier to understand with examples, so let's take a look.
 ```python
 # Disable unnecessary resources
-page = DynamicFetcher.fetch(
-    'https://example.com',
-    disable_resources=True  # Blocks fonts, images, media, etc...
-)
 ```
 ### Network Control
@@ -114,11 +115,8 @@ page = DynamicFetcher.fetch('https://example.com', network_idle=True)
 # Custom timeout (in milliseconds)
 page = DynamicFetcher.fetch('https://example.com', timeout=30000)  # 30 seconds
-# Proxy support
-page = DynamicFetcher.fetch(
-    'https://example.com',
-    proxy='http://username:password@host:port'  # Or it can be a dictionary with the keys 'server', 'username', and 'password' only
-)
 ```
 ### Downloading Files
@@ -130,7 +128,7 @@ with open(file='poster.png', mode='wb') as f:
     f.write(page.body)
 ```
-The `body` attribute of the `Response` object is a `bytes` object containing the response body in case of Non-HTML responses.
 ### Browser Automation
 This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
@@ -146,10 +144,7 @@ def scroll_page(page: Page):
     page.mouse.move(100, 400)
     page.mouse.up()
-page = DynamicFetcher.fetch(
-    'https://example.com',
-    page_action=scroll_page
-)
 ```
 Of course, if you use the async fetch version, the function must also be async.
 ```python
@@ -160,10 +155,7 @@ async def scroll_page(page: Page):
    await page.mouse.move(100, 400)
    await page.mouse.up()
-page = await DynamicFetcher.async_fetch(
-    'https://example.com',
-    page_action=scroll_page
-)
 ```
 ### Wait Conditions
@@ -197,9 +189,6 @@ page = DynamicFetcher.fetch(
     locale='en-US',  # Set browser locale
 )
 ```
-If you didn't set a user agent and enabled headless mode, the fetcher will generate a real User Agent of the same browser version and use it. If you didn't set a user agent without enabling headless mode, the fetcher will leave the browser's default User Agent because it's the same exact as normal browsers in the latest versions.
-The `google_search` argument is enabled by default, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
 ### General example
 ```python

 # Introduction
+Here, we will discuss the `DynamicFetcher` class (formerly `PlayWrightFetcher`). This class provides flexible browser automation with multiple configuration options and little under-the-hood stealth improvements.
 As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
 ```python
 DynamicFetcher.fetch('https://example.com')
 ```
+Using it in that manner will open a Chromium browser and load the page. There are optimizations for speed, and some stealth goes automatically under the hood, but other than that, there are no tricks or extra features unless you enable some; it's just a plain PlayWright API.
 ### 2. Real Chrome
 ```python
 DynamicFetcher.fetch('https://example.com', real_chrome=True)
 ```
+If you have a Google Chrome browser installed, use this option. It's the same as the first option, but it will use the Google Chrome browser you installed on your device instead of Chromium. This will make your requests look more authentic, so they're less detectable for better results.
 If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually:
 ```commandline
 Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
+> Notes:
+>
+> * There was a `stealth` option here, but it was moved to the `StealthyFetcher` class, as explained on the next page, with additional features since version 0.3.13.<br/>
+> * This makes it less confusing for new users, easier to maintain, and provides other benefits, as explained on the [StealthyFetcher page](../fetching/stealthy.md).
 ## Full list of arguments
 Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
+|      Argument       | Description                                                                                                                                                                                                                         | Optional |
+|:-------------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
+|         url         | Target url                                                                                                                                                                                                                          |    ❌     |
+|      headless       | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode.                                                                                                                                |    ✔️    |
+|  disable_resources  | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.                         |    ✔️    |
+|       cookies       | Set cookies for the next request.                                                                                                                                                                                                   |    ✔️    |
+|      useragent      | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser and version.**                                                                                              |    ✔️    |
+|    network_idle     | Wait for the page until there are no network connections for at least 500 ms.                                                                                                                                                       |    ✔️    |
+|      load_dom       | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state).                                                                                                           |    ✔️    |
+|       timeout       | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds).                                                                                                                |    ✔️    |
+|        wait         | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object.                                                                                                |    ✔️    |
+|     page_action     | Added for automation. Pass a function that takes the `page` object and does the necessary automation.                                                                                                                               |    ✔️    |
+|    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                         |    ✔️    |
+|     init_script     | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.                                                                                                                                |    ✔️    |
+| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._                                                                                                 |    ✔️    |
+|    google_search    | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name.                                                                                               |    ✔️    |
+|    extra_headers    | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._                                                                   |    ✔️    |
+|        proxy        | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.                                                                                                     |    ✔️    |
+|     real_chrome     | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser.                                                                                                |    ✔️    |
+|       locale        | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. |    ✔️    |
+|     timezone_id     | Changes the timezone of the browser. Defaults to the system timezone.                                                                                                                                                               |    ✔️    |
+|       cdp_url       | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP.                                                                                                                          |    ✔️    |
+|    user_data_dir    | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions**                                                       |    ✔️    |
+|     extra_flags     | A list of additional browser flags to pass to the browser on launch.                                                                                                                                                                |    ✔️    |
+|   additional_args   | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings.                                                                                          |    ✔️    |
+|   selector_config   | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class.                                                                                                                            |    ✔️    |
 In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
+> 🔍 Notes:
+>
+> 1. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
+> 2. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
+> 3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
+> 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
 ## Examples
 ```python
 # Disable unnecessary resources
+page = DynamicFetcher.fetch('https://example.com', disable_resources=True)  # Blocks fonts, images, media, etc.
 ```
 ### Network Control
 # Custom timeout (in milliseconds)
 page = DynamicFetcher.fetch('https://example.com', timeout=30000)  # 30 seconds
+# Proxy support (It can also be a dictionary with the keys 'server', 'username', and 'password' only)
+page = DynamicFetcher.fetch('https://example.com', proxy='http://username:password@host:port')
 ```
 ### Downloading Files
     f.write(page.body)
 ```
+The `body` attribute of the `Response` object is a `bytes` object containing the response body in case of non-HTML responses.
 ### Browser Automation
 This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
     page.mouse.move(100, 400)
     page.mouse.up()
+page = DynamicFetcher.fetch('https://example.com', page_action=scroll_page)
 ```
 Of course, if you use the async fetch version, the function must also be async.
 ```python
    await page.mouse.move(100, 400)
    await page.mouse.up()
+page = await DynamicFetcher.async_fetch('https://example.com', page_action=scroll_page)
 ```
 ### Wait Conditions
     locale='en-US',  # Set browser locale
 )
 ```
 ### General example
 ```python