Spaces:

lenson78
/

Scrapling

Paused

App Files Files Community

Karim shoair commited on Dec 28, 2025

Commit

3985986

1 Parent(s): 1812d2b

docs: updating the `Fetcher Basics` and the `HTTP requests` pages

Browse files

Files changed (2) hide show

docs/fetching/choosing.md +12 -12
docs/fetching/stealthy.md +29 -29

docs/fetching/choosing.md CHANGED Viewed

@@ -3,7 +3,7 @@ Fetchers are classes that can do requests or fetch pages for you easily in a sin
 This feature was introduced because, before v0.2, Scrapling was only a parsing engine. The target here is to gradually become the one-stop shop for all Web Scraping needs.
-> Fetchers are not wrappers built on top of other libraries. However, they utilize these libraries as an engine to request/fetch pages easily for you, while fully leveraging that engine and adding features for you. Some fetchers don't even use the official library for requests; instead, they use their own custom version. For example, `StealthyFetcher` utilizes `Camoufox` browser directly, without relying on its Python library for anything except launch options. This last part might change soon as well.
 ## Fetchers Overview
@@ -12,17 +12,17 @@ Scrapling provides three different fetcher classes with their session classes; e
 The following table compares them and can be quickly used for guidance.
-| Feature            | Fetcher                                           | DynamicFetcher                                                                 | StealthyFetcher                                                                      |
-|--------------------|---------------------------------------------------|--------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
-| Relative speed     | 🐇🐇🐇🐇🐇                                        | 🐇🐇🐇                                                                         | 🐇🐇                                                                                 |
-| Stealth            | ⭐⭐                                                | ⭐⭐⭐                                                                            | ⭐⭐⭐⭐⭐                                                                                |
-| Anti-Bot options   | ⭐⭐                                                | ⭐⭐⭐                                                                            | ⭐⭐⭐⭐⭐                                                                                |
-| JavaScript loading | ❌                                                 | ✅                                                                              | ✅                                                                                    |
-| Memory Usage       | ⭐                                                 | ⭐⭐⭐                                                                            | ⭐⭐⭐                                                                                  |
-| Best used for      | Basic scraping when HTTP requests alone can do it | - Dynamically loaded websites <br/>- Small automation<br/>- Slight protections | - Dynamically loaded websites <br/>- Small automation <br/>- Complicated protections |
-| Browser(s)         | ❌                                                 | Chromium and Google Chrome                                                     | Modified Firefox                                                                     |
-| Browser API used   | ❌                                                 | PlayWright                                                                     | PlayWright                                                                           |
-| Setup Complexity   | Simple                                            | Simple                                                                         | Simple                                                                               |
 In the following pages, we will talk about each one in detail.

 This feature was introduced because, before v0.2, Scrapling was only a parsing engine. The target here is to gradually become the one-stop shop for all Web Scraping needs.
+> Fetchers are not wrappers built on top of other libraries. However, they only use these libraries as an engine to request/fetch pages. To further clarify this, all fetchers have features that the underlying engines don't, while still fully leveraging those engines and optimizing them for Web Scraping.
 ## Fetchers Overview
 The following table compares them and can be quickly used for guidance.
+| Feature            | Fetcher                                           | DynamicFetcher                                                                    | StealthyFetcher                                                                            |
+|--------------------|---------------------------------------------------|-----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
+| Relative speed     | 🐇🐇🐇🐇🐇                                        | 🐇🐇🐇                                                                            | 🐇🐇🐇                                                                                     |
+| Stealth            | ⭐⭐                                                | ⭐⭐⭐                                                                               | ⭐⭐⭐⭐⭐                                                                                      |
+| Anti-Bot options   | ⭐⭐                                                | ⭐⭐⭐                                                                               | ⭐⭐⭐⭐⭐                                                                                      |
+| JavaScript loading | ❌                                                 | ✅                                                                                 | ✅                                                                                          |
+| Memory Usage       | ⭐                                                 | ⭐⭐⭐                                                                               | ⭐⭐⭐                                                                                        |
+| Best used for      | Basic scraping when HTTP requests alone can do it | - Dynamically loaded websites <br/>- Small automation<br/>- Small-Mid protections | - Dynamically loaded websites <br/>- Small automation <br/>- Small-Complicated protections |
+| Browser(s)         | ❌                                                 | Chromium and Google Chrome                                                        | Chromium and Google Chrome                                                                 |
+| Browser API used   | ❌                                                 | PlayWright                                                                        | PlayWright                                                                                 |
+| Setup Complexity   | Simple                                            | Simple                                                                            | Simple                                                                                     |
 In the following pages, we will talk about each one in detail.

docs/fetching/stealthy.md CHANGED Viewed

@@ -24,35 +24,35 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
 Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
-|      Argument       | Description                                                                                                                                                                                                                                                                                                                                                                                                                 | Optional |
-|:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
-|         url         | Target url                                                                                                                                                                                                                                                                                                                                                                                                                  |    ❌     |
-|      headless       | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode.                                                                                                                                                                                                                                                                                                                        |    ✔️    |
-|    block_images     | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage, but be cautious with this option, as it may cause some websites to never finish loading._                                                                                                                                                                                                                                  |    ✔️    |
-|  disable_resources  | Drop requests for unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage, but be cautious with this option, as it may cause some websites to never finish loading._ |    ✔️    |
-|       cookies       | Set cookies for the next request.                                                                                                                                                                                                                                                                                                                                                                                           |    ✔️    |
-|    google_search    | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name.                                                                                                                                                                                                                                                                                       |    ✔️    |
-|    extra_headers    | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._                                                                                                                                                                                                                                                           |    ✔️    |
-|    block_webrtc     | Blocks WebRTC entirely.                                                                                                                                                                                                                                                                                                                                                                                                     |    ✔️    |
-|     page_action     | Added for automation. Pass a function that takes the `page` object and does the necessary automation.                                                                                                                                                                                                                                                                                                                       |    ✔️    |
-|       addons        | List of Firefox addons to use. **Must be paths to extracted addons.**                                                                                                                                                                                                                                                                                                                                                       |    ✔️    |
-|      humanize       | Humanize the cursor movement. The cursor movement takes either True or the maximum duration in seconds. The cursor typically takes up to 1.5 seconds to move across the window.                                                                                                                                                                                                                                             |    ✔️    |
-|     allow_webgl     | Enabled by default. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled.                                                                                                                                                                                                                                                                                                                         |    ✔️    |
-|        geoip        | Recommended to use with proxies; Automatically use IPs' longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region.                                                                                                                                                         |    ✔️    |
-|    os_randomize     | If enabled, Scrapling will randomize the OS fingerprints used. The default is matching the fingerprints with the current OS.                                                                                                                                                                                                                                                                                                |    ✔️    |
-|     disable_ads     | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled.                                                                                                                                                                                                                                                                                                                                     |    ✔️    |
-|  solve_cloudflare   | When enabled, fetcher solves all types of Cloudflare's Turnstile/Interstitial challenges before returning the response to you.                                                                                                                                                                                                                                                                                              |    ✔️    |
-|    network_idle     | Wait for the page until there are no network connections for at least 500 ms.                                                                                                                                                                                                                                                                                                                                               |    ✔️    |
-|      load_dom       | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state).                                                                                                                                                                                                                                                                                                   |    ✔️    |
-|       timeout       | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000.                                                                                                                                                                                                                                                                                                              |    ✔️    |
-|        wait         | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object.                                                                                                                                                                                                                                                                                        |    ✔️    |
-|    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                                                                                                                                                                                                                 |    ✔️    |
-|     init_script     | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.                                                                                                                                                                                                                                                                                                                        |    ✔️    |
-| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._                                                                                                                                                                                                                                                                                         |    ✔️    |
-|        proxy        | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.                                                                                                                                                                                                                                                                                             |    ✔️    |
-|    user_data_dir    | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions**                                                                                                                                                                                                                                               |    ✔️    |
-|   additional_args   | Additional arguments to be passed to Camoufox as additional settings, and they take higher priority than Scrapling's settings.                                                                                                                                                                                                                                                                                              |    ✔️    |
-|   selector_config   | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class.                                                                                                                                                                                                                                                                                                                    |    ✔️    |
 In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, and `selector_config`.

 Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
+|      Argument       | Description                                                                                                                                                                                                                                                         | Optional |
+|:-------------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
+|         url         | Target url                                                                                                                                                                                                                                                          |    ❌     |
+|      headless       | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode.                                                                                                                                                                |    ✔️    |
+|    block_images     | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage, but be cautious with this option, as it may cause some websites to never finish loading._                                                                          |    ✔️    |
+|  disable_resources  | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.                                                         |    ✔️    |
+|       cookies       | Set cookies for the next request.                                                                                                                                                                                                                                   |    ✔️    |
+|    google_search    | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name.                                                                                                                               |    ✔️    |
+|    extra_headers    | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._                                                                                                   |    ✔️    |
+|    block_webrtc     | Blocks WebRTC entirely.                                                                                                                                                                                                                                             |    ✔️    |
+|     page_action     | Added for automation. Pass a function that takes the `page` object and does the necessary automation.                                                                                                                                                               |    ✔️    |
+|       addons        | List of Firefox addons to use. **Must be paths to extracted addons.**                                                                                                                                                                                               |    ✔️    |
+|      humanize       | Humanize the cursor movement. The cursor movement takes either True or the maximum duration in seconds. The cursor typically takes up to 1.5 seconds to move across the window.                                                                                     |    ✔️    |
+|     allow_webgl     | Enabled by default. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled.                                                                                                                                                                 |    ✔️    |
+|        geoip        | Recommended to use with proxies; Automatically use IPs' longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. |    ✔️    |
+|    os_randomize     | If enabled, Scrapling will randomize the OS fingerprints used. The default is matching the fingerprints with the current OS.                                                                                                                                        |    ✔️    |
+|     disable_ads     | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled.                                                                                                                                                                             |    ✔️    |
+|  solve_cloudflare   | When enabled, fetcher solves all types of Cloudflare's Turnstile/Interstitial challenges before returning the response to you.                                                                                                                                      |    ✔️    |
+|    network_idle     | Wait for the page until there are no network connections for at least 500 ms.                                                                                                                                                                                       |    ✔️    |
+|      load_dom       | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state).                                                                                                                                           |    ✔️    |
+|       timeout       | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000.                                                                                                                                                      |    ✔️    |
+|        wait         | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object.                                                                                                                                |    ✔️    |
+|    wait_selector    | Wait for a specific css selector to be in a specific state.                                                                                                                                                                                                         |    ✔️    |
+|     init_script     | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.                                                                                                                                                                |    ✔️    |
+| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._                                                                                                                                 |    ✔️    |
+|        proxy        | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.                                                                                                                                     |    ✔️    |
+|    user_data_dir    | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions**                                                                                       |    ✔️    |
+|   additional_args   | Additional arguments to be passed to Camoufox as additional settings, and they take higher priority than Scrapling's settings.                                                                                                                                      |    ✔️    |
+|   selector_config   | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class.                                                                                                                                                            |    ✔️    |
 In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, and `selector_config`.