Karim shoair commited on
Commit Β·
3985986
1
Parent(s): 1812d2b
docs: updating the `Fetcher Basics` and the `HTTP requests` pages
Browse files- docs/fetching/choosing.md +12 -12
- docs/fetching/stealthy.md +29 -29
docs/fetching/choosing.md
CHANGED
|
@@ -3,7 +3,7 @@ Fetchers are classes that can do requests or fetch pages for you easily in a sin
|
|
| 3 |
|
| 4 |
This feature was introduced because, before v0.2, Scrapling was only a parsing engine. The target here is to gradually become the one-stop shop for all Web Scraping needs.
|
| 5 |
|
| 6 |
-
> Fetchers are not wrappers built on top of other libraries. However, they
|
| 7 |
|
| 8 |
## Fetchers Overview
|
| 9 |
|
|
@@ -12,17 +12,17 @@ Scrapling provides three different fetcher classes with their session classes; e
|
|
| 12 |
The following table compares them and can be quickly used for guidance.
|
| 13 |
|
| 14 |
|
| 15 |
-
| Feature | Fetcher | DynamicFetcher
|
| 16 |
-
|--------------------|---------------------------------------------------|--------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|
|
| 17 |
-
| Relative speed | πππππ | πππ
|
| 18 |
-
| Stealth | ββ | βββ
|
| 19 |
-
| Anti-Bot options | ββ | βββ
|
| 20 |
-
| JavaScript loading | β | β
|
| 21 |
-
| Memory Usage | β | βββ
|
| 22 |
-
| Best used for | Basic scraping when HTTP requests alone can do it | - Dynamically loaded websites <br/>- Small automation<br/>-
|
| 23 |
-
| Browser(s) | β | Chromium and Google Chrome
|
| 24 |
-
| Browser API used | β | PlayWright
|
| 25 |
-
| Setup Complexity | Simple | Simple
|
| 26 |
|
| 27 |
In the following pages, we will talk about each one in detail.
|
| 28 |
|
|
|
|
| 3 |
|
| 4 |
This feature was introduced because, before v0.2, Scrapling was only a parsing engine. The target here is to gradually become the one-stop shop for all Web Scraping needs.
|
| 5 |
|
| 6 |
+
> Fetchers are not wrappers built on top of other libraries. However, they only use these libraries as an engine to request/fetch pages. To further clarify this, all fetchers have features that the underlying engines don't, while still fully leveraging those engines and optimizing them for Web Scraping.
|
| 7 |
|
| 8 |
## Fetchers Overview
|
| 9 |
|
|
|
|
| 12 |
The following table compares them and can be quickly used for guidance.
|
| 13 |
|
| 14 |
|
| 15 |
+
| Feature | Fetcher | DynamicFetcher | StealthyFetcher |
|
| 16 |
+
|--------------------|---------------------------------------------------|-----------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|
|
| 17 |
+
| Relative speed | πππππ | πππ | πππ |
|
| 18 |
+
| Stealth | ββ | βββ | βββββ |
|
| 19 |
+
| Anti-Bot options | ββ | βββ | βββββ |
|
| 20 |
+
| JavaScript loading | β | β
| β
|
|
| 21 |
+
| Memory Usage | β | βββ | βββ |
|
| 22 |
+
| Best used for | Basic scraping when HTTP requests alone can do it | - Dynamically loaded websites <br/>- Small automation<br/>- Small-Mid protections | - Dynamically loaded websites <br/>- Small automation <br/>- Small-Complicated protections |
|
| 23 |
+
| Browser(s) | β | Chromium and Google Chrome | Chromium and Google Chrome |
|
| 24 |
+
| Browser API used | β | PlayWright | PlayWright |
|
| 25 |
+
| Setup Complexity | Simple | Simple | Simple |
|
| 26 |
|
| 27 |
In the following pages, we will talk about each one in detail.
|
| 28 |
|
docs/fetching/stealthy.md
CHANGED
|
@@ -24,35 +24,35 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
|
|
| 24 |
Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
|
| 25 |
|
| 26 |
|
| 27 |
-
| Argument | Description
|
| 28 |
-
|:-------------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| 29 |
-
| url | Target url
|
| 30 |
-
| headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode.
|
| 31 |
-
| block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage, but be cautious with this option, as it may cause some websites to never finish loading._
|
| 32 |
-
| disable_resources | Drop requests for unnecessary resources for a speed boost.
|
| 33 |
-
| cookies | Set cookies for the next request.
|
| 34 |
-
| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name.
|
| 35 |
-
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
|
| 36 |
-
| block_webrtc | Blocks WebRTC entirely.
|
| 37 |
-
| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation.
|
| 38 |
-
| addons | List of Firefox addons to use. **Must be paths to extracted addons.**
|
| 39 |
-
| humanize | Humanize the cursor movement. The cursor movement takes either True or the maximum duration in seconds. The cursor typically takes up to 1.5 seconds to move across the window.
|
| 40 |
-
| allow_webgl | Enabled by default. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled.
|
| 41 |
-
| geoip | Recommended to use with proxies; Automatically use IPs' longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region.
|
| 42 |
-
| os_randomize | If enabled, Scrapling will randomize the OS fingerprints used. The default is matching the fingerprints with the current OS.
|
| 43 |
-
| disable_ads | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled.
|
| 44 |
-
| solve_cloudflare | When enabled, fetcher solves all types of Cloudflare's Turnstile/Interstitial challenges before returning the response to you.
|
| 45 |
-
| network_idle | Wait for the page until there are no network connections for at least 500 ms.
|
| 46 |
-
| load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state).
|
| 47 |
-
| timeout | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000.
|
| 48 |
-
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object.
|
| 49 |
-
| wait_selector | Wait for a specific css selector to be in a specific state.
|
| 50 |
-
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session.
|
| 51 |
-
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._
|
| 52 |
-
| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'.
|
| 53 |
-
| user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions**
|
| 54 |
-
| additional_args | Additional arguments to be passed to Camoufox as additional settings, and they take higher priority than Scrapling's settings.
|
| 55 |
-
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class.
|
| 56 |
|
| 57 |
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, and `selector_config`.
|
| 58 |
|
|
|
|
| 24 |
Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
|
| 25 |
|
| 26 |
|
| 27 |
+
| Argument | Description | Optional |
|
| 28 |
+
|:-------------------:|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
| 29 |
+
| url | Target url | β |
|
| 30 |
+
| headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | βοΈ |
|
| 31 |
+
| block_images | Prevent the loading of images through Firefox preferences. _This can help save your proxy usage, but be cautious with this option, as it may cause some websites to never finish loading._ | βοΈ |
|
| 32 |
+
| disable_resources | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. | βοΈ |
|
| 33 |
+
| cookies | Set cookies for the next request. | βοΈ |
|
| 34 |
+
| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | βοΈ |
|
| 35 |
+
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | βοΈ |
|
| 36 |
+
| block_webrtc | Blocks WebRTC entirely. | βοΈ |
|
| 37 |
+
| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | βοΈ |
|
| 38 |
+
| addons | List of Firefox addons to use. **Must be paths to extracted addons.** | βοΈ |
|
| 39 |
+
| humanize | Humanize the cursor movement. The cursor movement takes either True or the maximum duration in seconds. The cursor typically takes up to 1.5 seconds to move across the window. | βοΈ |
|
| 40 |
+
| allow_webgl | Enabled by default. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | βοΈ |
|
| 41 |
+
| geoip | Recommended to use with proxies; Automatically use IPs' longitude, latitude, timezone, country, locale, & spoof the WebRTC IP address. It will also calculate and spoof the browser's language based on the distribution of language speakers in the target region. | βοΈ |
|
| 42 |
+
| os_randomize | If enabled, Scrapling will randomize the OS fingerprints used. The default is matching the fingerprints with the current OS. | βοΈ |
|
| 43 |
+
| disable_ads | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled. | βοΈ |
|
| 44 |
+
| solve_cloudflare | When enabled, fetcher solves all types of Cloudflare's Turnstile/Interstitial challenges before returning the response to you. | βοΈ |
|
| 45 |
+
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | βοΈ |
|
| 46 |
+
| load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | βοΈ |
|
| 47 |
+
| timeout | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000. | βοΈ |
|
| 48 |
+
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | βοΈ |
|
| 49 |
+
| wait_selector | Wait for a specific css selector to be in a specific state. | βοΈ |
|
| 50 |
+
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | βοΈ |
|
| 51 |
+
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | βοΈ |
|
| 52 |
+
| proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | βοΈ |
|
| 53 |
+
| user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | βοΈ |
|
| 54 |
+
| additional_args | Additional arguments to be passed to Camoufox as additional settings, and they take higher priority than Scrapling's settings. | βοΈ |
|
| 55 |
+
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | βοΈ |
|
| 56 |
|
| 57 |
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, and `selector_config`.
|
| 58 |
|