Karim shoair commited on
Commit ·
60d0c55
1
Parent(s): c181b7d
docs: Update all docstring according to the new changes
Browse files
docs/fetching/dynamic.md
CHANGED
|
@@ -77,7 +77,7 @@ Scrapling provides many options with this fetcher. To make it as simple as possi
|
|
| 77 |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 78 |
| timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
|
| 79 |
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
|
| 80 |
-
| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation
|
| 81 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 82 |
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
|
| 83 |
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
|
@@ -134,7 +134,6 @@ def scroll_page(page: Page):
|
|
| 134 |
page.mouse.wheel(10, 0)
|
| 135 |
page.mouse.move(100, 400)
|
| 136 |
page.mouse.up()
|
| 137 |
-
return page
|
| 138 |
|
| 139 |
page = DynamicFetcher.fetch(
|
| 140 |
'https://example.com',
|
|
@@ -149,7 +148,6 @@ async def scroll_page(page: Page):
|
|
| 149 |
await page.mouse.wheel(10, 0)
|
| 150 |
await page.mouse.move(100, 400)
|
| 151 |
await page.mouse.up()
|
| 152 |
-
return page
|
| 153 |
|
| 154 |
page = await DynamicFetcher.async_fetch(
|
| 155 |
'https://example.com',
|
|
@@ -273,9 +271,14 @@ async def scrape_multiple_sites():
|
|
| 273 |
return pages
|
| 274 |
```
|
| 275 |
|
| 276 |
-
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of
|
| 277 |
|
| 278 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 279 |
|
| 280 |
### Session Benefits
|
| 281 |
|
|
|
|
| 77 |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 78 |
| timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
|
| 79 |
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
|
| 80 |
+
| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | ✔️ |
|
| 81 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 82 |
| init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
|
| 83 |
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
|
|
|
| 134 |
page.mouse.wheel(10, 0)
|
| 135 |
page.mouse.move(100, 400)
|
| 136 |
page.mouse.up()
|
|
|
|
| 137 |
|
| 138 |
page = DynamicFetcher.fetch(
|
| 139 |
'https://example.com',
|
|
|
|
| 148 |
await page.mouse.wheel(10, 0)
|
| 149 |
await page.mouse.move(100, 400)
|
| 150 |
await page.mouse.up()
|
|
|
|
| 151 |
|
| 152 |
page = await DynamicFetcher.async_fetch(
|
| 153 |
'https://example.com',
|
|
|
|
| 271 |
return pages
|
| 272 |
```
|
| 273 |
|
| 274 |
+
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of using one tab for all your requests, you set a limit of the maximum number of pages allowed and with each request, the library will close all tabs that finished its task and check if the number of the current tabs is lower than the number of maximum allowed number of pages/tabs then:
|
| 275 |
|
| 276 |
+
1. If you are within the allowed range, the fetcher will create a new tab for you and then all is as normal.
|
| 277 |
+
2. Otherwise, it will keep checking every sub second if creating a new tab is allowed or not for 60 seconds then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive for some reason.
|
| 278 |
+
|
| 279 |
+
This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
|
| 280 |
+
|
| 281 |
+
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time but this logic proved to have flaws since it's nearly impossible to protections pages/tabs from contamination of the previous configuration you used with the request before this one.
|
| 282 |
|
| 283 |
### Session Benefits
|
| 284 |
|
docs/fetching/stealthy.md
CHANGED
|
@@ -31,7 +31,7 @@ Before jumping to [examples](#examples), here's the full list of arguments
|
|
| 31 |
| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
|
| 32 |
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
|
| 33 |
| block_webrtc | Blocks WebRTC entirely. | ✔️ |
|
| 34 |
-
| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation
|
| 35 |
| addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
|
| 36 |
| humanize | Humanize the cursor movement. The cursor movement takes either True or the maximum duration in seconds. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
|
| 37 |
| allow_webgl | Enabled by default. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
|
|
@@ -156,7 +156,6 @@ def scroll_page(page: Page):
|
|
| 156 |
page.mouse.wheel(10, 0)
|
| 157 |
page.mouse.move(100, 400)
|
| 158 |
page.mouse.up()
|
| 159 |
-
return page
|
| 160 |
|
| 161 |
page = StealthyFetcher.fetch(
|
| 162 |
'https://example.com',
|
|
@@ -171,7 +170,6 @@ async def scroll_page(page: Page):
|
|
| 171 |
await page.mouse.wheel(10, 0)
|
| 172 |
await page.mouse.move(100, 400)
|
| 173 |
await page.mouse.up()
|
| 174 |
-
return page
|
| 175 |
|
| 176 |
page = await StealthyFetcher.async_fetch(
|
| 177 |
'https://example.com',
|
|
@@ -278,9 +276,14 @@ async def scrape_multiple_sites():
|
|
| 278 |
return pages
|
| 279 |
```
|
| 280 |
|
| 281 |
-
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of
|
| 282 |
|
| 283 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 284 |
|
| 285 |
### Session Benefits
|
| 286 |
|
|
|
|
| 31 |
| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
|
| 32 |
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
|
| 33 |
| block_webrtc | Blocks WebRTC entirely. | ✔️ |
|
| 34 |
+
| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | ✔️ |
|
| 35 |
| addons | List of Firefox addons to use. **Must be paths to extracted addons.** | ✔️ |
|
| 36 |
| humanize | Humanize the cursor movement. The cursor movement takes either True or the maximum duration in seconds. The cursor typically takes up to 1.5 seconds to move across the window. | ✔️ |
|
| 37 |
| allow_webgl | Enabled by default. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
|
|
|
|
| 156 |
page.mouse.wheel(10, 0)
|
| 157 |
page.mouse.move(100, 400)
|
| 158 |
page.mouse.up()
|
|
|
|
| 159 |
|
| 160 |
page = StealthyFetcher.fetch(
|
| 161 |
'https://example.com',
|
|
|
|
| 170 |
await page.mouse.wheel(10, 0)
|
| 171 |
await page.mouse.move(100, 400)
|
| 172 |
await page.mouse.up()
|
|
|
|
| 173 |
|
| 174 |
page = await StealthyFetcher.async_fetch(
|
| 175 |
'https://example.com',
|
|
|
|
| 276 |
return pages
|
| 277 |
```
|
| 278 |
|
| 279 |
+
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of using one tab for all your requests, you set a limit of the maximum number of pages allowed and with each request, the library will close all tabs that finished its task and check if the number of the current tabs is lower than the number of maximum allowed number of pages/tabs then:
|
| 280 |
|
| 281 |
+
1. If you are within the allowed range, the fetcher will create a new tab for you and then all is as normal.
|
| 282 |
+
2. Otherwise, it will keep checking every sub second if creating a new tab is allowed or not for 60 seconds then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive for some reason.
|
| 283 |
+
|
| 284 |
+
This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
|
| 285 |
+
|
| 286 |
+
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time but this logic proved to have flaws since it's nearly impossible to protections pages/tabs from contamination of the previous configuration you used with the request before this one.
|
| 287 |
|
| 288 |
### Session Benefits
|
| 289 |
|
scrapling/engines/_browsers/_camoufox.py
CHANGED
|
@@ -119,7 +119,7 @@ class StealthySession(StealthySessionMixin, SyncSession):
|
|
| 119 |
:param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
|
| 120 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 121 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 122 |
-
:param page_action: Added for automation. A function that takes the `page` object
|
| 123 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 124 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation for all pages in this session.
|
| 125 |
:param geoip: Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, and spoof the WebRTC IP address.
|
|
@@ -269,7 +269,7 @@ class StealthySession(StealthySessionMixin, SyncSession):
|
|
| 269 |
:param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
|
| 270 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 271 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 272 |
-
:param page_action: Added for automation. A function that takes the `page` object
|
| 273 |
:param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
|
| 274 |
:param disable_resources: Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.
|
| 275 |
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
|
|
@@ -433,7 +433,7 @@ class AsyncStealthySession(StealthySessionMixin, AsyncSession):
|
|
| 433 |
:param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
|
| 434 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 435 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 436 |
-
:param page_action: Added for automation. A function that takes the `page` object
|
| 437 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 438 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation for all pages in this session.
|
| 439 |
:param geoip: Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, and spoof the WebRTC IP address.
|
|
@@ -585,7 +585,7 @@ class AsyncStealthySession(StealthySessionMixin, AsyncSession):
|
|
| 585 |
:param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
|
| 586 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 587 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 588 |
-
:param page_action: Added for automation. A function that takes the `page` object
|
| 589 |
:param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
|
| 590 |
:param disable_resources: Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.
|
| 591 |
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
|
|
|
|
| 119 |
:param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
|
| 120 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 121 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 122 |
+
:param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
|
| 123 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 124 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation for all pages in this session.
|
| 125 |
:param geoip: Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, and spoof the WebRTC IP address.
|
|
|
|
| 269 |
:param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
|
| 270 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 271 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 272 |
+
:param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
|
| 273 |
:param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
|
| 274 |
:param disable_resources: Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.
|
| 275 |
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
|
|
|
|
| 433 |
:param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
|
| 434 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 435 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 436 |
+
:param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
|
| 437 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 438 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation for all pages in this session.
|
| 439 |
:param geoip: Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, and spoof the WebRTC IP address.
|
|
|
|
| 585 |
:param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
|
| 586 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 587 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 588 |
+
:param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
|
| 589 |
:param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
|
| 590 |
:param disable_resources: Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.
|
| 591 |
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
|
scrapling/engines/_browsers/_controllers.py
CHANGED
|
@@ -106,7 +106,7 @@ class DynamicSession(DynamicSessionMixin, SyncSession):
|
|
| 106 |
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
|
| 107 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 108 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 109 |
-
:param page_action: Added for automation. A function that takes the `page` object
|
| 110 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 111 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation for all pages in this session.
|
| 112 |
:param locale: Set the locale for the browser if wanted. The default value is `en-US`.
|
|
@@ -217,7 +217,7 @@ class DynamicSession(DynamicSessionMixin, SyncSession):
|
|
| 217 |
:param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
|
| 218 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 219 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 220 |
-
:param page_action: Added for automation. A function that takes the `page` object
|
| 221 |
:param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
|
| 222 |
:param disable_resources: Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.
|
| 223 |
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
|
|
@@ -360,7 +360,7 @@ class AsyncDynamicSession(DynamicSessionMixin, AsyncSession):
|
|
| 360 |
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
|
| 361 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 362 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 363 |
-
:param page_action: Added for automation. A function that takes the `page` object
|
| 364 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 365 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation for all pages in this session.
|
| 366 |
:param locale: Set the locale for the browser if wanted. The default value is `en-US`.
|
|
@@ -478,7 +478,7 @@ class AsyncDynamicSession(DynamicSessionMixin, AsyncSession):
|
|
| 478 |
:param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
|
| 479 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 480 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 481 |
-
:param page_action: Added for automation. A function that takes the `page` object
|
| 482 |
:param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
|
| 483 |
:param disable_resources: Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.
|
| 484 |
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
|
|
|
|
| 106 |
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
|
| 107 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 108 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 109 |
+
:param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
|
| 110 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 111 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation for all pages in this session.
|
| 112 |
:param locale: Set the locale for the browser if wanted. The default value is `en-US`.
|
|
|
|
| 217 |
:param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
|
| 218 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 219 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 220 |
+
:param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
|
| 221 |
:param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
|
| 222 |
:param disable_resources: Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.
|
| 223 |
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
|
|
|
|
| 360 |
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
|
| 361 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 362 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 363 |
+
:param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
|
| 364 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 365 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation for all pages in this session.
|
| 366 |
:param locale: Set the locale for the browser if wanted. The default value is `en-US`.
|
|
|
|
| 478 |
:param google_search: Enabled by default, Scrapling will set the referer header to be as if this request came from a Google search of this website's domain name.
|
| 479 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 480 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 481 |
+
:param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
|
| 482 |
:param extra_headers: A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._
|
| 483 |
:param disable_resources: Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.
|
| 484 |
Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`.
|
scrapling/fetchers.py
CHANGED
|
@@ -96,7 +96,7 @@ class StealthyFetcher(BaseFetcher):
|
|
| 96 |
:param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
|
| 97 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 98 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 99 |
-
:param page_action: Added for automation. A function that takes the `page` object
|
| 100 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 101 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation with this request.
|
| 102 |
:param geoip: Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, and spoof the WebRTC IP address.
|
|
@@ -194,7 +194,7 @@ class StealthyFetcher(BaseFetcher):
|
|
| 194 |
:param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
|
| 195 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 196 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 197 |
-
:param page_action: Added for automation. A function that takes the `page` object
|
| 198 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 199 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation with this request.
|
| 200 |
:param geoip: Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, and spoof the WebRTC IP address.
|
|
@@ -299,7 +299,7 @@ class DynamicFetcher(BaseFetcher):
|
|
| 299 |
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
|
| 300 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 301 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 302 |
-
:param page_action: Added for automation. A function that takes the `page` object
|
| 303 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 304 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation with this request.
|
| 305 |
:param locale: Set the locale for the browser if wanted. The default value is `en-US`.
|
|
@@ -385,7 +385,7 @@ class DynamicFetcher(BaseFetcher):
|
|
| 385 |
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
|
| 386 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 387 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 388 |
-
:param page_action: Added for automation. A function that takes the `page` object
|
| 389 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 390 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation with this request.
|
| 391 |
:param locale: Set the locale for the browser if wanted. The default value is `en-US`.
|
|
|
|
| 96 |
:param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
|
| 97 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 98 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 99 |
+
:param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
|
| 100 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 101 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation with this request.
|
| 102 |
:param geoip: Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, and spoof the WebRTC IP address.
|
|
|
|
| 194 |
:param os_randomize: If enabled, Scrapling will randomize the OS fingerprints used. The default is Scrapling matching the fingerprints with the current OS.
|
| 195 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 196 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 197 |
+
:param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
|
| 198 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 199 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation with this request.
|
| 200 |
:param geoip: Recommended to use with proxies; Automatically use IP's longitude, latitude, timezone, country, locale, and spoof the WebRTC IP address.
|
|
|
|
| 299 |
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
|
| 300 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 301 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 302 |
+
:param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
|
| 303 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 304 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation with this request.
|
| 305 |
:param locale: Set the locale for the browser if wanted. The default value is `en-US`.
|
|
|
|
| 385 |
:param network_idle: Wait for the page until there are no network connections for at least 500 ms.
|
| 386 |
:param timeout: The timeout in milliseconds that is used in all operations and waits through the page. The default is 30,000
|
| 387 |
:param wait: The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the ` Response ` object.
|
| 388 |
+
:param page_action: Added for automation. A function that takes the `page` object and does the automation you need.
|
| 389 |
:param wait_selector: Wait for a specific CSS selector to be in a specific state.
|
| 390 |
:param init_script: An absolute path to a JavaScript file to be executed on page creation with this request.
|
| 391 |
:param locale: Set the locale for the browser if wanted. The default value is `en-US`.
|