Karim shoair commited on
Commit ·
061c69c
1
Parent(s): 66fd35f
docs: update all pages related to last changes
Browse files- docs/fetching/dynamic.md +7 -9
- docs/fetching/stealthy.md +7 -9
- docs/overview.md +0 -2
docs/fetching/dynamic.md
CHANGED
|
@@ -14,10 +14,7 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
|
|
| 14 |
|
| 15 |
Now, we will review most of the arguments one by one, using examples. If you want to jump to a table of all arguments for quick reference, [click here](#full-list-of-arguments)
|
| 16 |
|
| 17 |
-
>
|
| 18 |
-
>
|
| 19 |
-
> 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (wait for the `domcontentloaded` state).
|
| 20 |
-
> 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
|
| 21 |
|
| 22 |
|
| 23 |
This fetcher currently provides four main run options, which can be mixed as desired.
|
|
@@ -75,6 +72,7 @@ Scrapling provides many options with this fetcher. To make it as simple as possi
|
|
| 75 |
| cookies | Set cookies for the next request. | ✔️ |
|
| 76 |
| useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser.** | ✔️ |
|
| 77 |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
|
|
|
| 78 |
| timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
|
| 79 |
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
|
| 80 |
| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | ✔️ |
|
|
@@ -167,7 +165,7 @@ page = DynamicFetcher.fetch(
|
|
| 167 |
```
|
| 168 |
This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
|
| 169 |
|
| 170 |
-
After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle`
|
| 171 |
|
| 172 |
The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
|
| 173 |
|
|
@@ -271,14 +269,14 @@ async def scrape_multiple_sites():
|
|
| 271 |
return pages
|
| 272 |
```
|
| 273 |
|
| 274 |
-
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of using one tab for all your requests, you set a limit
|
| 275 |
|
| 276 |
-
1. If you are within the allowed range, the fetcher will create a new tab for you and then all is as normal.
|
| 277 |
-
2. Otherwise, it will keep checking every
|
| 278 |
|
| 279 |
This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
|
| 280 |
|
| 281 |
-
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time
|
| 282 |
|
| 283 |
### Session Benefits
|
| 284 |
|
|
|
|
| 14 |
|
| 15 |
Now, we will review most of the arguments one by one, using examples. If you want to jump to a table of all arguments for quick reference, [click here](#full-list-of-arguments)
|
| 16 |
|
| 17 |
+
> Note: The async version of the `fetch` method is the `async_fetch` method, of course.
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
|
| 20 |
This fetcher currently provides four main run options, which can be mixed as desired.
|
|
|
|
| 72 |
| cookies | Set cookies for the next request. | ✔️ |
|
| 73 |
| useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser.** | ✔️ |
|
| 74 |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 75 |
+
| load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | ✔️ |
|
| 76 |
| timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
|
| 77 |
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
|
| 78 |
| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | ✔️ |
|
|
|
|
| 165 |
```
|
| 166 |
This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
|
| 167 |
|
| 168 |
+
After that, if `load_dom` is enabled (the default), the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle`, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
|
| 169 |
|
| 170 |
The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
|
| 171 |
|
|
|
|
| 269 |
return pages
|
| 270 |
```
|
| 271 |
|
| 272 |
+
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of using one tab for all your requests, you set a limit on the maximum number of pages allowed. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
|
| 273 |
|
| 274 |
+
1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
|
| 275 |
+
2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive for some reason.
|
| 276 |
|
| 277 |
This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
|
| 278 |
|
| 279 |
+
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved to have flaws since it's nearly impossible to protect pages/tabs from contamination of the previous configuration you used with the request before this one.
|
| 280 |
|
| 281 |
### Session Benefits
|
| 282 |
|
docs/fetching/stealthy.md
CHANGED
|
@@ -12,10 +12,7 @@ You have one primary way to import this Fetcher, which is the same for all fetch
|
|
| 12 |
```
|
| 13 |
Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
|
| 14 |
|
| 15 |
-
>
|
| 16 |
-
>
|
| 17 |
-
> 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (wait for the `domcontentloaded` state).
|
| 18 |
-
> 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
|
| 19 |
|
| 20 |
## Full list of arguments
|
| 21 |
Before jumping to [examples](#examples), here's the full list of arguments
|
|
@@ -40,6 +37,7 @@ Before jumping to [examples](#examples), here's the full list of arguments
|
|
| 40 |
| disable_ads | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled. | ✔️ |
|
| 41 |
| solve_cloudflare | When enabled, fetcher solves all three types of Cloudflare's Turnstile wait/captcha page before returning the response to you. | ✔️ |
|
| 42 |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
|
|
|
| 43 |
| timeout | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000. | ✔️ |
|
| 44 |
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
|
| 45 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
|
@@ -188,7 +186,7 @@ page = StealthyFetcher.fetch(
|
|
| 188 |
```
|
| 189 |
This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
|
| 190 |
|
| 191 |
-
After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle`, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
|
| 192 |
|
| 193 |
The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
|
| 194 |
|
|
@@ -276,14 +274,14 @@ async def scrape_multiple_sites():
|
|
| 276 |
return pages
|
| 277 |
```
|
| 278 |
|
| 279 |
-
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of using one tab for all your requests, you set a limit
|
| 280 |
|
| 281 |
-
1. If you are within the allowed range, the fetcher will create a new tab for you and then all is as normal.
|
| 282 |
-
2. Otherwise, it will keep checking every
|
| 283 |
|
| 284 |
This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
|
| 285 |
|
| 286 |
-
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time
|
| 287 |
|
| 288 |
### Session Benefits
|
| 289 |
|
|
|
|
| 12 |
```
|
| 13 |
Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
|
| 14 |
|
| 15 |
+
> Note: The async version of the `fetch` method is the `async_fetch` method, of course.
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## Full list of arguments
|
| 18 |
Before jumping to [examples](#examples), here's the full list of arguments
|
|
|
|
| 37 |
| disable_ads | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled. | ✔️ |
|
| 38 |
| solve_cloudflare | When enabled, fetcher solves all three types of Cloudflare's Turnstile wait/captcha page before returning the response to you. | ✔️ |
|
| 39 |
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 40 |
+
| load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | ✔️ |
|
| 41 |
| timeout | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000. | ✔️ |
|
| 42 |
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
|
| 43 |
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
|
|
|
| 186 |
```
|
| 187 |
This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
|
| 188 |
|
| 189 |
+
After that, if `load_dom` is enabled (the default), the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle`, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
|
| 190 |
|
| 191 |
The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
|
| 192 |
|
|
|
|
| 274 |
return pages
|
| 275 |
```
|
| 276 |
|
| 277 |
+
You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of using one tab for all your requests, you set a limit on the maximum number of pages allowed. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
|
| 278 |
|
| 279 |
+
1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
|
| 280 |
+
2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive for some reason.
|
| 281 |
|
| 282 |
This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
|
| 283 |
|
| 284 |
+
In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved to have flaws since it's nearly impossible to protect pages/tabs from contamination of the previous configuration you used with the request before this one.
|
| 285 |
|
| 286 |
### Session Benefits
|
| 287 |
|
docs/overview.md
CHANGED
|
@@ -292,7 +292,6 @@ It's built on top of [Playwright](https://playwright.dev/python/) and it's curre
|
|
| 292 |
- Stealthy Playwright with custom stealth mode explicitly written for it. It's not top-tier stealth mode, but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/). Check out the `StealthyFetcher` class below for more advanced stealth mode. It uses the Chromium browser.
|
| 293 |
- Real browsers like your Chrome browser by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it.
|
| 294 |
|
| 295 |
-
> Note: All requests done by this fetcher are waiting by default for all JavaScript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing `network_idle=True`, as you will see later.
|
| 296 |
|
| 297 |
Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
|
| 298 |
|
|
@@ -314,7 +313,6 @@ True
|
|
| 314 |
>>> page.status == 200
|
| 315 |
True
|
| 316 |
```
|
| 317 |
-
> Note: All requests done by this fetcher are waiting by default for all JavaScript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing `network_idle=True`, as you will see later.
|
| 318 |
|
| 319 |
Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
|
| 320 |
|
|
|
|
| 292 |
- Stealthy Playwright with custom stealth mode explicitly written for it. It's not top-tier stealth mode, but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/). Check out the `StealthyFetcher` class below for more advanced stealth mode. It uses the Chromium browser.
|
| 293 |
- Real browsers like your Chrome browser by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it.
|
| 294 |
|
|
|
|
| 295 |
|
| 296 |
Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
|
| 297 |
|
|
|
|
| 313 |
>>> page.status == 200
|
| 314 |
True
|
| 315 |
```
|
|
|
|
| 316 |
|
| 317 |
Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
|
| 318 |
|