Karim shoair commited on
Commit
061c69c
·
1 Parent(s): 66fd35f

docs: update all pages related to last changes

Browse files
docs/fetching/dynamic.md CHANGED
@@ -14,10 +14,7 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
14
 
15
  Now, we will review most of the arguments one by one, using examples. If you want to jump to a table of all arguments for quick reference, [click here](#full-list-of-arguments)
16
 
17
- > Notes:
18
- >
19
- > 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (wait for the `domcontentloaded` state).
20
- > 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
21
 
22
 
23
  This fetcher currently provides four main run options, which can be mixed as desired.
@@ -75,6 +72,7 @@ Scrapling provides many options with this fetcher. To make it as simple as possi
75
  | cookies | Set cookies for the next request. | ✔️ |
76
  | useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser.** | ✔️ |
77
  | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
 
78
  | timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
79
  | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
80
  | page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | ✔️ |
@@ -167,7 +165,7 @@ page = DynamicFetcher.fetch(
167
  ```
168
  This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
169
 
170
- After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle` with this, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
171
 
172
  The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
173
 
@@ -271,14 +269,14 @@ async def scrape_multiple_sites():
271
  return pages
272
  ```
273
 
274
- You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of using one tab for all your requests, you set a limit of the maximum number of pages allowed and with each request, the library will close all tabs that finished its task and check if the number of the current tabs is lower than the number of maximum allowed number of pages/tabs then:
275
 
276
- 1. If you are within the allowed range, the fetcher will create a new tab for you and then all is as normal.
277
- 2. Otherwise, it will keep checking every sub second if creating a new tab is allowed or not for 60 seconds then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive for some reason.
278
 
279
  This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
280
 
281
- In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time but this logic proved to have flaws since it's nearly impossible to protections pages/tabs from contamination of the previous configuration you used with the request before this one.
282
 
283
  ### Session Benefits
284
 
 
14
 
15
  Now, we will review most of the arguments one by one, using examples. If you want to jump to a table of all arguments for quick reference, [click here](#full-list-of-arguments)
16
 
17
+ > Note: The async version of the `fetch` method is the `async_fetch` method, of course.
 
 
 
18
 
19
 
20
  This fetcher currently provides four main run options, which can be mixed as desired.
 
72
  | cookies | Set cookies for the next request. | ✔️ |
73
  | useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser.** | ✔️ |
74
  | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
75
+ | load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | ✔️ |
76
  | timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
77
  | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
78
  | page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | ✔️ |
 
165
  ```
166
  This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
167
 
168
+ After that, if `load_dom` is enabled (the default), the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle`, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
169
 
170
  The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
171
 
 
269
  return pages
270
  ```
271
 
272
+ You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of using one tab for all your requests, you set a limit on the maximum number of pages allowed. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
273
 
274
+ 1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
275
+ 2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive for some reason.
276
 
277
  This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
278
 
279
+ In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved to have flaws since it's nearly impossible to protect pages/tabs from contamination of the previous configuration you used with the request before this one.
280
 
281
  ### Session Benefits
282
 
docs/fetching/stealthy.md CHANGED
@@ -12,10 +12,7 @@ You have one primary way to import this Fetcher, which is the same for all fetch
12
  ```
13
  Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
14
 
15
- > Notes:
16
- >
17
- > 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (wait for the `domcontentloaded` state).
18
- > 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
19
 
20
  ## Full list of arguments
21
  Before jumping to [examples](#examples), here's the full list of arguments
@@ -40,6 +37,7 @@ Before jumping to [examples](#examples), here's the full list of arguments
40
  | disable_ads | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled. | ✔️ |
41
  | solve_cloudflare | When enabled, fetcher solves all three types of Cloudflare's Turnstile wait/captcha page before returning the response to you. | ✔️ |
42
  | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
 
43
  | timeout | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000. | ✔️ |
44
  | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
45
  | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
@@ -188,7 +186,7 @@ page = StealthyFetcher.fetch(
188
  ```
189
  This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
190
 
191
- After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle`, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
192
 
193
  The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
194
 
@@ -276,14 +274,14 @@ async def scrape_multiple_sites():
276
  return pages
277
  ```
278
 
279
- You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of using one tab for all your requests, you set a limit of the maximum number of pages allowed and with each request, the library will close all tabs that finished its task and check if the number of the current tabs is lower than the number of maximum allowed number of pages/tabs then:
280
 
281
- 1. If you are within the allowed range, the fetcher will create a new tab for you and then all is as normal.
282
- 2. Otherwise, it will keep checking every sub second if creating a new tab is allowed or not for 60 seconds then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive for some reason.
283
 
284
  This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
285
 
286
- In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time but this logic proved to have flaws since it's nearly impossible to protections pages/tabs from contamination of the previous configuration you used with the request before this one.
287
 
288
  ### Session Benefits
289
 
 
12
  ```
13
  Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
14
 
15
+ > Note: The async version of the `fetch` method is the `async_fetch` method, of course.
 
 
 
16
 
17
  ## Full list of arguments
18
  Before jumping to [examples](#examples), here's the full list of arguments
 
37
  | disable_ads | Disabled by default; this installs the `uBlock Origin` addon on the browser if enabled. | ✔️ |
38
  | solve_cloudflare | When enabled, fetcher solves all three types of Cloudflare's Turnstile wait/captcha page before returning the response to you. | ✔️ |
39
  | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
40
+ | load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | ✔️ |
41
  | timeout | The timeout used in all operations and waits through the page. It's in milliseconds, and the default is 30000. | ✔️ |
42
  | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
43
  | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
 
186
  ```
187
  This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
188
 
189
+ After that, if `load_dom` is enabled (the default), the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle`, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
190
 
191
  The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
192
 
 
274
  return pages
275
  ```
276
 
277
+ You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of using one tab for all your requests, you set a limit on the maximum number of pages allowed. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
278
 
279
+ 1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
280
+ 2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive for some reason.
281
 
282
  This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
283
 
284
+ In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved to have flaws since it's nearly impossible to protect pages/tabs from contamination of the previous configuration you used with the request before this one.
285
 
286
  ### Session Benefits
287
 
docs/overview.md CHANGED
@@ -292,7 +292,6 @@ It's built on top of [Playwright](https://playwright.dev/python/) and it's curre
292
  - Stealthy Playwright with custom stealth mode explicitly written for it. It's not top-tier stealth mode, but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/). Check out the `StealthyFetcher` class below for more advanced stealth mode. It uses the Chromium browser.
293
  - Real browsers like your Chrome browser by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it.
294
 
295
- > Note: All requests done by this fetcher are waiting by default for all JavaScript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing `network_idle=True`, as you will see later.
296
 
297
  Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
298
 
@@ -314,7 +313,6 @@ True
314
  >>> page.status == 200
315
  True
316
  ```
317
- > Note: All requests done by this fetcher are waiting by default for all JavaScript to be fully loaded and executed. In detail, it waits for the `load` and `domcontentloaded` load states to be reached; you can make it wait for the `networkidle` load state by passing `network_idle=True`, as you will see later.
318
 
319
  Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
320
 
 
292
  - Stealthy Playwright with custom stealth mode explicitly written for it. It's not top-tier stealth mode, but it bypasses many online tests like [Sannysoft's](https://bot.sannysoft.com/). Check out the `StealthyFetcher` class below for more advanced stealth mode. It uses the Chromium browser.
293
  - Real browsers like your Chrome browser by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it.
294
 
 
295
 
296
  Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
297
 
 
313
  >>> page.status == 200
314
  True
315
  ```
 
316
 
317
  Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments.
318