Karim shoair commited on
Commit
3ba53b4
·
1 Parent(s): e6bbd0d

docs: update website accordingly

Browse files
Files changed (2) hide show
  1. docs/fetching/dynamic.md +12 -12
  2. docs/fetching/stealthy.md +14 -11
docs/fetching/dynamic.md CHANGED
@@ -17,7 +17,7 @@ Now, we will review most of the arguments one by one, using examples. If you wan
17
  > Note: The async version of the `fetch` method is the `async_fetch` method, of course.
18
 
19
 
20
- This fetcher currently provides four main run options, which can be mixed as desired.
21
 
22
  Which are:
23
 
@@ -62,7 +62,7 @@ DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
62
  Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
63
 
64
  ## Full list of arguments
65
- Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of using most of them.
66
 
67
  | Argument | Description | Optional |
68
  |:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
@@ -81,7 +81,7 @@ Scrapling provides many options with this fetcher and its session classes. To ma
81
  | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
82
  | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
83
  | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
84
- | proxy | The proxy to be used with requests. It can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | ✔️ |
85
  | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
86
  | disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
87
  | stealth | Enables stealth mode; you should always check the documentation to see what the stealth mode does currently. | ✔️ |
@@ -89,7 +89,7 @@ Scrapling provides many options with this fetcher and its session classes. To ma
89
  | locale | Set the locale for the browser if wanted. The default value is `en-US`. | ✔️ |
90
  | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
91
  | user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
92
- | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and it takes higher priority than Scrapling's settings. | ✔️ |
93
  | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
94
 
95
  In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
@@ -125,11 +125,11 @@ page = DynamicFetcher.fetch(
125
  ```
126
 
127
  ### Browser Automation
128
- This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then returns it for the current fetcher to continue processing.
129
 
130
- This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for various purposes, not just automation. You can alter the page as you want.
131
 
132
- In the example below, I used page [mouse events](https://playwright.dev/python/docs/api/class-mouse) to move the mouse wheel to scroll the page and then move the mouse.
133
  ```python
134
  from playwright.sync_api import Page
135
 
@@ -203,9 +203,9 @@ page = DynamicFetcher.fetch(
203
  locale='en-US'
204
  )
205
  ```
206
- Hence, the `hide_canvas` argument doesn't disable the canvas but instead hides it by adding random noise to canvas operations, preventing fingerprinting. Also, if you didn't set a user agent (preferred), the fetcher will generate a real User Agent of the same browser and use it.
207
 
208
- The `google_search` argument is enabled by default, making the request look as if it came from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
209
 
210
  ### General example
211
  ```python
@@ -274,14 +274,14 @@ async def scrape_multiple_sites():
274
  return pages
275
  ```
276
 
277
- You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of using one tab for all your requests, you set a limit on the maximum number of pages allowed. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
278
 
279
  1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
280
- 2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive for some reason.
281
 
282
  This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
283
 
284
- In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved to have flaws, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used with the request before this one.
285
 
286
  ### Session Benefits
287
 
 
17
  > Note: The async version of the `fetch` method is the `async_fetch` method, of course.
18
 
19
 
20
+ This fetcher currently provides four main run options that can be combined as desired.
21
 
22
  Which are:
23
 
 
62
  Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
63
 
64
  ## Full list of arguments
65
+ Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
66
 
67
  | Argument | Description | Optional |
68
  |:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
 
81
  | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
82
  | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
83
  | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
84
+ | proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
85
  | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
86
  | disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
87
  | stealth | Enables stealth mode; you should always check the documentation to see what the stealth mode does currently. | ✔️ |
 
89
  | locale | Set the locale for the browser if wanted. The default value is `en-US`. | ✔️ |
90
  | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
91
  | user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
92
+ | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
93
  | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
94
 
95
  In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
 
125
  ```
126
 
127
  ### Browser Automation
128
+ This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
129
 
130
+ This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for purposes beyond automation. You can alter the page as you want.
131
 
132
+ In the example below, I used the pages' [mouse events](https://playwright.dev/python/docs/api/class-mouse) to scroll the page with the mouse wheel, then move the mouse.
133
  ```python
134
  from playwright.sync_api import Page
135
 
 
203
  locale='en-US'
204
  )
205
  ```
206
+ Hence, the `hide_canvas` argument doesn't disable the canvas; instead, it hides it by adding random noise to canvas operations, preventing fingerprinting. Also, if you didn't set a user agent (preferred), the fetcher will generate a real User Agent of the same browser and use it.
207
 
208
+ The `google_search` argument is enabled by default, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
209
 
210
  ### General example
211
  ```python
 
274
  return pages
275
  ```
276
 
277
+ You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
278
 
279
  1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
280
+ 2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
281
 
282
  This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
283
 
284
+ In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
285
 
286
  ### Session Benefits
287
 
docs/fetching/stealthy.md CHANGED
@@ -1,6 +1,6 @@
1
  # Introduction
2
 
3
- Here, we will discuss the `StealthyFetcher` class. This class is similar to [DynamicFetcher](dynamic.md#introduction) in many ways, such as browser automation and the utilization of [Playwright's API](https://playwright.dev/python/docs/intro). The main difference is that this class provides advanced anti-bot protection bypass capabilities and a custom version of a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), from which most stealth comes.
4
 
5
  As with [DynamicFetcher](dynamic.md#introduction), you will need some knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
6
 
@@ -43,7 +43,7 @@ Scrapling provides many options with this fetcher and its session classes. Befor
43
  | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
44
  | init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
45
  | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
46
- | proxy | The proxy to be used with requests. It can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | ✔️ |
47
  | user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
48
  | additional_args | Additional arguments to be passed to Camoufox as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
49
  | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
@@ -51,7 +51,7 @@ Scrapling provides many options with this fetcher and its session classes. Befor
51
  In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, and `selector_config`.
52
 
53
  ## Examples
54
- It's easier to understand with examples, so we will now review most of the arguments individually with examples.
55
 
56
  ### Browser Modes
57
 
@@ -98,8 +98,11 @@ The `solve_cloudflare` parameter enables automatic detection and solving all typ
98
  - Interactive challenges (clicking verification boxes)
99
  - Invisible challenges (automatic background verification)
100
 
 
 
101
  **Important notes:**
102
 
 
103
  - When `solve_cloudflare=True` is enabled, `humanize=True` is automatically activated for more realistic behavior
104
  - The timeout should be at least 60 seconds when using the Cloudflare solver for sufficient challenge-solving time
105
  - This feature works seamlessly with proxies and other stealth options
@@ -125,7 +128,7 @@ page = StealthyFetcher.fetch(
125
  )
126
  ```
127
 
128
- The `google_search` argument is enabled by default, making the request look as if it came from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
129
 
130
  ### Network Control
131
 
@@ -144,11 +147,11 @@ page = StealthyFetcher.fetch(
144
  ```
145
 
146
  ### Browser Automation
147
- This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then returns it for the current fetcher to continue processing.
148
 
149
- This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for various purposes, not just automation. You can alter the page as you want.
150
 
151
- In the example below, I used page [mouse events](https://playwright.dev/python/docs/api/class-mouse) to move the mouse wheel to scroll the page and then move the mouse.
152
  ```python
153
  from playwright.sync_api import Page
154
 
@@ -206,7 +209,7 @@ page = StealthyFetcher.fetch(
206
  addons=['/path/to/addon1', '/path/to/addon2']
207
  )
208
  ```
209
- The paths here must be paths of extracted addons, which will be installed automatically upon browser launch.
210
 
211
  ### Real-world example (Amazon)
212
  This is for educational purposes only; this example was generated by AI, which shows how easy it is to work with Scrapling through AI
@@ -276,14 +279,14 @@ async def scrape_multiple_sites():
276
  return pages
277
  ```
278
 
279
- You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **pool of Browser tabs** that will be rotated automatically. Instead of using one tab for all your requests, you set a limit on the maximum number of pages allowed. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
280
 
281
  1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
282
- 2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive for some reason.
283
 
284
  This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
285
 
286
- In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved to have flaws, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used with the request before this one.
287
 
288
  ### Session Benefits
289
 
 
1
  # Introduction
2
 
3
+ Here, we will discuss the `StealthyFetcher` class. This class is similar to [DynamicFetcher](dynamic.md#introduction) in many ways, including browser automation and the use of [Playwright's API](https://playwright.dev/python/docs/intro). The main difference is that this class provides advanced anti-bot protection bypass capabilities and a custom version of a modified Firefox browser called [Camoufox](https://github.com/daijro/camoufox), from which most stealth comes.
4
 
5
  As with [DynamicFetcher](dynamic.md#introduction), you will need some knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) to automate the page, as we will explain later.
6
 
 
43
  | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
44
  | init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
45
  | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
46
+ | proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
47
  | user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
48
  | additional_args | Additional arguments to be passed to Camoufox as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
49
  | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
 
51
  In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, and `selector_config`.
52
 
53
  ## Examples
54
+ It's easier to understand with examples, so we will now review most of the arguments individually.
55
 
56
  ### Browser Modes
57
 
 
98
  - Interactive challenges (clicking verification boxes)
99
  - Invisible challenges (automatic background verification)
100
 
101
+ And even solves the custom pages.
102
+
103
  **Important notes:**
104
 
105
+ - Sometimes, with websites that use custom implementations, you will need to use `wait_selector` to make sure Scrapling waits for the real website content to be loaded after solving the captcha. Some websites can be the real definition of an edge case while we are trying to make the solver as generic as possible.
106
  - When `solve_cloudflare=True` is enabled, `humanize=True` is automatically activated for more realistic behavior
107
  - The timeout should be at least 60 seconds when using the Cloudflare solver for sufficient challenge-solving time
108
  - This feature works seamlessly with proxies and other stealth options
 
128
  )
129
  ```
130
 
131
+ The `google_search` argument is enabled by default, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
132
 
133
  ### Network Control
134
 
 
147
  ```
148
 
149
  ### Browser Automation
150
+ This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
151
 
152
+ This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for purposes beyond automation. You can alter the page as you want.
153
 
154
+ In the example below, I used the pages' [mouse events](https://playwright.dev/python/docs/api/class-mouse) to scroll the page with the mouse wheel, then move the mouse.
155
  ```python
156
  from playwright.sync_api import Page
157
 
 
209
  addons=['/path/to/addon1', '/path/to/addon2']
210
  )
211
  ```
212
+ The paths here must point to extracted addons that will be installed automatically upon browser launch.
213
 
214
  ### Real-world example (Amazon)
215
  This is for educational purposes only; this example was generated by AI, which shows how easy it is to work with Scrapling through AI
 
279
  return pages
280
  ```
281
 
282
+ You may have noticed the `max_pages` argument. This is a new argument that enables the fetcher to create a **rotating pool of Browser tabs**. Instead of using a single tab for all your requests, you set a limit on the maximum number of pages. With each request, the library will close all tabs that have finished their task and check if the number of the current tabs is lower than the maximum allowed number of pages/tabs, then:
283
 
284
  1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
285
+ 2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
286
 
287
  This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
288
 
289
+ In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
290
 
291
  ### Session Benefits
292