Karim shoair commited on
Commit
d10d3f9
·
1 Parent(s): 24ea5ca

docs: update DynamicFetcher page

Browse files
Files changed (1) hide show
  1. docs/fetching/dynamic.md +47 -74
docs/fetching/dynamic.md CHANGED
@@ -1,6 +1,6 @@
1
  # Introduction
2
 
3
- Here, we will discuss the `DynamicFetcher` class (previously known as `PlayWrightFetcher`). This class provides flexible browser automation with multiple configuration options and some stealth capabilities.
4
 
5
  As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
6
 
@@ -23,7 +23,7 @@ Now, we will review most of the arguments one by one, using examples. If you wan
23
  > Note: The async version of the `fetch` method is the `async_fetch` method, of course.
24
 
25
 
26
- This fetcher currently provides four main run options that can be combined as desired.
27
 
28
  Which are:
29
 
@@ -31,77 +31,66 @@ Which are:
31
  ```python
32
  DynamicFetcher.fetch('https://example.com')
33
  ```
34
- Using it in that manner will open a Chromium browser and load the page. There are no tricks or extra features unless you enable some; it's just a plain PlayWright API.
35
 
36
- ### 2. Stealth Mode
37
- ```python
38
- DynamicFetcher.fetch('https://example.com', stealth=True)
39
- ```
40
- It's the same as the vanilla Playwright option, but it provides a simple stealth mode suitable for websites with a small to medium protection layer(s).
41
-
42
- Some of the things this fetcher's stealth mode does include:
43
-
44
- * Patching the CDP runtime fingerprint by using PatchRight.
45
- * Mimics some of the real browsers' properties by injecting several JS files and using custom options.
46
- * Custom flags are used on launch to hide Playwright even more and make it faster.
47
- * Generates real browser headers of the same type and user OS, then appends them to the request's headers.
48
-
49
- ### 3. Real Chrome
50
  ```python
51
  DynamicFetcher.fetch('https://example.com', real_chrome=True)
52
  ```
53
- If you have a Google Chrome browser installed, use this option. It's the same as the first option, but it will use the Google Chrome browser you installed on your device instead of Chromium.
54
 
55
- This will make your requests look more authentic, so it's less detectable, and you can even use the `stealth=True` mode with it for better results, like below:
56
- ```python
57
- DynamicFetcher.fetch('https://example.com', real_chrome=True, stealth=True)
58
- ```
59
  If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually:
60
  ```commandline
61
  playwright install chrome
62
  ```
63
 
64
- ### 4. CDP Connection
65
  ```python
66
  DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
67
  ```
68
  Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
69
 
 
 
 
 
 
70
  ## Full list of arguments
71
  Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
72
 
73
- | Argument | Description | Optional |
74
- |:-------------------:|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
75
- | url | Target url | ❌ |
76
- | headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ✔️ |
77
- | disable_resources | Drop requests for unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage, but be cautious with this option, as it may cause some websites to never finish loading._ | ✔️ |
78
- | cookies | Set cookies for the next request. | ✔️ |
79
- | useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser.** | ✔️ |
80
- | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
81
- | load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | ✔️ |
82
- | timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
83
- | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
84
- | page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | ✔️ |
85
- | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
86
- | init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
87
- | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
88
- | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
89
- | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
90
- | proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
91
- | hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
92
- | disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
93
- | stealth | Enables stealth mode; you should always check the documentation to see what the stealth mode does currently. | ✔️ |
94
- | real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
95
- | locale | Set the locale for the browser if wanted. The default value is `en-US`. | ✔️ |
96
- | timezone_id | Set the timezone for the browser if wanted. | ✔️ |
97
- | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
98
- | user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
99
- | extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
100
- | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
101
- | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
102
 
103
  In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
104
 
 
 
 
 
105
 
106
  ## Examples
107
  It's easier to understand with examples, so let's take a look.
@@ -201,28 +190,14 @@ The states the fetcher can wait for can be any of the following ([source](https:
201
  ### Some Stealth Features
202
 
203
  ```python
204
- # Full stealth mode
205
- page = DynamicFetcher.fetch(
206
- 'https://example.com',
207
- stealth=True,
208
- hide_canvas=True,
209
- disable_webgl=True,
210
- google_search=True
211
- )
212
-
213
- # Custom user agent
214
- page = DynamicFetcher.fetch(
215
- 'https://example.com',
216
- useragent='Mozilla/5.0...'
217
- )
218
-
219
- # Set browser locale
220
  page = DynamicFetcher.fetch(
221
  'https://example.com',
222
- locale='en-US'
 
 
223
  )
224
  ```
225
- Hence, the `hide_canvas` argument doesn't disable the canvas; instead, it hides it by adding random noise to canvas operations, preventing fingerprinting. Also, if you didn't set a user agent (preferred), the fetcher will generate a real User Agent of the same browser and use it.
226
 
227
  The `google_search` argument is enabled by default, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
228
 
@@ -259,7 +234,6 @@ from scrapling.fetchers import DynamicSession
259
  # Create a session with default configuration
260
  with DynamicSession(
261
  headless=True,
262
- stealth=True,
263
  disable_resources=True,
264
  real_chrome=True
265
  ) as session:
@@ -279,7 +253,6 @@ from scrapling.fetchers import AsyncDynamicSession
279
 
280
  async def scrape_multiple_sites():
281
  async with AsyncDynamicSession(
282
- stealth=True,
283
  network_idle=True,
284
  timeout=30000,
285
  max_pages=3
@@ -298,7 +271,7 @@ You may have noticed the `max_pages` argument. This is a new argument that enabl
298
  1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
299
  2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
300
 
301
- This logic allows for multiple websites to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
302
 
303
  In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
304
 
@@ -317,6 +290,6 @@ Use DynamicFetcher when:
317
  - Want multiple browser options
318
  - Using a real Chrome browser
319
  - Need custom browser config
320
- - Want flexible stealth options
321
 
322
  If you want more stealth and control without much config, check out the [StealthyFetcher](stealthy.md).
 
1
  # Introduction
2
 
3
+ Here, we will discuss the `DynamicFetcher` class (formerly `PlayWrightFetcher`). This class provides flexible browser automation with multiple configuration options and little underhood stealth improvements.
4
 
5
  As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
6
 
 
23
  > Note: The async version of the `fetch` method is the `async_fetch` method, of course.
24
 
25
 
26
+ This fetcher currently provides three main run options that can be combined as desired.
27
 
28
  Which are:
29
 
 
31
  ```python
32
  DynamicFetcher.fetch('https://example.com')
33
  ```
34
+ Using it in that manner will open a Chromium browser and load the page. There are optimizations for speed, and some stealth goes automatically under the hood but other than that there are no tricks or extra features unless you enable some; it's just a plain PlayWright API.
35
 
36
+ ### 2. Real Chrome
 
 
 
 
 
 
 
 
 
 
 
 
 
37
  ```python
38
  DynamicFetcher.fetch('https://example.com', real_chrome=True)
39
  ```
40
+ If you have a Google Chrome browser installed, use this option. It's the same as the first option, but it will use the Google Chrome browser you installed on your device instead of Chromium. This will make your requests look more authentic, so it's less detectable for better results.
41
 
 
 
 
 
42
  If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually:
43
  ```commandline
44
  playwright install chrome
45
  ```
46
 
47
+ ### 3. CDP Connection
48
  ```python
49
  DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
50
  ```
51
  Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
52
 
53
+
54
+ > Note:<br/>
55
+ > * There was a `stealth` option here, but it was moved to `StealthyFetcher` class as explained in the next page with more features since version 0.3.13.<br/>
56
+ > * This makes it less confusing for new users, easier to maintain and other reasons explained in the [StealthyFetcher page](../fetching/stealthy.md).
57
+
58
  ## Full list of arguments
59
  Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
60
 
61
+ | Argument | Description | Optional |
62
+ |:-------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
63
+ | url | Target url | ❌ |
64
+ | headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ✔️ |
65
+ | disable_resources | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. | ✔️ |
66
+ | cookies | Set cookies for the next request. | ✔️ |
67
+ | useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser and same version.** | ✔️ |
68
+ | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
69
+ | load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | ✔️ |
70
+ | timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
71
+ | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
72
+ | page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | ✔️ |
73
+ | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
74
+ | init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
75
+ | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
76
+ | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
77
+ | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
78
+ | proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
79
+ | real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
80
+ | locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
81
+ | timezone_id | Changes the timezone of the browser. Defaults to the system timezone. | ✔️ |
82
+ | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
83
+ | user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
84
+ | extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
85
+ | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
86
+ | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
 
 
 
87
 
88
  In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
89
 
90
+ > 🔍 Notes:<br/>
91
+ > 1. The `disable_resources` made requests ~25% faster in my tests for some websites and it can help save your proxy usage but be careful with this option as it makes some websites never finish loading.<br/>
92
+ > 2. Since version 0.3.13, the `stealth` option has been removed from here in favor for the `StealthyFetcher` class and the `hide_canvas` option moved to it. The `disable_webgl` has been moved to the `StealthyFetcher` class as well and renamed as `allow_webgl`.
93
+
94
 
95
  ## Examples
96
  It's easier to understand with examples, so let's take a look.
 
190
  ### Some Stealth Features
191
 
192
  ```python
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
  page = DynamicFetcher.fetch(
194
  'https://example.com',
195
+ google_search=True,
196
+ useragent='Mozilla/5.0...', # Custom user agent
197
+ locale='en-US', # Set browser locale
198
  )
199
  ```
200
+ If you didn't set a user agent and enabled headless mode, the fetcher will generate a real User Agent of the same browser version and use it. If you didn't set a user agent without enabling headless mode, the fetcher will leave the browser's default User Agent because it's the same exact as normal browsers in the latest versions.
201
 
202
  The `google_search` argument is enabled by default, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
203
 
 
234
  # Create a session with default configuration
235
  with DynamicSession(
236
  headless=True,
 
237
  disable_resources=True,
238
  real_chrome=True
239
  ) as session:
 
253
 
254
  async def scrape_multiple_sites():
255
  async with AsyncDynamicSession(
 
256
  network_idle=True,
257
  timeout=30000,
258
  max_pages=3
 
271
  1. If you are within the allowed range, the fetcher will create a new tab for you, and then all is as normal.
272
  2. Otherwise, it will keep checking every subsecond if creating a new tab is allowed or not for 60 seconds, then raise `TimeoutError`. This can happen when the website you are fetching becomes unresponsive.
273
 
274
+ This logic allows for multiple URLs to be fetched at the same time in the same browser, which saves a lot of resources, but most importantly, is so fast :)
275
 
276
  In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resources/time. That logic proved flawed, as it's nearly impossible to protect pages/tabs from contamination by the previous configuration used in the request before this one.
277
 
 
290
  - Want multiple browser options
291
  - Using a real Chrome browser
292
  - Need custom browser config
293
+ - Want a few stealth options
294
 
295
  If you want more stealth and control without much config, check out the [StealthyFetcher](stealthy.md).