Karim shoair commited on
Commit
d540c57
·
1 Parent(s): d10d3f9

docs: update DynamicFetcher page

Browse files
Files changed (1) hide show
  1. docs/fetching/dynamic.md +45 -56
docs/fetching/dynamic.md CHANGED
@@ -1,6 +1,6 @@
1
  # Introduction
2
 
3
- Here, we will discuss the `DynamicFetcher` class (formerly `PlayWrightFetcher`). This class provides flexible browser automation with multiple configuration options and little underhood stealth improvements.
4
 
5
  As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
6
 
@@ -31,13 +31,13 @@ Which are:
31
  ```python
32
  DynamicFetcher.fetch('https://example.com')
33
  ```
34
- Using it in that manner will open a Chromium browser and load the page. There are optimizations for speed, and some stealth goes automatically under the hood but other than that there are no tricks or extra features unless you enable some; it's just a plain PlayWright API.
35
 
36
  ### 2. Real Chrome
37
  ```python
38
  DynamicFetcher.fetch('https://example.com', real_chrome=True)
39
  ```
40
- If you have a Google Chrome browser installed, use this option. It's the same as the first option, but it will use the Google Chrome browser you installed on your device instead of Chromium. This will make your requests look more authentic, so it's less detectable for better results.
41
 
42
  If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually:
43
  ```commandline
@@ -51,45 +51,49 @@ DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
51
  Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
52
 
53
 
54
- > Note:<br/>
55
- > * There was a `stealth` option here, but it was moved to `StealthyFetcher` class as explained in the next page with more features since version 0.3.13.<br/>
56
- > * This makes it less confusing for new users, easier to maintain and other reasons explained in the [StealthyFetcher page](../fetching/stealthy.md).
 
57
 
58
  ## Full list of arguments
59
  Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
60
 
61
- | Argument | Description | Optional |
62
- |:-------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
63
- | url | Target url | ❌ |
64
- | headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ✔️ |
65
- | disable_resources | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. | ✔️ |
66
- | cookies | Set cookies for the next request. | ✔️ |
67
- | useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser and same version.** | ✔️ |
68
- | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
69
- | load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | ✔️ |
70
- | timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
71
- | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
72
- | page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | ✔️ |
73
- | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
74
- | init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
75
- | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
76
- | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
77
- | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
78
- | proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
79
- | real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
80
- | locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
81
- | timezone_id | Changes the timezone of the browser. Defaults to the system timezone. | ✔️ |
82
- | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
83
- | user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
84
- | extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
85
- | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
86
- | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
87
 
88
  In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
89
 
90
- > 🔍 Notes:<br/>
91
- > 1. The `disable_resources` made requests ~25% faster in my tests for some websites and it can help save your proxy usage but be careful with this option as it makes some websites never finish loading.<br/>
92
- > 2. Since version 0.3.13, the `stealth` option has been removed from here in favor for the `StealthyFetcher` class and the `hide_canvas` option moved to it. The `disable_webgl` has been moved to the `StealthyFetcher` class as well and renamed as `allow_webgl`.
 
 
 
93
 
94
 
95
  ## Examples
@@ -99,10 +103,7 @@ It's easier to understand with examples, so let's take a look.
99
 
100
  ```python
101
  # Disable unnecessary resources
102
- page = DynamicFetcher.fetch(
103
- 'https://example.com',
104
- disable_resources=True # Blocks fonts, images, media, etc...
105
- )
106
  ```
107
 
108
  ### Network Control
@@ -114,11 +115,8 @@ page = DynamicFetcher.fetch('https://example.com', network_idle=True)
114
  # Custom timeout (in milliseconds)
115
  page = DynamicFetcher.fetch('https://example.com', timeout=30000) # 30 seconds
116
 
117
- # Proxy support
118
- page = DynamicFetcher.fetch(
119
- 'https://example.com',
120
- proxy='http://username:password@host:port' # Or it can be a dictionary with the keys 'server', 'username', and 'password' only
121
- )
122
  ```
123
 
124
  ### Downloading Files
@@ -130,7 +128,7 @@ with open(file='poster.png', mode='wb') as f:
130
  f.write(page.body)
131
  ```
132
 
133
- The `body` attribute of the `Response` object is a `bytes` object containing the response body in case of Non-HTML responses.
134
 
135
  ### Browser Automation
136
  This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
@@ -146,10 +144,7 @@ def scroll_page(page: Page):
146
  page.mouse.move(100, 400)
147
  page.mouse.up()
148
 
149
- page = DynamicFetcher.fetch(
150
- 'https://example.com',
151
- page_action=scroll_page
152
- )
153
  ```
154
  Of course, if you use the async fetch version, the function must also be async.
155
  ```python
@@ -160,10 +155,7 @@ async def scroll_page(page: Page):
160
  await page.mouse.move(100, 400)
161
  await page.mouse.up()
162
 
163
- page = await DynamicFetcher.async_fetch(
164
- 'https://example.com',
165
- page_action=scroll_page
166
- )
167
  ```
168
 
169
  ### Wait Conditions
@@ -197,9 +189,6 @@ page = DynamicFetcher.fetch(
197
  locale='en-US', # Set browser locale
198
  )
199
  ```
200
- If you didn't set a user agent and enabled headless mode, the fetcher will generate a real User Agent of the same browser version and use it. If you didn't set a user agent without enabling headless mode, the fetcher will leave the browser's default User Agent because it's the same exact as normal browsers in the latest versions.
201
-
202
- The `google_search` argument is enabled by default, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
203
 
204
  ### General example
205
  ```python
 
1
  # Introduction
2
 
3
+ Here, we will discuss the `DynamicFetcher` class (formerly `PlayWrightFetcher`). This class provides flexible browser automation with multiple configuration options and little under-the-hood stealth improvements.
4
 
5
  As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
6
 
 
31
  ```python
32
  DynamicFetcher.fetch('https://example.com')
33
  ```
34
+ Using it in that manner will open a Chromium browser and load the page. There are optimizations for speed, and some stealth goes automatically under the hood, but other than that, there are no tricks or extra features unless you enable some; it's just a plain PlayWright API.
35
 
36
  ### 2. Real Chrome
37
  ```python
38
  DynamicFetcher.fetch('https://example.com', real_chrome=True)
39
  ```
40
+ If you have a Google Chrome browser installed, use this option. It's the same as the first option, but it will use the Google Chrome browser you installed on your device instead of Chromium. This will make your requests look more authentic, so they're less detectable for better results.
41
 
42
  If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually:
43
  ```commandline
 
51
  Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
52
 
53
 
54
+ > Notes:
55
+ >
56
+ > * There was a `stealth` option here, but it was moved to the `StealthyFetcher` class, as explained on the next page, with additional features since version 0.3.13.<br/>
57
+ > * This makes it less confusing for new users, easier to maintain, and provides other benefits, as explained on the [StealthyFetcher page](../fetching/stealthy.md).
58
 
59
  ## Full list of arguments
60
  Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
61
 
62
+ | Argument | Description | Optional |
63
+ |:-------------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
64
+ | url | Target url | ❌ |
65
+ | headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ✔️ |
66
+ | disable_resources | Drop requests for unnecessary resources for a speed boost. Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. | ✔️ |
67
+ | cookies | Set cookies for the next request. | ✔️ |
68
+ | useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser and version.** | ✔️ |
69
+ | network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
70
+ | load_dom | Enabled by default, wait for all JavaScript on page(s) to fully load and execute (wait for the `domcontentloaded` state). | ✔️ |
71
+ | timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
72
+ | wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
73
+ | page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation. | ✔️ |
74
+ | wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
75
+ | init_script | An absolute path to a JavaScript file to be executed on page creation for all pages in this session. | ✔️ |
76
+ | wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
77
+ | google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
78
+ | extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
79
+ | proxy | The proxy to be used with requests. It can be a string or a dictionary with only the keys 'server', 'username', and 'password'. | ✔️ |
80
+ | real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
81
+ | locale | Specify user locale, for example, `en-GB`, `de-DE`, etc. Locale will affect `navigator.language` value, `Accept-Language` request header value, as well as number and date formatting rules. Defaults to the system default locale. | ✔️ |
82
+ | timezone_id | Changes the timezone of the browser. Defaults to the system timezone. | ✔️ |
83
+ | cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
84
+ | user_data_dir | Path to a User Data Directory, which stores browser session data like cookies and local storage. The default is to create a temporary directory. **Only Works with sessions** | ✔️ |
85
+ | extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
86
+ | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
87
+ | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
88
 
89
  In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
90
 
91
+ > 🔍 Notes:
92
+ >
93
+ > 1. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
94
+ > 2. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
95
+ > 3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
96
+ > 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
97
 
98
 
99
  ## Examples
 
103
 
104
  ```python
105
  # Disable unnecessary resources
106
+ page = DynamicFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc.
 
 
 
107
  ```
108
 
109
  ### Network Control
 
115
  # Custom timeout (in milliseconds)
116
  page = DynamicFetcher.fetch('https://example.com', timeout=30000) # 30 seconds
117
 
118
+ # Proxy support (It can also be a dictionary with the keys 'server', 'username', and 'password' only)
119
+ page = DynamicFetcher.fetch('https://example.com', proxy='http://username:password@host:port')
 
 
 
120
  ```
121
 
122
  ### Downloading Files
 
128
  f.write(page.body)
129
  ```
130
 
131
+ The `body` attribute of the `Response` object is a `bytes` object containing the response body in case of non-HTML responses.
132
 
133
  ### Browser Automation
134
  This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
 
144
  page.mouse.move(100, 400)
145
  page.mouse.up()
146
 
147
+ page = DynamicFetcher.fetch('https://example.com', page_action=scroll_page)
 
 
 
148
  ```
149
  Of course, if you use the async fetch version, the function must also be async.
150
  ```python
 
155
  await page.mouse.move(100, 400)
156
  await page.mouse.up()
157
 
158
+ page = await DynamicFetcher.async_fetch('https://example.com', page_action=scroll_page)
 
 
 
159
  ```
160
 
161
  ### Wait Conditions
 
189
  locale='en-US', # Set browser locale
190
  )
191
  ```
 
 
 
192
 
193
  ### General example
194
  ```python