Karim shoair commited on
Commit
4d52585
·
1 Parent(s): 6bc5321

docs: updating `dynamic websites` page and some corrections

Browse files
Files changed (1) hide show
  1. docs/fetching/dynamic.md +40 -8
docs/fetching/dynamic.md CHANGED
@@ -4,11 +4,11 @@ Here, we will discuss the `DynamicFetcher` class (formerly `PlayWrightFetcher`).
4
 
5
  As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
6
 
7
- > 💡 **Prerequisites:**
8
- >
9
- > 1. Youve completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
10
- > 2. Youve completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
11
- > 3. Youve completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
12
 
13
  ## Basic Usage
14
  You have one primary way to import this Fetcher, which is the same for all fetchers.
@@ -85,8 +85,12 @@ Scrapling provides many options with this fetcher and its session classes. To ma
85
  | extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
86
  | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
87
  | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
 
 
 
 
88
 
89
- In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
90
 
91
  > 🔍 Notes:
92
  >
@@ -106,6 +110,13 @@ It's easier to understand with examples, so let's take a look.
106
  page = DynamicFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc.
107
  ```
108
 
 
 
 
 
 
 
 
109
  ### Network Control
110
 
111
  ```python
@@ -119,6 +130,27 @@ page = DynamicFetcher.fetch('https://example.com', timeout=30000) # 30 seconds
119
  page = DynamicFetcher.fetch('https://example.com', proxy='http://username:password@host:port')
120
  ```
121
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
  ### Downloading Files
123
 
124
  ```python
@@ -128,7 +160,7 @@ with open(file='main_cover.png', mode='wb') as f:
128
  f.write(page.body)
129
  ```
130
 
131
- The `body` attribute of the `Response` object is a `bytes` object containing the response body in case of non-HTML responses.
132
 
133
  ### Browser Automation
134
  This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
@@ -206,7 +238,7 @@ def scrape_dynamic_content():
206
  content = page.css('.content')
207
 
208
  return {
209
- 'title': content.css_first('h1::text'),
210
  'items': [
211
  item.text for item in content.css('.item')
212
  ]
 
4
 
5
  As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
6
 
7
+ !!! success "Prerequisites"
8
+
9
+ 1. You've completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
10
+ 2. You've completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
11
+ 3. You've completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
12
 
13
  ## Basic Usage
14
  You have one primary way to import this Fetcher, which is the same for all fetchers.
 
85
  | extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
86
  | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
87
  | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
88
+ | blocked_domains | A set of domain names to block requests to. Subdomains are also matched (e.g., `"example.com"` blocks `"sub.example.com"` too). | ✔️ |
89
+ | proxy_rotator | A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy`. | ✔️ |
90
+ | retries | Number of retry attempts for failed requests. Defaults to 3. | ✔️ |
91
+ | retry_delay | Seconds to wait between retry attempts. Defaults to 1. | ✔️ |
92
 
93
+ In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `blocked_domains`, `proxy`, and `selector_config`.
94
 
95
  > 🔍 Notes:
96
  >
 
110
  page = DynamicFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc.
111
  ```
112
 
113
+ ### Domain Blocking
114
+
115
+ ```python
116
+ # Block requests to specific domains (and their subdomains)
117
+ page = DynamicFetcher.fetch('https://example.com', blocked_domains={"ads.example.com", "tracker.net"})
118
+ ```
119
+
120
  ### Network Control
121
 
122
  ```python
 
130
  page = DynamicFetcher.fetch('https://example.com', proxy='http://username:password@host:port')
131
  ```
132
 
133
+ ### Proxy Rotation
134
+
135
+ ```python
136
+ from scrapling.fetchers import ProxyRotator
137
+
138
+ # Set up proxy rotation
139
+ rotator = ProxyRotator([
140
+ "http://proxy1:8080",
141
+ "http://proxy2:8080",
142
+ "http://proxy3:8080",
143
+ ])
144
+
145
+ # Use with session - rotates proxy automatically with each request
146
+ with DynamicSession(proxy_rotator=rotator, headless=True) as session:
147
+ page1 = session.fetch('https://example1.com')
148
+ page2 = session.fetch('https://example2.com')
149
+
150
+ # Override rotator for a specific request
151
+ page3 = session.fetch('https://example3.com', proxy='http://specific-proxy:8080')
152
+ ```
153
+
154
  ### Downloading Files
155
 
156
  ```python
 
160
  f.write(page.body)
161
  ```
162
 
163
+ The `body` attribute of the `Response` object always returns `bytes`.
164
 
165
  ### Browser Automation
166
  This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
 
238
  content = page.css('.content')
239
 
240
  return {
241
+ 'title': content.css('h1::text').get(),
242
  'items': [
243
  item.text for item in content.css('.item')
244
  ]