Karim shoair commited on
Commit ·
4d52585
1
Parent(s): 6bc5321
docs: updating `dynamic websites` page and some corrections
Browse files- docs/fetching/dynamic.md +40 -8
docs/fetching/dynamic.md
CHANGED
|
@@ -4,11 +4,11 @@ Here, we will discuss the `DynamicFetcher` class (formerly `PlayWrightFetcher`).
|
|
| 4 |
|
| 5 |
As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
|
| 13 |
## Basic Usage
|
| 14 |
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
|
@@ -85,8 +85,12 @@ Scrapling provides many options with this fetcher and its session classes. To ma
|
|
| 85 |
| extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
|
| 86 |
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 87 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
-
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
|
| 90 |
|
| 91 |
> 🔍 Notes:
|
| 92 |
>
|
|
@@ -106,6 +110,13 @@ It's easier to understand with examples, so let's take a look.
|
|
| 106 |
page = DynamicFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc.
|
| 107 |
```
|
| 108 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
### Network Control
|
| 110 |
|
| 111 |
```python
|
|
@@ -119,6 +130,27 @@ page = DynamicFetcher.fetch('https://example.com', timeout=30000) # 30 seconds
|
|
| 119 |
page = DynamicFetcher.fetch('https://example.com', proxy='http://username:password@host:port')
|
| 120 |
```
|
| 121 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
### Downloading Files
|
| 123 |
|
| 124 |
```python
|
|
@@ -128,7 +160,7 @@ with open(file='main_cover.png', mode='wb') as f:
|
|
| 128 |
f.write(page.body)
|
| 129 |
```
|
| 130 |
|
| 131 |
-
The `body` attribute of the `Response` object
|
| 132 |
|
| 133 |
### Browser Automation
|
| 134 |
This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
|
|
@@ -206,7 +238,7 @@ def scrape_dynamic_content():
|
|
| 206 |
content = page.css('.content')
|
| 207 |
|
| 208 |
return {
|
| 209 |
-
'title': content.
|
| 210 |
'items': [
|
| 211 |
item.text for item in content.css('.item')
|
| 212 |
]
|
|
|
|
| 4 |
|
| 5 |
As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
|
| 6 |
|
| 7 |
+
!!! success "Prerequisites"
|
| 8 |
+
|
| 9 |
+
1. You've completed or read the [Fetchers basics](../fetching/choosing.md) page to understand what the [Response object](../fetching/choosing.md#response-object) is and which fetcher to use.
|
| 10 |
+
2. You've completed or read the [Querying elements](../parsing/selection.md) page to understand how to find/extract elements from the [Selector](../parsing/main_classes.md#selector)/[Response](../fetching/choosing.md#response-object) object.
|
| 11 |
+
3. You've completed or read the [Main classes](../parsing/main_classes.md) page to know what properties/methods the [Response](../fetching/choosing.md#response-object) class is inheriting from the [Selector](../parsing/main_classes.md#selector) class.
|
| 12 |
|
| 13 |
## Basic Usage
|
| 14 |
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
|
|
|
| 85 |
| extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
|
| 86 |
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 87 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 88 |
+
| blocked_domains | A set of domain names to block requests to. Subdomains are also matched (e.g., `"example.com"` blocks `"sub.example.com"` too). | ✔️ |
|
| 89 |
+
| proxy_rotator | A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy`. | ✔️ |
|
| 90 |
+
| retries | Number of retry attempts for failed requests. Defaults to 3. | ✔️ |
|
| 91 |
+
| retry_delay | Seconds to wait between retry attempts. Defaults to 1. | ✔️ |
|
| 92 |
|
| 93 |
+
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `blocked_domains`, `proxy`, and `selector_config`.
|
| 94 |
|
| 95 |
> 🔍 Notes:
|
| 96 |
>
|
|
|
|
| 110 |
page = DynamicFetcher.fetch('https://example.com', disable_resources=True) # Blocks fonts, images, media, etc.
|
| 111 |
```
|
| 112 |
|
| 113 |
+
### Domain Blocking
|
| 114 |
+
|
| 115 |
+
```python
|
| 116 |
+
# Block requests to specific domains (and their subdomains)
|
| 117 |
+
page = DynamicFetcher.fetch('https://example.com', blocked_domains={"ads.example.com", "tracker.net"})
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
### Network Control
|
| 121 |
|
| 122 |
```python
|
|
|
|
| 130 |
page = DynamicFetcher.fetch('https://example.com', proxy='http://username:password@host:port')
|
| 131 |
```
|
| 132 |
|
| 133 |
+
### Proxy Rotation
|
| 134 |
+
|
| 135 |
+
```python
|
| 136 |
+
from scrapling.fetchers import ProxyRotator
|
| 137 |
+
|
| 138 |
+
# Set up proxy rotation
|
| 139 |
+
rotator = ProxyRotator([
|
| 140 |
+
"http://proxy1:8080",
|
| 141 |
+
"http://proxy2:8080",
|
| 142 |
+
"http://proxy3:8080",
|
| 143 |
+
])
|
| 144 |
+
|
| 145 |
+
# Use with session - rotates proxy automatically with each request
|
| 146 |
+
with DynamicSession(proxy_rotator=rotator, headless=True) as session:
|
| 147 |
+
page1 = session.fetch('https://example1.com')
|
| 148 |
+
page2 = session.fetch('https://example2.com')
|
| 149 |
+
|
| 150 |
+
# Override rotator for a specific request
|
| 151 |
+
page3 = session.fetch('https://example3.com', proxy='http://specific-proxy:8080')
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
### Downloading Files
|
| 155 |
|
| 156 |
```python
|
|
|
|
| 160 |
f.write(page.body)
|
| 161 |
```
|
| 162 |
|
| 163 |
+
The `body` attribute of the `Response` object always returns `bytes`.
|
| 164 |
|
| 165 |
### Browser Automation
|
| 166 |
This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
|
|
|
|
| 238 |
content = page.css('.content')
|
| 239 |
|
| 240 |
return {
|
| 241 |
+
'title': content.css('h1::text').get(),
|
| 242 |
'items': [
|
| 243 |
item.text for item in content.css('.item')
|
| 244 |
]
|