Karim shoair commited on
Commit ·
5df81ed
1
Parent(s): edde4f1
docs: update the DynamicFetcher page
Browse files- docs/fetching/dynamic.md +115 -72
docs/fetching/dynamic.md
CHANGED
|
@@ -1,57 +1,57 @@
|
|
| 1 |
# Introduction
|
| 2 |
|
| 3 |
-
Here, we will discuss the `
|
| 4 |
|
| 5 |
-
As we will explain later, to automate the page, you need some knowledge of [
|
| 6 |
|
| 7 |
## Basic Usage
|
| 8 |
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
| 9 |
|
| 10 |
```python
|
| 11 |
-
>>> from scrapling.fetchers import
|
| 12 |
```
|
| 13 |
Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
|
| 14 |
|
| 15 |
-
Now we will
|
| 16 |
|
| 17 |
> Notes:
|
| 18 |
>
|
| 19 |
-
> 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (
|
| 20 |
> 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
|
| 21 |
|
| 22 |
|
| 23 |
-
This fetcher currently provides
|
| 24 |
|
| 25 |
Which are:
|
| 26 |
|
| 27 |
### 1. Vanilla Playwright
|
| 28 |
```python
|
| 29 |
-
|
| 30 |
```
|
| 31 |
-
Using it
|
| 32 |
|
| 33 |
### 2. Stealth Mode
|
| 34 |
```python
|
| 35 |
-
|
| 36 |
```
|
| 37 |
-
It's the same as the vanilla
|
| 38 |
|
| 39 |
Some of the things this fetcher's stealth mode does include:
|
| 40 |
|
| 41 |
* Patching the CDP runtime fingerprint.
|
| 42 |
* Mimics some of the real browsers' properties by injecting several JS files and using custom options.
|
| 43 |
* Custom flags are used on launch to hide Playwright even more and make it faster.
|
| 44 |
-
* Generates real browser headers of the same type and user OS, then
|
| 45 |
|
| 46 |
### 3. Real Chrome
|
| 47 |
```python
|
| 48 |
-
|
| 49 |
```
|
| 50 |
-
If you have a Google Chrome browser installed, use this option. It's the same as the first option but will use the Google Chrome browser you installed on your device instead of Chromium.
|
| 51 |
|
| 52 |
-
This will make your requests look more
|
| 53 |
```python
|
| 54 |
-
|
| 55 |
```
|
| 56 |
If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually:
|
| 57 |
```commandline
|
|
@@ -60,52 +60,45 @@ playwright install chrome
|
|
| 60 |
|
| 61 |
### 4. CDP Connection
|
| 62 |
```python
|
| 63 |
-
|
| 64 |
```
|
| 65 |
Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
|
| 66 |
|
| 67 |
-
This fetcher takes it even a step further. You can use [NSTBrowser](https://app.nstbrowser.io/r/1vO5e5)'s [docker browserless](https://hub.docker.com/r/nstbrowser/browserless) option by passing the CDP URL and enabling `nstbrowser_mode` option like below
|
| 68 |
-
```python
|
| 69 |
-
PlayWrightFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222', nstbrowser_mode=True)
|
| 70 |
-
```
|
| 71 |
-
There's also a `nstbrowser_config` argument to send the config you want to send with the requests to the NSTBrowser. If you leave it empty, Scrapling defaults to an optimized NSTBrowser's docker browserless config.
|
| 72 |
-
|
| 73 |
## Full list of arguments
|
| 74 |
-
Scrapling provides many options with this fetcher
|
| 75 |
-
|
| 76 |
-
| Argument | Description
|
| 77 |
-
|:-------------------:|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
| 78 |
-
| url | Target url
|
| 79 |
-
| headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode.
|
| 80 |
-
| disable_resources | Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage, but be
|
| 81 |
-
|
|
| 82 |
-
|
|
| 83 |
-
|
|
| 84 |
-
|
|
| 85 |
-
|
|
| 86 |
-
|
|
| 87 |
-
|
|
| 88 |
-
|
|
| 89 |
-
|
|
| 90 |
-
|
|
| 91 |
-
|
|
| 92 |
-
|
|
| 93 |
-
|
|
| 94 |
-
|
|
| 95 |
-
|
|
| 96 |
-
|
|
| 97 |
-
|
|
| 98 |
-
|
|
| 99 |
-
|
| 100 |
|
| 101 |
## Examples
|
| 102 |
-
It's easier to understand with examples, so let's
|
| 103 |
|
| 104 |
### Resource Control
|
| 105 |
|
| 106 |
```python
|
| 107 |
# Disable unnecessary resources
|
| 108 |
-
page =
|
| 109 |
'https://example.com',
|
| 110 |
disable_resources=True # Blocks fonts, images, media, etc...
|
| 111 |
)
|
|
@@ -115,22 +108,22 @@ page = PlayWrightFetcher.fetch(
|
|
| 115 |
|
| 116 |
```python
|
| 117 |
# Wait for network idle (Consider fetch to be finished when there are no network connections for at least 500 ms)
|
| 118 |
-
page =
|
| 119 |
|
| 120 |
# Custom timeout (in milliseconds)
|
| 121 |
-
page =
|
| 122 |
|
| 123 |
# Proxy support
|
| 124 |
-
page =
|
| 125 |
'https://example.com',
|
| 126 |
proxy='http://username:password@host:port' # Or it can be a dictionary with the keys 'server', 'username', and 'password' only
|
| 127 |
)
|
| 128 |
```
|
| 129 |
|
| 130 |
### Browser Automation
|
| 131 |
-
This is where your knowledge about [
|
| 132 |
|
| 133 |
-
This function is executed
|
| 134 |
|
| 135 |
In the example below, I used page [mouse events](https://playwright.dev/python/docs/api/class-mouse) to move the mouse wheel to scroll the page and then move the mouse.
|
| 136 |
```python
|
|
@@ -142,7 +135,7 @@ def scroll_page(page: Page):
|
|
| 142 |
page.mouse.up()
|
| 143 |
return page
|
| 144 |
|
| 145 |
-
page =
|
| 146 |
'https://example.com',
|
| 147 |
page_action=scroll_page
|
| 148 |
)
|
|
@@ -157,7 +150,7 @@ async def scroll_page(page: Page):
|
|
| 157 |
await page.mouse.up()
|
| 158 |
return page
|
| 159 |
|
| 160 |
-
page = await
|
| 161 |
'https://example.com',
|
| 162 |
page_action=scroll_page
|
| 163 |
)
|
|
@@ -167,7 +160,7 @@ page = await PlayWrightFetcher.async_fetch(
|
|
| 167 |
|
| 168 |
```python
|
| 169 |
# Wait for the selector
|
| 170 |
-
page =
|
| 171 |
'https://example.com',
|
| 172 |
wait_selector='h1',
|
| 173 |
wait_selector_state='visible'
|
|
@@ -175,20 +168,20 @@ page = PlayWrightFetcher.fetch(
|
|
| 175 |
```
|
| 176 |
This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
|
| 177 |
|
| 178 |
-
After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state)
|
| 179 |
|
| 180 |
-
The states the fetcher can wait for can be
|
| 181 |
|
| 182 |
-
- `attached`: Wait for an element to be present in DOM.
|
| 183 |
-
- `detached`: Wait for an element to not be present in DOM.
|
| 184 |
- `visible`: wait for an element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible.
|
| 185 |
-
- `hidden`: wait for an element to be either detached from DOM, or have an empty bounding box or `visibility:hidden`. This is opposite to the `'visible'` option.
|
| 186 |
|
| 187 |
### Some Stealth Features
|
| 188 |
|
| 189 |
```python
|
| 190 |
# Full stealth mode
|
| 191 |
-
page =
|
| 192 |
'https://example.com',
|
| 193 |
stealth=True,
|
| 194 |
hide_canvas=True,
|
|
@@ -197,28 +190,28 @@ page = PlayWrightFetcher.fetch(
|
|
| 197 |
)
|
| 198 |
|
| 199 |
# Custom user agent
|
| 200 |
-
page =
|
| 201 |
'https://example.com',
|
| 202 |
useragent='Mozilla/5.0...'
|
| 203 |
)
|
| 204 |
|
| 205 |
# Set browser locale
|
| 206 |
-
page =
|
| 207 |
'https://example.com',
|
| 208 |
locale='en-US'
|
| 209 |
)
|
| 210 |
```
|
| 211 |
-
Hence, the `hide_canvas` argument doesn't disable canvas but hides it by adding random noise to canvas operations
|
| 212 |
|
| 213 |
-
The `google_search` argument is enabled by default, making the request look
|
| 214 |
|
| 215 |
### General example
|
| 216 |
```python
|
| 217 |
-
from scrapling.fetchers import
|
| 218 |
|
| 219 |
def scrape_dynamic_content():
|
| 220 |
-
# Use
|
| 221 |
-
page =
|
| 222 |
'https://example.com/dynamic',
|
| 223 |
network_idle=True,
|
| 224 |
wait_selector='.content'
|
|
@@ -235,9 +228,59 @@ def scrape_dynamic_content():
|
|
| 235 |
}
|
| 236 |
```
|
| 237 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 238 |
## When to Use
|
| 239 |
|
| 240 |
-
Use
|
| 241 |
|
| 242 |
- Need browser automation
|
| 243 |
- Want multiple browser options
|
|
|
|
| 1 |
# Introduction
|
| 2 |
|
| 3 |
+
Here, we will discuss the `DynamicFetcher` class (previously known as `PlayWrightFetcher`). This class provides flexible browser automation with multiple configuration options and some stealth capabilities.
|
| 4 |
|
| 5 |
+
As we will explain later, to automate the page, you need some knowledge of [Playwright's Page API](https://playwright.dev/python/docs/api/class-page).
|
| 6 |
|
| 7 |
## Basic Usage
|
| 8 |
You have one primary way to import this Fetcher, which is the same for all fetchers.
|
| 9 |
|
| 10 |
```python
|
| 11 |
+
>>> from scrapling.fetchers import DynamicFetcher
|
| 12 |
```
|
| 13 |
Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
|
| 14 |
|
| 15 |
+
Now, we will review most of the arguments one by one, using examples. If you want to jump to a table of all arguments for quick reference, [click here](#full-list-of-arguments)
|
| 16 |
|
| 17 |
> Notes:
|
| 18 |
>
|
| 19 |
+
> 1. Every time you fetch a website with this fetcher, it waits by default for all JavaScript to fully load and execute, so you don't have to (wait for the `domcontentloaded` state).
|
| 20 |
> 2. Of course, the async version of the `fetch` method is the `async_fetch` method.
|
| 21 |
|
| 22 |
|
| 23 |
+
This fetcher currently provides four main run options, which can be mixed as desired.
|
| 24 |
|
| 25 |
Which are:
|
| 26 |
|
| 27 |
### 1. Vanilla Playwright
|
| 28 |
```python
|
| 29 |
+
DynamicFetcher.fetch('https://example.com')
|
| 30 |
```
|
| 31 |
+
Using it in that manner will open a Chromium browser and load the page. There are no tricks or extra features unless you enable some; it's just a plain PlayWright API.
|
| 32 |
|
| 33 |
### 2. Stealth Mode
|
| 34 |
```python
|
| 35 |
+
DynamicFetcher.fetch('https://example.com', stealth=True)
|
| 36 |
```
|
| 37 |
+
It's the same as the vanilla Playwright option, but it provides a simple stealth mode suitable for websites with a small to medium protection layer(s).
|
| 38 |
|
| 39 |
Some of the things this fetcher's stealth mode does include:
|
| 40 |
|
| 41 |
* Patching the CDP runtime fingerprint.
|
| 42 |
* Mimics some of the real browsers' properties by injecting several JS files and using custom options.
|
| 43 |
* Custom flags are used on launch to hide Playwright even more and make it faster.
|
| 44 |
+
* Generates real browser headers of the same type and user OS, then appends them to the request's headers.
|
| 45 |
|
| 46 |
### 3. Real Chrome
|
| 47 |
```python
|
| 48 |
+
DynamicFetcher.fetch('https://example.com', real_chrome=True)
|
| 49 |
```
|
| 50 |
+
If you have a Google Chrome browser installed, use this option. It's the same as the first option, but will use the Google Chrome browser you installed on your device instead of Chromium.
|
| 51 |
|
| 52 |
+
This will make your requests look more authentic, so it's less detectable, and you can even use the `stealth=True` mode with it for better results, like below:
|
| 53 |
```python
|
| 54 |
+
DynamicFetcher.fetch('https://example.com', real_chrome=True, stealth=True)
|
| 55 |
```
|
| 56 |
If you don't have Google Chrome installed and want to use this option, you can use the command below in the terminal to install it for the library instead of installing it manually:
|
| 57 |
```commandline
|
|
|
|
| 60 |
|
| 61 |
### 4. CDP Connection
|
| 62 |
```python
|
| 63 |
+
DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
|
| 64 |
```
|
| 65 |
Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
|
| 66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
## Full list of arguments
|
| 68 |
+
Scrapling provides many options with this fetcher. To make it as simple as possible, we will list the options here and give examples of using most of them.
|
| 69 |
+
|
| 70 |
+
| Argument | Description | Optional |
|
| 71 |
+
|:-------------------:|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
| 72 |
+
| url | Target url | ❌ |
|
| 73 |
+
| headless | Pass `True` to run the browser in headless/hidden (**default**) or `False` for headful/visible mode. | ✔️ |
|
| 74 |
+
| disable_resources | Drop requests of unnecessary resources for a speed boost. It depends, but it made requests ~25% faster in my tests for some websites.<br/>Requests dropped are of type `font`, `image`, `media`, `beacon`, `object`, `imageset`, `texttrack`, `websocket`, `csp_report`, and `stylesheet`. _This can help save your proxy usage, but be cautious with this option, as it may cause some websites to never finish loading._ | ✔️ |
|
| 75 |
+
| cookies | Set cookies for the next request. | ✔️ |
|
| 76 |
+
| useragent | Pass a useragent string to be used. **Otherwise, the fetcher will generate and use a real Useragent of the same browser.** | ✔️ |
|
| 77 |
+
| network_idle | Wait for the page until there are no network connections for at least 500 ms. | ✔️ |
|
| 78 |
+
| timeout | The timeout (milliseconds) used in all operations and waits through the page. The default is 30,000 ms (30 seconds). | ✔️ |
|
| 79 |
+
| wait | The time (milliseconds) the fetcher will wait after everything finishes before closing the page and returning the `Response` object. | ✔️ |
|
| 80 |
+
| page_action | Added for automation. Pass a function that takes the `page` object and does the necessary automation, then returns `page` again. | ✔️ |
|
| 81 |
+
| wait_selector | Wait for a specific css selector to be in a specific state. | ✔️ |
|
| 82 |
+
| wait_selector_state | Scrapling will wait for the given state to be fulfilled for the selector given with `wait_selector`. _Default state is `attached`._ | ✔️ |
|
| 83 |
+
| google_search | Enabled by default, Scrapling will set the referer header as if this request came from a Google search of this website's domain name. | ✔️ |
|
| 84 |
+
| extra_headers | A dictionary of extra headers to add to the request. _The referer set by the `google_search` argument takes priority over the referer set here if used together._ | ✔️ |
|
| 85 |
+
| proxy | The proxy to be used with requests. It can be a string or a dictionary with the keys 'server', 'username', and 'password' only. | ✔️ |
|
| 86 |
+
| hide_canvas | Add random noise to canvas operations to prevent fingerprinting. | ✔️ |
|
| 87 |
+
| disable_webgl | Disables WebGL and WebGL 2.0 support entirely. | ✔️ |
|
| 88 |
+
| stealth | Enables stealth mode; you should always check the documentation to see what the stealth mode does currently. | ✔️ |
|
| 89 |
+
| real_chrome | If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch and use an instance of your browser. | ✔️ |
|
| 90 |
+
| locale | Set the locale for the browser if wanted. The default value is `en-US`. | ✔️ |
|
| 91 |
+
| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
|
| 92 |
+
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
|
|
|
| 93 |
|
| 94 |
## Examples
|
| 95 |
+
It's easier to understand with examples, so let's take a look.
|
| 96 |
|
| 97 |
### Resource Control
|
| 98 |
|
| 99 |
```python
|
| 100 |
# Disable unnecessary resources
|
| 101 |
+
page = DynamicFetcher.fetch(
|
| 102 |
'https://example.com',
|
| 103 |
disable_resources=True # Blocks fonts, images, media, etc...
|
| 104 |
)
|
|
|
|
| 108 |
|
| 109 |
```python
|
| 110 |
# Wait for network idle (Consider fetch to be finished when there are no network connections for at least 500 ms)
|
| 111 |
+
page = DynamicFetcher.fetch('https://example.com', network_idle=True)
|
| 112 |
|
| 113 |
# Custom timeout (in milliseconds)
|
| 114 |
+
page = DynamicFetcher.fetch('https://example.com', timeout=30000) # 30 seconds
|
| 115 |
|
| 116 |
# Proxy support
|
| 117 |
+
page = DynamicFetcher.fetch(
|
| 118 |
'https://example.com',
|
| 119 |
proxy='http://username:password@host:port' # Or it can be a dictionary with the keys 'server', 'username', and 'password' only
|
| 120 |
)
|
| 121 |
```
|
| 122 |
|
| 123 |
### Browser Automation
|
| 124 |
+
This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then returns it for the current fetcher to continue working on it.
|
| 125 |
|
| 126 |
+
This function is executed immediately after waiting for `network_idle` (if enabled) and before waiting for the `wait_selector` argument, allowing it to be used for various purposes, not just automation. You can alter the page as you want.
|
| 127 |
|
| 128 |
In the example below, I used page [mouse events](https://playwright.dev/python/docs/api/class-mouse) to move the mouse wheel to scroll the page and then move the mouse.
|
| 129 |
```python
|
|
|
|
| 135 |
page.mouse.up()
|
| 136 |
return page
|
| 137 |
|
| 138 |
+
page = DynamicFetcher.fetch(
|
| 139 |
'https://example.com',
|
| 140 |
page_action=scroll_page
|
| 141 |
)
|
|
|
|
| 150 |
await page.mouse.up()
|
| 151 |
return page
|
| 152 |
|
| 153 |
+
page = await DynamicFetcher.async_fetch(
|
| 154 |
'https://example.com',
|
| 155 |
page_action=scroll_page
|
| 156 |
)
|
|
|
|
| 160 |
|
| 161 |
```python
|
| 162 |
# Wait for the selector
|
| 163 |
+
page = DynamicFetcher.fetch(
|
| 164 |
'https://example.com',
|
| 165 |
wait_selector='h1',
|
| 166 |
wait_selector_state='visible'
|
|
|
|
| 168 |
```
|
| 169 |
This is the last wait the fetcher will do before returning the response (if enabled). You pass a CSS selector to the `wait_selector` argument, and the fetcher will wait for the state you passed in the `wait_selector_state` argument to be fulfilled. If you didn't pass a state, the default would be `attached`, which means it will wait for the element to be present in the DOM.
|
| 170 |
|
| 171 |
+
After that, the fetcher will check again to see if all JS files are loaded and executed (the `domcontentloaded` state) or continue waiting. If you have enabled `network_idle` with this, the fetcher will wait for `network_idle` to be fulfilled again, as explained above.
|
| 172 |
|
| 173 |
+
The states the fetcher can wait for can be any of the following ([source](https://playwright.dev/python/docs/api/class-page#page-wait-for-selector)):
|
| 174 |
|
| 175 |
+
- `attached`: Wait for an element to be present in the DOM.
|
| 176 |
+
- `detached`: Wait for an element to not be present in the DOM.
|
| 177 |
- `visible`: wait for an element to have a non-empty bounding box and no `visibility:hidden`. Note that an element without any content or with `display:none` has an empty bounding box and is not considered visible.
|
| 178 |
+
- `hidden`: wait for an element to be either detached from the DOM, or have an empty bounding box, or `visibility:hidden`. This is opposite to the `'visible'` option.
|
| 179 |
|
| 180 |
### Some Stealth Features
|
| 181 |
|
| 182 |
```python
|
| 183 |
# Full stealth mode
|
| 184 |
+
page = DynamicFetcher.fetch(
|
| 185 |
'https://example.com',
|
| 186 |
stealth=True,
|
| 187 |
hide_canvas=True,
|
|
|
|
| 190 |
)
|
| 191 |
|
| 192 |
# Custom user agent
|
| 193 |
+
page = DynamicFetcher.fetch(
|
| 194 |
'https://example.com',
|
| 195 |
useragent='Mozilla/5.0...'
|
| 196 |
)
|
| 197 |
|
| 198 |
# Set browser locale
|
| 199 |
+
page = DynamicFetcher.fetch(
|
| 200 |
'https://example.com',
|
| 201 |
locale='en-US'
|
| 202 |
)
|
| 203 |
```
|
| 204 |
+
Hence, the `hide_canvas` argument doesn't disable the canvas but instead hides it by adding random noise to canvas operations, preventing fingerprinting. Also, if you didn't set a user agent (preferred), the fetcher will generate a real User Agent of the same browser and use it.
|
| 205 |
|
| 206 |
+
The `google_search` argument is enabled by default, making the request look as if it came from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
|
| 207 |
|
| 208 |
### General example
|
| 209 |
```python
|
| 210 |
+
from scrapling.fetchers import DynamicFetcher
|
| 211 |
|
| 212 |
def scrape_dynamic_content():
|
| 213 |
+
# Use Playwright for JavaScript content
|
| 214 |
+
page = DynamicFetcher.fetch(
|
| 215 |
'https://example.com/dynamic',
|
| 216 |
network_idle=True,
|
| 217 |
wait_selector='.content'
|
|
|
|
| 228 |
}
|
| 229 |
```
|
| 230 |
|
| 231 |
+
## Session Management
|
| 232 |
+
|
| 233 |
+
To keep the browser open until you make multiple requests with the same configuration, use `DynamicSession`/`AsyncDynamicSession` classes. Those classes can accept all the arguments that the `fetch` function can take, which enables you to specify a config for the entire session.
|
| 234 |
+
|
| 235 |
+
```python
|
| 236 |
+
from scrapling.fetchers import DynamicSession
|
| 237 |
+
|
| 238 |
+
# Create a session with default configuration
|
| 239 |
+
with DynamicSession(
|
| 240 |
+
headless=True,
|
| 241 |
+
stealth=True,
|
| 242 |
+
disable_resources=True,
|
| 243 |
+
real_chrome=True
|
| 244 |
+
) as session:
|
| 245 |
+
# Make multiple requests with the same browser instance
|
| 246 |
+
page1 = session.fetch('https://example1.com')
|
| 247 |
+
page2 = session.fetch('https://example2.com')
|
| 248 |
+
page3 = session.fetch('https://dynamic-site.com')
|
| 249 |
+
|
| 250 |
+
# All requests reuse the same tab on the same browser instance
|
| 251 |
+
```
|
| 252 |
+
|
| 253 |
+
### Async Session Usage
|
| 254 |
+
|
| 255 |
+
```python
|
| 256 |
+
import asyncio
|
| 257 |
+
from scrapling.fetchers import AsyncDynamicSession
|
| 258 |
+
|
| 259 |
+
async def scrape_multiple_sites():
|
| 260 |
+
async with AsyncDynamicSession(
|
| 261 |
+
stealth=True,
|
| 262 |
+
network_idle=True,
|
| 263 |
+
timeout=30000
|
| 264 |
+
) as session:
|
| 265 |
+
# Make async requests with shared browser configuration
|
| 266 |
+
pages = await asyncio.gather(
|
| 267 |
+
session.fetch('https://spa-app1.com'),
|
| 268 |
+
session.fetch('https://spa-app2.com'),
|
| 269 |
+
session.fetch('https://dynamic-content.com')
|
| 270 |
+
)
|
| 271 |
+
return pages
|
| 272 |
+
```
|
| 273 |
+
|
| 274 |
+
### Session Benefits
|
| 275 |
+
|
| 276 |
+
- **Browser reuse**: Much faster subsequent requests by reusing the same browser instance.
|
| 277 |
+
- **Cookie persistence**: Automatic cookie and session state handling as any browser does automatically.
|
| 278 |
+
- **Consistent fingerprint**: Same browser fingerprint across all requests.
|
| 279 |
+
- **Memory efficiency**: Better resource usage compared to launching new browsers with each fetch.
|
| 280 |
+
|
| 281 |
## When to Use
|
| 282 |
|
| 283 |
+
Use DynamicFetcher when:
|
| 284 |
|
| 285 |
- Need browser automation
|
| 286 |
- Want multiple browser options
|