Karim shoair commited on
Commit ·
028ed17
1
Parent(s): 309c3e5
docs: Update all pages related to last changes
Browse files- README.md +28 -24
- docs/fetching/dynamic.md +4 -1
- docs/fetching/stealthy.md +2 -1
- docs/index.md +24 -20
README.md
CHANGED
|
@@ -92,7 +92,7 @@ Built for the modern Web, Scrapling has its own rapid parsing engine and its fet
|
|
| 92 |
- 🔄 **Smart Element Tracking**: Relocate elements after website changes using intelligent similarity algorithms.
|
| 93 |
- 🎯 **Smart Flexible Selection**: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more.
|
| 94 |
- 🔍 **Find Similar Elements**: Automatically locate elements similar to found elements.
|
| 95 |
-
- 🤖 **MCP Server to be used with AI**: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features custom, powerful capabilities that utilize Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage.
|
| 96 |
|
| 97 |
### High-Performance & battle-tested Architecture
|
| 98 |
- 🚀 **Lightning Fast**: Optimized performance outperforming most Python scraping libraries.
|
|
@@ -134,7 +134,7 @@ quotes = page.css('.quote .text::text')
|
|
| 134 |
|
| 135 |
# Advanced stealth mode (Keep the browser open until you finish)
|
| 136 |
with StealthySession(headless=True, solve_cloudflare=True) as session:
|
| 137 |
-
page = session.fetch('https://nopecha.com/demo/cloudflare')
|
| 138 |
data = page.css('#padded_content a')
|
| 139 |
|
| 140 |
# Or use one-off request style, it opens the browser for this request, then closes it after finishing
|
|
@@ -143,7 +143,7 @@ data = page.css('#padded_content a')
|
|
| 143 |
|
| 144 |
# Full browser automation (Keep the browser open until you finish)
|
| 145 |
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:
|
| 146 |
-
page = session.fetch('https://quotes.toscrape.com/')
|
| 147 |
data = page.xpath('//span[@class="text"]/text()') # XPath selector if you prefer it
|
| 148 |
|
| 149 |
# Or use one-off request style, it opens the browser for this request, then closes it after finishing
|
|
@@ -187,7 +187,7 @@ from scrapling.parser import Selector
|
|
| 187 |
|
| 188 |
page = Selector("<html>...</html>")
|
| 189 |
```
|
| 190 |
-
And it works
|
| 191 |
|
| 192 |
### Async Session Management Examples
|
| 193 |
```python
|
|
@@ -271,29 +271,33 @@ Scrapling requires Python 3.10 or higher:
|
|
| 271 |
pip install scrapling
|
| 272 |
```
|
| 273 |
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
If you are going to use any of the fetchers or their classes, then install browser dependencies with
|
| 277 |
-
```bash
|
| 278 |
-
scrapling install
|
| 279 |
-
```
|
| 280 |
-
|
| 281 |
-
This downloads all browsers with their system dependencies and fingerprint manipulation dependencies.
|
| 282 |
|
| 283 |
### Optional Dependencies
|
| 284 |
|
| 285 |
-
|
| 286 |
-
```bash
|
| 287 |
-
pip install "scrapling[
|
| 288 |
-
|
| 289 |
-
|
| 290 |
-
```
|
| 291 |
-
|
| 292 |
-
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 297 |
|
| 298 |
## Contributing
|
| 299 |
|
|
|
|
| 92 |
- 🔄 **Smart Element Tracking**: Relocate elements after website changes using intelligent similarity algorithms.
|
| 93 |
- 🎯 **Smart Flexible Selection**: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more.
|
| 94 |
- 🔍 **Find Similar Elements**: Automatically locate elements similar to found elements.
|
| 95 |
+
- 🤖 **MCP Server to be used with AI**: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features custom, powerful capabilities that utilize Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. ([demo video](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
|
| 96 |
|
| 97 |
### High-Performance & battle-tested Architecture
|
| 98 |
- 🚀 **Lightning Fast**: Optimized performance outperforming most Python scraping libraries.
|
|
|
|
| 134 |
|
| 135 |
# Advanced stealth mode (Keep the browser open until you finish)
|
| 136 |
with StealthySession(headless=True, solve_cloudflare=True) as session:
|
| 137 |
+
page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
|
| 138 |
data = page.css('#padded_content a')
|
| 139 |
|
| 140 |
# Or use one-off request style, it opens the browser for this request, then closes it after finishing
|
|
|
|
| 143 |
|
| 144 |
# Full browser automation (Keep the browser open until you finish)
|
| 145 |
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:
|
| 146 |
+
page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
|
| 147 |
data = page.xpath('//span[@class="text"]/text()') # XPath selector if you prefer it
|
| 148 |
|
| 149 |
# Or use one-off request style, it opens the browser for this request, then closes it after finishing
|
|
|
|
| 187 |
|
| 188 |
page = Selector("<html>...</html>")
|
| 189 |
```
|
| 190 |
+
And it works precisely the same way!
|
| 191 |
|
| 192 |
### Async Session Management Examples
|
| 193 |
```python
|
|
|
|
| 271 |
pip install scrapling
|
| 272 |
```
|
| 273 |
|
| 274 |
+
Starting with v0.3.2, this installation only includes the parser engine and its dependencies, without any fetchers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 275 |
|
| 276 |
### Optional Dependencies
|
| 277 |
|
| 278 |
+
1. If you are going to use any of the extra features below, the fetchers, or their classes, then you need to install fetchers' dependencies, and then install their browser dependencies with
|
| 279 |
+
```bash
|
| 280 |
+
pip install "scrapling[fetchers]"
|
| 281 |
+
|
| 282 |
+
scrapling install
|
| 283 |
+
```
|
| 284 |
+
|
| 285 |
+
This downloads all browsers with their system dependencies and fingerprint manipulation dependencies.
|
| 286 |
+
|
| 287 |
+
2. Extra features:
|
| 288 |
+
- Install the MCP server feature:
|
| 289 |
+
```bash
|
| 290 |
+
pip install "scrapling[ai]"
|
| 291 |
+
```
|
| 292 |
+
- Install shell features (Web Scraping shell and the `extract` command):
|
| 293 |
+
```bash
|
| 294 |
+
pip install "scrapling[shell]"
|
| 295 |
+
```
|
| 296 |
+
- Install everything:
|
| 297 |
+
```bash
|
| 298 |
+
pip install "scrapling[all]"
|
| 299 |
+
```
|
| 300 |
+
Don't forget that you need to install the browser dependencies with `scrapling install` after any of these extras (if you didn't already)
|
| 301 |
|
| 302 |
## Contributing
|
| 303 |
|
docs/fetching/dynamic.md
CHANGED
|
@@ -62,7 +62,7 @@ DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
|
|
| 62 |
Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
|
| 63 |
|
| 64 |
## Full list of arguments
|
| 65 |
-
Scrapling provides many options with this fetcher. To make it as simple as possible, we will list the options here and give examples of using most of them.
|
| 66 |
|
| 67 |
| Argument | Description | Optional |
|
| 68 |
|:-------------------:|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
|
@@ -90,6 +90,9 @@ Scrapling provides many options with this fetcher. To make it as simple as possi
|
|
| 90 |
| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
|
| 91 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 92 |
|
|
|
|
|
|
|
|
|
|
| 93 |
## Examples
|
| 94 |
It's easier to understand with examples, so let's take a look.
|
| 95 |
|
|
|
|
| 62 |
Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
|
| 63 |
|
| 64 |
## Full list of arguments
|
| 65 |
+
Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of using most of them.
|
| 66 |
|
| 67 |
| Argument | Description | Optional |
|
| 68 |
|:-------------------:|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------:|
|
|
|
|
| 90 |
| cdp_url | Instead of launching a new browser instance, connect to this CDP URL to control real browsers through CDP. | ✔️ |
|
| 91 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 92 |
|
| 93 |
+
In the session classes, all these arguments can be set for the session globally. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, and `selector_config`.
|
| 94 |
+
|
| 95 |
+
|
| 96 |
## Examples
|
| 97 |
It's easier to understand with examples, so let's take a look.
|
| 98 |
|
docs/fetching/stealthy.md
CHANGED
|
@@ -15,7 +15,7 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
|
|
| 15 |
> Note: The async version of the `fetch` method is the `async_fetch` method, of course.
|
| 16 |
|
| 17 |
## Full list of arguments
|
| 18 |
-
Before jumping to [examples](#examples), here's the full list of arguments
|
| 19 |
|
| 20 |
|
| 21 |
| Argument | Description | Optional |
|
|
@@ -47,6 +47,7 @@ Before jumping to [examples](#examples), here's the full list of arguments
|
|
| 47 |
| additional_args | Additional arguments to be passed to Camoufox as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 48 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 49 |
|
|
|
|
| 50 |
|
| 51 |
## Examples
|
| 52 |
It's easier to understand with examples, so we will now review most of the arguments individually with examples.
|
|
|
|
| 15 |
> Note: The async version of the `fetch` method is the `async_fetch` method, of course.
|
| 16 |
|
| 17 |
## Full list of arguments
|
| 18 |
+
Scrapling provides many options with this fetcher and its session classes. Before jumping to the [examples](#examples), here's the full list of arguments
|
| 19 |
|
| 20 |
|
| 21 |
| Argument | Description | Optional |
|
|
|
|
| 47 |
| additional_args | Additional arguments to be passed to Camoufox as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 48 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 49 |
|
| 50 |
+
In the session classes, all these arguments can be set for the session globally. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, and `selector_config`.
|
| 51 |
|
| 52 |
## Examples
|
| 53 |
It's easier to understand with examples, so we will now review most of the arguments individually with examples.
|
docs/index.md
CHANGED
|
@@ -114,29 +114,33 @@ Scrapling requires Python 3.10 or higher:
|
|
| 114 |
pip install scrapling
|
| 115 |
```
|
| 116 |
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
If you are going to use any of the fetchers or their session classes, then install browser dependencies with
|
| 120 |
-
```bash
|
| 121 |
-
scrapling install
|
| 122 |
-
```
|
| 123 |
-
|
| 124 |
-
This downloads all browsers with their system dependencies and fingerprint manipulation dependencies.
|
| 125 |
|
| 126 |
### Optional Dependencies
|
| 127 |
|
| 128 |
-
|
| 129 |
-
```bash
|
| 130 |
-
pip install "scrapling[
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
```
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
## How the documentation is organized
|
| 142 |
Scrapling has a lot of documentation, so we try to follow a guideline called the [Diátaxis documentation framework](https://diataxis.fr/).
|
|
|
|
| 114 |
pip install scrapling
|
| 115 |
```
|
| 116 |
|
| 117 |
+
Starting with v0.3.2, this installation only includes the parser engine and its dependencies, without any fetchers.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
|
| 119 |
### Optional Dependencies
|
| 120 |
|
| 121 |
+
1. If you are going to use any of the extra features below, the fetchers, or their classes, then you need to install fetchers' dependencies, and then install their browser dependencies with
|
| 122 |
+
```bash
|
| 123 |
+
pip install "scrapling[fetchers]"
|
| 124 |
+
|
| 125 |
+
scrapling install
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
This downloads all browsers with their system dependencies and fingerprint manipulation dependencies.
|
| 129 |
+
|
| 130 |
+
2. Extra features:
|
| 131 |
+
- Install the MCP server feature:
|
| 132 |
+
```bash
|
| 133 |
+
pip install "scrapling[ai]"
|
| 134 |
+
```
|
| 135 |
+
- Install shell features (Web Scraping shell and the `extract` command):
|
| 136 |
+
```bash
|
| 137 |
+
pip install "scrapling[shell]"
|
| 138 |
+
```
|
| 139 |
+
- Install everything:
|
| 140 |
+
```bash
|
| 141 |
+
pip install "scrapling[all]"
|
| 142 |
+
```
|
| 143 |
+
Don't forget that you need to install the browser dependencies with `scrapling install` after any of these extras (if you didn't already)
|
| 144 |
|
| 145 |
## How the documentation is organized
|
| 146 |
Scrapling has a lot of documentation, so we try to follow a guideline called the [Diátaxis documentation framework](https://diataxis.fr/).
|