Karim shoair commited on
Commit ·
790e9ee
1
Parent(s): 6f38a9c
docs: improve wording, style, and use more visual ways
Browse files- docs/ai/mcp-server.md +1 -1
- docs/fetching/choosing.md +3 -1
- docs/fetching/dynamic.md +16 -14
- docs/fetching/static.md +6 -5
- docs/fetching/stealthy.md +17 -15
- docs/overview.md +5 -5
- docs/parsing/adaptive.md +3 -1
- docs/parsing/main_classes.md +4 -4
- docs/parsing/selection.md +13 -8
- docs/tutorials/migrating_from_beautifulsoup.md +5 -5
docs/ai/mcp-server.md
CHANGED
|
@@ -179,7 +179,7 @@ We will gradually go from simple prompts to more complex ones. We will use Claud
|
|
| 179 |
```
|
| 180 |
Use regular requests to scrape the main content from https://example.com and convert it to markdown format.
|
| 181 |
```
|
| 182 |
-
This tells Claude which tool to use here, so it doesn't have to guess. Sometimes it will start using normal requests on its own, and at other times, it will assume browsers are better suited for this website without any apparent reason. As a
|
| 183 |
|
| 184 |
2. **Targeted Data Extraction**
|
| 185 |
|
|
|
|
| 179 |
```
|
| 180 |
Use regular requests to scrape the main content from https://example.com and convert it to markdown format.
|
| 181 |
```
|
| 182 |
+
This tells Claude which tool to use here, so it doesn't have to guess. Sometimes it will start using normal requests on its own, and at other times, it will assume browsers are better suited for this website without any apparent reason. As a rule of thumb, you should always tell Claude which tool to use to save time and money and get consistent results.
|
| 183 |
|
| 184 |
2. **Targeted Data Extraction**
|
| 185 |
|
docs/fetching/choosing.md
CHANGED
|
@@ -52,7 +52,9 @@ Then, continue your code as usual.
|
|
| 52 |
|
| 53 |
The available configuration arguments are: `adaptive`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the [Selector](../parsing/main_classes.md#selector) class. You can display the current configuration anytime by running `<fetcher_class>.display_config()`.
|
| 54 |
|
| 55 |
-
|
|
|
|
|
|
|
| 56 |
|
| 57 |
### Set parser config per request
|
| 58 |
As you probably understand, the logic above for setting the parser config will apply globally to all requests/fetches made through that class, and it's intended for simplicity.
|
|
|
|
| 52 |
|
| 53 |
The available configuration arguments are: `adaptive`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the [Selector](../parsing/main_classes.md#selector) class. You can display the current configuration anytime by running `<fetcher_class>.display_config()`.
|
| 54 |
|
| 55 |
+
!!! info
|
| 56 |
+
|
| 57 |
+
The `adaptive` argument is disabled by default; you must enable it to use that feature.
|
| 58 |
|
| 59 |
### Set parser config per request
|
| 60 |
As you probably understand, the logic above for setting the parser config will apply globally to all requests/fetches made through that class, and it's intended for simplicity.
|
docs/fetching/dynamic.md
CHANGED
|
@@ -20,7 +20,9 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
|
|
| 20 |
|
| 21 |
Now, we will review most of the arguments one by one, using examples. If you want to jump to a table of all arguments for quick reference, [click here](#full-list-of-arguments)
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
| 24 |
|
| 25 |
|
| 26 |
This fetcher currently provides three main run options that can be combined as desired.
|
|
@@ -51,10 +53,10 @@ DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
|
|
| 51 |
Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
|
| 52 |
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
|
| 59 |
## Full list of arguments
|
| 60 |
Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
|
|
@@ -85,19 +87,19 @@ Scrapling provides many options with this fetcher and its session classes. To ma
|
|
| 85 |
| extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
|
| 86 |
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 87 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 88 |
-
|
|
| 89 |
-
|
|
| 90 |
| retries | Number of retry attempts for failed requests. Defaults to 3. | ✔️ |
|
| 91 |
-
|
|
| 92 |
|
| 93 |
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `blocked_domains`, `proxy`, and `selector_config`.
|
| 94 |
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
|
| 102 |
|
| 103 |
## Examples
|
|
|
|
| 20 |
|
| 21 |
Now, we will review most of the arguments one by one, using examples. If you want to jump to a table of all arguments for quick reference, [click here](#full-list-of-arguments)
|
| 22 |
|
| 23 |
+
!!! abstract
|
| 24 |
+
|
| 25 |
+
The async version of the `fetch` method is `async_fetch`, of course.
|
| 26 |
|
| 27 |
|
| 28 |
This fetcher currently provides three main run options that can be combined as desired.
|
|
|
|
| 53 |
Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
|
| 54 |
|
| 55 |
|
| 56 |
+
!!! note "Notes:"
|
| 57 |
+
|
| 58 |
+
* There was a `stealth` option here, but it was moved to the `StealthyFetcher` class, as explained on the next page, with additional features since version 0.3.13.<br/>
|
| 59 |
+
* This makes it less confusing for new users, easier to maintain, and provides other benefits, as explained on the [StealthyFetcher page](../fetching/stealthy.md).
|
| 60 |
|
| 61 |
## Full list of arguments
|
| 62 |
Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
|
|
|
|
| 87 |
| extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
|
| 88 |
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 89 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 90 |
+
| blocked_domains | A set of domain names to block requests to. Subdomains are also matched (e.g., `"example.com"` blocks `"sub.example.com"` too). | ✔️ |
|
| 91 |
+
| proxy_rotator | A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy`. | ✔️ |
|
| 92 |
| retries | Number of retry attempts for failed requests. Defaults to 3. | ✔️ |
|
| 93 |
+
| retry_delay | Seconds to wait between retry attempts. Defaults to 1. | ✔️ |
|
| 94 |
|
| 95 |
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `blocked_domains`, `proxy`, and `selector_config`.
|
| 96 |
|
| 97 |
+
!!! note "Notes:"
|
| 98 |
+
|
| 99 |
+
1. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
|
| 100 |
+
2. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
|
| 101 |
+
3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
|
| 102 |
+
4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
|
| 103 |
|
| 104 |
|
| 105 |
## Examples
|
docs/fetching/static.md
CHANGED
|
@@ -38,12 +38,13 @@ All methods for making requests here share some arguments, so let's discuss them
|
|
| 38 |
- **cert**: Tuple of (cert, key) filenames for the client certificate.
|
| 39 |
- **selector_config**: A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class.
|
| 40 |
|
| 41 |
-
|
| 42 |
-
> 1. The currently available browsers to impersonate are (`"edge"`, `"chrome"`, `"chrome_android"`, `"safari"`, `"safari_beta"`, `"safari_ios"`, `"safari_ios_beta"`, `"firefox"`, `"tor"`)<br/>
|
| 43 |
-
> 2. The available browsers to impersonate, along with their corresponding versions, are automatically displayed in the argument autocompletion and updated with each `curl_cffi` update.<br/>
|
| 44 |
-
> 3. If any of the arguments `impersonate` or `stealthy_headers` are enabled, the fetchers will automatically generate real browser headers that match the browser version used.
|
| 45 |
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
### HTTP Methods
|
| 49 |
There are additional arguments for each method, depending on the method, such as `params` for GET requests and `data`/`json` for POST/PUT/DELETE requests.
|
|
|
|
| 38 |
- **cert**: Tuple of (cert, key) filenames for the client certificate.
|
| 39 |
- **selector_config**: A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class.
|
| 40 |
|
| 41 |
+
!!! note "Notes:"
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
1. The currently available browsers to impersonate are (`"edge"`, `"chrome"`, `"chrome_android"`, `"safari"`, `"safari_beta"`, `"safari_ios"`, `"safari_ios_beta"`, `"firefox"`, `"tor"`)<br/>
|
| 44 |
+
2. The available browsers to impersonate, along with their corresponding versions, are automatically displayed in the argument autocompletion and updated with each `curl_cffi` update.<br/>
|
| 45 |
+
3. If any of the arguments `impersonate` or `stealthy_headers` are enabled, the fetchers will automatically generate real browser headers that match the browser version used.
|
| 46 |
+
|
| 47 |
+
Other than this, for further customization, you can pass any arguments that `curl_cffi` supports for any method if that method doesn't already support them.
|
| 48 |
|
| 49 |
### HTTP Methods
|
| 50 |
There are additional arguments for each method, depending on the method, such as `params` for GET requests and `data`/`json` for POST/PUT/DELETE requests.
|
docs/fetching/stealthy.md
CHANGED
|
@@ -19,7 +19,9 @@ You have one primary way to import this Fetcher, which is the same for all fetch
|
|
| 19 |
```
|
| 20 |
Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
|
| 21 |
|
| 22 |
-
|
|
|
|
|
|
|
| 23 |
|
| 24 |
## What does it do?
|
| 25 |
|
|
@@ -67,19 +69,19 @@ Scrapling provides many options with this fetcher and its session classes. Befor
|
|
| 67 |
| allow_webgl | Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
|
| 68 |
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 69 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 70 |
-
|
|
| 71 |
-
|
|
| 72 |
| retries | Number of retry attempts for failed requests. Defaults to 3. | ✔️ |
|
| 73 |
| retry_delay | Seconds to wait between retry attempts. Defaults to 1. | ✔️ |
|
| 74 |
|
| 75 |
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, `blocked_domains`, `proxy`, and `selector_config`.
|
| 76 |
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
|
| 84 |
## Examples
|
| 85 |
It's easier to understand with examples, so we will now review most of the arguments individually. Since it's the same class as the [DynamicFetcher](dynamic.md#introduction), you can refer to that page for more examples, as we won't repeat all the examples from there.
|
|
@@ -110,11 +112,11 @@ The `solve_cloudflare` parameter enables automatic detection and solving all typ
|
|
| 110 |
|
| 111 |
And even solves the custom pages with embedded captcha.
|
| 112 |
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
|
| 117 |
-
|
| 118 |
|
| 119 |
### Browser Automation
|
| 120 |
This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
|
|
@@ -251,7 +253,7 @@ In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resou
|
|
| 251 |
|
| 252 |
## Using Camoufox as an engine
|
| 253 |
|
| 254 |
-
This fetcher
|
| 255 |
|
| 256 |
First, you will need to install the Camoufox library, browser, and Firefox system dependencies if you didn't already:
|
| 257 |
```commandline
|
|
|
|
| 19 |
```
|
| 20 |
Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
|
| 21 |
|
| 22 |
+
!!! abstract
|
| 23 |
+
|
| 24 |
+
The async version of the `fetch` method is `async_fetch`, of course.
|
| 25 |
|
| 26 |
## What does it do?
|
| 27 |
|
|
|
|
| 69 |
| allow_webgl | Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
|
| 70 |
| additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
|
| 71 |
| selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
|
| 72 |
+
| blocked_domains | A set of domain names to block requests to. Subdomains are also matched (e.g., `"example.com"` blocks `"sub.example.com"` too). | ✔️ |
|
| 73 |
+
| proxy_rotator | A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy`. | ✔️ |
|
| 74 |
| retries | Number of retry attempts for failed requests. Defaults to 3. | ✔️ |
|
| 75 |
| retry_delay | Seconds to wait between retry attempts. Defaults to 1. | ✔️ |
|
| 76 |
|
| 77 |
In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, `blocked_domains`, `proxy`, and `selector_config`.
|
| 78 |
|
| 79 |
+
!!! note "Notes:"
|
| 80 |
+
|
| 81 |
+
1. It's basically the same arguments as [DynamicFetcher](dynamic.md#introduction) class, but with these additional arguments: `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`.
|
| 82 |
+
2. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
|
| 83 |
+
3. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
|
| 84 |
+
4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
|
| 85 |
|
| 86 |
## Examples
|
| 87 |
It's easier to understand with examples, so we will now review most of the arguments individually. Since it's the same class as the [DynamicFetcher](dynamic.md#introduction), you can refer to that page for more examples, as we won't repeat all the examples from there.
|
|
|
|
| 112 |
|
| 113 |
And even solves the custom pages with embedded captcha.
|
| 114 |
|
| 115 |
+
!!! notes "**Important notes:**"
|
| 116 |
+
|
| 117 |
+
1. Sometimes, with websites that use custom implementations, you will need to use `wait_selector` to make sure Scrapling waits for the real website content to be loaded after solving the captcha. Some websites can be the real definition of an edge case while we are trying to make the solver as generic as possible.
|
| 118 |
+
2. The timeout should be at least 60 seconds when using the Cloudflare solver for sufficient challenge-solving time.
|
| 119 |
+
3. This feature works seamlessly with proxies and other stealth options.
|
| 120 |
|
| 121 |
### Browser Automation
|
| 122 |
This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
|
|
|
|
| 253 |
|
| 254 |
## Using Camoufox as an engine
|
| 255 |
|
| 256 |
+
This fetcher used a custom version of [Camoufox](https://github.com/daijro/camoufox) as an engine before version 0.3.13, which was replaced by [patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright) for many reasons. If you see that Camoufox is stable on your device, has no high memory issues, and you want to continue using it, then you can.
|
| 257 |
|
| 258 |
First, you will need to install the Camoufox library, browser, and Firefox system dependencies if you didn't already:
|
| 259 |
```commandline
|
docs/overview.md
CHANGED
|
@@ -264,11 +264,11 @@ For Async requests, you will replace the import like below:
|
|
| 264 |
>>> page = await AsyncFetcher.delete('https://scrapling.requestcatcher.com/delete')
|
| 265 |
```
|
| 266 |
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
|
| 272 |
|
| 273 |
This is just the tip of the iceberg with this fetcher; check out the rest from [here](fetching/static.md)
|
| 274 |
|
|
|
|
| 264 |
>>> page = await AsyncFetcher.delete('https://scrapling.requestcatcher.com/delete')
|
| 265 |
```
|
| 266 |
|
| 267 |
+
!!! note "Notes:"
|
| 268 |
+
|
| 269 |
+
1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from a Google search of this domain. It's enabled by default.
|
| 270 |
+
2. The `impersonate` argument lets you fake the TLS fingerprint for a specific browser version.
|
| 271 |
+
3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic
|
| 272 |
|
| 273 |
This is just the tip of the iceberg with this fetcher; check out the rest from [here](fetching/static.md)
|
| 274 |
|
docs/parsing/adaptive.md
CHANGED
|
@@ -99,7 +99,9 @@ The code will be the same in a real-world scenario, except it will use the same
|
|
| 99 |
|
| 100 |
Hence, in the two examples above, I used both the `Selector` and `Fetcher` classes to show that the adaptive logic is the same.
|
| 101 |
|
| 102 |
-
|
|
|
|
|
|
|
| 103 |
|
| 104 |
## How the adaptive scraping feature works
|
| 105 |
Adaptive scraping works in two phases:
|
|
|
|
| 99 |
|
| 100 |
Hence, in the two examples above, I used both the `Selector` and `Fetcher` classes to show that the adaptive logic is the same.
|
| 101 |
|
| 102 |
+
!!! info
|
| 103 |
+
|
| 104 |
+
The main reason for creating the `adaptive_domain` argument was to handle if the website changed its URL while changing the design/structure. In that case, you can use it to continue using the previously stored adaptive data for the new URL. Otherwise, scrapling will consider it a new website and discard the old data.
|
| 105 |
|
| 106 |
## How the adaptive scraping feature works
|
| 107 |
Adaptive scraping works in two phases:
|
docs/parsing/main_classes.md
CHANGED
|
@@ -276,7 +276,7 @@ If your case needs more than the element's parent, you can iterate over the whol
|
|
| 276 |
for ancestor in article.iterancestors():
|
| 277 |
# do something with it...
|
| 278 |
```
|
| 279 |
-
You can search for a specific ancestor of an element that satisfies a search function; all you need to do is
|
| 280 |
```python
|
| 281 |
>>> article.find_ancestor(lambda ancestor: ancestor.has_class('product-list'))
|
| 282 |
<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
|
|
@@ -289,7 +289,7 @@ The class `Selectors` is the "List" version of the [Selector](#selector) class.
|
|
| 289 |
|
| 290 |
In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance.
|
| 291 |
|
| 292 |
-
Starting with v0.4, all selection methods consistently return [Selector](#selector)/[Selectors](#selectors) objects, even for text nodes and attribute values. Text nodes (selected via `::text`, `/text()`, `::attr()`, `/@attr`) are wrapped in [Selector](#selector) objects. These text node selectors have `tag` set to `"#text"`, and their `text` property returns the text value. You can still access the text value directly, and all other properties
|
| 293 |
|
| 294 |
```python
|
| 295 |
>>> page.css('a::text') # -> Selectors (of text node Selectors)
|
|
@@ -531,7 +531,7 @@ First, we start with the `re` and `re_first` methods. These are the same methods
|
|
| 531 |
{'some_key': 'some_value'}
|
| 532 |
```
|
| 533 |
You might wonder how this happened, given that the `html` tag doesn't contain direct text.<br/>
|
| 534 |
-
Well, for cases like JSON responses, I made the [Selector](#selector) class keep a raw copy of the content it receives. This way, when you use the `.json()` method, it checks for that raw copy and then converts it to JSON. If the raw copy is
|
| 535 |
|
| 536 |
- Another handy method is `.clean()`, which will remove all white spaces and consecutive spaces for you and return a new `TextHandler` instance
|
| 537 |
```python
|
|
@@ -559,7 +559,7 @@ You probably guessed it: This class is similar to [Selectors](#selectors) and [S
|
|
| 559 |
The only difference is that the `re_first` method logic here runs `re` on each [TextHandler](#texthandler) and returns the first result, or `None`. Nothing new needs to be explained here, but new methods will be added over time.
|
| 560 |
|
| 561 |
## AttributesHandler
|
| 562 |
-
This is a read-only version of Python's standard dictionary, or `dict`, used solely to store the attributes of each element
|
| 563 |
```python
|
| 564 |
>>> print(page.find('script').attrib)
|
| 565 |
{'id': 'page-data', 'type': 'application/json'}
|
|
|
|
| 276 |
for ancestor in article.iterancestors():
|
| 277 |
# do something with it...
|
| 278 |
```
|
| 279 |
+
You can search for a specific ancestor of an element that satisfies a search function; all you need to do is pass a function that takes a [Selector](#selector) object as an argument and return `True` if the condition satisfies or `False` otherwise, like below:
|
| 280 |
```python
|
| 281 |
>>> article.find_ancestor(lambda ancestor: ancestor.has_class('product-list'))
|
| 282 |
<data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
|
|
|
|
| 289 |
|
| 290 |
In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance.
|
| 291 |
|
| 292 |
+
Starting with v0.4, all selection methods consistently return [Selector](#selector)/[Selectors](#selectors) objects, even for text nodes and attribute values. Text nodes (selected via `::text`, `/text()`, `::attr()`, `/@attr`) are wrapped in [Selector](#selector) objects. These text node selectors have `tag` set to `"#text"`, and their `text` property returns the text value. You can still access the text value directly, and all other properties return empty/default values gracefully.
|
| 293 |
|
| 294 |
```python
|
| 295 |
>>> page.css('a::text') # -> Selectors (of text node Selectors)
|
|
|
|
| 531 |
{'some_key': 'some_value'}
|
| 532 |
```
|
| 533 |
You might wonder how this happened, given that the `html` tag doesn't contain direct text.<br/>
|
| 534 |
+
Well, for cases like JSON responses, I made the [Selector](#selector) class keep a raw copy of the content it receives. This way, when you use the `.json()` method, it checks for that raw copy and then converts it to JSON. If the raw copy is unavailable, as with the elements, it checks the current element's text content; otherwise, it uses the `get_all_text` method directly.<br/>
|
| 535 |
|
| 536 |
- Another handy method is `.clean()`, which will remove all white spaces and consecutive spaces for you and return a new `TextHandler` instance
|
| 537 |
```python
|
|
|
|
| 559 |
The only difference is that the `re_first` method logic here runs `re` on each [TextHandler](#texthandler) and returns the first result, or `None`. Nothing new needs to be explained here, but new methods will be added over time.
|
| 560 |
|
| 561 |
## AttributesHandler
|
| 562 |
+
This is a read-only version of Python's standard dictionary, or `dict`, used solely to store the attributes of each element/[Selector](#selector) instance.
|
| 563 |
```python
|
| 564 |
>>> print(page.find('script').attrib)
|
| 565 |
{'id': 'page-data', 'type': 'application/json'}
|
docs/parsing/selection.md
CHANGED
|
@@ -27,7 +27,7 @@ Also, Scrapling implements some non-standard pseudo-elements like:
|
|
| 27 |
|
| 28 |
In short, if you come from Scrapy/Parsel, you will find the same logic for selectors here to make it easier. No need to implement a stranger logic to the one that most of us are used to :)
|
| 29 |
|
| 30 |
-
To select elements with CSS selectors,
|
| 31 |
|
| 32 |
### What are XPath selectors?
|
| 33 |
[XPath](https://en.wikipedia.org/wiki/XPath) is a language for selecting nodes in XML documents, which can also be used with HTML. This [cheatsheet](https://devhints.io/xpath) is a good resource for learning about [XPath](https://en.wikipedia.org/wiki/XPath). Scrapling adds XPath selectors directly through [lxml](https://lxml.de/).
|
|
@@ -46,7 +46,9 @@ Select all elements with the class `product`.
|
|
| 46 |
products = page.css('.product')
|
| 47 |
products = page.xpath('//*[@class="product"]')
|
| 48 |
```
|
| 49 |
-
|
|
|
|
|
|
|
| 50 |
|
| 51 |
Select the first element with the class `product`.
|
| 52 |
```python
|
|
@@ -109,7 +111,9 @@ By default, Scrapling searches for the exact matching of the text/pattern you pa
|
|
| 109 |
|
| 110 |
* **partial**: If enabled, `find_by_text` will return elements that contain the input text. So it's not an exact match anymore
|
| 111 |
|
| 112 |
-
|
|
|
|
|
|
|
| 113 |
|
| 114 |
### Finding Similar Elements
|
| 115 |
One of the most remarkable new features Scrapling puts on the table is the ability to tell Scrapling to find elements similar to the element at hand. This feature's inspiration came from the AutoScraper library, but in Scrapling, it can be used on elements found by any method. Most of its usage would likely occur after finding elements through text content, similar to how AutoScraper works, making it convenient to explain here.
|
|
@@ -349,10 +353,10 @@ It filters all elements in the current page/element in the following order:
|
|
| 349 |
3. All elements that match all passed regex patterns are collected, or if previous filter(s) are used, then previously collected elements are filtered.
|
| 350 |
4. All elements that fulfill all passed function(s) are collected; if a previous filter(s) is used, then previously collected elements are filtered.
|
| 351 |
|
| 352 |
-
Notes:
|
| 353 |
|
| 354 |
-
1. As you probably understood, the filtering process always starts from the first filter it finds in the filtering order above. So, if no tag name(s) are passed but attributes are passed, the process starts from that step (number 2), and so on.
|
| 355 |
-
2. The order in which you pass the arguments doesn't matter. The only order considered is the one explained above.
|
| 356 |
|
| 357 |
Check examples to clear any confusion :)
|
| 358 |
|
|
@@ -468,8 +472,9 @@ Generate a full XPath selector for the `url_element` element from the start of t
|
|
| 468 |
>>> url_element.generate_full_xpath_selector
|
| 469 |
'//body/div/div[2]/div/div/span[2]/a'
|
| 470 |
```
|
| 471 |
-
|
| 472 |
-
|
|
|
|
| 473 |
|
| 474 |
## Using selectors with regular expressions
|
| 475 |
Similar to `parsel`/`scrapy`, `re` and `re_first` methods are available for extracting data using regular expressions. However, unlike the former libraries, these methods are in nearly all classes like `Selector`/`Selectors`/`TextHandler` and `TextHandlers`, which means you can use them directly on the element even if you didn't select a text node.
|
|
|
|
| 27 |
|
| 28 |
In short, if you come from Scrapy/Parsel, you will find the same logic for selectors here to make it easier. No need to implement a stranger logic to the one that most of us are used to :)
|
| 29 |
|
| 30 |
+
To select elements with CSS selectors, use the `css` method, which returns `Selectors`. Use `[0]` to get the first element, or `.get()` / `.getall()` to extract text values from text/attribute pseudo-selectors.
|
| 31 |
|
| 32 |
### What are XPath selectors?
|
| 33 |
[XPath](https://en.wikipedia.org/wiki/XPath) is a language for selecting nodes in XML documents, which can also be used with HTML. This [cheatsheet](https://devhints.io/xpath) is a good resource for learning about [XPath](https://en.wikipedia.org/wiki/XPath). Scrapling adds XPath selectors directly through [lxml](https://lxml.de/).
|
|
|
|
| 46 |
products = page.css('.product')
|
| 47 |
products = page.xpath('//*[@class="product"]')
|
| 48 |
```
|
| 49 |
+
!!! info "Note:"
|
| 50 |
+
|
| 51 |
+
The XPath one won't be accurate if there's another class; **it's always better to rely on CSS for selecting by class**
|
| 52 |
|
| 53 |
Select the first element with the class `product`.
|
| 54 |
```python
|
|
|
|
| 111 |
|
| 112 |
* **partial**: If enabled, `find_by_text` will return elements that contain the input text. So it's not an exact match anymore
|
| 113 |
|
| 114 |
+
!!! abstract "Note:"
|
| 115 |
+
|
| 116 |
+
The method `find_by_regex` can accept both regular strings and a compiled regex pattern as its first argument, as you will see in the upcoming examples.
|
| 117 |
|
| 118 |
### Finding Similar Elements
|
| 119 |
One of the most remarkable new features Scrapling puts on the table is the ability to tell Scrapling to find elements similar to the element at hand. This feature's inspiration came from the AutoScraper library, but in Scrapling, it can be used on elements found by any method. Most of its usage would likely occur after finding elements through text content, similar to how AutoScraper works, making it convenient to explain here.
|
|
|
|
| 353 |
3. All elements that match all passed regex patterns are collected, or if previous filter(s) are used, then previously collected elements are filtered.
|
| 354 |
4. All elements that fulfill all passed function(s) are collected; if a previous filter(s) is used, then previously collected elements are filtered.
|
| 355 |
|
| 356 |
+
!!! note "Notes:"
|
| 357 |
|
| 358 |
+
1. As you probably understood, the filtering process always starts from the first filter it finds in the filtering order above. So, if no tag name(s) are passed but attributes are passed, the process starts from that step (number 2), and so on.
|
| 359 |
+
2. The order in which you pass the arguments doesn't matter. The only order considered is the one explained above.
|
| 360 |
|
| 361 |
Check examples to clear any confusion :)
|
| 362 |
|
|
|
|
| 472 |
>>> url_element.generate_full_xpath_selector
|
| 473 |
'//body/div/div[2]/div/div/span[2]/a'
|
| 474 |
```
|
| 475 |
+
!!! abstract "Note:"
|
| 476 |
+
|
| 477 |
+
When you tell Scrapling to create a short selector, it tries to find a unique element to use in generation as a stop point, like an element with an `id` attribute, but in our case, there wasn't any, so that's why the short and the full selector will be the same.
|
| 478 |
|
| 479 |
## Using selectors with regular expressions
|
| 480 |
Similar to `parsel`/`scrapy`, `re` and `re_first` methods are available for extracting data using regular expressions. However, unlike the former libraries, these methods are in nearly all classes like `Selector`/`Selectors`/`TextHandler` and `TextHandlers`, which means you can use them directly on the element even if you didn't select a text node.
|
docs/tutorials/migrating_from_beautifulsoup.md
CHANGED
|
@@ -80,12 +80,12 @@ for link in links:
|
|
| 80 |
|
| 81 |
As you can see, Scrapling simplifies the process by combining fetching and parsing into a single step, making your code cleaner and more efficient.
|
| 82 |
|
| 83 |
-
**Additional Notes:**
|
| 84 |
|
| 85 |
-
- **Different parsers**: BeautifulSoup allows you to set the parser engine to use, and one of them is `lxml`. Scrapling doesn't do that and uses the `lxml` library by default for performance reasons.
|
| 86 |
-
- **Element Types**: In BeautifulSoup, elements are `Tag` objects; in Scrapling, they are `Selector` objects. However, they provide similar methods and properties for navigation and data extraction.
|
| 87 |
-
- **Error Handling**: Both libraries return `None` when an element is not found (e.g., `soup.find()` or `page.find()`). In Scrapling, `page.css()` returns an empty `Selectors` list when no elements match, and you can use `page.css('.foo').first` to safely get the first match or `None`. To avoid errors, check for `None` or empty results before accessing properties.
|
| 88 |
-
- **Text Extraction**: Scrapling provides additional methods for handling text through `TextHandler`, such as `clean()`, which can help remove extra whitespace, consecutive spaces, or unwanted characters. Please check out the documentation for the complete list.
|
| 89 |
|
| 90 |
The documentation provides more details on Scrapling's features and the complete list of arguments that can be passed to all methods.
|
| 91 |
|
|
|
|
| 80 |
|
| 81 |
As you can see, Scrapling simplifies the process by combining fetching and parsing into a single step, making your code cleaner and more efficient.
|
| 82 |
|
| 83 |
+
!!! abstract "**Additional Notes:**"
|
| 84 |
|
| 85 |
+
- **Different parsers**: BeautifulSoup allows you to set the parser engine to use, and one of them is `lxml`. Scrapling doesn't do that and uses the `lxml` library by default for performance reasons.
|
| 86 |
+
- **Element Types**: In BeautifulSoup, elements are `Tag` objects; in Scrapling, they are `Selector` objects. However, they provide similar methods and properties for navigation and data extraction.
|
| 87 |
+
- **Error Handling**: Both libraries return `None` when an element is not found (e.g., `soup.find()` or `page.find()`). In Scrapling, `page.css()` returns an empty `Selectors` list when no elements match, and you can use `page.css('.foo').first` to safely get the first match or `None`. To avoid errors, check for `None` or empty results before accessing properties.
|
| 88 |
+
- **Text Extraction**: Scrapling provides additional methods for handling text through `TextHandler`, such as `clean()`, which can help remove extra whitespace, consecutive spaces, or unwanted characters. Please check out the documentation for the complete list.
|
| 89 |
|
| 90 |
The documentation provides more details on Scrapling's features and the complete list of arguments that can be passed to all methods.
|
| 91 |
|