Karim shoair commited on
Commit
790e9ee
·
1 Parent(s): 6f38a9c

docs: improve wording, style, and use more visual ways

Browse files
docs/ai/mcp-server.md CHANGED
@@ -179,7 +179,7 @@ We will gradually go from simple prompts to more complex ones. We will use Claud
179
  ```
180
  Use regular requests to scrape the main content from https://example.com and convert it to markdown format.
181
  ```
182
- This tells Claude which tool to use here, so it doesn't have to guess. Sometimes it will start using normal requests on its own, and at other times, it will assume browsers are better suited for this website without any apparent reason. As a general rule of thumb, you should always tell Claude which tool to use if you want to save time and money and get consistent results.
183
 
184
  2. **Targeted Data Extraction**
185
 
 
179
  ```
180
  Use regular requests to scrape the main content from https://example.com and convert it to markdown format.
181
  ```
182
+ This tells Claude which tool to use here, so it doesn't have to guess. Sometimes it will start using normal requests on its own, and at other times, it will assume browsers are better suited for this website without any apparent reason. As a rule of thumb, you should always tell Claude which tool to use to save time and money and get consistent results.
183
 
184
  2. **Targeted Data Extraction**
185
 
docs/fetching/choosing.md CHANGED
@@ -52,7 +52,9 @@ Then, continue your code as usual.
52
 
53
  The available configuration arguments are: `adaptive`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the [Selector](../parsing/main_classes.md#selector) class. You can display the current configuration anytime by running `<fetcher_class>.display_config()`.
54
 
55
- > Note: The `adaptive` argument is disabled by default; you must enable it to use that feature.
 
 
56
 
57
  ### Set parser config per request
58
  As you probably understand, the logic above for setting the parser config will apply globally to all requests/fetches made through that class, and it's intended for simplicity.
 
52
 
53
  The available configuration arguments are: `adaptive`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the [Selector](../parsing/main_classes.md#selector) class. You can display the current configuration anytime by running `<fetcher_class>.display_config()`.
54
 
55
+ !!! info
56
+
57
+ The `adaptive` argument is disabled by default; you must enable it to use that feature.
58
 
59
  ### Set parser config per request
60
  As you probably understand, the logic above for setting the parser config will apply globally to all requests/fetches made through that class, and it's intended for simplicity.
docs/fetching/dynamic.md CHANGED
@@ -20,7 +20,9 @@ Check out how to configure the parsing options [here](choosing.md#parser-configu
20
 
21
  Now, we will review most of the arguments one by one, using examples. If you want to jump to a table of all arguments for quick reference, [click here](#full-list-of-arguments)
22
 
23
- > Note: The async version of the `fetch` method is the `async_fetch` method, of course.
 
 
24
 
25
 
26
  This fetcher currently provides three main run options that can be combined as desired.
@@ -51,10 +53,10 @@ DynamicFetcher.fetch('https://example.com', cdp_url='ws://localhost:9222')
51
  Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
52
 
53
 
54
- > Notes:
55
- >
56
- > * There was a `stealth` option here, but it was moved to the `StealthyFetcher` class, as explained on the next page, with additional features since version 0.3.13.<br/>
57
- > * This makes it less confusing for new users, easier to maintain, and provides other benefits, as explained on the [StealthyFetcher page](../fetching/stealthy.md).
58
 
59
  ## Full list of arguments
60
  Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
@@ -85,19 +87,19 @@ Scrapling provides many options with this fetcher and its session classes. To ma
85
  | extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
86
  | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
87
  | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
88
- | blocked_domains | A set of domain names to block requests to. Subdomains are also matched (e.g., `"example.com"` blocks `"sub.example.com"` too). | ✔️ |
89
- | proxy_rotator | A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy`. | ✔️ |
90
  | retries | Number of retry attempts for failed requests. Defaults to 3. | ✔️ |
91
- | retry_delay | Seconds to wait between retry attempts. Defaults to 1. | ✔️ |
92
 
93
  In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `blocked_domains`, `proxy`, and `selector_config`.
94
 
95
- > 🔍 Notes:
96
- >
97
- > 1. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
98
- > 2. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
99
- > 3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
100
- > 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
101
 
102
 
103
  ## Examples
 
20
 
21
  Now, we will review most of the arguments one by one, using examples. If you want to jump to a table of all arguments for quick reference, [click here](#full-list-of-arguments)
22
 
23
+ !!! abstract
24
+
25
+ The async version of the `fetch` method is `async_fetch`, of course.
26
 
27
 
28
  This fetcher currently provides three main run options that can be combined as desired.
 
53
  Instead of launching a browser locally (Chromium/Google Chrome), you can connect to a remote browser through the [Chrome DevTools Protocol](https://chromedevtools.github.io/devtools-protocol/).
54
 
55
 
56
+ !!! note "Notes:"
57
+
58
+ * There was a `stealth` option here, but it was moved to the `StealthyFetcher` class, as explained on the next page, with additional features since version 0.3.13.<br/>
59
+ * This makes it less confusing for new users, easier to maintain, and provides other benefits, as explained on the [StealthyFetcher page](../fetching/stealthy.md).
60
 
61
  ## Full list of arguments
62
  Scrapling provides many options with this fetcher and its session classes. To make it as simple as possible, we will list the options here and give examples of how to use most of them.
 
87
  | extra_flags | A list of additional browser flags to pass to the browser on launch. | ✔️ |
88
  | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
89
  | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
90
+ | blocked_domains | A set of domain names to block requests to. Subdomains are also matched (e.g., `"example.com"` blocks `"sub.example.com"` too). | ✔️ |
91
+ | proxy_rotator | A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy`. | ✔️ |
92
  | retries | Number of retry attempts for failed requests. Defaults to 3. | ✔️ |
93
+ | retry_delay | Seconds to wait between retry attempts. Defaults to 1. | ✔️ |
94
 
95
  In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `blocked_domains`, `proxy`, and `selector_config`.
96
 
97
+ !!! note "Notes:"
98
+
99
+ 1. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
100
+ 2. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
101
+ 3. Since version 0.3.13, the `stealth` option has been removed here in favor of the `StealthyFetcher` class, and the `hide_canvas` option has been moved to it. The `disable_webgl` argument has been moved to the `StealthyFetcher` class and renamed as `allow_webgl`.
102
+ 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
103
 
104
 
105
  ## Examples
docs/fetching/static.md CHANGED
@@ -38,12 +38,13 @@ All methods for making requests here share some arguments, so let's discuss them
38
  - **cert**: Tuple of (cert, key) filenames for the client certificate.
39
  - **selector_config**: A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class.
40
 
41
- > Note: <br/>
42
- > 1. The currently available browsers to impersonate are (`"edge"`, `"chrome"`, `"chrome_android"`, `"safari"`, `"safari_beta"`, `"safari_ios"`, `"safari_ios_beta"`, `"firefox"`, `"tor"`)<br/>
43
- > 2. The available browsers to impersonate, along with their corresponding versions, are automatically displayed in the argument autocompletion and updated with each `curl_cffi` update.<br/>
44
- > 3. If any of the arguments `impersonate` or `stealthy_headers` are enabled, the fetchers will automatically generate real browser headers that match the browser version used.
45
 
46
- Other than this, for further customization, you can pass any arguments that `curl_cffi` supports for any method if that method doesn't already support it.
 
 
 
 
47
 
48
  ### HTTP Methods
49
  There are additional arguments for each method, depending on the method, such as `params` for GET requests and `data`/`json` for POST/PUT/DELETE requests.
 
38
  - **cert**: Tuple of (cert, key) filenames for the client certificate.
39
  - **selector_config**: A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class.
40
 
41
+ !!! note "Notes:"
 
 
 
42
 
43
+ 1. The currently available browsers to impersonate are (`"edge"`, `"chrome"`, `"chrome_android"`, `"safari"`, `"safari_beta"`, `"safari_ios"`, `"safari_ios_beta"`, `"firefox"`, `"tor"`)<br/>
44
+ 2. The available browsers to impersonate, along with their corresponding versions, are automatically displayed in the argument autocompletion and updated with each `curl_cffi` update.<br/>
45
+ 3. If any of the arguments `impersonate` or `stealthy_headers` are enabled, the fetchers will automatically generate real browser headers that match the browser version used.
46
+
47
+ Other than this, for further customization, you can pass any arguments that `curl_cffi` supports for any method if that method doesn't already support them.
48
 
49
  ### HTTP Methods
50
  There are additional arguments for each method, depending on the method, such as `params` for GET requests and `data`/`json` for POST/PUT/DELETE requests.
docs/fetching/stealthy.md CHANGED
@@ -19,7 +19,9 @@ You have one primary way to import this Fetcher, which is the same for all fetch
19
  ```
20
  Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
21
 
22
- > Note: The async version of the `fetch` method is the `async_fetch` method, of course.
 
 
23
 
24
  ## What does it do?
25
 
@@ -67,19 +69,19 @@ Scrapling provides many options with this fetcher and its session classes. Befor
67
  | allow_webgl | Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
68
  | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
69
  | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
70
- | blocked_domains | A set of domain names to block requests to. Subdomains are also matched (e.g., `"example.com"` blocks `"sub.example.com"` too). | ✔️ |
71
- | proxy_rotator | A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy`. | ✔️ |
72
  | retries | Number of retry attempts for failed requests. Defaults to 3. | ✔️ |
73
  | retry_delay | Seconds to wait between retry attempts. Defaults to 1. | ✔️ |
74
 
75
  In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, `blocked_domains`, `proxy`, and `selector_config`.
76
 
77
- > 🔍 Notes:
78
- >
79
- > 1. It's basically the same arguments as [DynamicFetcher](dynamic.md#introduction) class but with these additional arguments `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`.
80
- > 2. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
81
- > 3. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
82
- > 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
83
 
84
  ## Examples
85
  It's easier to understand with examples, so we will now review most of the arguments individually. Since it's the same class as the [DynamicFetcher](dynamic.md#introduction), you can refer to that page for more examples, as we won't repeat all the examples from there.
@@ -110,11 +112,11 @@ The `solve_cloudflare` parameter enables automatic detection and solving all typ
110
 
111
  And even solves the custom pages with embedded captcha.
112
 
113
- > 🔍 **Important notes:**
114
- >
115
- > 1. Sometimes, with websites that use custom implementations, you will need to use `wait_selector` to make sure Scrapling waits for the real website content to be loaded after solving the captcha. Some websites can be the real definition of an edge case while we are trying to make the solver as generic as possible.
116
- > 2. The timeout should be at least 60 seconds when using the Cloudflare solver for sufficient challenge-solving time.
117
- > 3. This feature works seamlessly with proxies and other stealth options.
118
 
119
  ### Browser Automation
120
  This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
@@ -251,7 +253,7 @@ In versions 0.3 and 0.3.1, the pool was reusing finished tabs to save more resou
251
 
252
  ## Using Camoufox as an engine
253
 
254
- This fetcher was using a custom version of [Camoufox](https://github.com/daijro/camoufox) as an engine before version 0.3.13, which was replaced now with [patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright) for many reasons. If you see that Camoufox is stable on your device, has no high memory issues, and want to continue using it. This section is for you.
255
 
256
  First, you will need to install the Camoufox library, browser, and Firefox system dependencies if you didn't already:
257
  ```commandline
 
19
  ```
20
  Check out how to configure the parsing options [here](choosing.md#parser-configuration-in-all-fetchers)
21
 
22
+ !!! abstract
23
+
24
+ The async version of the `fetch` method is `async_fetch`, of course.
25
 
26
  ## What does it do?
27
 
 
69
  | allow_webgl | Enabled by default. Disabling it disables WebGL and WebGL 2.0 support entirely. Disabling WebGL is not recommended, as many WAFs now check if WebGL is enabled. | ✔️ |
70
  | additional_args | Additional arguments to be passed to Playwright's context as additional settings, and they take higher priority than Scrapling's settings. | ✔️ |
71
  | selector_config | A dictionary of custom parsing arguments to be used when creating the final `Selector`/`Response` class. | ✔️ |
72
+ | blocked_domains | A set of domain names to block requests to. Subdomains are also matched (e.g., `"example.com"` blocks `"sub.example.com"` too). | ✔️ |
73
+ | proxy_rotator | A `ProxyRotator` instance for automatic proxy rotation. Cannot be combined with `proxy`. | ✔️ |
74
  | retries | Number of retry attempts for failed requests. Defaults to 3. | ✔️ |
75
  | retry_delay | Seconds to wait between retry attempts. Defaults to 1. | ✔️ |
76
 
77
  In session classes, all these arguments can be set globally for the session. Still, you can configure each request individually by passing some of the arguments here that can be configured on the browser tab level like: `google_search`, `timeout`, `wait`, `page_action`, `extra_headers`, `disable_resources`, `wait_selector`, `wait_selector_state`, `network_idle`, `load_dom`, `solve_cloudflare`, `blocked_domains`, `proxy`, and `selector_config`.
78
 
79
+ !!! note "Notes:"
80
+
81
+ 1. It's basically the same arguments as [DynamicFetcher](dynamic.md#introduction) class, but with these additional arguments: `solve_cloudflare`, `block_webrtc`, `hide_canvas`, and `allow_webgl`.
82
+ 2. The `disable_resources` option made requests ~25% faster in my tests for some websites and can help save your proxy usage, but be careful with it, as it can cause some websites to never finish loading.
83
+ 3. The `google_search` argument is enabled by default for all requests, making the request appear to come from a Google search page. So, a request for `https://example.com` will set the referer to `https://www.google.com/search?q=example`. Also, if used together, it takes priority over the referer set by the `extra_headers` argument.
84
+ 4. If you didn't set a user agent and enabled headless mode, the fetcher will generate a real user agent for the same browser version and use it. If you didn't set a user agent and didn't enable headless mode, the fetcher will use the browser's default user agent, which is the same as in standard browsers in the latest versions.
85
 
86
  ## Examples
87
  It's easier to understand with examples, so we will now review most of the arguments individually. Since it's the same class as the [DynamicFetcher](dynamic.md#introduction), you can refer to that page for more examples, as we won't repeat all the examples from there.
 
112
 
113
  And even solves the custom pages with embedded captcha.
114
 
115
+ !!! notes "**Important notes:**"
116
+
117
+ 1. Sometimes, with websites that use custom implementations, you will need to use `wait_selector` to make sure Scrapling waits for the real website content to be loaded after solving the captcha. Some websites can be the real definition of an edge case while we are trying to make the solver as generic as possible.
118
+ 2. The timeout should be at least 60 seconds when using the Cloudflare solver for sufficient challenge-solving time.
119
+ 3. This feature works seamlessly with proxies and other stealth options.
120
 
121
  ### Browser Automation
122
  This is where your knowledge about [Playwright's Page API](https://playwright.dev/python/docs/api/class-page) comes into play. The function you pass here takes the page object from Playwright's API, performs the desired action, and then the fetcher continues.
 
253
 
254
  ## Using Camoufox as an engine
255
 
256
+ This fetcher used a custom version of [Camoufox](https://github.com/daijro/camoufox) as an engine before version 0.3.13, which was replaced by [patchright](https://github.com/Kaliiiiiiiiii-Vinyzu/patchright) for many reasons. If you see that Camoufox is stable on your device, has no high memory issues, and you want to continue using it, then you can.
257
 
258
  First, you will need to install the Camoufox library, browser, and Firefox system dependencies if you didn't already:
259
  ```commandline
docs/overview.md CHANGED
@@ -264,11 +264,11 @@ For Async requests, you will replace the import like below:
264
  >>> page = await AsyncFetcher.delete('https://scrapling.requestcatcher.com/delete')
265
  ```
266
 
267
- > Notes:
268
- >
269
- > 1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from a Google search of this domain. It's enabled by default.
270
- > 2. The `impersonate` argument lets you fake the TLS fingerprint for a specific browser version.
271
- > 3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic
272
 
273
  This is just the tip of the iceberg with this fetcher; check out the rest from [here](fetching/static.md)
274
 
 
264
  >>> page = await AsyncFetcher.delete('https://scrapling.requestcatcher.com/delete')
265
  ```
266
 
267
+ !!! note "Notes:"
268
+
269
+ 1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from a Google search of this domain. It's enabled by default.
270
+ 2. The `impersonate` argument lets you fake the TLS fingerprint for a specific browser version.
271
+ 3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic
272
 
273
  This is just the tip of the iceberg with this fetcher; check out the rest from [here](fetching/static.md)
274
 
docs/parsing/adaptive.md CHANGED
@@ -99,7 +99,9 @@ The code will be the same in a real-world scenario, except it will use the same
99
 
100
  Hence, in the two examples above, I used both the `Selector` and `Fetcher` classes to show that the adaptive logic is the same.
101
 
102
- > Note: the main reason for creating the `adaptive_domain` argument was to handle if the website changed its URL while changing the design/structure. In that case, you can use it to continue using the previously stored adaptive data for the new URL. Otherwise, scrapling will consider it a new website and discard the old data.
 
 
103
 
104
  ## How the adaptive scraping feature works
105
  Adaptive scraping works in two phases:
 
99
 
100
  Hence, in the two examples above, I used both the `Selector` and `Fetcher` classes to show that the adaptive logic is the same.
101
 
102
+ !!! info
103
+
104
+ The main reason for creating the `adaptive_domain` argument was to handle if the website changed its URL while changing the design/structure. In that case, you can use it to continue using the previously stored adaptive data for the new URL. Otherwise, scrapling will consider it a new website and discard the old data.
105
 
106
  ## How the adaptive scraping feature works
107
  Adaptive scraping works in two phases:
docs/parsing/main_classes.md CHANGED
@@ -276,7 +276,7 @@ If your case needs more than the element's parent, you can iterate over the whol
276
  for ancestor in article.iterancestors():
277
  # do something with it...
278
  ```
279
- You can search for a specific ancestor of an element that satisfies a search function; all you need to do is to pass a function that takes a [Selector](#selector) object as an argument and return `True` if the condition satisfies or `False` otherwise, like below:
280
  ```python
281
  >>> article.find_ancestor(lambda ancestor: ancestor.has_class('product-list'))
282
  <data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
@@ -289,7 +289,7 @@ The class `Selectors` is the "List" version of the [Selector](#selector) class.
289
 
290
  In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance.
291
 
292
- Starting with v0.4, all selection methods consistently return [Selector](#selector)/[Selectors](#selectors) objects, even for text nodes and attribute values. Text nodes (selected via `::text`, `/text()`, `::attr()`, `/@attr`) are wrapped in [Selector](#selector) objects. These text node selectors have `tag` set to `"#text"`, and their `text` property returns the text value. You can still access the text value directly, and all other properties gracefully return empty/default values.
293
 
294
  ```python
295
  >>> page.css('a::text') # -> Selectors (of text node Selectors)
@@ -531,7 +531,7 @@ First, we start with the `re` and `re_first` methods. These are the same methods
531
  {'some_key': 'some_value'}
532
  ```
533
  You might wonder how this happened, given that the `html` tag doesn't contain direct text.<br/>
534
- Well, for cases like JSON responses, I made the [Selector](#selector) class keep a raw copy of the content it receives. This way, when you use the `.json()` method, it checks for that raw copy and then converts it to JSON. If the raw copy is not available like the case with the elements, it checks for the current element text content, or otherwise it uses the `get_all_text` method directly.<br/>
535
 
536
  - Another handy method is `.clean()`, which will remove all white spaces and consecutive spaces for you and return a new `TextHandler` instance
537
  ```python
@@ -559,7 +559,7 @@ You probably guessed it: This class is similar to [Selectors](#selectors) and [S
559
  The only difference is that the `re_first` method logic here runs `re` on each [TextHandler](#texthandler) and returns the first result, or `None`. Nothing new needs to be explained here, but new methods will be added over time.
560
 
561
  ## AttributesHandler
562
- This is a read-only version of Python's standard dictionary, or `dict`, used solely to store the attributes of each element or [Selector](#selector) instance.
563
  ```python
564
  >>> print(page.find('script').attrib)
565
  {'id': 'page-data', 'type': 'application/json'}
 
276
  for ancestor in article.iterancestors():
277
  # do something with it...
278
  ```
279
+ You can search for a specific ancestor of an element that satisfies a search function; all you need to do is pass a function that takes a [Selector](#selector) object as an argument and return `True` if the condition satisfies or `False` otherwise, like below:
280
  ```python
281
  >>> article.find_ancestor(lambda ancestor: ancestor.has_class('product-list'))
282
  <data='<div class="product-list"> <article clas...' parent='<body> <div class="product-list"> <artic...'>
 
289
 
290
  In the [Selector](#selector) class, all methods/properties that should return a group of elements return them as a [Selectors](#selectors) class instance.
291
 
292
+ Starting with v0.4, all selection methods consistently return [Selector](#selector)/[Selectors](#selectors) objects, even for text nodes and attribute values. Text nodes (selected via `::text`, `/text()`, `::attr()`, `/@attr`) are wrapped in [Selector](#selector) objects. These text node selectors have `tag` set to `"#text"`, and their `text` property returns the text value. You can still access the text value directly, and all other properties return empty/default values gracefully.
293
 
294
  ```python
295
  >>> page.css('a::text') # -> Selectors (of text node Selectors)
 
531
  {'some_key': 'some_value'}
532
  ```
533
  You might wonder how this happened, given that the `html` tag doesn't contain direct text.<br/>
534
+ Well, for cases like JSON responses, I made the [Selector](#selector) class keep a raw copy of the content it receives. This way, when you use the `.json()` method, it checks for that raw copy and then converts it to JSON. If the raw copy is unavailable, as with the elements, it checks the current element's text content; otherwise, it uses the `get_all_text` method directly.<br/>
535
 
536
  - Another handy method is `.clean()`, which will remove all white spaces and consecutive spaces for you and return a new `TextHandler` instance
537
  ```python
 
559
  The only difference is that the `re_first` method logic here runs `re` on each [TextHandler](#texthandler) and returns the first result, or `None`. Nothing new needs to be explained here, but new methods will be added over time.
560
 
561
  ## AttributesHandler
562
+ This is a read-only version of Python's standard dictionary, or `dict`, used solely to store the attributes of each element/[Selector](#selector) instance.
563
  ```python
564
  >>> print(page.find('script').attrib)
565
  {'id': 'page-data', 'type': 'application/json'}
docs/parsing/selection.md CHANGED
@@ -27,7 +27,7 @@ Also, Scrapling implements some non-standard pseudo-elements like:
27
 
28
  In short, if you come from Scrapy/Parsel, you will find the same logic for selectors here to make it easier. No need to implement a stranger logic to the one that most of us are used to :)
29
 
30
- To select elements with CSS selectors, you have the `css` method which returns `Selectors`. Use `[0]` to get the first element, or `.get()` / `.getall()` to extract text values from text/attribute pseudo-selectors.
31
 
32
  ### What are XPath selectors?
33
  [XPath](https://en.wikipedia.org/wiki/XPath) is a language for selecting nodes in XML documents, which can also be used with HTML. This [cheatsheet](https://devhints.io/xpath) is a good resource for learning about [XPath](https://en.wikipedia.org/wiki/XPath). Scrapling adds XPath selectors directly through [lxml](https://lxml.de/).
@@ -46,7 +46,9 @@ Select all elements with the class `product`.
46
  products = page.css('.product')
47
  products = page.xpath('//*[@class="product"]')
48
  ```
49
- Note: The XPath one won't be accurate if there's another class; **it's always better to rely on CSS for selecting by class**
 
 
50
 
51
  Select the first element with the class `product`.
52
  ```python
@@ -109,7 +111,9 @@ By default, Scrapling searches for the exact matching of the text/pattern you pa
109
 
110
  * **partial**: If enabled, `find_by_text` will return elements that contain the input text. So it's not an exact match anymore
111
 
112
- Note: The method `find_by_regex` can accept both regular strings and a compiled regex pattern as its first argument, as you will see in the upcoming examples.
 
 
113
 
114
  ### Finding Similar Elements
115
  One of the most remarkable new features Scrapling puts on the table is the ability to tell Scrapling to find elements similar to the element at hand. This feature's inspiration came from the AutoScraper library, but in Scrapling, it can be used on elements found by any method. Most of its usage would likely occur after finding elements through text content, similar to how AutoScraper works, making it convenient to explain here.
@@ -349,10 +353,10 @@ It filters all elements in the current page/element in the following order:
349
  3. All elements that match all passed regex patterns are collected, or if previous filter(s) are used, then previously collected elements are filtered.
350
  4. All elements that fulfill all passed function(s) are collected; if a previous filter(s) is used, then previously collected elements are filtered.
351
 
352
- Notes:
353
 
354
- 1. As you probably understood, the filtering process always starts from the first filter it finds in the filtering order above. So, if no tag name(s) are passed but attributes are passed, the process starts from that step (number 2), and so on.
355
- 2. The order in which you pass the arguments doesn't matter. The only order considered is the one explained above.
356
 
357
  Check examples to clear any confusion :)
358
 
@@ -468,8 +472,9 @@ Generate a full XPath selector for the `url_element` element from the start of t
468
  >>> url_element.generate_full_xpath_selector
469
  '//body/div/div[2]/div/div/span[2]/a'
470
  ```
471
- > Note: <br>
472
- > When you tell Scrapling to create a short selector, it tries to find a unique element to use in generation as a stop point, like an element with an `id` attribute, but in our case, there wasn't any, so that's why the short and the full selector will be the same.
 
473
 
474
  ## Using selectors with regular expressions
475
  Similar to `parsel`/`scrapy`, `re` and `re_first` methods are available for extracting data using regular expressions. However, unlike the former libraries, these methods are in nearly all classes like `Selector`/`Selectors`/`TextHandler` and `TextHandlers`, which means you can use them directly on the element even if you didn't select a text node.
 
27
 
28
  In short, if you come from Scrapy/Parsel, you will find the same logic for selectors here to make it easier. No need to implement a stranger logic to the one that most of us are used to :)
29
 
30
+ To select elements with CSS selectors, use the `css` method, which returns `Selectors`. Use `[0]` to get the first element, or `.get()` / `.getall()` to extract text values from text/attribute pseudo-selectors.
31
 
32
  ### What are XPath selectors?
33
  [XPath](https://en.wikipedia.org/wiki/XPath) is a language for selecting nodes in XML documents, which can also be used with HTML. This [cheatsheet](https://devhints.io/xpath) is a good resource for learning about [XPath](https://en.wikipedia.org/wiki/XPath). Scrapling adds XPath selectors directly through [lxml](https://lxml.de/).
 
46
  products = page.css('.product')
47
  products = page.xpath('//*[@class="product"]')
48
  ```
49
+ !!! info "Note:"
50
+
51
+ The XPath one won't be accurate if there's another class; **it's always better to rely on CSS for selecting by class**
52
 
53
  Select the first element with the class `product`.
54
  ```python
 
111
 
112
  * **partial**: If enabled, `find_by_text` will return elements that contain the input text. So it's not an exact match anymore
113
 
114
+ !!! abstract "Note:"
115
+
116
+ The method `find_by_regex` can accept both regular strings and a compiled regex pattern as its first argument, as you will see in the upcoming examples.
117
 
118
  ### Finding Similar Elements
119
  One of the most remarkable new features Scrapling puts on the table is the ability to tell Scrapling to find elements similar to the element at hand. This feature's inspiration came from the AutoScraper library, but in Scrapling, it can be used on elements found by any method. Most of its usage would likely occur after finding elements through text content, similar to how AutoScraper works, making it convenient to explain here.
 
353
  3. All elements that match all passed regex patterns are collected, or if previous filter(s) are used, then previously collected elements are filtered.
354
  4. All elements that fulfill all passed function(s) are collected; if a previous filter(s) is used, then previously collected elements are filtered.
355
 
356
+ !!! note "Notes:"
357
 
358
+ 1. As you probably understood, the filtering process always starts from the first filter it finds in the filtering order above. So, if no tag name(s) are passed but attributes are passed, the process starts from that step (number 2), and so on.
359
+ 2. The order in which you pass the arguments doesn't matter. The only order considered is the one explained above.
360
 
361
  Check examples to clear any confusion :)
362
 
 
472
  >>> url_element.generate_full_xpath_selector
473
  '//body/div/div[2]/div/div/span[2]/a'
474
  ```
475
+ !!! abstract "Note:"
476
+
477
+ When you tell Scrapling to create a short selector, it tries to find a unique element to use in generation as a stop point, like an element with an `id` attribute, but in our case, there wasn't any, so that's why the short and the full selector will be the same.
478
 
479
  ## Using selectors with regular expressions
480
  Similar to `parsel`/`scrapy`, `re` and `re_first` methods are available for extracting data using regular expressions. However, unlike the former libraries, these methods are in nearly all classes like `Selector`/`Selectors`/`TextHandler` and `TextHandlers`, which means you can use them directly on the element even if you didn't select a text node.
docs/tutorials/migrating_from_beautifulsoup.md CHANGED
@@ -80,12 +80,12 @@ for link in links:
80
 
81
  As you can see, Scrapling simplifies the process by combining fetching and parsing into a single step, making your code cleaner and more efficient.
82
 
83
- **Additional Notes:**
84
 
85
- - **Different parsers**: BeautifulSoup allows you to set the parser engine to use, and one of them is `lxml`. Scrapling doesn't do that and uses the `lxml` library by default for performance reasons.
86
- - **Element Types**: In BeautifulSoup, elements are `Tag` objects; in Scrapling, they are `Selector` objects. However, they provide similar methods and properties for navigation and data extraction.
87
- - **Error Handling**: Both libraries return `None` when an element is not found (e.g., `soup.find()` or `page.find()`). In Scrapling, `page.css()` returns an empty `Selectors` list when no elements match, and you can use `page.css('.foo').first` to safely get the first match or `None`. To avoid errors, check for `None` or empty results before accessing properties.
88
- - **Text Extraction**: Scrapling provides additional methods for handling text through `TextHandler`, such as `clean()`, which can help remove extra whitespace, consecutive spaces, or unwanted characters. Please check out the documentation for the complete list.
89
 
90
  The documentation provides more details on Scrapling's features and the complete list of arguments that can be passed to all methods.
91
 
 
80
 
81
  As you can see, Scrapling simplifies the process by combining fetching and parsing into a single step, making your code cleaner and more efficient.
82
 
83
+ !!! abstract "**Additional Notes:**"
84
 
85
+ - **Different parsers**: BeautifulSoup allows you to set the parser engine to use, and one of them is `lxml`. Scrapling doesn't do that and uses the `lxml` library by default for performance reasons.
86
+ - **Element Types**: In BeautifulSoup, elements are `Tag` objects; in Scrapling, they are `Selector` objects. However, they provide similar methods and properties for navigation and data extraction.
87
+ - **Error Handling**: Both libraries return `None` when an element is not found (e.g., `soup.find()` or `page.find()`). In Scrapling, `page.css()` returns an empty `Selectors` list when no elements match, and you can use `page.css('.foo').first` to safely get the first match or `None`. To avoid errors, check for `None` or empty results before accessing properties.
88
+ - **Text Extraction**: Scrapling provides additional methods for handling text through `TextHandler`, such as `clean()`, which can help remove extra whitespace, consecutive spaces, or unwanted characters. Please check out the documentation for the complete list.
89
 
90
  The documentation provides more details on Scrapling's features and the complete list of arguments that can be passed to all methods.
91