Spaces:

lenson78
/

Scrapling

Paused

App Files Files Community

Karim shoair commited on Feb 15

Commit

4329d45

1 Parent(s): 5618e2f

docs: general correction across all the pages

Browse files

Automated. I don't know how I missed all of that!

Files changed (10) hide show

docs/ai/mcp-server.md +1 -1
docs/api-reference/mcp-server.md +1 -1
docs/cli/extract-commands.md +2 -3
docs/development/adaptive_storage_system.md +1 -1
docs/fetching/choosing.md +2 -3
docs/fetching/stealthy.md +1 -1
docs/parsing/adaptive.md +1 -1
docs/parsing/main_classes.md +1 -1
docs/parsing/selection.md +1 -1
docs/tutorials/migrating_from_beautifulsoup.md +5 -3

docs/ai/mcp-server.md CHANGED Viewed

@@ -189,7 +189,7 @@ We will gradually go from simple prompts to more complex ones. We will use Claud
     Get all product titles from https://shop.example.com using the CSS selector '.product-title'. If the request fails, retry up to 5 times every 10 seconds.
     ```
-    The server will extract only the elements matching your selector and return them as a structured list. Notice I told it to set the tool to try only 3 times in case the website has connection issues, but the default setting should be fine for most cases.
 3. **E-commerce Data Collection**

     Get all product titles from https://shop.example.com using the CSS selector '.product-title'. If the request fails, retry up to 5 times every 10 seconds.
     ```
+    The server will extract only the elements matching your selector and return them as a structured list. Notice I told it to set the tool to try up to 5 times in case the website has connection issues, but the default setting should be fine for most cases.
 3. **E-commerce Data Collection**

docs/api-reference/mcp-server.md CHANGED Viewed

@@ -19,7 +19,7 @@ Or import the server class directly:
 from scrapling.core.ai import ScraplingMCPServer
 server = ScraplingMCPServer()
-server.serve()
 ```
 ## Response Model

 from scrapling.core.ai import ScraplingMCPServer
 server = ScraplingMCPServer()
+server.serve(http=False, host="0.0.0.0", port=8000)
 ```
 ## Response Model

docs/cli/extract-commands.md CHANGED Viewed

@@ -280,7 +280,7 @@ We will go through each command in detail below.
       -s, --css-selector TEXT                     CSS selector to extract specific content from the page. It returns all matches.
       --wait-selector TEXT                        CSS selector to wait for before proceeding
       --locale TEXT                               Specify user locale. Defaults to the system default locale.
-      ---real-chrome/--no-real-chrome             If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
       --proxy TEXT                                Proxy URL in format "http://username:password@host:port"
       -H, --extra-headers TEXT                    Extra headers in format "Key: Value" (can be used multiple times)
       --help                                      Show this message and exit.
@@ -320,8 +320,7 @@ We will go through each command in detail below.
       --solve-cloudflare / --no-solve-cloudflare  Solve Cloudflare challenges (default: False)
       --allow-webgl / --block-webgl               Allow WebGL (default: True)
       --network-idle / --no-network-idle          Wait for network idle (default: False)
-      ---real-chrome/--no-real-chrome             If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
-      --hide-canvas/--show-canvas                 Add noise to canvas operations (default: False)
       --timeout INTEGER                           Timeout in milliseconds (default: 30000)
       --wait INTEGER                              Additional wait time in milliseconds after page load (default: 0)
       -s, --css-selector TEXT                     CSS selector to extract specific content from the page. It returns all matches.

       -s, --css-selector TEXT                     CSS selector to extract specific content from the page. It returns all matches.
       --wait-selector TEXT                        CSS selector to wait for before proceeding
       --locale TEXT                               Specify user locale. Defaults to the system default locale.
+      --real-chrome/--no-real-chrome              If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
       --proxy TEXT                                Proxy URL in format "http://username:password@host:port"
       -H, --extra-headers TEXT                    Extra headers in format "Key: Value" (can be used multiple times)
       --help                                      Show this message and exit.
       --solve-cloudflare / --no-solve-cloudflare  Solve Cloudflare challenges (default: False)
       --allow-webgl / --block-webgl               Allow WebGL (default: True)
       --network-idle / --no-network-idle          Wait for network idle (default: False)
+      --real-chrome/--no-real-chrome              If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
       --timeout INTEGER                           Timeout in milliseconds (default: 30000)
       --wait INTEGER                              Additional wait time in milliseconds after page load (default: 0)
       -s, --css-selector TEXT                     CSS selector to extract specific content from the page. It returns all matches.

docs/development/adaptive_storage_system.md CHANGED Viewed

@@ -56,7 +56,7 @@ class RedisStorage(StorageSystemMixin):
             orjson.dumps(element_dict)
         )
-    def retrieve(self, identifier: str) -> dict:
         # Get data
         key = f"scrapling:{self._get_base_url()}:{identifier}"
         data = self.redis.get(key)

             orjson.dumps(element_dict)
         )
+    def retrieve(self, identifier: str) -> dict | None:
         # Get data
         key = f"scrapling:{self._get_base_url()}:{identifier}"
         data = self.redis.get(key)

docs/fetching/choosing.md CHANGED Viewed

@@ -40,19 +40,18 @@ Then you use it right away without initializing like this, and it will use the d
 If you want to configure the parser ([Selector class](../parsing/main_classes.md#selector)) that will be used on the response before returning it for you, then do this first:
 ```python
 >>> from scrapling.fetchers import Fetcher
->>> Fetcher.configure(adaptive=True, encoding="utf-8", keep_comments=False, keep_cdata=False)  # and the rest
 ```
 or
 ```python
 >>> from scrapling.fetchers import Fetcher
 >>> Fetcher.adaptive=True
->>> Fetcher.encoding="utf-8"
 >>> Fetcher.keep_comments=False
 >>> Fetcher.keep_cdata=False  # and the rest
 ```
 Then, continue your code as usual.
-The available configuration arguments are: `adaptive`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the [Selector](../parsing/main_classes.md#selector) class. You can display the current configuration anytime by running `<fetcher_class>.display_config()`.
 !!! info

 If you want to configure the parser ([Selector class](../parsing/main_classes.md#selector)) that will be used on the response before returning it for you, then do this first:
 ```python
 >>> from scrapling.fetchers import Fetcher
+>>> Fetcher.configure(adaptive=True, keep_comments=False, keep_cdata=False)  # and the rest
 ```
 or
 ```python
 >>> from scrapling.fetchers import Fetcher
 >>> Fetcher.adaptive=True
 >>> Fetcher.keep_comments=False
 >>> Fetcher.keep_cdata=False  # and the rest
 ```
 Then, continue your code as usual.
+The available configuration arguments are: `adaptive`, `adaptive_domain`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the [Selector](../parsing/main_classes.md#selector) class. You can display the current configuration anytime by running `<fetcher_class>.display_config()`.
 !!! info

docs/fetching/stealthy.md CHANGED Viewed

@@ -181,7 +181,7 @@ def scrape_amazon_product(url):
         'rating': page.css('[data-feature-name="averageCustomerReviews"] .a-popover-trigger .a-color-base::text').get(),
         'reviews_count': page.css('#acrCustomerReviewText::text').re_first(r'[\d,]+'),
         'features': [
-            li.clean() for li in page.css('#feature-bullets li span::text')
         ],
         'availability': page.css('#availability')[0].get_all_text(strip=True),
         'images': [

         'rating': page.css('[data-feature-name="averageCustomerReviews"] .a-popover-trigger .a-color-base::text').get(),
         'reviews_count': page.css('#acrCustomerReviewText::text').re_first(r'[\d,]+'),
         'features': [
+            li.get().clean() for li in page.css('#feature-bullets li span::text')
         ],
         'availability': page.css('#availability')[0].get_all_text(strip=True),
         'images': [

docs/parsing/adaptive.md CHANGED Viewed

@@ -145,7 +145,7 @@ Examples:
 >>> page = Selector(html_doc, adaptive=True)
 # OR
 >>> Fetcher.adaptive = True
->>> page = Fetcher.fetch('https://example.com')
 ```
 If you are using the [Selector](main_classes.md#selector) class, you need to pass the url of the website you are using with the argument `url` so Scrapling can separate the properties saved for each element by domain.

 >>> page = Selector(html_doc, adaptive=True)
 # OR
 >>> Fetcher.adaptive = True
+>>> page = Fetcher.get('https://example.com')
 ```
 If you are using the [Selector](main_classes.md#selector) class, you need to pass the url of the website you are using with the argument `url` so Scrapling can separate the properties saved for each element by domain.

docs/parsing/main_classes.md CHANGED Viewed

@@ -343,7 +343,7 @@ Apart from the standard operations on Python lists, such as iteration and slicin
 You can do the following:
-Execute CSS and XPath selectors directly on the [Selector](#selector) instances it has, while the arguments and the return types are the same as [Selector](#selector)'s `css` and `xpath` methods. This, of course, makes chaining methods very straightforward.
 ```python
 >>> page.css('.product_pod a')
 [<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,

 You can do the following:
+Execute CSS and XPath selectors directly on the [Selector](#selector) instances it has, while the return types are the same as [Selector](#selector)'s `css` and `xpath` methods. The arguments are similar, except the `adaptive` argument is not available here. This, of course, makes chaining methods very straightforward.
 ```python
 >>> page.css('.product_pod a')
 [<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,

docs/parsing/selection.md CHANGED Viewed

@@ -398,7 +398,7 @@ Find all div elements with a class that equals `quote` and contains the element
 >>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css('.text::text').get())
 [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
 ```
-Find all elements that don't have children.
 ```python
 >>> page.find_all(lambda element: len(element.children) > 0)
 [<data='<html lang="en"><head><meta charset="UTF...'>,

 >>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css('.text::text').get())
 [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
 ```
+Find all elements that have children.
 ```python
 >>> page.find_all(lambda element: len(element.children) > 0)
 [<data='<html lang="en"><head><meta charset="UTF...'>,

docs/tutorials/migrating_from_beautifulsoup.md CHANGED Viewed

@@ -21,7 +21,7 @@ You will notice that some shortcuts in BeautifulSoup are missing in Scrapling, w
 | Using CSS selectors to find the first matching element          | `elements = soup.select_one('div.example')`                                                                   | `elements = page.css('div.example').first`                                        |
 | Using CSS selectors to find all matching element                | `elements = soup.select('div.example')`                                                                       | `elements = page.css('div.example')`                                              |
 | Get a prettified version of the page/element source             | `prettified = soup.prettify()`                                                                                | `prettified = page.prettify()`                                                    |
-| Get a Non-pretty version of the page/element source             | `source = str(soup)`                                                                                          | `source = page.body`                                                              |
 | Get tag name of an element                                      | `name = element.name`                                                                                         | `name = element.tag`                                                              |
 | Extracting text content of an element                           | `string = element.string`                                                                                     | `string = element.text`                                                           |
 | Extracting all the text in a document or beneath a tag          | `text = soup.get_text(strip=True)`                                                                            | `text = page.get_all_text(strip=True)`                                            |
@@ -36,14 +36,16 @@ You will notice that some shortcuts in BeautifulSoup are missing in Scrapling, w
 | Searching for elements in the siblings of an element            | `target_sibling = element.find_next_siblings("a")`<br/>`target_sibling = element.find_previous_siblings("a")` | `target_sibling = element.siblings.filter(lambda s: s.tag == 'a')`                |
 | Searching for an element in the next elements of an element     | `target_parent = element.find_next("a")`                                                                      | `target_parent = element.below_elements.search(lambda p: p.tag == 'a')`           |
 | Searching for elements in the next elements of an element       | `target_parent = element.find_all_next("a")`                                                                  | `target_parent = element.below_elements.filter(lambda p: p.tag == 'a')`           |
-| Searching for an element in the previous elements of an element | `target_parent = element.find_previous("a")`                                                                  | `target_parent = element.path.search(lambda p: p.tag == 'a')`                     |
-| Searching for elements in the previous elements of an element   | `target_parent = element.find_all_previous("a")`                                                              | `target_parent = element.path.filter(lambda p: p.tag == 'a')`                     |
 | Get previous sibling of an element                              | `prev_element = element.previous_sibling`                                                                     | `prev_element = element.previous`                                                 |
 | Navigating to children                                          | `children = list(element.children)`                                                                           | `children = element.children`                                                     |
 | Get all descendants of an element                               | `children = list(element.descendants)`                                                                        | `children = element.below_elements`                                               |
 | Filtering a group of elements that satisfies a condition        | `group = soup.find('p', 'story').css.filter('a')`                                                             | `group = page.find_all('p', 'story').filter(lambda p: p.tag == 'a')`              |
 **One key point to remember**: BeautifulSoup offers features for modifying and manipulating the page after it has been parsed. Scrapling focuses more on scraping the page faster for you, and then you can do what you want with the extracted information. So, two different tools can be used in Web Scraping, but one of them specializes in Web Scraping :)
 ### Putting It All Together

 | Using CSS selectors to find the first matching element          | `elements = soup.select_one('div.example')`                                                                   | `elements = page.css('div.example').first`                                        |
 | Using CSS selectors to find all matching element                | `elements = soup.select('div.example')`                                                                       | `elements = page.css('div.example')`                                              |
 | Get a prettified version of the page/element source             | `prettified = soup.prettify()`                                                                                | `prettified = page.prettify()`                                                    |
+| Get a Non-pretty version of the page/element source             | `source = str(soup)`                                                                                          | `source = page.html_content`                                                      |
 | Get tag name of an element                                      | `name = element.name`                                                                                         | `name = element.tag`                                                              |
 | Extracting text content of an element                           | `string = element.string`                                                                                     | `string = element.text`                                                           |
 | Extracting all the text in a document or beneath a tag          | `text = soup.get_text(strip=True)`                                                                            | `text = page.get_all_text(strip=True)`                                            |
 | Searching for elements in the siblings of an element            | `target_sibling = element.find_next_siblings("a")`<br/>`target_sibling = element.find_previous_siblings("a")` | `target_sibling = element.siblings.filter(lambda s: s.tag == 'a')`                |
 | Searching for an element in the next elements of an element     | `target_parent = element.find_next("a")`                                                                      | `target_parent = element.below_elements.search(lambda p: p.tag == 'a')`           |
 | Searching for elements in the next elements of an element       | `target_parent = element.find_all_next("a")`                                                                  | `target_parent = element.below_elements.filter(lambda p: p.tag == 'a')`           |
+| Searching for an element in the ancestors of an element         | `target_parent = element.find_previous("a")` ¹                                                                | `target_parent = element.path.search(lambda p: p.tag == 'a')`                     |
+| Searching for elements in the ancestors of an element           | `target_parent = element.find_all_previous("a")` ¹                                                            | `target_parent = element.path.filter(lambda p: p.tag == 'a')`                     |
 | Get previous sibling of an element                              | `prev_element = element.previous_sibling`                                                                     | `prev_element = element.previous`                                                 |
 | Navigating to children                                          | `children = list(element.children)`                                                                           | `children = element.children`                                                     |
 | Get all descendants of an element                               | `children = list(element.descendants)`                                                                        | `children = element.below_elements`                                               |
 | Filtering a group of elements that satisfies a condition        | `group = soup.find('p', 'story').css.filter('a')`                                                             | `group = page.find_all('p', 'story').filter(lambda p: p.tag == 'a')`              |
+¹ **Note:** BS4's `find_previous`/`find_all_previous` searches all preceding elements in document order, while Scrapling's `path` only returns ancestors (the parent chain). These are not exact equivalents, but ancestor search covers the most common use case.
 **One key point to remember**: BeautifulSoup offers features for modifying and manipulating the page after it has been parsed. Scrapling focuses more on scraping the page faster for you, and then you can do what you want with the extracted information. So, two different tools can be used in Web Scraping, but one of them specializes in Web Scraping :)
 ### Putting It All Together