Karim shoair commited on
Commit
4329d45
·
1 Parent(s): 5618e2f

docs: general correction across all the pages

Browse files

Automated. I don't know how I missed all of that!

docs/ai/mcp-server.md CHANGED
@@ -189,7 +189,7 @@ We will gradually go from simple prompts to more complex ones. We will use Claud
189
  Get all product titles from https://shop.example.com using the CSS selector '.product-title'. If the request fails, retry up to 5 times every 10 seconds.
190
  ```
191
 
192
- The server will extract only the elements matching your selector and return them as a structured list. Notice I told it to set the tool to try only 3 times in case the website has connection issues, but the default setting should be fine for most cases.
193
 
194
  3. **E-commerce Data Collection**
195
 
 
189
  Get all product titles from https://shop.example.com using the CSS selector '.product-title'. If the request fails, retry up to 5 times every 10 seconds.
190
  ```
191
 
192
+ The server will extract only the elements matching your selector and return them as a structured list. Notice I told it to set the tool to try up to 5 times in case the website has connection issues, but the default setting should be fine for most cases.
193
 
194
  3. **E-commerce Data Collection**
195
 
docs/api-reference/mcp-server.md CHANGED
@@ -19,7 +19,7 @@ Or import the server class directly:
19
  from scrapling.core.ai import ScraplingMCPServer
20
 
21
  server = ScraplingMCPServer()
22
- server.serve()
23
  ```
24
 
25
  ## Response Model
 
19
  from scrapling.core.ai import ScraplingMCPServer
20
 
21
  server = ScraplingMCPServer()
22
+ server.serve(http=False, host="0.0.0.0", port=8000)
23
  ```
24
 
25
  ## Response Model
docs/cli/extract-commands.md CHANGED
@@ -280,7 +280,7 @@ We will go through each command in detail below.
280
  -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
281
  --wait-selector TEXT CSS selector to wait for before proceeding
282
  --locale TEXT Specify user locale. Defaults to the system default locale.
283
- ---real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
284
  --proxy TEXT Proxy URL in format "http://username:password@host:port"
285
  -H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
286
  --help Show this message and exit.
@@ -320,8 +320,7 @@ We will go through each command in detail below.
320
  --solve-cloudflare / --no-solve-cloudflare Solve Cloudflare challenges (default: False)
321
  --allow-webgl / --block-webgl Allow WebGL (default: True)
322
  --network-idle / --no-network-idle Wait for network idle (default: False)
323
- ---real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
324
- --hide-canvas/--show-canvas Add noise to canvas operations (default: False)
325
  --timeout INTEGER Timeout in milliseconds (default: 30000)
326
  --wait INTEGER Additional wait time in milliseconds after page load (default: 0)
327
  -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
 
280
  -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
281
  --wait-selector TEXT CSS selector to wait for before proceeding
282
  --locale TEXT Specify user locale. Defaults to the system default locale.
283
+ --real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
284
  --proxy TEXT Proxy URL in format "http://username:password@host:port"
285
  -H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
286
  --help Show this message and exit.
 
320
  --solve-cloudflare / --no-solve-cloudflare Solve Cloudflare challenges (default: False)
321
  --allow-webgl / --block-webgl Allow WebGL (default: True)
322
  --network-idle / --no-network-idle Wait for network idle (default: False)
323
+ --real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
 
324
  --timeout INTEGER Timeout in milliseconds (default: 30000)
325
  --wait INTEGER Additional wait time in milliseconds after page load (default: 0)
326
  -s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
docs/development/adaptive_storage_system.md CHANGED
@@ -56,7 +56,7 @@ class RedisStorage(StorageSystemMixin):
56
  orjson.dumps(element_dict)
57
  )
58
 
59
- def retrieve(self, identifier: str) -> dict:
60
  # Get data
61
  key = f"scrapling:{self._get_base_url()}:{identifier}"
62
  data = self.redis.get(key)
 
56
  orjson.dumps(element_dict)
57
  )
58
 
59
+ def retrieve(self, identifier: str) -> dict | None:
60
  # Get data
61
  key = f"scrapling:{self._get_base_url()}:{identifier}"
62
  data = self.redis.get(key)
docs/fetching/choosing.md CHANGED
@@ -40,19 +40,18 @@ Then you use it right away without initializing like this, and it will use the d
40
  If you want to configure the parser ([Selector class](../parsing/main_classes.md#selector)) that will be used on the response before returning it for you, then do this first:
41
  ```python
42
  >>> from scrapling.fetchers import Fetcher
43
- >>> Fetcher.configure(adaptive=True, encoding="utf-8", keep_comments=False, keep_cdata=False) # and the rest
44
  ```
45
  or
46
  ```python
47
  >>> from scrapling.fetchers import Fetcher
48
  >>> Fetcher.adaptive=True
49
- >>> Fetcher.encoding="utf-8"
50
  >>> Fetcher.keep_comments=False
51
  >>> Fetcher.keep_cdata=False # and the rest
52
  ```
53
  Then, continue your code as usual.
54
 
55
- The available configuration arguments are: `adaptive`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the [Selector](../parsing/main_classes.md#selector) class. You can display the current configuration anytime by running `<fetcher_class>.display_config()`.
56
 
57
  !!! info
58
 
 
40
  If you want to configure the parser ([Selector class](../parsing/main_classes.md#selector)) that will be used on the response before returning it for you, then do this first:
41
  ```python
42
  >>> from scrapling.fetchers import Fetcher
43
+ >>> Fetcher.configure(adaptive=True, keep_comments=False, keep_cdata=False) # and the rest
44
  ```
45
  or
46
  ```python
47
  >>> from scrapling.fetchers import Fetcher
48
  >>> Fetcher.adaptive=True
 
49
  >>> Fetcher.keep_comments=False
50
  >>> Fetcher.keep_cdata=False # and the rest
51
  ```
52
  Then, continue your code as usual.
53
 
54
+ The available configuration arguments are: `adaptive`, `adaptive_domain`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the [Selector](../parsing/main_classes.md#selector) class. You can display the current configuration anytime by running `<fetcher_class>.display_config()`.
55
 
56
  !!! info
57
 
docs/fetching/stealthy.md CHANGED
@@ -181,7 +181,7 @@ def scrape_amazon_product(url):
181
  'rating': page.css('[data-feature-name="averageCustomerReviews"] .a-popover-trigger .a-color-base::text').get(),
182
  'reviews_count': page.css('#acrCustomerReviewText::text').re_first(r'[\d,]+'),
183
  'features': [
184
- li.clean() for li in page.css('#feature-bullets li span::text')
185
  ],
186
  'availability': page.css('#availability')[0].get_all_text(strip=True),
187
  'images': [
 
181
  'rating': page.css('[data-feature-name="averageCustomerReviews"] .a-popover-trigger .a-color-base::text').get(),
182
  'reviews_count': page.css('#acrCustomerReviewText::text').re_first(r'[\d,]+'),
183
  'features': [
184
+ li.get().clean() for li in page.css('#feature-bullets li span::text')
185
  ],
186
  'availability': page.css('#availability')[0].get_all_text(strip=True),
187
  'images': [
docs/parsing/adaptive.md CHANGED
@@ -145,7 +145,7 @@ Examples:
145
  >>> page = Selector(html_doc, adaptive=True)
146
  # OR
147
  >>> Fetcher.adaptive = True
148
- >>> page = Fetcher.fetch('https://example.com')
149
  ```
150
  If you are using the [Selector](main_classes.md#selector) class, you need to pass the url of the website you are using with the argument `url` so Scrapling can separate the properties saved for each element by domain.
151
 
 
145
  >>> page = Selector(html_doc, adaptive=True)
146
  # OR
147
  >>> Fetcher.adaptive = True
148
+ >>> page = Fetcher.get('https://example.com')
149
  ```
150
  If you are using the [Selector](main_classes.md#selector) class, you need to pass the url of the website you are using with the argument `url` so Scrapling can separate the properties saved for each element by domain.
151
 
docs/parsing/main_classes.md CHANGED
@@ -343,7 +343,7 @@ Apart from the standard operations on Python lists, such as iteration and slicin
343
 
344
  You can do the following:
345
 
346
- Execute CSS and XPath selectors directly on the [Selector](#selector) instances it has, while the arguments and the return types are the same as [Selector](#selector)'s `css` and `xpath` methods. This, of course, makes chaining methods very straightforward.
347
  ```python
348
  >>> page.css('.product_pod a')
349
  [<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
 
343
 
344
  You can do the following:
345
 
346
+ Execute CSS and XPath selectors directly on the [Selector](#selector) instances it has, while the return types are the same as [Selector](#selector)'s `css` and `xpath` methods. The arguments are similar, except the `adaptive` argument is not available here. This, of course, makes chaining methods very straightforward.
347
  ```python
348
  >>> page.css('.product_pod a')
349
  [<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
docs/parsing/selection.md CHANGED
@@ -398,7 +398,7 @@ Find all div elements with a class that equals `quote` and contains the element
398
  >>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css('.text::text').get())
399
  [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
400
  ```
401
- Find all elements that don't have children.
402
  ```python
403
  >>> page.find_all(lambda element: len(element.children) > 0)
404
  [<data='<html lang="en"><head><meta charset="UTF...'>,
 
398
  >>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css('.text::text').get())
399
  [<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
400
  ```
401
+ Find all elements that have children.
402
  ```python
403
  >>> page.find_all(lambda element: len(element.children) > 0)
404
  [<data='<html lang="en"><head><meta charset="UTF...'>,
docs/tutorials/migrating_from_beautifulsoup.md CHANGED
@@ -21,7 +21,7 @@ You will notice that some shortcuts in BeautifulSoup are missing in Scrapling, w
21
  | Using CSS selectors to find the first matching element | `elements = soup.select_one('div.example')` | `elements = page.css('div.example').first` |
22
  | Using CSS selectors to find all matching element | `elements = soup.select('div.example')` | `elements = page.css('div.example')` |
23
  | Get a prettified version of the page/element source | `prettified = soup.prettify()` | `prettified = page.prettify()` |
24
- | Get a Non-pretty version of the page/element source | `source = str(soup)` | `source = page.body` |
25
  | Get tag name of an element | `name = element.name` | `name = element.tag` |
26
  | Extracting text content of an element | `string = element.string` | `string = element.text` |
27
  | Extracting all the text in a document or beneath a tag | `text = soup.get_text(strip=True)` | `text = page.get_all_text(strip=True)` |
@@ -36,14 +36,16 @@ You will notice that some shortcuts in BeautifulSoup are missing in Scrapling, w
36
  | Searching for elements in the siblings of an element | `target_sibling = element.find_next_siblings("a")`<br/>`target_sibling = element.find_previous_siblings("a")` | `target_sibling = element.siblings.filter(lambda s: s.tag == 'a')` |
37
  | Searching for an element in the next elements of an element | `target_parent = element.find_next("a")` | `target_parent = element.below_elements.search(lambda p: p.tag == 'a')` |
38
  | Searching for elements in the next elements of an element | `target_parent = element.find_all_next("a")` | `target_parent = element.below_elements.filter(lambda p: p.tag == 'a')` |
39
- | Searching for an element in the previous elements of an element | `target_parent = element.find_previous("a")` | `target_parent = element.path.search(lambda p: p.tag == 'a')` |
40
- | Searching for elements in the previous elements of an element | `target_parent = element.find_all_previous("a")` | `target_parent = element.path.filter(lambda p: p.tag == 'a')` |
41
  | Get previous sibling of an element | `prev_element = element.previous_sibling` | `prev_element = element.previous` |
42
  | Navigating to children | `children = list(element.children)` | `children = element.children` |
43
  | Get all descendants of an element | `children = list(element.descendants)` | `children = element.below_elements` |
44
  | Filtering a group of elements that satisfies a condition | `group = soup.find('p', 'story').css.filter('a')` | `group = page.find_all('p', 'story').filter(lambda p: p.tag == 'a')` |
45
 
46
 
 
 
47
  **One key point to remember**: BeautifulSoup offers features for modifying and manipulating the page after it has been parsed. Scrapling focuses more on scraping the page faster for you, and then you can do what you want with the extracted information. So, two different tools can be used in Web Scraping, but one of them specializes in Web Scraping :)
48
 
49
  ### Putting It All Together
 
21
  | Using CSS selectors to find the first matching element | `elements = soup.select_one('div.example')` | `elements = page.css('div.example').first` |
22
  | Using CSS selectors to find all matching element | `elements = soup.select('div.example')` | `elements = page.css('div.example')` |
23
  | Get a prettified version of the page/element source | `prettified = soup.prettify()` | `prettified = page.prettify()` |
24
+ | Get a Non-pretty version of the page/element source | `source = str(soup)` | `source = page.html_content` |
25
  | Get tag name of an element | `name = element.name` | `name = element.tag` |
26
  | Extracting text content of an element | `string = element.string` | `string = element.text` |
27
  | Extracting all the text in a document or beneath a tag | `text = soup.get_text(strip=True)` | `text = page.get_all_text(strip=True)` |
 
36
  | Searching for elements in the siblings of an element | `target_sibling = element.find_next_siblings("a")`<br/>`target_sibling = element.find_previous_siblings("a")` | `target_sibling = element.siblings.filter(lambda s: s.tag == 'a')` |
37
  | Searching for an element in the next elements of an element | `target_parent = element.find_next("a")` | `target_parent = element.below_elements.search(lambda p: p.tag == 'a')` |
38
  | Searching for elements in the next elements of an element | `target_parent = element.find_all_next("a")` | `target_parent = element.below_elements.filter(lambda p: p.tag == 'a')` |
39
+ | Searching for an element in the ancestors of an element | `target_parent = element.find_previous("a")` ¹ | `target_parent = element.path.search(lambda p: p.tag == 'a')` |
40
+ | Searching for elements in the ancestors of an element | `target_parent = element.find_all_previous("a")` ¹ | `target_parent = element.path.filter(lambda p: p.tag == 'a')` |
41
  | Get previous sibling of an element | `prev_element = element.previous_sibling` | `prev_element = element.previous` |
42
  | Navigating to children | `children = list(element.children)` | `children = element.children` |
43
  | Get all descendants of an element | `children = list(element.descendants)` | `children = element.below_elements` |
44
  | Filtering a group of elements that satisfies a condition | `group = soup.find('p', 'story').css.filter('a')` | `group = page.find_all('p', 'story').filter(lambda p: p.tag == 'a')` |
45
 
46
 
47
+ ¹ **Note:** BS4's `find_previous`/`find_all_previous` searches all preceding elements in document order, while Scrapling's `path` only returns ancestors (the parent chain). These are not exact equivalents, but ancestor search covers the most common use case.
48
+
49
  **One key point to remember**: BeautifulSoup offers features for modifying and manipulating the page after it has been parsed. Scrapling focuses more on scraping the page faster for you, and then you can do what you want with the extracted information. So, two different tools can be used in Web Scraping, but one of them specializes in Web Scraping :)
50
 
51
  ### Putting It All Together