Karim shoair commited on
Commit ·
4329d45
1
Parent(s): 5618e2f
docs: general correction across all the pages
Browse filesAutomated. I don't know how I missed all of that!
- docs/ai/mcp-server.md +1 -1
- docs/api-reference/mcp-server.md +1 -1
- docs/cli/extract-commands.md +2 -3
- docs/development/adaptive_storage_system.md +1 -1
- docs/fetching/choosing.md +2 -3
- docs/fetching/stealthy.md +1 -1
- docs/parsing/adaptive.md +1 -1
- docs/parsing/main_classes.md +1 -1
- docs/parsing/selection.md +1 -1
- docs/tutorials/migrating_from_beautifulsoup.md +5 -3
docs/ai/mcp-server.md
CHANGED
|
@@ -189,7 +189,7 @@ We will gradually go from simple prompts to more complex ones. We will use Claud
|
|
| 189 |
Get all product titles from https://shop.example.com using the CSS selector '.product-title'. If the request fails, retry up to 5 times every 10 seconds.
|
| 190 |
```
|
| 191 |
|
| 192 |
-
The server will extract only the elements matching your selector and return them as a structured list. Notice I told it to set the tool to try
|
| 193 |
|
| 194 |
3. **E-commerce Data Collection**
|
| 195 |
|
|
|
|
| 189 |
Get all product titles from https://shop.example.com using the CSS selector '.product-title'. If the request fails, retry up to 5 times every 10 seconds.
|
| 190 |
```
|
| 191 |
|
| 192 |
+
The server will extract only the elements matching your selector and return them as a structured list. Notice I told it to set the tool to try up to 5 times in case the website has connection issues, but the default setting should be fine for most cases.
|
| 193 |
|
| 194 |
3. **E-commerce Data Collection**
|
| 195 |
|
docs/api-reference/mcp-server.md
CHANGED
|
@@ -19,7 +19,7 @@ Or import the server class directly:
|
|
| 19 |
from scrapling.core.ai import ScraplingMCPServer
|
| 20 |
|
| 21 |
server = ScraplingMCPServer()
|
| 22 |
-
server.serve()
|
| 23 |
```
|
| 24 |
|
| 25 |
## Response Model
|
|
|
|
| 19 |
from scrapling.core.ai import ScraplingMCPServer
|
| 20 |
|
| 21 |
server = ScraplingMCPServer()
|
| 22 |
+
server.serve(http=False, host="0.0.0.0", port=8000)
|
| 23 |
```
|
| 24 |
|
| 25 |
## Response Model
|
docs/cli/extract-commands.md
CHANGED
|
@@ -280,7 +280,7 @@ We will go through each command in detail below.
|
|
| 280 |
-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
|
| 281 |
--wait-selector TEXT CSS selector to wait for before proceeding
|
| 282 |
--locale TEXT Specify user locale. Defaults to the system default locale.
|
| 283 |
-
--
|
| 284 |
--proxy TEXT Proxy URL in format "http://username:password@host:port"
|
| 285 |
-H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
|
| 286 |
--help Show this message and exit.
|
|
@@ -320,8 +320,7 @@ We will go through each command in detail below.
|
|
| 320 |
--solve-cloudflare / --no-solve-cloudflare Solve Cloudflare challenges (default: False)
|
| 321 |
--allow-webgl / --block-webgl Allow WebGL (default: True)
|
| 322 |
--network-idle / --no-network-idle Wait for network idle (default: False)
|
| 323 |
-
--
|
| 324 |
-
--hide-canvas/--show-canvas Add noise to canvas operations (default: False)
|
| 325 |
--timeout INTEGER Timeout in milliseconds (default: 30000)
|
| 326 |
--wait INTEGER Additional wait time in milliseconds after page load (default: 0)
|
| 327 |
-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
|
|
|
|
| 280 |
-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
|
| 281 |
--wait-selector TEXT CSS selector to wait for before proceeding
|
| 282 |
--locale TEXT Specify user locale. Defaults to the system default locale.
|
| 283 |
+
--real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
|
| 284 |
--proxy TEXT Proxy URL in format "http://username:password@host:port"
|
| 285 |
-H, --extra-headers TEXT Extra headers in format "Key: Value" (can be used multiple times)
|
| 286 |
--help Show this message and exit.
|
|
|
|
| 320 |
--solve-cloudflare / --no-solve-cloudflare Solve Cloudflare challenges (default: False)
|
| 321 |
--allow-webgl / --block-webgl Allow WebGL (default: True)
|
| 322 |
--network-idle / --no-network-idle Wait for network idle (default: False)
|
| 323 |
+
--real-chrome/--no-real-chrome If you have a Chrome browser installed on your device, enable this, and the Fetcher will launch an instance of your browser and use it. (default: False)
|
|
|
|
| 324 |
--timeout INTEGER Timeout in milliseconds (default: 30000)
|
| 325 |
--wait INTEGER Additional wait time in milliseconds after page load (default: 0)
|
| 326 |
-s, --css-selector TEXT CSS selector to extract specific content from the page. It returns all matches.
|
docs/development/adaptive_storage_system.md
CHANGED
|
@@ -56,7 +56,7 @@ class RedisStorage(StorageSystemMixin):
|
|
| 56 |
orjson.dumps(element_dict)
|
| 57 |
)
|
| 58 |
|
| 59 |
-
def retrieve(self, identifier: str) -> dict:
|
| 60 |
# Get data
|
| 61 |
key = f"scrapling:{self._get_base_url()}:{identifier}"
|
| 62 |
data = self.redis.get(key)
|
|
|
|
| 56 |
orjson.dumps(element_dict)
|
| 57 |
)
|
| 58 |
|
| 59 |
+
def retrieve(self, identifier: str) -> dict | None:
|
| 60 |
# Get data
|
| 61 |
key = f"scrapling:{self._get_base_url()}:{identifier}"
|
| 62 |
data = self.redis.get(key)
|
docs/fetching/choosing.md
CHANGED
|
@@ -40,19 +40,18 @@ Then you use it right away without initializing like this, and it will use the d
|
|
| 40 |
If you want to configure the parser ([Selector class](../parsing/main_classes.md#selector)) that will be used on the response before returning it for you, then do this first:
|
| 41 |
```python
|
| 42 |
>>> from scrapling.fetchers import Fetcher
|
| 43 |
-
>>> Fetcher.configure(adaptive=True,
|
| 44 |
```
|
| 45 |
or
|
| 46 |
```python
|
| 47 |
>>> from scrapling.fetchers import Fetcher
|
| 48 |
>>> Fetcher.adaptive=True
|
| 49 |
-
>>> Fetcher.encoding="utf-8"
|
| 50 |
>>> Fetcher.keep_comments=False
|
| 51 |
>>> Fetcher.keep_cdata=False # and the rest
|
| 52 |
```
|
| 53 |
Then, continue your code as usual.
|
| 54 |
|
| 55 |
-
The available configuration arguments are: `adaptive`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the [Selector](../parsing/main_classes.md#selector) class. You can display the current configuration anytime by running `<fetcher_class>.display_config()`.
|
| 56 |
|
| 57 |
!!! info
|
| 58 |
|
|
|
|
| 40 |
If you want to configure the parser ([Selector class](../parsing/main_classes.md#selector)) that will be used on the response before returning it for you, then do this first:
|
| 41 |
```python
|
| 42 |
>>> from scrapling.fetchers import Fetcher
|
| 43 |
+
>>> Fetcher.configure(adaptive=True, keep_comments=False, keep_cdata=False) # and the rest
|
| 44 |
```
|
| 45 |
or
|
| 46 |
```python
|
| 47 |
>>> from scrapling.fetchers import Fetcher
|
| 48 |
>>> Fetcher.adaptive=True
|
|
|
|
| 49 |
>>> Fetcher.keep_comments=False
|
| 50 |
>>> Fetcher.keep_cdata=False # and the rest
|
| 51 |
```
|
| 52 |
Then, continue your code as usual.
|
| 53 |
|
| 54 |
+
The available configuration arguments are: `adaptive`, `adaptive_domain`, `huge_tree`, `keep_comments`, `keep_cdata`, `storage`, and `storage_args`, which are the same ones you give to the [Selector](../parsing/main_classes.md#selector) class. You can display the current configuration anytime by running `<fetcher_class>.display_config()`.
|
| 55 |
|
| 56 |
!!! info
|
| 57 |
|
docs/fetching/stealthy.md
CHANGED
|
@@ -181,7 +181,7 @@ def scrape_amazon_product(url):
|
|
| 181 |
'rating': page.css('[data-feature-name="averageCustomerReviews"] .a-popover-trigger .a-color-base::text').get(),
|
| 182 |
'reviews_count': page.css('#acrCustomerReviewText::text').re_first(r'[\d,]+'),
|
| 183 |
'features': [
|
| 184 |
-
li.clean() for li in page.css('#feature-bullets li span::text')
|
| 185 |
],
|
| 186 |
'availability': page.css('#availability')[0].get_all_text(strip=True),
|
| 187 |
'images': [
|
|
|
|
| 181 |
'rating': page.css('[data-feature-name="averageCustomerReviews"] .a-popover-trigger .a-color-base::text').get(),
|
| 182 |
'reviews_count': page.css('#acrCustomerReviewText::text').re_first(r'[\d,]+'),
|
| 183 |
'features': [
|
| 184 |
+
li.get().clean() for li in page.css('#feature-bullets li span::text')
|
| 185 |
],
|
| 186 |
'availability': page.css('#availability')[0].get_all_text(strip=True),
|
| 187 |
'images': [
|
docs/parsing/adaptive.md
CHANGED
|
@@ -145,7 +145,7 @@ Examples:
|
|
| 145 |
>>> page = Selector(html_doc, adaptive=True)
|
| 146 |
# OR
|
| 147 |
>>> Fetcher.adaptive = True
|
| 148 |
-
>>> page = Fetcher.
|
| 149 |
```
|
| 150 |
If you are using the [Selector](main_classes.md#selector) class, you need to pass the url of the website you are using with the argument `url` so Scrapling can separate the properties saved for each element by domain.
|
| 151 |
|
|
|
|
| 145 |
>>> page = Selector(html_doc, adaptive=True)
|
| 146 |
# OR
|
| 147 |
>>> Fetcher.adaptive = True
|
| 148 |
+
>>> page = Fetcher.get('https://example.com')
|
| 149 |
```
|
| 150 |
If you are using the [Selector](main_classes.md#selector) class, you need to pass the url of the website you are using with the argument `url` so Scrapling can separate the properties saved for each element by domain.
|
| 151 |
|
docs/parsing/main_classes.md
CHANGED
|
@@ -343,7 +343,7 @@ Apart from the standard operations on Python lists, such as iteration and slicin
|
|
| 343 |
|
| 344 |
You can do the following:
|
| 345 |
|
| 346 |
-
Execute CSS and XPath selectors directly on the [Selector](#selector) instances it has, while the
|
| 347 |
```python
|
| 348 |
>>> page.css('.product_pod a')
|
| 349 |
[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
|
|
|
|
| 343 |
|
| 344 |
You can do the following:
|
| 345 |
|
| 346 |
+
Execute CSS and XPath selectors directly on the [Selector](#selector) instances it has, while the return types are the same as [Selector](#selector)'s `css` and `xpath` methods. The arguments are similar, except the `adaptive` argument is not available here. This, of course, makes chaining methods very straightforward.
|
| 347 |
```python
|
| 348 |
>>> page.css('.product_pod a')
|
| 349 |
[<data='<a href="catalogue/a-light-in-the-attic_...' parent='<div class="image_container"> <a href="c...'>,
|
docs/parsing/selection.md
CHANGED
|
@@ -398,7 +398,7 @@ Find all div elements with a class that equals `quote` and contains the element
|
|
| 398 |
>>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css('.text::text').get())
|
| 399 |
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
|
| 400 |
```
|
| 401 |
-
Find all elements that
|
| 402 |
```python
|
| 403 |
>>> page.find_all(lambda element: len(element.children) > 0)
|
| 404 |
[<data='<html lang="en"><head><meta charset="UTF...'>,
|
|
|
|
| 398 |
>>> page.find_all('div', {'class': 'quote'}, lambda e: "world" in e.css('.text::text').get())
|
| 399 |
[<data='<div class="quote" itemscope itemtype="h...' parent='<div class="col-md-8"> <div class="quote...'>]
|
| 400 |
```
|
| 401 |
+
Find all elements that have children.
|
| 402 |
```python
|
| 403 |
>>> page.find_all(lambda element: len(element.children) > 0)
|
| 404 |
[<data='<html lang="en"><head><meta charset="UTF...'>,
|
docs/tutorials/migrating_from_beautifulsoup.md
CHANGED
|
@@ -21,7 +21,7 @@ You will notice that some shortcuts in BeautifulSoup are missing in Scrapling, w
|
|
| 21 |
| Using CSS selectors to find the first matching element | `elements = soup.select_one('div.example')` | `elements = page.css('div.example').first` |
|
| 22 |
| Using CSS selectors to find all matching element | `elements = soup.select('div.example')` | `elements = page.css('div.example')` |
|
| 23 |
| Get a prettified version of the page/element source | `prettified = soup.prettify()` | `prettified = page.prettify()` |
|
| 24 |
-
| Get a Non-pretty version of the page/element source | `source = str(soup)` | `source = page.
|
| 25 |
| Get tag name of an element | `name = element.name` | `name = element.tag` |
|
| 26 |
| Extracting text content of an element | `string = element.string` | `string = element.text` |
|
| 27 |
| Extracting all the text in a document or beneath a tag | `text = soup.get_text(strip=True)` | `text = page.get_all_text(strip=True)` |
|
|
@@ -36,14 +36,16 @@ You will notice that some shortcuts in BeautifulSoup are missing in Scrapling, w
|
|
| 36 |
| Searching for elements in the siblings of an element | `target_sibling = element.find_next_siblings("a")`<br/>`target_sibling = element.find_previous_siblings("a")` | `target_sibling = element.siblings.filter(lambda s: s.tag == 'a')` |
|
| 37 |
| Searching for an element in the next elements of an element | `target_parent = element.find_next("a")` | `target_parent = element.below_elements.search(lambda p: p.tag == 'a')` |
|
| 38 |
| Searching for elements in the next elements of an element | `target_parent = element.find_all_next("a")` | `target_parent = element.below_elements.filter(lambda p: p.tag == 'a')` |
|
| 39 |
-
| Searching for an element in the
|
| 40 |
-
| Searching for elements in the
|
| 41 |
| Get previous sibling of an element | `prev_element = element.previous_sibling` | `prev_element = element.previous` |
|
| 42 |
| Navigating to children | `children = list(element.children)` | `children = element.children` |
|
| 43 |
| Get all descendants of an element | `children = list(element.descendants)` | `children = element.below_elements` |
|
| 44 |
| Filtering a group of elements that satisfies a condition | `group = soup.find('p', 'story').css.filter('a')` | `group = page.find_all('p', 'story').filter(lambda p: p.tag == 'a')` |
|
| 45 |
|
| 46 |
|
|
|
|
|
|
|
| 47 |
**One key point to remember**: BeautifulSoup offers features for modifying and manipulating the page after it has been parsed. Scrapling focuses more on scraping the page faster for you, and then you can do what you want with the extracted information. So, two different tools can be used in Web Scraping, but one of them specializes in Web Scraping :)
|
| 48 |
|
| 49 |
### Putting It All Together
|
|
|
|
| 21 |
| Using CSS selectors to find the first matching element | `elements = soup.select_one('div.example')` | `elements = page.css('div.example').first` |
|
| 22 |
| Using CSS selectors to find all matching element | `elements = soup.select('div.example')` | `elements = page.css('div.example')` |
|
| 23 |
| Get a prettified version of the page/element source | `prettified = soup.prettify()` | `prettified = page.prettify()` |
|
| 24 |
+
| Get a Non-pretty version of the page/element source | `source = str(soup)` | `source = page.html_content` |
|
| 25 |
| Get tag name of an element | `name = element.name` | `name = element.tag` |
|
| 26 |
| Extracting text content of an element | `string = element.string` | `string = element.text` |
|
| 27 |
| Extracting all the text in a document or beneath a tag | `text = soup.get_text(strip=True)` | `text = page.get_all_text(strip=True)` |
|
|
|
|
| 36 |
| Searching for elements in the siblings of an element | `target_sibling = element.find_next_siblings("a")`<br/>`target_sibling = element.find_previous_siblings("a")` | `target_sibling = element.siblings.filter(lambda s: s.tag == 'a')` |
|
| 37 |
| Searching for an element in the next elements of an element | `target_parent = element.find_next("a")` | `target_parent = element.below_elements.search(lambda p: p.tag == 'a')` |
|
| 38 |
| Searching for elements in the next elements of an element | `target_parent = element.find_all_next("a")` | `target_parent = element.below_elements.filter(lambda p: p.tag == 'a')` |
|
| 39 |
+
| Searching for an element in the ancestors of an element | `target_parent = element.find_previous("a")` ¹ | `target_parent = element.path.search(lambda p: p.tag == 'a')` |
|
| 40 |
+
| Searching for elements in the ancestors of an element | `target_parent = element.find_all_previous("a")` ¹ | `target_parent = element.path.filter(lambda p: p.tag == 'a')` |
|
| 41 |
| Get previous sibling of an element | `prev_element = element.previous_sibling` | `prev_element = element.previous` |
|
| 42 |
| Navigating to children | `children = list(element.children)` | `children = element.children` |
|
| 43 |
| Get all descendants of an element | `children = list(element.descendants)` | `children = element.below_elements` |
|
| 44 |
| Filtering a group of elements that satisfies a condition | `group = soup.find('p', 'story').css.filter('a')` | `group = page.find_all('p', 'story').filter(lambda p: p.tag == 'a')` |
|
| 45 |
|
| 46 |
|
| 47 |
+
¹ **Note:** BS4's `find_previous`/`find_all_previous` searches all preceding elements in document order, while Scrapling's `path` only returns ancestors (the parent chain). These are not exact equivalents, but ancestor search covers the most common use case.
|
| 48 |
+
|
| 49 |
**One key point to remember**: BeautifulSoup offers features for modifying and manipulating the page after it has been parsed. Scrapling focuses more on scraping the page faster for you, and then you can do what you want with the extracted information. So, two different tools can be used in Web Scraping, but one of them specializes in Web Scraping :)
|
| 50 |
|
| 51 |
### Putting It All Together
|