Karim shoair commited on
Commit ·
92b1671
1
Parent(s): be5aa87
docs: Update the MCP server page
Browse files- docs/ai/mcp-server.md +20 -16
docs/ai/mcp-server.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
| 2 |
|
| 3 |
<iframe width="560" height="315" src="https://www.youtube.com/embed/qyFk3ZNwOxE?si=3FHzgcYCb66iJ6e3" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
|
| 4 |
|
| 5 |
-
The **Scrapling MCP Server** is a new feature that brings Scrapling's powerful Web Scraping capabilities directly to your favorite AI chatbot or AI agent. This integration allows you to scrape websites, extract data, and bypass anti-bot protections conversationally through Claude's AI interface or any
|
| 6 |
|
| 7 |
## Features
|
| 8 |
|
|
@@ -13,11 +13,11 @@ The Scrapling MCP Server provides six powerful tools for web scraping:
|
|
| 13 |
- **`bulk_get`**: An async version of the above tool that allows scraping of multiple URLs at the same time!
|
| 14 |
|
| 15 |
### 🌐 Dynamic Content Scraping
|
| 16 |
-
- **`fetch`**: Rapidly fetch dynamic content with Chromium/Chrome browser with complete control over the request/browser,
|
| 17 |
- **`bulk_fetch`**: An async version of the above tool that allows scraping of multiple URLs in different browser tabs at the same time!
|
| 18 |
|
| 19 |
### 🔒 Stealth Scraping
|
| 20 |
-
- **`stealthy_fetch`**: Uses our
|
| 21 |
- **`bulk_stealthy_fetch`**: An async version of the above tool that allows stealth scraping of multiple URLs in different browser tabs at the same time!
|
| 22 |
|
| 23 |
### Key Capabilities
|
|
@@ -30,9 +30,9 @@ The Scrapling MCP Server provides six powerful tools for web scraping:
|
|
| 30 |
|
| 31 |
#### But why use Scrapling MCP Server instead of other available tools?
|
| 32 |
|
| 33 |
-
Aside from its stealth capabilities and ability to bypass Cloudflare Turnstile/Interstitial, Scrapling's server is the only one that
|
| 34 |
|
| 35 |
-
The way other servers work is that they extract the content, then pass it all to the AI to extract the fields you want. This causes the AI to consume
|
| 36 |
|
| 37 |
If you don't know how to write/use CSS selectors, don't worry. You can tell the AI in the prompt to write selectors to match possible fields for you and watch it try different combinations until it finds the right one, as we will show in the examples section.
|
| 38 |
|
|
@@ -48,10 +48,14 @@ pip install "scrapling[ai]"
|
|
| 48 |
scrapling install
|
| 49 |
```
|
| 50 |
|
| 51 |
-
Or use the Docker image directly:
|
| 52 |
```bash
|
| 53 |
docker pull pyd4vinci/scrapling
|
| 54 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
## Setting up the MCP Server
|
| 57 |
|
|
@@ -83,12 +87,12 @@ If that's the first MCP server you're adding, set the content of the file to thi
|
|
| 83 |
}
|
| 84 |
}
|
| 85 |
```
|
| 86 |
-
As per the [official article](https://modelcontextprotocol.io/quickstart/user), this action creates a new configuration file if
|
| 87 |
|
| 88 |
1. **MacOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
|
| 89 |
2. **Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
|
| 90 |
|
| 91 |
-
To ensure it's working,
|
| 92 |
|
| 93 |
1. **MacOS**: `which scrapling`
|
| 94 |
2. **Windows**: `where scrapling`
|
|
@@ -150,7 +154,7 @@ Use the following to enable 'Streamable HTTP' transport mode:
|
|
| 150 |
```bash
|
| 151 |
scrapling mcp --http
|
| 152 |
```
|
| 153 |
-
Hence, the default value for the host the server is listening
|
| 154 |
```bash
|
| 155 |
scrapling mcp --http --host '127.0.0.1' --port 8000
|
| 156 |
```
|
|
@@ -169,13 +173,13 @@ We will gradually go from simple prompts to more complex ones. We will use Claud
|
|
| 169 |
Scrape the main content from https://example.com and convert it to markdown format.
|
| 170 |
```
|
| 171 |
|
| 172 |
-
Claude will use the `get` tool to fetch the page and return clean, readable content. If it fails, it will continue retrying every second for
|
| 173 |
|
| 174 |
A more optimized version of the same prompt would be:
|
| 175 |
```
|
| 176 |
Use regular requests to scrape the main content from https://example.com and convert it to markdown format.
|
| 177 |
```
|
| 178 |
-
This tells Claude
|
| 179 |
|
| 180 |
2. **Targeted Data Extraction**
|
| 181 |
|
|
@@ -185,7 +189,7 @@ We will gradually go from simple prompts to more complex ones. We will use Claud
|
|
| 185 |
Get all product titles from https://shop.example.com using the CSS selector '.product-title'. If the request fails, retry up to 5 times every 10 seconds.
|
| 186 |
```
|
| 187 |
|
| 188 |
-
The server will extract only the elements matching your selector and return them as a structured list. Notice I told it to set the tool to
|
| 189 |
|
| 190 |
3. **E-commerce Data Collection**
|
| 191 |
|
|
@@ -199,7 +203,7 @@ We will gradually go from simple prompts to more complex ones. We will use Claud
|
|
| 199 |
Get the product names, prices, and descriptions from each page.
|
| 200 |
```
|
| 201 |
|
| 202 |
-
Claude will use `bulk_fetch` to scrape all URLs
|
| 203 |
|
| 204 |
4. **More advanced workflow**
|
| 205 |
|
|
@@ -216,14 +220,14 @@ We will gradually go from simple prompts to more complex ones. We will use Claud
|
|
| 216 |
And if you know how to write CSS selectors, you can instruct Claude to apply the selectors to the elements you want, and it will nearly complete the task immediately.
|
| 217 |
```
|
| 218 |
Use normal requests to extract the URLs of all games on the page below, then perform a bulk request to them and return a list of all action games.
|
| 219 |
-
The selector for games in the first page is `[href*="/concept/"]` and the selector for the genre in the second request is `[data-qa="gameInfo#releaseInformation#genre-value"]`
|
| 220 |
|
| 221 |
URL: https://store.playstation.com/en-us/pages/browse
|
| 222 |
```
|
| 223 |
|
| 224 |
5. **Get data from a website with Cloudflare protection**
|
| 225 |
|
| 226 |
-
If you think the website you are targeting has Cloudflare protection,
|
| 227 |
```
|
| 228 |
What's the price of this product? Be cautious, as it utilizes Cloudflare's Turnstile protection. Make the browser visible while you work.
|
| 229 |
|
|
@@ -234,7 +238,7 @@ We will gradually go from simple prompts to more complex ones. We will use Claud
|
|
| 234 |
|
| 235 |
You can, for example, use a prompt like this:
|
| 236 |
```
|
| 237 |
-
Extract all
|
| 238 |
|
| 239 |
https://www.arnotts.ie/furniture/bedroom/bed-frames/
|
| 240 |
```
|
|
|
|
| 2 |
|
| 3 |
<iframe width="560" height="315" src="https://www.youtube.com/embed/qyFk3ZNwOxE?si=3FHzgcYCb66iJ6e3" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>
|
| 4 |
|
| 5 |
+
The **Scrapling MCP Server** is a new feature that brings Scrapling's powerful Web Scraping capabilities directly to your favorite AI chatbot or AI agent. This integration allows you to scrape websites, extract data, and bypass anti-bot protections conversationally through Claude's AI interface or any interface that supports MCP.
|
| 6 |
|
| 7 |
## Features
|
| 8 |
|
|
|
|
| 13 |
- **`bulk_get`**: An async version of the above tool that allows scraping of multiple URLs at the same time!
|
| 14 |
|
| 15 |
### 🌐 Dynamic Content Scraping
|
| 16 |
+
- **`fetch`**: Rapidly fetch dynamic content with Chromium/Chrome browser with complete control over the request/browser, and more!
|
| 17 |
- **`bulk_fetch`**: An async version of the above tool that allows scraping of multiple URLs in different browser tabs at the same time!
|
| 18 |
|
| 19 |
### 🔒 Stealth Scraping
|
| 20 |
+
- **`stealthy_fetch`**: Uses our Stealthy browser to bypass Cloudflare Turnstile/Interstitial and other anti-bot systems with complete control over the request/browser!
|
| 21 |
- **`bulk_stealthy_fetch`**: An async version of the above tool that allows stealth scraping of multiple URLs in different browser tabs at the same time!
|
| 22 |
|
| 23 |
### Key Capabilities
|
|
|
|
| 30 |
|
| 31 |
#### But why use Scrapling MCP Server instead of other available tools?
|
| 32 |
|
| 33 |
+
Aside from its stealth capabilities and ability to bypass Cloudflare Turnstile/Interstitial, Scrapling's server is the only one that lets you select specific elements to pass to the AI, saving a lot of time and tokens!
|
| 34 |
|
| 35 |
+
The way other servers work is that they extract the content, then pass it all to the AI to extract the fields you want. This causes the AI to consume far more tokens than needed (from irrelevant content). Scrapling solves this problem by allowing you to pass a CSS selector to narrow down the content you want before passing it to the AI, which makes the whole process much faster and more efficient.
|
| 36 |
|
| 37 |
If you don't know how to write/use CSS selectors, don't worry. You can tell the AI in the prompt to write selectors to match possible fields for you and watch it try different combinations until it finds the right one, as we will show in the examples section.
|
| 38 |
|
|
|
|
| 48 |
scrapling install
|
| 49 |
```
|
| 50 |
|
| 51 |
+
Or use the Docker image directly from the Docker registry:
|
| 52 |
```bash
|
| 53 |
docker pull pyd4vinci/scrapling
|
| 54 |
```
|
| 55 |
+
Or download it from the GitHub registry:
|
| 56 |
+
```bash
|
| 57 |
+
docker pull ghcr.io/d4vinci/scrapling:latest
|
| 58 |
+
```
|
| 59 |
|
| 60 |
## Setting up the MCP Server
|
| 61 |
|
|
|
|
| 87 |
}
|
| 88 |
}
|
| 89 |
```
|
| 90 |
+
As per the [official article](https://modelcontextprotocol.io/quickstart/user), this action either creates a new configuration file if none exists or opens your existing configuration. The file is located at
|
| 91 |
|
| 92 |
1. **MacOS**: `~/Library/Application Support/Claude/claude_desktop_config.json`
|
| 93 |
2. **Windows**: `%APPDATA%\Claude\claude_desktop_config.json`
|
| 94 |
|
| 95 |
+
To ensure it's working, use the full path to the `scrapling` executable. Open the terminal and execute the following command:
|
| 96 |
|
| 97 |
1. **MacOS**: `which scrapling`
|
| 98 |
2. **Windows**: `where scrapling`
|
|
|
|
| 154 |
```bash
|
| 155 |
scrapling mcp --http
|
| 156 |
```
|
| 157 |
+
Hence, the default value for the host the server is listening to is '0.0.0.0' and the port is 8000, which both can be configured as below:
|
| 158 |
```bash
|
| 159 |
scrapling mcp --http --host '127.0.0.1' --port 8000
|
| 160 |
```
|
|
|
|
| 173 |
Scrape the main content from https://example.com and convert it to markdown format.
|
| 174 |
```
|
| 175 |
|
| 176 |
+
Claude will use the `get` tool to fetch the page and return clean, readable content. If it fails, it will continue retrying every second for 3 attempts, unless you instruct it otherwise. If it fails to retrieve content for any reason, such as protection or if it's a dynamic website, it will automatically try the other tools. If Claude didn't do that automatically for some reason, you can add that to the prompt.
|
| 177 |
|
| 178 |
A more optimized version of the same prompt would be:
|
| 179 |
```
|
| 180 |
Use regular requests to scrape the main content from https://example.com and convert it to markdown format.
|
| 181 |
```
|
| 182 |
+
This tells Claude which tool to use here, so it doesn't have to guess. Sometimes it will start using normal requests on its own, and at other times, it will assume browsers are better suited for this website without any apparent reason. As a general rule of thumb, you should always tell Claude which tool to use if you want to save time and money and get consistent results.
|
| 183 |
|
| 184 |
2. **Targeted Data Extraction**
|
| 185 |
|
|
|
|
| 189 |
Get all product titles from https://shop.example.com using the CSS selector '.product-title'. If the request fails, retry up to 5 times every 10 seconds.
|
| 190 |
```
|
| 191 |
|
| 192 |
+
The server will extract only the elements matching your selector and return them as a structured list. Notice I told it to set the tool to try only 3 times in case the website has connection issues, but the default setting should be fine for most cases.
|
| 193 |
|
| 194 |
3. **E-commerce Data Collection**
|
| 195 |
|
|
|
|
| 203 |
Get the product names, prices, and descriptions from each page.
|
| 204 |
```
|
| 205 |
|
| 206 |
+
Claude will use `bulk_fetch` to concurrently scrape all URLs, then analyze the extracted data.
|
| 207 |
|
| 208 |
4. **More advanced workflow**
|
| 209 |
|
|
|
|
| 220 |
And if you know how to write CSS selectors, you can instruct Claude to apply the selectors to the elements you want, and it will nearly complete the task immediately.
|
| 221 |
```
|
| 222 |
Use normal requests to extract the URLs of all games on the page below, then perform a bulk request to them and return a list of all action games.
|
| 223 |
+
The selector for games in the first page is `[href*="/concept/"]` and the selector for the genre in the second request is `[data-qa="gameInfo#releaseInformation#genre-value"]`.
|
| 224 |
|
| 225 |
URL: https://store.playstation.com/en-us/pages/browse
|
| 226 |
```
|
| 227 |
|
| 228 |
5. **Get data from a website with Cloudflare protection**
|
| 229 |
|
| 230 |
+
If you think the website you are targeting has Cloudflare protection, tell Claude instead of letting it discover it on its own.
|
| 231 |
```
|
| 232 |
What's the price of this product? Be cautious, as it utilizes Cloudflare's Turnstile protection. Make the browser visible while you work.
|
| 233 |
|
|
|
|
| 238 |
|
| 239 |
You can, for example, use a prompt like this:
|
| 240 |
```
|
| 241 |
+
Extract all product URLs for the following category, then return the prices and details for the first 3 products.
|
| 242 |
|
| 243 |
https://www.arnotts.ie/furniture/bedroom/bed-frames/
|
| 244 |
```
|