Karim shoair commited on
Commit Β·
cc4ddeb
1
Parent(s): b5d12da
docs: update the website main page
Browse files- docs/index.md +40 -22
docs/index.md
CHANGED
|
@@ -14,27 +14,34 @@
|
|
| 14 |
</a>
|
| 15 |
</div>
|
| 16 |
|
| 17 |
-
<
|
| 18 |
-
<i><code>Easy, effortless Web Scraping as it should be!</code></i>
|
| 19 |
-
<br/><br/>
|
| 20 |
-
</div>
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
```python
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
```
|
| 39 |
|
| 40 |
## Top Sponsors
|
|
@@ -56,16 +63,27 @@ Built for the modern Web, Scrapling features **its own rapid parsing engine** an
|
|
| 56 |
|
| 57 |
## Key Features
|
| 58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
### Advanced Websites Fetching with Session Support
|
| 60 |
- **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3.
|
| 61 |
-
- **Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium
|
| 62 |
-
- **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` and fingerprint spoofing. Can bypass all types of Cloudflare's Turnstile/Interstitial with automation
|
| 63 |
- **Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests.
|
|
|
|
|
|
|
| 64 |
- **Async Support**: Complete async support across all fetchers and dedicated async session classes.
|
| 65 |
|
| 66 |
### Adaptive Scraping & AI Integration
|
| 67 |
- π **Smart Element Tracking**: Relocate elements after website changes using intelligent similarity algorithms.
|
| 68 |
-
- π― **Smart Flexible Selection**: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more.
|
| 69 |
- π **Find Similar Elements**: Automatically locate elements similar to found elements.
|
| 70 |
- π€ **MCP Server to be used with AI**: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features powerful, custom capabilities that leverage Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. ([demo video](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
|
| 71 |
|
|
@@ -77,12 +95,12 @@ Built for the modern Web, Scrapling features **its own rapid parsing engine** an
|
|
| 77 |
|
| 78 |
### Developer/Web Scraper Friendly Experience
|
| 79 |
- π― **Interactive Web Scraping Shell**: Optional built-in IPython shell with Scrapling integration, shortcuts, and new tools to speed up Web Scraping scripts development, like converting curl requests to Scrapling requests and viewing requests results in your browser.
|
| 80 |
-
- π **Use it directly from the Terminal**: Optionally, you can use Scrapling to scrape a URL without writing a single code!
|
| 81 |
- π οΈ **Rich Navigation API**: Advanced DOM traversal with parent, sibling, and child navigation methods.
|
| 82 |
- 𧬠**Enhanced Text Processing**: Built-in regex, cleaning methods, and optimized string operations.
|
| 83 |
- π **Auto Selector Generation**: Generate robust CSS/XPath selectors for any element.
|
| 84 |
- π **Familiar API**: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel.
|
| 85 |
-
- π **Complete Type Coverage**: Full type hints for excellent IDE support and code completion.
|
| 86 |
- π **Ready Docker image**: With each release, a Docker image containing all browsers is automatically built and pushed.
|
| 87 |
|
| 88 |
|
|
@@ -127,7 +145,7 @@ Scrapling requires Python 3.10 or higher:
|
|
| 127 |
pip install scrapling
|
| 128 |
```
|
| 129 |
|
| 130 |
-
|
| 131 |
|
| 132 |
### Optional Dependencies
|
| 133 |
|
|
|
|
| 14 |
</a>
|
| 15 |
</div>
|
| 16 |
|
| 17 |
+
<h2 align="center"><i>Effortless Web Scraping for the Modern Web</i></h2><br>
|
|
|
|
|
|
|
|
|
|
| 18 |
|
| 19 |
+
Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl.
|
| 20 |
|
| 21 |
+
Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation β all in a few lines of Python. One library, zero compromises.
|
| 22 |
|
| 23 |
+
Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
|
| 24 |
|
| 25 |
```python
|
| 26 |
+
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
|
| 27 |
+
StealthyFetcher.adaptive = True
|
| 28 |
+
page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Fetch website under the radar!
|
| 29 |
+
products = page.css('.product', auto_save=True) # Scrape data that survives website design changes!
|
| 30 |
+
products = page.css('.product', adaptive=True) # Later, if the website structure changes, pass `adaptive=True` to find them!
|
| 31 |
+
```
|
| 32 |
+
Or scale up to full crawls
|
| 33 |
+
```python
|
| 34 |
+
from scrapling.spiders import Spider, Response
|
| 35 |
+
|
| 36 |
+
class MySpider(Spider):
|
| 37 |
+
name = "demo"
|
| 38 |
+
start_urls = ["https://example.com/"]
|
| 39 |
+
|
| 40 |
+
async def parse(self, response: Response):
|
| 41 |
+
for item in response.css('.product'):
|
| 42 |
+
yield {"title": item.css('h2::text').get()}
|
| 43 |
+
|
| 44 |
+
MySpider().start()
|
| 45 |
```
|
| 46 |
|
| 47 |
## Top Sponsors
|
|
|
|
| 63 |
|
| 64 |
## Key Features
|
| 65 |
|
| 66 |
+
### Spiders β A Full Crawling Framework
|
| 67 |
+
- π·οΈ **Scrapy-like Spider API**: Define spiders with `start_urls`, async `parse` callbacks, and `Request`/`Response` objects.
|
| 68 |
+
- β‘ **Concurrent Crawling**: Configurable concurrency limits, per-domain throttling, and download delays.
|
| 69 |
+
- π **Multi-Session Support**: Unified interface for HTTP requests, and stealthy headless browsers in a single spider β route requests to different sessions by ID.
|
| 70 |
+
- πΎ **Pause & Resume**: Checkpoint-based crawl persistence. Press Ctrl+C for graceful shutdown; restart to resume from where you left off.
|
| 71 |
+
- π‘ **Streaming Mode**: Stream scraped items as they arrive via `async for item in spider.stream()` with real-time stats β ideal for UI, pipelines, and long-running crawls.
|
| 72 |
+
- π‘οΈ **Blocked Request Detection**: Automatic detection and retry of blocked requests with customizable logic.
|
| 73 |
+
- π¦ **Built-in Export**: Export results through hooks and your own pipeline or the built-in JSON/JSONL with `result.items.to_json()` / `result.items.to_jsonl()` respectively.
|
| 74 |
+
|
| 75 |
### Advanced Websites Fetching with Session Support
|
| 76 |
- **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3.
|
| 77 |
+
- **Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium and Google's Chrome.
|
| 78 |
+
- **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` and fingerprint spoofing. Can easily bypass all types of Cloudflare's Turnstile/Interstitial with automation.
|
| 79 |
- **Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests.
|
| 80 |
+
- **Proxy Rotation**: Built-in `ProxyRotator` with round-robin or custom strategies across all session types, plus per-request proxy overrides.
|
| 81 |
+
- **Domain Blocking**: Block requests to specific domains (and their subdomains) in browser-based fetchers.
|
| 82 |
- **Async Support**: Complete async support across all fetchers and dedicated async session classes.
|
| 83 |
|
| 84 |
### Adaptive Scraping & AI Integration
|
| 85 |
- π **Smart Element Tracking**: Relocate elements after website changes using intelligent similarity algorithms.
|
| 86 |
+
- π― **Smart Flexible Selection**: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more.
|
| 87 |
- π **Find Similar Elements**: Automatically locate elements similar to found elements.
|
| 88 |
- π€ **MCP Server to be used with AI**: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features powerful, custom capabilities that leverage Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. ([demo video](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
|
| 89 |
|
|
|
|
| 95 |
|
| 96 |
### Developer/Web Scraper Friendly Experience
|
| 97 |
- π― **Interactive Web Scraping Shell**: Optional built-in IPython shell with Scrapling integration, shortcuts, and new tools to speed up Web Scraping scripts development, like converting curl requests to Scrapling requests and viewing requests results in your browser.
|
| 98 |
+
- π **Use it directly from the Terminal**: Optionally, you can use Scrapling to scrape a URL without writing a single line of code!
|
| 99 |
- π οΈ **Rich Navigation API**: Advanced DOM traversal with parent, sibling, and child navigation methods.
|
| 100 |
- 𧬠**Enhanced Text Processing**: Built-in regex, cleaning methods, and optimized string operations.
|
| 101 |
- π **Auto Selector Generation**: Generate robust CSS/XPath selectors for any element.
|
| 102 |
- π **Familiar API**: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel.
|
| 103 |
+
- π **Complete Type Coverage**: Full type hints for excellent IDE support and code completion. The entire codebase is automatically scanned with **PyRight** and **MyPy** with each change.
|
| 104 |
- π **Ready Docker image**: With each release, a Docker image containing all browsers is automatically built and pushed.
|
| 105 |
|
| 106 |
|
|
|
|
| 145 |
pip install scrapling
|
| 146 |
```
|
| 147 |
|
| 148 |
+
This installation only includes the parser engine and its dependencies, without any fetchers or commandline dependencies.
|
| 149 |
|
| 150 |
### Optional Dependencies
|
| 151 |
|