Karim shoair commited on
Commit
cc4ddeb
Β·
1 Parent(s): b5d12da

docs: update the website main page

Browse files
Files changed (1) hide show
  1. docs/index.md +40 -22
docs/index.md CHANGED
@@ -14,27 +14,34 @@
14
  </a>
15
  </div>
16
 
17
- <div align="center">
18
- <i><code>Easy, effortless Web Scraping as it should be!</code></i>
19
- <br/><br/>
20
- </div>
21
 
22
- **Stop fighting anti-bot systems. Stop rewriting selectors after every website update.**
23
 
24
- Scrapling isn't just another Web Scraping library. It's the first **adaptive** scraping library that learns from website changes and evolves with them. While other libraries break when websites update their structure, Scrapling automatically relocates your elements and keeps your scrapers running.
25
 
26
- Built for the modern Web, Scrapling features **its own rapid parsing engine** and fetchers to handle all Web Scraping challenges you face or will face. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
27
 
28
  ```python
29
- >> from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
30
- >> StealthyFetcher.adaptive = True
31
- # Fetch websites' source under the radar!
32
- >> page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
33
- >> print(page.status)
34
- 200
35
- >> products = page.css('.product', auto_save=True) # Scrape data that survives website design changes!
36
- >> # Later, if the website structure changes, pass `adaptive=True`
37
- >> products = page.css('.product', adaptive=True) # and Scrapling still finds them!
 
 
 
 
 
 
 
 
 
 
38
  ```
39
 
40
  ## Top Sponsors
@@ -56,16 +63,27 @@ Built for the modern Web, Scrapling features **its own rapid parsing engine** an
56
 
57
  ## Key Features
58
 
 
 
 
 
 
 
 
 
 
59
  ### Advanced Websites Fetching with Session Support
60
  - **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3.
61
- - **Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium, and Google's Chrome.
62
- - **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` and fingerprint spoofing. Can bypass all types of Cloudflare's Turnstile/Interstitial with automation easily.
63
  - **Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests.
 
 
64
  - **Async Support**: Complete async support across all fetchers and dedicated async session classes.
65
 
66
  ### Adaptive Scraping & AI Integration
67
  - πŸ”„ **Smart Element Tracking**: Relocate elements after website changes using intelligent similarity algorithms.
68
- - 🎯 **Smart Flexible Selection**: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more.
69
  - πŸ” **Find Similar Elements**: Automatically locate elements similar to found elements.
70
  - πŸ€– **MCP Server to be used with AI**: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features powerful, custom capabilities that leverage Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. ([demo video](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
71
 
@@ -77,12 +95,12 @@ Built for the modern Web, Scrapling features **its own rapid parsing engine** an
77
 
78
  ### Developer/Web Scraper Friendly Experience
79
  - 🎯 **Interactive Web Scraping Shell**: Optional built-in IPython shell with Scrapling integration, shortcuts, and new tools to speed up Web Scraping scripts development, like converting curl requests to Scrapling requests and viewing requests results in your browser.
80
- - πŸš€ **Use it directly from the Terminal**: Optionally, you can use Scrapling to scrape a URL without writing a single code!
81
  - πŸ› οΈ **Rich Navigation API**: Advanced DOM traversal with parent, sibling, and child navigation methods.
82
  - 🧬 **Enhanced Text Processing**: Built-in regex, cleaning methods, and optimized string operations.
83
  - πŸ“ **Auto Selector Generation**: Generate robust CSS/XPath selectors for any element.
84
  - πŸ”Œ **Familiar API**: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel.
85
- - πŸ“˜ **Complete Type Coverage**: Full type hints for excellent IDE support and code completion.
86
  - πŸ”‹ **Ready Docker image**: With each release, a Docker image containing all browsers is automatically built and pushed.
87
 
88
 
@@ -127,7 +145,7 @@ Scrapling requires Python 3.10 or higher:
127
  pip install scrapling
128
  ```
129
 
130
- Starting with v0.3.2, this installation only includes the parser engine and its dependencies, without any fetchers or commandline dependencies.
131
 
132
  ### Optional Dependencies
133
 
 
14
  </a>
15
  </div>
16
 
17
+ <h2 align="center"><i>Effortless Web Scraping for the Modern Web</i></h2><br>
 
 
 
18
 
19
+ Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl.
20
 
21
+ Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation β€” all in a few lines of Python. One library, zero compromises.
22
 
23
+ Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
24
 
25
  ```python
26
+ from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
27
+ StealthyFetcher.adaptive = True
28
+ page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Fetch website under the radar!
29
+ products = page.css('.product', auto_save=True) # Scrape data that survives website design changes!
30
+ products = page.css('.product', adaptive=True) # Later, if the website structure changes, pass `adaptive=True` to find them!
31
+ ```
32
+ Or scale up to full crawls
33
+ ```python
34
+ from scrapling.spiders import Spider, Response
35
+
36
+ class MySpider(Spider):
37
+ name = "demo"
38
+ start_urls = ["https://example.com/"]
39
+
40
+ async def parse(self, response: Response):
41
+ for item in response.css('.product'):
42
+ yield {"title": item.css('h2::text').get()}
43
+
44
+ MySpider().start()
45
  ```
46
 
47
  ## Top Sponsors
 
63
 
64
  ## Key Features
65
 
66
+ ### Spiders β€” A Full Crawling Framework
67
+ - πŸ•·οΈ **Scrapy-like Spider API**: Define spiders with `start_urls`, async `parse` callbacks, and `Request`/`Response` objects.
68
+ - ⚑ **Concurrent Crawling**: Configurable concurrency limits, per-domain throttling, and download delays.
69
+ - πŸ”„ **Multi-Session Support**: Unified interface for HTTP requests, and stealthy headless browsers in a single spider β€” route requests to different sessions by ID.
70
+ - πŸ’Ύ **Pause & Resume**: Checkpoint-based crawl persistence. Press Ctrl+C for graceful shutdown; restart to resume from where you left off.
71
+ - πŸ“‘ **Streaming Mode**: Stream scraped items as they arrive via `async for item in spider.stream()` with real-time stats β€” ideal for UI, pipelines, and long-running crawls.
72
+ - πŸ›‘οΈ **Blocked Request Detection**: Automatic detection and retry of blocked requests with customizable logic.
73
+ - πŸ“¦ **Built-in Export**: Export results through hooks and your own pipeline or the built-in JSON/JSONL with `result.items.to_json()` / `result.items.to_jsonl()` respectively.
74
+
75
  ### Advanced Websites Fetching with Session Support
76
  - **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3.
77
+ - **Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium and Google's Chrome.
78
+ - **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` and fingerprint spoofing. Can easily bypass all types of Cloudflare's Turnstile/Interstitial with automation.
79
  - **Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests.
80
+ - **Proxy Rotation**: Built-in `ProxyRotator` with round-robin or custom strategies across all session types, plus per-request proxy overrides.
81
+ - **Domain Blocking**: Block requests to specific domains (and their subdomains) in browser-based fetchers.
82
  - **Async Support**: Complete async support across all fetchers and dedicated async session classes.
83
 
84
  ### Adaptive Scraping & AI Integration
85
  - πŸ”„ **Smart Element Tracking**: Relocate elements after website changes using intelligent similarity algorithms.
86
+ - 🎯 **Smart Flexible Selection**: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more.
87
  - πŸ” **Find Similar Elements**: Automatically locate elements similar to found elements.
88
  - πŸ€– **MCP Server to be used with AI**: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features powerful, custom capabilities that leverage Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. ([demo video](https://www.youtube.com/watch?v=qyFk3ZNwOxE))
89
 
 
95
 
96
  ### Developer/Web Scraper Friendly Experience
97
  - 🎯 **Interactive Web Scraping Shell**: Optional built-in IPython shell with Scrapling integration, shortcuts, and new tools to speed up Web Scraping scripts development, like converting curl requests to Scrapling requests and viewing requests results in your browser.
98
+ - πŸš€ **Use it directly from the Terminal**: Optionally, you can use Scrapling to scrape a URL without writing a single line of code!
99
  - πŸ› οΈ **Rich Navigation API**: Advanced DOM traversal with parent, sibling, and child navigation methods.
100
  - 🧬 **Enhanced Text Processing**: Built-in regex, cleaning methods, and optimized string operations.
101
  - πŸ“ **Auto Selector Generation**: Generate robust CSS/XPath selectors for any element.
102
  - πŸ”Œ **Familiar API**: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel.
103
+ - πŸ“˜ **Complete Type Coverage**: Full type hints for excellent IDE support and code completion. The entire codebase is automatically scanned with **PyRight** and **MyPy** with each change.
104
  - πŸ”‹ **Ready Docker image**: With each release, a Docker image containing all browsers is automatically built and pushed.
105
 
106
 
 
145
  pip install scrapling
146
  ```
147
 
148
+ This installation only includes the parser engine and its dependencies, without any fetchers or commandline dependencies.
149
 
150
  ### Optional Dependencies
151