| <style> |
| .md-typeset h1 { |
| display: none; |
| } |
| [data-md-color-scheme="default"] .only-dark { display: none; } |
| [data-md-color-scheme="slate"] .only-light { display: none; } |
| </style> |
|
|
| <br/> |
| <div align="center"> |
| <a href="https://scrapling.readthedocs.io/en/latest/" alt="poster"> |
| <img alt="Scrapling" src="assets/cover_light.svg" class="only-light"> |
| <img alt="Scrapling" src="assets/cover_dark.svg" class="only-dark"> |
| </a> |
| </div> |
| |
| <h2 align="center"><i>Effortless Web Scraping for the Modern Web</i></h2><br> |
|
|
| Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl. |
|
|
| Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation β all in a few lines of Python. One library, zero compromises. |
|
|
| Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone. |
|
|
| ```python |
| from scrapling.fetchers import Fetcher, StealthyFetcher, DynamicFetcher |
| StealthyFetcher.adaptive = True |
| page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Fetch website under the radar! |
| products = page.css('.product', auto_save=True) # Scrape data that survives website design changes! |
| products = page.css('.product', adaptive=True) # Later, if the website structure changes, pass `adaptive=True` to find them! |
| ``` |
| Or scale up to full crawls |
| ```python |
| from scrapling.spiders import Spider, Response |
| |
| class MySpider(Spider): |
| name = "demo" |
| start_urls = ["https://example.com/"] |
| |
| async def parse(self, response: Response): |
| for item in response.css('.product'): |
| yield {"title": item.css('h2::text').get()} |
| |
| MySpider().start() |
| ``` |
|
|
| ## Top Sponsors |
|
|
| <style> |
| .ad { |
| width:240px; |
| height:100px; |
| } |
| |
| </style> |
|
|
| <!-- sponsors --> |
| <div style="text-align: center;"> |
| <a href="https://hypersolutions.co/?utm_source=github&utm_medium=readme&utm_campaign=scrapling" target="_blank" title="Bot Protection Bypass API for Akamai, DataDome, Incapsula & Kasada"> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/HyperSolutions.png" class="ad"> |
| </a> |
| <a href="https://birdproxies.com/t/scrapling" target="_blank" title="At Bird Proxies, we eliminate your pains such as banned IPs, geo restriction, and high costs so you can focus on your work."> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/BirdProxies.jpg" class="ad"> |
| </a> |
| <a href="https://evomi.com?utm_source=github&utm_medium=banner&utm_campaign=d4vinci-scrapling" target="_blank" title="Evomi is your Swiss Quality Proxy Provider, starting at $0.49/GB"> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/evomi.png" class="ad"> |
| </a> |
| <a href="https://tikhub.io/?ref=KarimShoair" target="_blank" title="Unlock the Power of Social Media Data & AI"> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/TikHub.jpg" class="ad"> |
| </a> |
| <a href="https://www.nsocks.com/?keyword=2p67aivg" target="_blank" title="Scalable Web Data Access for AI Applications"> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/nsocks.png" class="ad"> |
| </a> |
| <a href="https://petrosky.io/d4vinci" target="_blank" title="PetroSky delivers cutting-edge VPS hosting."> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/petrosky.png" class="ad"> |
| </a> |
| <a href="https://substack.thewebscraping.club/p/scrapling-hands-on-guide?utm_source=github&utm_medium=repo&utm_campaign=scrapling" target="_blank" title="The #1 newsletter dedicated to Web Scraping"> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/TWSC.png" class="ad"> |
| </a> |
| <a href="https://proxy-seller.com/?partner=CU9CAA5TBYFFT2" target="_blank" title="Proxy-Seller provides reliable proxy infrastructure for Web Scraping"> |
| <img src="https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/ProxySeller.png" class="ad"> |
| </a> |
| <br /> |
| <br /> |
| </div> |
| <!-- /sponsors --> |
| |
| <i><sub>Do you want to show your ad here? Click [here](https://github.com/sponsors/D4Vinci), choose a plan, and enjoy the rest of the perks!</sub></i> |
|
|
| ## Key Features |
|
|
| ### Spiders β A Full Crawling Framework |
| - π·οΈ **Scrapy-like Spider API**: Define spiders with `start_urls`, async `parse` callbacks, and `Request`/`Response` objects. |
| - β‘ **Concurrent Crawling**: Configurable concurrency limits, per-domain throttling, and download delays. |
| - π **Multi-Session Support**: Unified interface for HTTP requests, and stealthy headless browsers in a single spider β route requests to different sessions by ID. |
| - πΎ **Pause & Resume**: Checkpoint-based crawl persistence. Press Ctrl+C for a graceful shutdown; restart to resume from where you left off. |
| - π‘ **Streaming Mode**: Stream scraped items as they arrive via `async for item in spider.stream()` with real-time stats β ideal for UI, pipelines, and long-running crawls. |
| - π‘οΈ **Blocked Request Detection**: Automatic detection and retry of blocked requests with customizable logic. |
| - π¦ **Built-in Export**: Export results through hooks and your own pipeline or the built-in JSON/JSONL with `result.items.to_json()` / `result.items.to_jsonl()` respectively. |
|
|
| ### Advanced Websites Fetching with Session Support |
| - **HTTP Requests**: Fast and stealthy HTTP requests with the `Fetcher` class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3. |
| - **Dynamic Loading**: Fetch dynamic websites with full browser automation through the `DynamicFetcher` class supporting Playwright's Chromium and Google's Chrome. |
| - **Anti-bot Bypass**: Advanced stealth capabilities with `StealthyFetcher` and fingerprint spoofing. Can easily bypass all types of Cloudflare's Turnstile/Interstitial with automation. |
| - **Session Management**: Persistent session support with `FetcherSession`, `StealthySession`, and `DynamicSession` classes for cookie and state management across requests. |
| - **Proxy Rotation**: Built-in `ProxyRotator` with cyclic or custom rotation strategies across all session types, plus per-request proxy overrides. |
| - **Domain Blocking**: Block requests to specific domains (and their subdomains) in browser-based fetchers. |
| - **Async Support**: Complete async support across all fetchers and dedicated async session classes. |
|
|
| ### Adaptive Scraping & AI Integration |
| - π **Smart Element Tracking**: Relocate elements after website changes using intelligent similarity algorithms. |
| - π― **Smart Flexible Selection**: CSS selectors, XPath selectors, filter-based search, text search, regex search, and more. |
| - π **Find Similar Elements**: Automatically locate elements similar to found elements. |
| - π€ **MCP Server to be used with AI**: Built-in MCP server for AI-assisted Web Scraping and data extraction. The MCP server features powerful, custom capabilities that leverage Scrapling to extract targeted content before passing it to the AI (Claude/Cursor/etc), thereby speeding up operations and reducing costs by minimizing token usage. ([demo video](https://www.youtube.com/watch?v=qyFk3ZNwOxE)) |
|
|
| ### High-Performance & battle-tested Architecture |
| - π **Lightning Fast**: Optimized performance outperforming most Python scraping libraries. |
| - π **Memory Efficient**: Optimized data structures and lazy loading for a minimal memory footprint. |
| - β‘ **Fast JSON Serialization**: 10x faster than the standard library. |
| - ποΈ **Battle tested**: Not only does Scrapling have 92% test coverage and full type hints coverage, but it has been used daily by hundreds of Web Scrapers over the past year. |
|
|
| ### Developer/Web Scraper Friendly Experience |
| - π― **Interactive Web Scraping Shell**: Optional built-in IPython shell with Scrapling integration, shortcuts, and new tools to speed up Web Scraping scripts development, like converting curl requests to Scrapling requests and viewing requests results in your browser. |
| - π **Use it directly from the Terminal**: Optionally, you can use Scrapling to scrape a URL without writing a single line of code! |
| - π οΈ **Rich Navigation API**: Advanced DOM traversal with parent, sibling, and child navigation methods. |
| - 𧬠**Enhanced Text Processing**: Built-in regex, cleaning methods, and optimized string operations. |
| - π **Auto Selector Generation**: Generate robust CSS/XPath selectors for any element. |
| - π **Familiar API**: Similar to Scrapy/BeautifulSoup with the same pseudo-elements used in Scrapy/Parsel. |
| - π **Complete Type Coverage**: Full type hints for excellent IDE support and code completion. The entire codebase is automatically scanned with **PyRight** and **MyPy** with each change. |
| - π **Ready Docker image**: With each release, a Docker image containing all browsers is automatically built and pushed. |
|
|
|
|
| ## Star History |
| Scraplingβs GitHub stars have grown steadily since its release (see chart below). |
|
|
| <div id="chartContainer"> |
| <a href="https://github.com/D4Vinci/Scrapling"> |
| <img id="chartImage" alt="Star History Chart" loading="lazy" src="https://api.star-history.com/svg?repos=D4Vinci/Scrapling&type=Date" height="400"/> |
| </a> |
| </div> |
| |
| <script> |
| const observer = new MutationObserver((mutations) => { |
| mutations.forEach((mutation) => { |
| if (mutation.attributeName === 'data-md-color-media') { |
| const colorMedia = document.body.getAttribute('data-md-color-media'); |
| const isDarkScheme = document.body.getAttribute('data-md-color-scheme') === 'slate'; |
| const chartImg = document.querySelector('#chartImage'); |
| const baseUrl = 'https://api.star-history.com/svg?repos=D4Vinci/Scrapling&type=Date'; |
| |
| if (colorMedia === '(prefers-color-scheme)' ? isDarkScheme : colorMedia.includes('dark')) { |
| chartImg.src = `${baseUrl}&theme=dark`; |
| } else { |
| chartImg.src = baseUrl; |
| } |
| } |
| }); |
| }); |
| |
| observer.observe(document.body, { |
| attributes: true, |
| attributeFilter: ['data-md-color-media', 'data-md-color-scheme'] |
| }); |
| </script> |
|
|
|
|
| ## Installation |
| Scrapling requires Python 3.10 or higher: |
|
|
| ```bash |
| pip install scrapling |
| ``` |
|
|
| This installation only includes the parser engine and its dependencies, without any fetchers or commandline dependencies. |
|
|
| ### Optional Dependencies |
|
|
| 1. If you are going to use any of the extra features below, the fetchers, or their classes, you will need to install fetchers' dependencies and their browser dependencies as follows: |
| ```bash |
| pip install "scrapling[fetchers]" |
| |
| scrapling install # normal install |
| scrapling install --force # force reinstall |
| ``` |
| |
| This downloads all browsers, along with their system dependencies and fingerprint manipulation dependencies. |
| |
| Or you can install them from the code instead of running a command like this: |
| ```python |
| from scrapling.cli import install |
| |
| install([], standalone_mode=False) # normal install |
| install(["--force"], standalone_mode=False) # force reinstall |
| ``` |
| |
| 2. Extra features: |
|
|
|
|
| - Install the MCP server feature: |
| ```bash |
| pip install "scrapling[ai]" |
| ``` |
| - Install shell features (Web Scraping shell and the `extract` command): |
| ```bash |
| pip install "scrapling[shell]" |
| ``` |
| - Install everything: |
| ```bash |
| pip install "scrapling[all]" |
| ``` |
| Don't forget that you need to install the browser dependencies with `scrapling install` after any of these extras (if you didn't already) |
| |
| ### Docker |
| You can also install a Docker image with all extras and browsers with the following command from DockerHub: |
| ```bash |
| docker pull pyd4vinci/scrapling |
| ``` |
| Or download it from the GitHub registry: |
| ```bash |
| docker pull ghcr.io/d4vinci/scrapling:latest |
| ``` |
| This image is automatically built and pushed using GitHub Actions and the repository's main branch. |
|
|
| ## How the documentation is organized |
| Scrapling has extensive documentation, so we try to follow the [DiΓ‘taxis documentation framework](https://diataxis.fr/). |
|
|
| ## Support |
|
|
| If you like Scrapling and want to support its development: |
|
|
| - β Star the [GitHub repository](https://github.com/D4Vinci/Scrapling) |
| - π Follow us on [Twitter](https://x.com/Scrapling_dev) and join the [discord server](https://discord.gg/EMgGbDceNQ) |
| - π Consider [sponsoring the project or buying me a coffee](donate.md) :wink: |
| - π Report bugs and suggest features through [GitHub Issues](https://github.com/D4Vinci/Scrapling/issues) |
|
|
| ## License |
|
|
| This project is licensed under the BSD-3 License. See the [LICENSE](https://github.com/D4Vinci/Scrapling/blob/main/LICENSE) file for details. |