| ## Pick Your Path |
|
|
| Not sure where to start? Pick the path that matches what you're trying to do: |
|
|
| | I want to... | Start here | |
| |:---|:---| |
| | **Parse HTML** I already have | [Querying elements](parsing/selection.md) — CSS, XPath, and text-based selection | |
| | **Quickly scrape a page** and prototype | Pick a [fetcher](fetching/choosing.md) and test right away, or launch the [interactive shell](cli/interactive-shell.md) | |
| | **Build a crawler** that scales | [Spiders](spiders/getting-started.md) — concurrent, multi-session crawls with pause/resume | |
| | **Scrape without writing code** | [CLI extract commands](cli/extract-commands.md) or hook up the [MCP server](ai/mcp-server.md) to your favourite AI tool | |
| | **Migrate** from another library | [From BeautifulSoup](tutorials/migrating_from_beautifulsoup.md) or [Scrapy comparison](spiders/architecture.md#comparison-with-scrapy) | |
|
|
| --- |
|
|
| We will start by quickly reviewing the parsing capabilities. Then we will fetch websites using custom browsers, make requests, and parse the responses. |
|
|
| Here's an HTML document generated by ChatGPT that we will be using as an example throughout this page: |
| ```html |
| <html> |
| <head> |
| <title>Complex Web Page</title> |
| <style> |
| .hidden { display: none; } |
| </style> |
| </head> |
| <body> |
| <header> |
| <nav> |
| <ul> |
| <li> <a href="#home">Home</a> </li> |
| <li> <a href="#about">About</a> </li> |
| <li> <a href="#contact">Contact</a> </li> |
| </ul> |
| </nav> |
| </header> |
| <main> |
| <section id="products" schema='{"jsonable": "data"}'> |
| <h2>Products</h2> |
| <div class="product-list"> |
| <article class="product" data-id="1"> |
| <h3>Product 1</h3> |
| <p class="description">This is product 1</p> |
| <span class="price">$10.99</span> |
| <div class="hidden stock">In stock: 5</div> |
| </article> |
| |
| <article class="product" data-id="2"> |
| <h3>Product 2</h3> |
| <p class="description">This is product 2</p> |
| <span class="price">$20.99</span> |
| <div class="hidden stock">In stock: 3</div> |
| </article> |
| |
| <article class="product" data-id="3"> |
| <h3>Product 3</h3> |
| <p class="description">This is product 3</p> |
| <span class="price">$15.99</span> |
| <div class="hidden stock">Out of stock</div> |
| </article> |
| </div> |
| </section> |
| |
| <section id="reviews"> |
| <h2>Customer Reviews</h2> |
| <div class="review-list"> |
| <div class="review" data-rating="5"> |
| <p class="review-text">Great product!</p> |
| <span class="reviewer">John Doe</span> |
| </div> |
| <div class="review" data-rating="4"> |
| <p class="review-text">Good value for money.</p> |
| <span class="reviewer">Jane Smith</span> |
| </div> |
| </div> |
| </section> |
| </main> |
| <script id="page-data" type="application/json"> |
| { |
| "lastUpdated": "2024-09-22T10:30:00Z", |
| "totalProducts": 3 |
| } |
| </script> |
| </body> |
| </html> |
| ``` |
| Starting with loading raw HTML above like this |
| ```python |
| from scrapling.parser import Selector |
| page = Selector(html_doc) |
| page # <data='<html><head><title>Complex Web Page</tit...'> |
| ``` |
| Get all text content on the page recursively |
| ```python |
| page.get_all_text(ignore_tags=('script', 'style')) |
| # 'Complex Web Page\nHome\nAbout\nContact\nProducts\nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock\nCustomer Reviews\nGreat product!\nJohn Doe\nGood value for money.\nJane Smith' |
| ``` |
|
|
| ## Finding elements |
| If there's an element you want to find on the page, you will find it! Your creativity level is the only limitation! |
|
|
| Finding the first HTML `section` element |
| ```python |
| section_element = page.find('section') |
| # <data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'> |
| ``` |
| Find all `section` elements |
| ```python |
| section_elements = page.find_all('section') |
| # [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>, <data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>] |
| ``` |
| Find all `section` elements whose `id` attribute value is `products`. |
| ```python |
| section_elements = page.find_all('section', {'id':"products"}) |
| # Same as |
| section_elements = page.find_all('section', id="products") |
| # [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>] |
| ``` |
| Find all `section` elements whose `id` attribute value contains `product`. |
| ```python |
| section_elements = page.find_all('section', {'id*':"product"}) |
| ``` |
| Find all `h3` elements whose text content matches this regex `Product \d` |
| ```python |
| page.find_all('h3', re.compile(r'Product \d')) |
| # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>] |
| ``` |
| Find all `h3` and `h2` elements whose text content matches the regex `Product` only |
| ```python |
| page.find_all(['h3', 'h2'], re.compile(r'Product')) |
| # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>, <data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>] |
| ``` |
| Find all elements whose text content matches exactly `Products` (Whitespaces are not taken into consideration) |
| ```python |
| page.find_by_text('Products', first_match=False) |
| # [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>] |
| ``` |
| Or find all elements whose text content matches regex `Product \d` |
| ```python |
| page.find_by_regex(r'Product \d', first_match=False) |
| # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>] |
| ``` |
| Find all elements that are similar to the element you want |
| ```python |
| target_element = page.find_by_regex(r'Product \d', first_match=True) |
| # <data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'> |
| target_element.find_similar() |
| # [<data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>] |
| ``` |
| Find the first element that matches a CSS selector |
| ```python |
| page.css('.product-list [data-id="1"]')[0] |
| # <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'> |
| ``` |
| Find all elements that match a CSS selector |
| ```python |
| page.css('.product-list article') |
| # [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>] |
| ``` |
| Find the first element that matches an XPath selector |
| ```python |
| page.xpath("//*[@id='products']/div/article")[0] |
| # <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'> |
| ``` |
| Find all elements that match an XPath selector |
| ```python |
| page.xpath("//*[@id='products']/div/article") |
| # [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>] |
| ``` |
|
|
| With this, we just scratched the surface of these functions; more advanced options with these selection methods are shown later. |
| ## Accessing elements' data |
| It's as simple as |
| ```python |
| >>> section_element.tag |
| 'section' |
| >>> print(section_element.attrib) |
| {'id': 'products', 'schema': '{"jsonable": "data"}'} |
| >>> section_element.attrib['schema'].json() # If an attribute value can be converted to json, then use `.json()` to convert it |
| {'jsonable': 'data'} |
| >>> section_element.text # Direct text content |
| '' |
| >>> section_element.get_all_text() # All text content recursively |
| 'Products\nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock' |
| >>> section_element.html_content # The HTML content of the element |
| '<section id="products" schema=\'{"jsonable": "data"}\'><h2>Products</h2>\n <div class="product-list">\n <article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article><article class="product" data-id="2"><h3>Product 2</h3>\n <p class="description">This is product 2</p>\n <span class="price">$20.99</span>\n <div class="hidden stock">In stock: 3</div>\n </article><article class="product" data-id="3"><h3>Product 3</h3>\n <p class="description">This is product 3</p>\n <span class="price">$15.99</span>\n <div class="hidden stock">Out of stock</div>\n </article></div>\n </section>' |
| >>> print(section_element.prettify()) # The prettified version |
| ''' |
| <section id="products" schema='{"jsonable": "data"}'><h2>Products</h2> |
| <div class="product-list"> |
| <article class="product" data-id="1"><h3>Product 1</h3> |
| <p class="description">This is product 1</p> |
| <span class="price">$10.99</span> |
| <div class="hidden stock">In stock: 5</div> |
| </article><article class="product" data-id="2"><h3>Product 2</h3> |
| <p class="description">This is product 2</p> |
| <span class="price">$20.99</span> |
| <div class="hidden stock">In stock: 3</div> |
| </article><article class="product" data-id="3"><h3>Product 3</h3> |
| <p class="description">This is product 3</p> |
| <span class="price">$15.99</span> |
| <div class="hidden stock">Out of stock</div> |
| </article> |
| </div> |
| </section> |
| ''' |
| >>> section_element.path # All the ancestors in the DOM tree of this element |
| [<data='<main><section id="products" schema='{"j...' parent='<body> <header><nav><ul><li> <a href="#h...'>, |
| <data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>, |
| <data='<html><head><title>Complex Web Page</tit...'>] |
| >>> section_element.generate_css_selector |
| '#products' |
| >>> section_element.generate_full_css_selector |
| 'body > main > #products > #products' |
| >>> section_element.generate_xpath_selector |
| "//*[@id='products']" |
| >>> section_element.generate_full_xpath_selector |
| "//body/main/*[@id='products']" |
| ``` |
|
|
| ## Navigation |
| Using the elements we found above |
|
|
| ```python |
| >>> section_element.parent |
| <data='<main><section id="products" schema='{"j...' parent='<body> <header><nav><ul><li> <a href="#h...'> |
| >>> section_element.parent.tag |
| 'main' |
| >>> section_element.parent.parent.tag |
| 'body' |
| >>> section_element.children |
| [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>, |
| <data='<div class="product-list"> <article clas...' parent='<section id="products" schema='{"jsonabl...'>] |
| >>> section_element.siblings |
| [<data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>] |
| >>> section_element.next # gets the next element, the same logic applies to `quote.previous`. |
| <data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'> |
| >>> section_element.children.css('h2::text').getall() |
| ['Products'] |
| >>> page.css('[data-id="1"]')[0].has_class('product') |
| True |
| ``` |
| If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element, like the one below |
| ```python |
| for ancestor in section_element.iterancestors(): |
| # do something with it... |
| ``` |
| You can search for a specific ancestor of an element that satisfies a function; all you need to do is pass a function that takes a `Selector` object as an argument and returns `True` if the condition is satisfied or `False` otherwise, like below: |
| ```python |
| >>> section_element.find_ancestor(lambda ancestor: ancestor.css('nav')) |
| <data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'> |
| ``` |
|
|
| ## Fetching websites |
| Instead of passing the raw HTML to Scrapling, you can retrieve a website's response directly via HTTP requests or by fetching it in a browser. |
|
|
| A fetcher is made for every use case. |
|
|
| ### HTTP Requests |
| For simple HTTP requests, there's a `Fetcher` class that can be imported and used as below: |
| ```python |
| from scrapling.fetchers import Fetcher |
| page = Fetcher.get('https://scrapling.requestcatcher.com/get', impersonate="chrome") |
| ``` |
| With that out of the way, here's how to do all HTTP methods: |
| ```python |
| >>> from scrapling.fetchers import Fetcher |
| >>> page = Fetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True) |
| >>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030') |
| >>> page = Fetcher.put('https://scrapling.requestcatcher.com/put', data={'key': 'value'}) |
| >>> page = Fetcher.delete('https://scrapling.requestcatcher.com/delete') |
| ``` |
| For Async requests, you will replace the import like below: |
| ```python |
| >>> from scrapling.fetchers import AsyncFetcher |
| >>> page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True) |
| >>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030') |
| >>> page = await AsyncFetcher.put('https://scrapling.requestcatcher.com/put', data={'key': 'value'}) |
| >>> page = await AsyncFetcher.delete('https://scrapling.requestcatcher.com/delete') |
| ``` |
|
|
| !!! note "Notes:" |
|
|
| 1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a Google referer header. It's enabled by default. |
| 2. The `impersonate` argument lets you fake the TLS fingerprint for a specific browser version. |
| 3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic |
|
|
| This is just the tip of the iceberg with this fetcher; check out the rest from [here](fetching/static.md) |
|
|
| ### Dynamic loading |
| We have you covered if you deal with dynamic websites like most today! |
|
|
| The `DynamicFetcher` class (formerly `PlayWrightFetcher`) offers many options for fetching and loading web pages using Chromium-based browsers. |
| ```python |
| >>> from scrapling.fetchers import DynamicFetcher |
| >>> page = DynamicFetcher.fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option |
| >>> page.css("#search a::attr(href)").get() |
| 'https://github.com/D4Vinci/Scrapling' |
| >>> # The async version of fetch |
| >>> page = await DynamicFetcher.async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) |
| >>> page.css("#search a::attr(href)").get() |
| 'https://github.com/D4Vinci/Scrapling' |
| ``` |
| It's built on top of [Playwright](https://playwright.dev/python/), and it's currently providing two main run options that can be mixed as you want: |
|
|
| - Vanilla Playwright without any modifications other than the ones you chose. It uses the Chromium browser. |
| - Real browsers like your Chrome browser by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it. |
|
|
|
|
| Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments. |
|
|
| ### Dynamic anti-protection loading |
| We also have you covered if you deal with dynamic websites with annoying anti-protections! |
|
|
| The `StealthyFetcher` class uses a stealthy version of the `DynamicFetcher` explained above. |
|
|
| Some of the things it does: |
|
|
| 1. It easily bypasses all types of Cloudflare's Turnstile/Interstitial automatically. |
| 2. It bypasses CDP runtime leaks and WebRTC leaks. |
| 3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do. |
| 4. It generates canvas noise to prevent fingerprinting through canvas. |
| 5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks. |
| 6. and other anti-protection options... |
|
|
| ```python |
| >>> from scrapling.fetchers import StealthyFetcher |
| >>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection') # Running headless by default |
| >>> page.status == 200 |
| True |
| >>> page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare', solve_cloudflare=True) # Solve Cloudflare captcha automatically if presented |
| >>> page.status == 200 |
| True |
| >>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection', humanize=True, os_randomize=True) # and the rest of arguments... |
| >>> # The async version of fetch |
| >>> page = await StealthyFetcher.async_fetch('https://www.browserscan.net/bot-detection') |
| >>> page.status == 200 |
| True |
| ``` |
|
|
| Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/stealthy.md) for all details and the complete list of arguments. |
|
|
| --- |
|
|
| That's Scrapling at a glance. If you want to learn more, continue to the next section. |