Spaces:
Sleeping
Sleeping
| ## Pick Your Path | |
| Not sure where to start? Pick the path that matches what you're trying to do: | |
| | I want to... | Start here | | |
| |:---|:---| | |
| | **Parse HTML** I already have | [Querying elements](parsing/selection.md) — CSS, XPath, and text-based selection | | |
| | **Quickly scrape a page** and prototype | Pick a [fetcher](fetching/choosing.md) and test right away, or launch the [interactive shell](cli/interactive-shell.md) | | |
| | **Build a crawler** that scales | [Spiders](spiders/getting-started.md) — concurrent, multi-session crawls with pause/resume | | |
| | **Scrape without writing code** | [CLI extract commands](cli/extract-commands.md) or hook up the [MCP server](ai/mcp-server.md) to your favourite AI tool | | |
| | **Migrate** from another library | [From BeautifulSoup](tutorials/migrating_from_beautifulsoup.md) or [Scrapy comparison](spiders/architecture.md#comparison-with-scrapy) | | |
| --- | |
| We will start by quickly reviewing the parsing capabilities. Then we will fetch websites using custom browsers, make requests, and parse the responses. | |
| Here's an HTML document generated by ChatGPT that we will be using as an example throughout this page: | |
| ```html | |
| <html> | |
| <head> | |
| <title>Complex Web Page</title> | |
| <style> | |
| .hidden { display: none; } | |
| </style> | |
| </head> | |
| <body> | |
| <header> | |
| <nav> | |
| <ul> | |
| <li> <a href="#home">Home</a> </li> | |
| <li> <a href="#about">About</a> </li> | |
| <li> <a href="#contact">Contact</a> </li> | |
| </ul> | |
| </nav> | |
| </header> | |
| <main> | |
| <section id="products" schema='{"jsonable": "data"}'> | |
| <h2>Products</h2> | |
| <div class="product-list"> | |
| <article class="product" data-id="1"> | |
| <h3>Product 1</h3> | |
| <p class="description">This is product 1</p> | |
| <span class="price">$10.99</span> | |
| <div class="hidden stock">In stock: 5</div> | |
| </article> | |
| <article class="product" data-id="2"> | |
| <h3>Product 2</h3> | |
| <p class="description">This is product 2</p> | |
| <span class="price">$20.99</span> | |
| <div class="hidden stock">In stock: 3</div> | |
| </article> | |
| <article class="product" data-id="3"> | |
| <h3>Product 3</h3> | |
| <p class="description">This is product 3</p> | |
| <span class="price">$15.99</span> | |
| <div class="hidden stock">Out of stock</div> | |
| </article> | |
| </div> | |
| </section> | |
| <section id="reviews"> | |
| <h2>Customer Reviews</h2> | |
| <div class="review-list"> | |
| <div class="review" data-rating="5"> | |
| <p class="review-text">Great product!</p> | |
| <span class="reviewer">John Doe</span> | |
| </div> | |
| <div class="review" data-rating="4"> | |
| <p class="review-text">Good value for money.</p> | |
| <span class="reviewer">Jane Smith</span> | |
| </div> | |
| </div> | |
| </section> | |
| </main> | |
| <script id="page-data" type="application/json"> | |
| { | |
| "lastUpdated": "2024-09-22T10:30:00Z", | |
| "totalProducts": 3 | |
| } | |
| </script> | |
| </body> | |
| </html> | |
| ``` | |
| Starting with loading raw HTML above like this | |
| ```python | |
| from scrapling.parser import Selector | |
| page = Selector(html_doc) | |
| page # <data='<html><head><title>Complex Web Page</tit...'> | |
| ``` | |
| Get all text content on the page recursively | |
| ```python | |
| page.get_all_text(ignore_tags=('script', 'style')) | |
| # 'Complex Web Page\nHome\nAbout\nContact\nProducts\nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock\nCustomer Reviews\nGreat product!\nJohn Doe\nGood value for money.\nJane Smith' | |
| ``` | |
| ## Finding elements | |
| If there's an element you want to find on the page, you will find it! Your creativity level is the only limitation! | |
| Finding the first HTML `section` element | |
| ```python | |
| section_element = page.find('section') | |
| # <data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'> | |
| ``` | |
| Find all `section` elements | |
| ```python | |
| section_elements = page.find_all('section') | |
| # [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>, <data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>] | |
| ``` | |
| Find all `section` elements whose `id` attribute value is `products`. | |
| ```python | |
| section_elements = page.find_all('section', {'id':"products"}) | |
| # Same as | |
| section_elements = page.find_all('section', id="products") | |
| # [<data='<section id="products" schema='{"jsonabl...' parent='<main><section id="products" schema='{"j...'>] | |
| ``` | |
| Find all `section` elements whose `id` attribute value contains `product`. | |
| ```python | |
| section_elements = page.find_all('section', {'id*':"product"}) | |
| ``` | |
| Find all `h3` elements whose text content matches this regex `Product \d` | |
| ```python | |
| page.find_all('h3', re.compile(r'Product \d')) | |
| # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>] | |
| ``` | |
| Find all `h3` and `h2` elements whose text content matches the regex `Product` only | |
| ```python | |
| page.find_all(['h3', 'h2'], re.compile(r'Product')) | |
| # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>, <data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>] | |
| ``` | |
| Find all elements whose text content matches exactly `Products` (Whitespaces are not taken into consideration) | |
| ```python | |
| page.find_by_text('Products', first_match=False) | |
| # [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>] | |
| ``` | |
| Or find all elements whose text content matches regex `Product \d` | |
| ```python | |
| page.find_by_regex(r'Product \d', first_match=False) | |
| # [<data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'>, <data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>] | |
| ``` | |
| Find all elements that are similar to the element you want | |
| ```python | |
| target_element = page.find_by_regex(r'Product \d', first_match=True) | |
| # <data='<h3>Product 1</h3>' parent='<article class="product" data-id="1"><h3...'> | |
| target_element.find_similar() | |
| # [<data='<h3>Product 2</h3>' parent='<article class="product" data-id="2"><h3...'>, <data='<h3>Product 3</h3>' parent='<article class="product" data-id="3"><h3...'>] | |
| ``` | |
| Find the first element that matches a CSS selector | |
| ```python | |
| page.css('.product-list [data-id="1"]')[0] | |
| # <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'> | |
| ``` | |
| Find all elements that match a CSS selector | |
| ```python | |
| page.css('.product-list article') | |
| # [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>] | |
| ``` | |
| Find the first element that matches an XPath selector | |
| ```python | |
| page.xpath("//*[@id='products']/div/article")[0] | |
| # <data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'> | |
| ``` | |
| Find all elements that match an XPath selector | |
| ```python | |
| page.xpath("//*[@id='products']/div/article") | |
| # [<data='<article class="product" data-id="1"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="2"><h3...' parent='<div class="product-list"> <article clas...'>, <data='<article class="product" data-id="3"><h3...' parent='<div class="product-list"> <article clas...'>] | |
| ``` | |
| With this, we just scratched the surface of these functions; more advanced options with these selection methods are shown later. | |
| ## Accessing elements' data | |
| It's as simple as | |
| ```python | |
| >>> section_element.tag | |
| 'section' | |
| >>> print(section_element.attrib) | |
| {'id': 'products', 'schema': '{"jsonable": "data"}'} | |
| >>> section_element.attrib['schema'].json() # If an attribute value can be converted to json, then use `.json()` to convert it | |
| {'jsonable': 'data'} | |
| >>> section_element.text # Direct text content | |
| '' | |
| >>> section_element.get_all_text() # All text content recursively | |
| 'Products\nProduct 1\nThis is product 1\n$10.99\nIn stock: 5\nProduct 2\nThis is product 2\n$20.99\nIn stock: 3\nProduct 3\nThis is product 3\n$15.99\nOut of stock' | |
| >>> section_element.html_content # The HTML content of the element | |
| '<section id="products" schema=\'{"jsonable": "data"}\'><h2>Products</h2>\n <div class="product-list">\n <article class="product" data-id="1"><h3>Product 1</h3>\n <p class="description">This is product 1</p>\n <span class="price">$10.99</span>\n <div class="hidden stock">In stock: 5</div>\n </article><article class="product" data-id="2"><h3>Product 2</h3>\n <p class="description">This is product 2</p>\n <span class="price">$20.99</span>\n <div class="hidden stock">In stock: 3</div>\n </article><article class="product" data-id="3"><h3>Product 3</h3>\n <p class="description">This is product 3</p>\n <span class="price">$15.99</span>\n <div class="hidden stock">Out of stock</div>\n </article></div>\n </section>' | |
| >>> print(section_element.prettify()) # The prettified version | |
| ''' | |
| <section id="products" schema='{"jsonable": "data"}'><h2>Products</h2> | |
| <div class="product-list"> | |
| <article class="product" data-id="1"><h3>Product 1</h3> | |
| <p class="description">This is product 1</p> | |
| <span class="price">$10.99</span> | |
| <div class="hidden stock">In stock: 5</div> | |
| </article><article class="product" data-id="2"><h3>Product 2</h3> | |
| <p class="description">This is product 2</p> | |
| <span class="price">$20.99</span> | |
| <div class="hidden stock">In stock: 3</div> | |
| </article><article class="product" data-id="3"><h3>Product 3</h3> | |
| <p class="description">This is product 3</p> | |
| <span class="price">$15.99</span> | |
| <div class="hidden stock">Out of stock</div> | |
| </article> | |
| </div> | |
| </section> | |
| ''' | |
| >>> section_element.path # All the ancestors in the DOM tree of this element | |
| [<data='<main><section id="products" schema='{"j...' parent='<body> <header><nav><ul><li> <a href="#h...'>, | |
| <data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'>, | |
| <data='<html><head><title>Complex Web Page</tit...'>] | |
| >>> section_element.generate_css_selector | |
| '#products' | |
| >>> section_element.generate_full_css_selector | |
| 'body > main > #products > #products' | |
| >>> section_element.generate_xpath_selector | |
| "//*[@id='products']" | |
| >>> section_element.generate_full_xpath_selector | |
| "//body/main/*[@id='products']" | |
| ``` | |
| ## Navigation | |
| Using the elements we found above | |
| ```python | |
| >>> section_element.parent | |
| <data='<main><section id="products" schema='{"j...' parent='<body> <header><nav><ul><li> <a href="#h...'> | |
| >>> section_element.parent.tag | |
| 'main' | |
| >>> section_element.parent.parent.tag | |
| 'body' | |
| >>> section_element.children | |
| [<data='<h2>Products</h2>' parent='<section id="products" schema='{"jsonabl...'>, | |
| <data='<div class="product-list"> <article clas...' parent='<section id="products" schema='{"jsonabl...'>] | |
| >>> section_element.siblings | |
| [<data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'>] | |
| >>> section_element.next # gets the next element, the same logic applies to `quote.previous`. | |
| <data='<section id="reviews"><h2>Customer Revie...' parent='<main><section id="products" schema='{"j...'> | |
| >>> section_element.children.css('h2::text').getall() | |
| ['Products'] | |
| >>> page.css('[data-id="1"]')[0].has_class('product') | |
| True | |
| ``` | |
| If your case needs more than the element's parent, you can iterate over the whole ancestors' tree of any element, like the one below | |
| ```python | |
| for ancestor in section_element.iterancestors(): | |
| # do something with it... | |
| ``` | |
| You can search for a specific ancestor of an element that satisfies a function; all you need to do is pass a function that takes a `Selector` object as an argument and returns `True` if the condition is satisfied or `False` otherwise, like below: | |
| ```python | |
| >>> section_element.find_ancestor(lambda ancestor: ancestor.css('nav')) | |
| <data='<body> <header><nav><ul><li> <a href="#h...' parent='<html><head><title>Complex Web Page</tit...'> | |
| ``` | |
| ## Fetching websites | |
| Instead of passing the raw HTML to Scrapling, you can retrieve a website's response directly via HTTP requests or by fetching it in a browser. | |
| A fetcher is made for every use case. | |
| ### HTTP Requests | |
| For simple HTTP requests, there's a `Fetcher` class that can be imported and used as below: | |
| ```python | |
| from scrapling.fetchers import Fetcher | |
| page = Fetcher.get('https://scrapling.requestcatcher.com/get', impersonate="chrome") | |
| ``` | |
| With that out of the way, here's how to do all HTTP methods: | |
| ```python | |
| >>> from scrapling.fetchers import Fetcher | |
| >>> page = Fetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True) | |
| >>> page = Fetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030') | |
| >>> page = Fetcher.put('https://scrapling.requestcatcher.com/put', data={'key': 'value'}) | |
| >>> page = Fetcher.delete('https://scrapling.requestcatcher.com/delete') | |
| ``` | |
| For Async requests, you will replace the import like below: | |
| ```python | |
| >>> from scrapling.fetchers import AsyncFetcher | |
| >>> page = await AsyncFetcher.get('https://scrapling.requestcatcher.com/get', stealthy_headers=True, follow_redirects=True) | |
| >>> page = await AsyncFetcher.post('https://scrapling.requestcatcher.com/post', data={'key': 'value'}, proxy='http://username:password@localhost:8030') | |
| >>> page = await AsyncFetcher.put('https://scrapling.requestcatcher.com/put', data={'key': 'value'}) | |
| >>> page = await AsyncFetcher.delete('https://scrapling.requestcatcher.com/delete') | |
| ``` | |
| !!! note "Notes:" | |
| 1. You have the `stealthy_headers` argument, which, when enabled, makes requests to generate real browser headers and use them, including a referer header, as if this request came from a Google search of this domain. It's enabled by default. | |
| 2. The `impersonate` argument lets you fake the TLS fingerprint for a specific browser version. | |
| 3. There's also the `http3` argument, which, when enabled, makes the fetcher use HTTP/3 for requests, which makes your requests more authentic | |
| This is just the tip of the iceberg with this fetcher; check out the rest from [here](fetching/static.md) | |
| ### Dynamic loading | |
| We have you covered if you deal with dynamic websites like most today! | |
| The `DynamicFetcher` class (formerly `PlayWrightFetcher`) offers many options for fetching and loading web pages using Chromium-based browsers. | |
| ```python | |
| >>> from scrapling.fetchers import DynamicFetcher | |
| >>> page = DynamicFetcher.fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) # Vanilla Playwright option | |
| >>> page.css("#search a::attr(href)").get() | |
| 'https://github.com/D4Vinci/Scrapling' | |
| >>> # The async version of fetch | |
| >>> page = await DynamicFetcher.async_fetch('https://www.google.com/search?q=%22Scrapling%22', disable_resources=True) | |
| >>> page.css("#search a::attr(href)").get() | |
| 'https://github.com/D4Vinci/Scrapling' | |
| ``` | |
| It's built on top of [Playwright](https://playwright.dev/python/), and it's currently providing two main run options that can be mixed as you want: | |
| - Vanilla Playwright without any modifications other than the ones you chose. It uses the Chromium browser. | |
| - Real browsers like your Chrome browser by passing the `real_chrome` argument or the CDP URL of your browser to be controlled by the Fetcher, and most of the options can be enabled on it. | |
| Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/dynamic.md) for all details and the complete list of arguments. | |
| ### Dynamic anti-protection loading | |
| We also have you covered if you deal with dynamic websites with annoying anti-protections! | |
| The `StealthyFetcher` class uses a stealthy version of the `DynamicFetcher` explained above. | |
| Some of the things it does: | |
| 1. It easily bypasses all types of Cloudflare's Turnstile/Interstitial automatically. | |
| 2. It bypasses CDP runtime leaks and WebRTC leaks. | |
| 3. It isolates JS execution, removes many Playwright fingerprints, and stops detection through some of the known behaviors that bots do. | |
| 4. It generates canvas noise to prevent fingerprinting through canvas. | |
| 5. It automatically patches known methods to detect running in headless mode and provides an option to defeat timezone mismatch attacks. | |
| 6. It makes requests look as if they came from Google's search page of the requested website. | |
| 7. and other anti-protection options... | |
| ```python | |
| >>> from scrapling.fetchers import StealthyFetcher | |
| >>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection') # Running headless by default | |
| >>> page.status == 200 | |
| True | |
| >>> page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare', solve_cloudflare=True) # Solve Cloudflare captcha automatically if presented | |
| >>> page.status == 200 | |
| True | |
| >>> page = StealthyFetcher.fetch('https://www.browserscan.net/bot-detection', humanize=True, os_randomize=True) # and the rest of arguments... | |
| >>> # The async version of fetch | |
| >>> page = await StealthyFetcher.async_fetch('https://www.browserscan.net/bot-detection') | |
| >>> page.status == 200 | |
| True | |
| ``` | |
| Again, this is just the tip of the iceberg with this fetcher. Check out the rest from [here](fetching/stealthy.md) for all details and the complete list of arguments. | |
| --- | |
| That's Scrapling at a glance. If you want to learn more, continue to the next section. |